Squeakr is a new k-mer counter that looks exciting and we wanted to try it. Squeakr is based on a Counting Quotient Filter and promises to be faster than the leading k-mer counters for large datasets, and we like speed. Below I show how to install it on CentOS. Note that you may break your system and you need root access to do this!
Several people ask me about tips for learning new programming languages. Here, we talk about some of the broader concepts in learning a language.
For several years the NSF have been prototyping a spreadsheet based conflicts reporting system. The spreadsheet typically has the following fields:
|C||Name:||Organizational Affiliation||Optional (email, Department)||Last Active|
The problem is you need to make this file every time you submit a grant. Here is a somewhat trivial solution, but hopefully it will help you create this file.
A lot of software benefits from paired fastq files that contain mate pair information, and usually you get these from your sequence provider. However, sometimes (e.g. when you download them from the SRA) you get sequences that are not appropriately paired.
Recently, however, we’ve been handling very large files and the performance of these programs, (yes, even the lowmem version) is hindering our ability to process these files.
Therefore, we introduce fastq_pair, a C-implementation for pairing fastq files and sorting out which reads have matches in both files and which are singletons. This code starts with two fastq files and creates four output files. It is quick, and efficient, especially if you manipulate the size of the hash table (which you can do with a command line option).
It takes advantage of the random access ability to read files. We open a file and make an index of the ids in the file and the positions those indices occur in the file. Then, we read the second file, and if the IDs match, we scoot to the start of the appropriate line and write out those two sequences to the “pairs” files. We also set a flag in our data structure so we know that we’ve printed that sequence out. If the IDs don’t match, we write them to the “singles” file, and atthe end of all the processing we go through the IDs in our data structure and make print out those sequences we haven’t printed yet.
Take a look and give it a try!
As easy as it is to install PyFBA using the
pip command, it can be quite cumbersome to do so when you are working on a system without granted administrative or
sudo permissions. Here is a quick guide that has worked for me when installing PyFBA on a CentOS 6.3 system running a SunGrid Engine cluster system. If you are working on a Linux system and you do have admin and
sudo permissions, please follow the install guide here. Continue reading
We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:
great Rob! Do you have the ratio in term of ‘file volumes’ between WGS and 16S amplicons? Just curious to know if WGS wins on this front 🙂
— tom delmont (@tomodelmont) March 30, 2017
This is how to answer the question!
CAMI (Critical Assessment of Metagenome Interpretation) is a community-led initiative designed to help tackle the problems faced by metagenomics analyses, aiming for an independent, comprehensive and bias-free evaluation of these metagenomics pipelines [source]. As part of the challenge, several simulated datasets were generated in order to evaluate each of the assembly, profiling, and binning tools submitted for review. Three distinct datasets were generated simulating microbiomes of varying complexities: low, medium, and high complexity. A pre-print version of the CAMI manuscript can be found on bioRxiv here: http://biorxiv.org/content/early/2017/01/09/099127
This blog post contains links to the binning and profiling results for those datasets. Continue reading
Once again we are offering a one week workshop on metagenomics data analysis at San Diego State University from June 26th to June 30th. The course will have a focus on random metagenomics sequencing and data analysis (not 16S sequencing). The course will cover sequencing technologies and sequencing approaches, data analysis using the linux command line, paired end sequencing, sequence assembly, mapping reads and visualization, population genomics, and extracting data from the sequence read archive. If you are interested, click the read more.