Category Archives: Lab blog

NSF Conflicts

For several years the NSF have been prototyping a spreadsheet based conflicts reporting system. The spreadsheet typically has the following fields:

 

C Name: Organizational Affiliation Optional (email, Department) Last Active

The problem is you need to make this file every time you submit a grant. Here is a somewhat trivial solution, but hopefully it will help you create this file.

Continue reading

Splitting and pairing fastq files

A lot of software benefits from paired fastq files that contain mate pair information, and usually you get these from your sequence provider. However, sometimes (e.g. when you download them from the SRA) you get sequences that are not appropriately paired.

There are lots of solutions (e.g. this thread suggests using Trimmomatic and this thread has an awk solution) but none split the sequences and order the sequences. Until now.

We’ve developed a bunch of different solutions to this problem in python (including fastq_pairs.pypair_fastq_fast.pypair_fastq_files.py, and pair_fastq_lowmem.py).

Recently, however, we’ve been handling very large files and the performance of these programs, (yes, even the lowmem version) is hindering our ability to process these files.

Therefore, we introduce fastq_pair, a C-implementation for pairing fastq files and sorting out which reads have matches in both files and which are singletons. This code starts with two fastq files and creates four output files. It is quick, and efficient, especially if you manipulate the size of the hash table (which you can do with a command line option).

It takes advantage of the random access ability to read files. We open a file and make an index of the ids in the file and the positions those indices occur in the file. Then, we read the second file, and if the IDs match, we scoot to the start of the appropriate line and write out those two sequences to the “pairs” files. We also set a flag in our data structure so we know that we’ve printed that sequence out. If the IDs don’t match, we write them to the “singles” file, and atthe end of all the processing we go through the IDs in our data structure and make print out those sequences we haven’t printed yet.

Take a look and give it a try!

Installing PyFBA (and necessary modules) without admin permissions

As easy as it is to install PyFBA using the pip command, it can be quite cumbersome to do so when you are working on a system without granted administrative or sudo permissions. Here is a quick guide that has worked for me when installing PyFBA on a CentOS 6.3 system running a SunGrid Engine cluster system. If you are working on a Linux system and you do have admin and sudo permissions, please follow the install guide here. Continue reading

How many bp of metagenomes are there in the SRA?

We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:

 

 

This is how to answer the question!

Continue reading

CAMI challenge datasets

CAMI (Critical Assessment of Metagenome Interpretation) is a community-led initiative designed to help tackle the problems faced by metagenomics analyses, aiming for an independent, comprehensive and bias-free evaluation of these metagenomics pipelines [source]. As part of the challenge, several simulated datasets were generated in order to evaluate each of the assembly, profiling, and binning tools submitted for review. Three distinct datasets were generated simulating microbiomes of varying complexities: low, medium, and high complexity. A pre-print version of the CAMI manuscript can be found on bioRxiv here: http://biorxiv.org/content/early/2017/01/09/099127
This blog post contains links to the binning and profiling results for those datasets. Continue reading

2017 Metagenomics Workshop

Once again we are offering a one week workshop on metagenomics data analysis at San Diego State University from June 26th to June 30th. The course will have a focus on random metagenomics sequencing and data analysis (not 16S sequencing). The course will cover sequencing technologies and sequencing approaches, data analysis using the linux command line, paired end sequencing, sequence assembly, mapping reads and visualization, population genomics, and extracting data from the sequence read archive. If you are interested, click the read more.

Continue reading