In genome sequencing projects one of the things we often need to do is split paired end sequence reads into the two ends. Like everything, there is the simple way, and the correct way to do this. I’m not saying which one this is.
After the break, I provide a really brief introduction to the problem, and describe a simple software solution that allows you to parse a sequence library and identify the paird ends.
To create a paired end library, random fragments of sequence are ligated to an adapter oligo. This is typically about 30-40 bp in length, and has a known sequence. Both ends of the fragment are ligated to the oligo, and the DNA is sheared by nebulization. The oligo contains a biotinylation site, and the fragments containing the oligo are enriched by streptavidin capture.
Here is a diagram from Roche-454 that describes the process:
Note that an alternative way to do this is to leave nicks in the DNA following adapter ligation, and then migrate the nicks around the circle (towards 9 and 3 on the clockface) using Klenow.
When you sequence the clonally amplified fragments you end up with the two ends of the sequences, but because of the orientation of the original fragment one of them needs to be reverse complemented to put it back in the correct orientation.
The problem at hand is given a paired end library constructed like this how do we efficiently identify all of the adapter sequences. Moreover, can we identify those sequences that have adapters where there is a sequencing error in the sequence.
The full script that removes the primers and creates fastq, fasta, and quality files is available on the lab’s sourceforge site.
Basically, the approach is to avoid regexps at all costs, and use index. Looking for substrings using index is way faster than firing up the regular expression engine, and so we start by filtering out the exact matches. Hopefully those are most of the sequences, and we can find them quickly. Index is so fast, we can even double check to make sure that there were no duplicated adapters.
Once we have those sequences, we want to look for adapters that have a mismatch. This is a trickier problem, and the solution that I took was to break the oligo into smaller overlapping fragments of 6-, 7- or 8-bp (you can set this, but 6 seems to work the best). Looking for regions where most of these fragments align will highlight the adapter. You then need to find the true start, which you can do by subtracting the index to the match from the index to the position in the array. After that you look for the correct start using index one more time.
With this combined approach, you can identify over 99% of the sequences that have adapters. The known problems are where the adapter is at the start, or the end, of the fragment, and we have not sequenced the whole adapter sequence. In general, those are flagged as not being able to identify the adapter, and we ignore them in the downstream assembly process.