AKA: how to remove contamination from your metagenome! We use sharks genomes, but it works with humans, corals, and other things too!
A while ago we wrote
This is trivial to do with modern sequence analysis tools, and so we provide recipes here for filtering your reads based on matches to a reference genome. Read more to find out how!
First, you need your sequence files (presumably as fastq files) and your reference genome that you are going to map to. For this example, we will use bowtie2, because it is still one of the most up-to-date mapping tools, but you can replace it with your mapper of choice.
Lets assume your sequences are in three files
left.fastq– the left reads in a
right.fastq– the right reads in a file
single.fastq– the singleton reads in a
And also your reference genome is called
sharks.fasta, because it contains sequences of sharks (why not?).
We’re going to use bowtie2 to index our sequences, search against them, and then use samtools to create a binary bam file of the alignment.
# build the index bowtie2-build sharks.fasta sharks # run the mapping using 16 processes bowtie2 -p 16 -x sharks -1 left.fastq -2 right.fastq -U single.fastq | samtools sort > seqs.sharks.bam
Now we have our
Note that for all these steps, we are making use of the samtools flags filters, using either -f or -F (match a flag or do not match a flag) or -G (exclude reads that match). You can learn more about the samtools flags and explore the options with this samtools flags calculator.
Sequences that map to the reference
We can extract all the sequences that map to the shark reference genomes into three files: the left reads, the right reads, and the unmapped reads.
# extract the left reads that match the reference genome samtools fastq -G 12 -f 65 seqs.sharks.bam > left.shark.fastq # extract the right reads that match the reference genome samtools fastq -G 12 -f 129 seqs.sharks.bam > right.shark.fastq # extract the single reads that match the reference genome samtools fastq -F 5 seqs.sharks.bam > single.shark.fastq
In the first two commands, we
-f 65) or it is the second read in the pair (
-f 129). This combination gives us either the left or right mapped reads.
Next, we look for reads that are not pairs (that would normally be
-F 1) and are mapped (that would normally be
-F 4), hence we use
-F 5 to find the singletons that are mapped to the reference, and are hence shark sequences.
Sequences that do not map to the reference
We apply a similar logic to find the sequences that do not map to the reference genome. Again, we’ll end up with three files, a left reads, right reads, and unmapped reads.
# extract the left reads that do NOT match the reference genome samtools fastq -f 77 seqs.sharks.bam > left.notshark.fastq # extract the right reads that do NOT match the reference genome samtools fastq -f 141 seqs.sharks.bam > right.notshark.fastq # extract the single reads that do NOT match the reference genome samtools fastq -f 4 -F 1 seqs.sharks.bam > single.notshark.fastq
The logic here is very similar. A
-f 77 means the read is
-f 141 means the same except it is the second mate in the pair.
Finally, we use two flags:
-f 4 (the read is unmapped)
-F 1 (the read is not paired) to find the single reads that are not mapped.
By combining the flags in the samtools files we can easily separate our sequences into those reads that match the reference genome, and those that do not. The next question is, which ones are you interested in!