command line deconseq

A while ago we wrote deconseq to allow you to remove contamination from your sequence libraries. We used an HTS-mapper to map the reads in your sequences to your reference genome, and then filtered the sequences after mapping.

This is trivial to do with modern sequence analysis tools, and so we provide recipes here for filtering your reads based on matches to a reference genome. Read more to find out how!

First, you need your sequence files (presumably as fastq files) and your reference genome that you are going to map to. For this example, we will use bowtie2, because it is still one of the most up-to-date mapping tools, but you can replace it with your mapper of choice.

Lets assume your sequences are in three files

  • left.fastq – the left reads in a fastq file
  • right.fastq – the right reads in a file
  • single.fastq – the singleton reads in a fastq file

And also your reference genome is called sharks.fasta, because it contains sequences of sharks (why not?).

We’re going to use bowtie2 to index our sequences, search against them, and then use samtools to create a binary bam file of the alignment.

# build the index
bowtie2-build sharks.fasta sharks
# run the mapping using 16 processes
bowtie2 -p 16 -x sharks -1 left.fastq -2 right.fastq -U single.fastq | samtools sort > seqs.sharks.bam

Now we have our bamfile we can use samtools to separate the sequences into those sequences that map to the sharks and those sequences that do not map to the sharks. We’ll do this in two separate steps.

Note that for all these steps, we are making use of the samtools flags filters, using either -f or -F (match a flag or do not match a flag) or -G (exclude reads that match). You can learn more about the samtools flags and explore the options with this samtools flags calculator.

Sequences that map to the reference

We can extract all the sequences that map to the shark reference genomes into three files: the left reads, the right reads, and the unmapped reads.

# extract the left reads that match the reference genome
samtools fastq -G 12 -f 65 seqs.sharks.bam > left.shark.fastq
# extract the right reads that match the reference genome
samtools fastq -G 12 -f 129 seqs.sharks.bam > right.shark.fastq
# extract the single reads that match the reference genome
samtools fastq -F 5 seqs.sharks.bam > single.shark.fastq

In the first two commands, we use -G 12 which excludes reads where the read is unmapped and the mate is unmapped, and then we look for reads that are paired and it is the first read in the pair (-f 65) or it is the second read in the pair (-f 129). This combination gives us either the left or right mapped reads.

Next, we look for reads that are not pairs (that would normally be -F 1) and are mapped (that would normally be -F 4), hence we use -F 5 to find the singletons that are mapped to the reference, and are hence shark sequences.

Sequences that do not map to the reference

We apply a similar logic to find the sequences that do not map to the reference genome. Again, we’ll end up with three files, a left reads, right reads, and unmapped reads.

# extract the left reads that do NOT match the reference genome
samtools fastq -f 77 seqs.sharks.bam > left.notshark.fastq
# extract the right reads that do NOT match the reference genome
samtools fastq -f 141 seqs.sharks.bam > right.notshark.fastq
# extract the single reads that do NOT match the reference genome
samtools fastq -f 4 -F 1 seqs.sharks.bam > single.notshark.fastq

The logic here is very similar. A samtools flag of -f 77 means the read is paird, neither the read nor mate is mapped, and the it is the first read in the pair, while a flag of -f 141 means the same except it is the second mate in the pair.

Finally, we use two flags: -f 4 (the read is unmapped) and -F 1 (the read is not paired) to find the single reads that are not mapped.

By combining the flags in the samtools files we can easily separate our sequences into those reads that match the reference genome, and those that do not. The next question is, which ones are you interested in!