We recently released a new version of our qudaich software, designed to compare short read sequence data sets to each other. Qudaich is built around a suffix trie and provides a rapid way to compare short read data sets at the DNA or protein level. Here is how to use qudaich to compare a set of metagenomes to find out how similar they are.
For this example, we are going to use human gut viromes from Rick Bushman’s lab. These are the metagenomes that appeared in the crAssphage poops up again post, so we know that there are some similar sequences in them.
The data comes from this paper: Chehoud C, Dryga A, Hwang Y, Nagy-Szakal D, Hollister EB, Luna RA, Versalovic J, Kellermayer R, Bushman FD. 2016.[Transfer of Viral Communities between Human Individuals during Fecal Microbiota Transplantation, and you can download the data from the sequence read archive.
We focused on four datasets from this study:
- SRR3403834 4,892,517 reads totaling 1,218,689,925 bp
- SRR3403835 646,661 reads totaling 154,849,037 bp
- SRR3403836 746,002 reads totaling 179,374,069 bp
- SRR3403848 4,367,579 reads totaling 1,093,194,181 bp
We chose those four because they have the most crAssphage like sequences in them! For this demonstration, we only used one end of the reads (the _1.fastq.gz files for each), although you should probably pair your data and/or compare both ends.
First, we used our new ceeqlib library (not published and really only in pre-alpha release) to convert the fastq files to fasta format. As you can see, two of the libraries have >4 million sequences in them, and so we also randomly subsampled (without replacement) those two libraries to get down to 750,000 sequences each. That way all four data sets have similar numbers of sequences.
The first step in qudaich is to build a suffix array for each pair of comparisons. qudaich treats the databases and queries differently in two ways: First, only the database is reverse complemented – the query is compared to both the forward and reverse complement of the database – and second, qudaich reports the best hit from the database for each query sequence.
Therefore in this analysis, we compared each sequence to each other sequence in both directions.
By default, qudaich only reports the sequences where the number of matches is above the average for all matches (i.e. it only reports strong hits). That option was used in the table below.
that have a match
|Unique database sequences
As an alternative, you can specify generating an alignment for every query sequence by providing the -a all command line option. That will generate the best possible alignment for every sequence.