While answering some reviewers comments, I pulled out this data about the instruments used to submit data to the SRA. Clearly the HiSeq and MiSeq are dominating the number of runs that people are submitting.
I love standards; there are always so many to choose from. The sequence read archive strives hard to capture appropriate information about the sequences that people deposit, but in the end scientists are people too, and they are never uniform and standard. This means there are a lot of ways to describe metagenomes. To get your data used by other people (and cite your papers), make sure you tag it so we can find it!
There is a lot of metagenomics data in the SRA, but it is not very well organized. To get it all, you need some wicked SQL-FU … or you can copy these recipes!
These are all the attributes in the SRA files
NCBI’s fastq-dump has to be one of the worst-documented programs available online. The default parameters for fastq-dump are also ridiculous and certainly not what you want to use. They also have absolutely required parameters mixed in with totally optional parameters, and so you have no idea what is required and what is optional. Here, we take a look at some of the options and hopefully help you decide which parameters to run.
The sequence read archive (aka short read archive) SRA metadata is complex! This is a brief guide to help you navigate it.
One key thing to remember is that:
A project (SRP) has one or more samples. However, projects are in the table called study.
A sample (SRS) has one or more experiments (SRX).
An experiment has one or more runs (SRR).
What you really want are the runs, and this is how you can get them!