The NCBI SRA contains a lot of data – about 1016 bp at the moment! However, searching that has always been problematic. We’re happy to unveil a new search SRA service that allows you to search the random metagenomes in the SRA using either DNA or protein queries.
We were curious about how many bp of metagenomes in the SRA. This was partly inspired by our grant writing, and partly by this question on twitter from Tom Delmont:
great Rob! Do you have the ratio in term of ‘file volumes’ between WGS and 16S amplicons? Just curious to know if WGS wins on this front 🙂
— tom delmont (@tomodelmont) March 30, 2017
This is how to answer the question!
The metadata in the SRA is not all the data you can get about a run. Here is how to get more data about a run from the SRA without going to the SRA website.
While answering some reviewers comments, I pulled out this data about the instruments used to submit data to the SRA. Clearly the HiSeq and MiSeq are dominating the number of runs that people are submitting.
I love standards; there are always so many to choose from. The sequence read archive strives hard to capture appropriate information about the sequences that people deposit, but in the end scientists are people too, and they are never uniform and standard. This means there are a lot of ways to describe metagenomes. To get your data used by other people (and cite your papers), make sure you tag it so we can find it!
There is a lot of metagenomics data in the SRA, but it is not very well organized. To get it all, you need some wicked SQL-FU … or you can copy these recipes!
These are all the attributes in the SRA files