Recall that in the SRA A project (SRP) has one or more samples, a sample (SRS) has one or more experiments (SRX), and an experiment has one or more runs (SRR). [source: davetang.org]
How many experiments only have one run, and how many experiments have lots of runs?
I used this sqlite temporary table select to query the database:
sqlite3 SRAmetadb.sqlite 'create temporary table tt as select experiment_accession, count(1) as cnt from run group by experiment_accession; select cnt, count(1) from tt group by cnt'
Basically we make a table (called tt) that has the experiment accessions present in the run table. Note that we group by experiment access and then count them. This first table shows us that one experiment (SRX661764) has 9,212 runs associated with it (I wish I had that sequencing budget!). The second select just counts the results in the cnt column of the temporary table, showing us how many experiments have one run, how many have two, etc etc.
Here is a graph of the counts:
(Note that both axis are presented in a log scale). There are a lot of experiments with only a single run. Note also that I took this from the runs table, by so definition I will not find experiments that have no runs associated with them (though I believe they exist).
How many experiments only have one run? Here are the counts of experiments with between 1 and 10 runs;
|Number of runs||Number of experiments with
that number of runs
|Percent of all runs|
82.6% of all runs are the only run for that experiment, and >90% of runs are associated with experiments that have fewer than four runs.
Note also, that experiments tend to favor having an even number of runs rather than an odd number. (Data is rarely random!)