Getting WGS data from GenBank via the SRA

Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.

We have a script (of course) that works with standard PERL libraries to get the records using e-utils. You can run that script like this:

perl AAVX02000000 100 AAVX02000000.out

In this case, AAVX02000000 is the base name of the record, 100is the number of records to request at once, so we don’t run afoul of NCBI’s rules, and AAVX02000000.out is the output file.

Here is another way to do that using NCBI’s tools.

First, you need to locate the record:

curl -s ""

The -soption to curl is silent mode so it doesn’t print progress. Change the part after acc= to be the accession number you are interested in.

This will give you a block of json like so:

     "version": "2",
     "result": [
             "bundle": "AAVX02000000",
             "status": 200,
             "msg": "ok",
             "files": [
                     "object": "wgs|AAVX02000000",
                     "name": "AAVX02.5",
                     "locations": [
                             "link": "",
                             "service": "ncbi"

The key item here is the link entry:

This tells you that the current link is AAVX02.5 and now you can use vdb-dump from the SRA toolkit to download the sequences:

vdb-dump AAVX02.5 > AAVX02.5.out

Note that the standard vdb-dump output is not that useful for bioinformatics, so you probably want to use

vdb-dump -f fasta AAVX02.5 > AAVX02.5.fna

Note you could also output fastq or other formats. Use vdb-dump -h for more choices