Accessing SRA in the Cloud

As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.

For this example, I am using Google Cloud Computing. The data is also currently available on AWS, and setting it up is essentially the same (except using the AWS tab below).

First, log into the Google Cloud Console and launch a new instance. I am using a Debian version 10 instance, but the approach is the same if you use a different variant.

Once the machine starts, you should access it via ssh as you normally do. Note that I am not giving you instructions here on how to ssh to your Google cloud instance, refer to the regular documentation for that.

Head to the SRA Toolkit GitHub download page and copy the link for the latest version of the toolkit. I am using the version called Ubuntu Linux 64 bit architecture - non-sudo tar archive At the time of writing it is 2.10.2, but change the version below because I am sure there will be a new version by the time you are reading this! Use curl to download the tarball.

curl -LO http://ftp-trace.ncbi.nlm.nih.gov/sra/sdk/2.10.2/sratoolkit.2.10.2-ubuntu64.tar.gz (update this as appropriate!)

extract the archive

tar xf sratoolkit.2.10.2-ubuntu64.tar.gz

and add the path to the executables to your path:

export PATH=$PATH:sratoolkit.2.10.2-ubuntu64/bin/

Note that you probably want to make this permanent so that the next time you log in to this instance, you are ready to go. The trivial way to do this is to edit your .bashrc file and add that line, but the instructions for that are outside the scope of this blog.

If we try and use the SRA Toolkit we get an error:

$ srapath SRR000001
 This sra toolkit installation has not been configured.
 Before continuing, please run: vdb-config --interactive
 For more information, see https://www.ncbi.nlm.nih.gov/sra/docs/sra-cloud/

And so we need to configure vdb-config

vdb-config -i

There are a couple of settings that you should change:

  1. Disable the cache:
    1. Press C to choose the Cache tab
    2. Press i to uncheck the box enable local file-caching
  2. Report cloud instance
    1. Press G to choose the GCP tab
    2. Press r to enable report cloud instance identity
  3. Press s to save the changes
  4. Press x to exit vdb-config

Now we are going to check and see where the SRA Toolkit is pulling the data from:

$ srapath SRR000001  
https://locate.ncbi.nlm.nih.gov/sdlr/sdlr.fcgi?jwt=eyJhbGciOiJSUzI1NiIsImtpZCI6InNkbGtpZDEiLCJ0eXAiOiJKV1QifQ.eyJleHAiOjE1ODE4ODA0NzEsImlhdCI6MTU3OTI4ODQ3MSwibGluayI6Imh0dHBzOi8vc3RvcmFnZS5nb29nbGXXaGlzLmNvbS9zcmEtcHViLXJ1bi03L1NSUjAwMDAwMS9TUlIwMDAwMDEuNCIsInJlZ2lvbiI6InVzIiwic2Vy2dmljZSI6ImdzIiwic2lnbmluZ0FjY291bnQiOiJzcmFfZ3MiLCJ0aW1lb3V0Ijo2MDAwfQ.jC1xVd60uevm_g-TjynZt_66X2-JpnBorRTGlRInlPrNFk7Zw27H5lpAjtBwOhvRaqC4payupnz6ymFw6TS5H1TJ8LAHAZbNg-qoSDqnPiict1qDswlr2tTwT3xctoUn2y2SjVbAlChJTprXVdXE17Fnptwy-OlT0I9sPXByvA_4OWggpD3EcrQSwuNwAOBSuyYX35n-Xnthl_Y-DdhFIu3Zmw8bMSHBfkCpR5QVU0_TazIvfWFaVorxq--E0Rvi9kCx7URTOS85DVHle2oYoi_pCONJT2DRmeL5nSTiQwLZvOfoK2tieoihYOpi_1TwEjI5bKzqL5lW9r2qA&ncbi_phid=939B877E5B0A11B500005DB9CB44A770.1.1

If you see a long and complex URL like this one then you are accessing SRA in the Cloud. Congratulations!

If, however, you see a short URL like this one:

$ srapath SRR000001
 https://sra-downloadb.be-md.ncbi.nlm.nih.gov/sos1/sra-pub-run-5/SRR000001/SRR000001.4

then, unfortunately, you are accessing SRA from NCBI, and you perhaps missed one of the steps in vdb-config

Note: Unselecting the local file caching is my personal preference and is not required for accessing the SRA in the cloud. However, the local file cache will very quickly fill up your local hard drive. I recommend not keeping a local file cache!