NCBI’s fastq-dump has to be one of the worst-documented programs available online. The default parameters for fastq-dump are also ridiculous and certainly not what you want to use. They also have absolutely required parameters mixed in with totally optional parameters, and so you have no idea what is required and what is optional. Here, we take a look at some of the options and hopefully help you decide which parameters to run.
If you are working with SRA files you will need, at some point, to use fastq-dump. Unfortunately, it is not very well explained. In fact, the official NCBI documentation doesn’t even include all the options available to fastq-dump!. This is what we have learned from using it, and also what we use to extract sequences.
We usually use this command (changing SRR_ID to whatever the ID is).
fastq-dump --outdir fastq --gzip --skip-technical --readids --read-filter pass --dumpbase --split-3 --clip SRR_ID
After that, and depending on your downstream analyses, you may need to reorganize the fastq files so that the sequences in each file match and that you get file(s) of singletons. I suggest that you try fastq_pair to do that. Update: if you use –split-3 you will get three files, a left file, a right file, and a singletons file. Then you may not need to reorder your fastq files.
The official documentation has pretty minimal descriptions of what the outputs are and how you should use them. Here is a more detailed description, along with noting some options that you absolutely must use when you extract sequences from SRA (noted [required])
Skipping the general options (help and version), you get to the “Data formatting” options:
Splitting reads [required]
We start with two related parameters (these are the definitions from the NCBI website)
--split-spot Split spots into individual reads.
--split-files Dump each read into separate file. Files will receive suffix corresponding to read number.
--split-3 Legacy 3-file splitting for mate-pairs: First biological reads satisfying dumping conditions are placed in files *_1.fastq and *_2.fastq If only one biological read is present it is placed in *.fastq Biological reads and above are ignored.
Rationale: Some of the reads in SRA are paired-end reads where they sequenced (e.g.) from the left and right end of the sequence and have an estimated gap size between the ends (i.e. the average length of the fragments they are sequencing). It is important that you know if the sequences are paired-end for your downstream analysis, and most programs take the pairs into consideration.
–split-spot separates the read into left and right ends, and puts the reads in the same file. This creates a single file.
–split-files separates the read into left and right ends, and puts the forward and reverse reads in two separate files. This creates two files.
–split-3 separates the read into left and right ends, but if there is a left end without a matching right end, or a right end without a matching left end, they will be put in a single file.
You need to pass one of these two parameters to your dump-fastq command (but that is not all, you need –readids, see below). If you don’t, the paired end reads are concatenated together. This makes absolutely no sense and is a stupid default to have reads put out concatenated.
A simple one:
-fasta <[line width]> FASTA only, no qualities. Optional line wrap width (set to zero for no wrapping).
Rationale: This outputs the sequences in fasta format with the specified line width. e.g. -fasta 80 will use 80 characters per line.
As the help says, if you use -fasta 0 you don’t get any wrapping.
A lot of downstream software requires fasta sequences only, and this is a good way to get those sequences. However, there are quick ways to convert fastq to fasta, and so if you think the fastq format may be useful (e.g. if you are going to do assembly), use that for the initial extraction.
-I | --readids Append read id after spot id as 'accession.spot.readid' on defline.
Rationale: This is another one in the “what do you mean this is not default?” set of options. If you split your sequences into one (using –split-spot) or two (using –split-files) files, by default the sequences get the same ID. Huh? That makes no sense and will break your downstream processes because you have duplicate sequence IDs in the same file.
If you add the -I / –readids flag, one sequence gets appended the ID .1 and the other .2. This is pretty much what every program that accepts paired-end reads expects for input, so if you use –split-spot or –split-files you should use –readids. You should.
However, it was pointed out to me (thanks, Dave!) that the –readids option breaks BWA in downstream applications. You may need to omit this if you are using BWA. Using fastq-pair may solve this issue (but I haven’t tested it yet).
-F | --origfmt Defline contains only original sequence name.
Rationale: The SRA archive has rewritten the definition line in the sequence so that it contains the SRA ID and the length of the sequence. Therefore, a header that starts out:
@DRR000979.1 GFOWKFF06HMZ8O length=104
(this sequence is in SRA entry DRR000979).
This is a matter of personal preference, but I use the modified IDs so that all the sequences have the ID of the SRA entry that they come from so you can back-track them if needed.
-C | --dumpcs <[cskey]> Formats sequence using color space (default for SOLiD). "cskey" may be specified for translation.
Rationale: If your data comes from a SOLiD machine, you may want to enable this option so you can use color space.
-B | --dumpbase Formats sequence using base space (default for other than SOLiD).
Rationale: This will ensure that your output has A, T, G, and C instead of being put into color space. We include this by default because we don’t want colorspace, even by accident.
-Q | –offset <integer> Offset to use for ASCII quality scores. The default is 33 (“!”).
Rationale: In the old days, Illumina used slightly different ways to calculate the quality scores and a slightly different offset for the quality scores. You should leave this as the default value for almost all applications.
-N | --minSpotId <rowid> Minimum spot id to be dumped. Use with "X" to dump a range. -X | --maxSpotId <rowid> Maximum spot id to be dumped. Use with "N" to dump a range.
Rationale: If you want to output one or a few spots from the SRA file, you can use -N and -X. Note that if you use this, you should probably not start at position 0, but start a few thousand reads into the file. For reasons that we are not sure about (but we have some conspiracy theories about), many fastq files have “unusual” sequences at the start of the file. We usually use -N 10000 -X 110000 to get 100,000 reads.
-M | --minReadLen <len> Filter by sequence length >= <len>
Rationale: Read length is a good proxy for sequence quality, and so you may want to filter reads below a certain length so they are ignored. Again, we post-filter the sequences, so we don’t use this one.
Technical sequences [required]
--skip-technical Dump only biological reads.
Rationale: If the sequencing was done with the “Illumina multiplex library construction protocol” the SRA entry ends up with application reads and technical reads like this:
Application Read Forward -> Technical Read Forward <- Application Read Reverse - Technical Read Reverse.
The spots look like this:
You don’t want the technical reads – you only want the biological reads – so include –skip-technical to remove those technical parts. If you omit this option and include the –split-files you actually end up with three or four files per SRA archive!
-W|–clip Apply left and right clips
Rationale: Some of the sequences in the SRA contain tags that are used e.g. for whole genome amplification and need to be removed. This will remove those sequences for you. You should enable this flag as it will trim off those sequences for you.
Aligned and unaligned sequences
--aligned Dump only aligned sequences. Aligned datasets only; see sra-stat. --unaligned Dump only unaligned sequences. Will dump all for unaligned datasets
Rationale: If you are looking for reads that map to the human genome, for example, you may want only the aligned or unaligned part. This is optional and up to you.
Read filtering [required]
Rationale: You want to filter out reads that are all N’s or otherwise completely useless. This filter can be set to pass|reject|criteria|redacted but you want only those reads that pass filtering. Otherwise you just get a bunch of Ns.
Workflow and piping
These options control the output of fastq-dump and are straightforward but worth noting.
Output to a specific directory
-O | --outdir <path> Output directory, default is current working directory ('.').
Rationale: The default directory to write to is your current working directory. This will output the fastq file to an alternate directory.
Output to standard out.
-Z | --stdout Output to stdout, all split data become joined into single stream
Rationale: By default the output is written to one or more files. This will output the data to standard out, so you can (for example) pipe it into another command. You probably don’t want to do this.
--gzip Compress output using gzip. --bzip2 Compress output using bzip2.
You can compress the sequences files using one of two standard compression algorithms, gzip or bzip2. Gzip is probably more widely supported (but only just) and several common downstream programs like bowtie2 can use both gzip and bzip2 directly.
The command that we use to extract sequences from SRA archives is:
fastq-dump --gzip --skip-technical --readids --dumpbase --split-files --clip sra_filename
Most of this was figured out by trial and error, although we thank the anonymous reviewers of our partie paper that pointed out a couple of options we didn’t regularly include but should have!