This tutorial is not one of those! It is quick, hands-on, and we’ll jump right in with little explanation (OK, there will be some). But I encourage you to read those tutorials, especially Titus’s tutorials as you will learn a lot from them.
Snakemake is a versatile pipeline manager for doing a lot of bioinformatics analysis, but handling wildcards in snakemake is not transparent, and here are some tips and tricks that we have gathered to help you process lots of files easily.
The tools fastq-dump and fasterq-dump are used to extract reads from the Sequence Read Archive and export them to (for example) fastq format. There is a hidden gotcha that you should be aware of using fastq-dump to extract data.
Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.
As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.
We have just made the transition of most of the servers from CentOS6 or CentOS7 to CentOS8. Most everything should be unified on CentOS8 (unless you know what you are doing).
This brings several new changes (as always) and some added benefits. This is a summary and does not reflect all the changes.
To check your servers operating system version, use this command:
The biggest changes should allow you to install software by yourself! There are two different ways you can install easily install software if either are supported by whatever you are trying to install.
Please note, that if you do not want to do either of these, it is fine. Just let me know and I am happy to install software for you (and everyone else) to use.
A lot of bioinformatics software is now available via conda. It is installed globally, but you can not install packages globally. You can create your own environment and then use that.
The first time you use conda, you will need to create a local environment. Start with:
We use anvi’o for all sorts of ‘omics analysis, but it is a pain to run on your laptop as you can’t watch netflix and youtube, check facebook, and post to twitter at the same time (well, you can, but why would you?).
Instead, we have the latest version of anvi’o installed on tatabox, one of the machines in our HPC environment. After you have run all the anvi-commands, very often you want to launch anvi-interactive, but tatabox is safely behind a firewall.
We can make a two step connection to tatabox using port tunneling. Depending on how you do this, you will need three terminals open.
First, start anvi-interactive on tatabox, and keep that window open (or use screen or tmux which are much better alternatives).
Next, open a terminal on your computer, and use this command. Change XXXX to a port near the on that you ssh to on edwards-data, change YYYY to the port that you normally use, and change USERNAME to your USERNAME.