Sometimes when you look at a record in RefSeq/GenBank it is a virtual record that is really a pointer to a set of records. For example, the entry for Callorhinchus milii isolate IMCB2004 points you to the WGS records AAVX02000001-AAVX02067420. Here we show how to get these records.
As part of the STRIDES initiative, the NIH has moved the SRA to the cloud. This includes the metadata, and the whole SRA archive. Here, I show how to set up a new instance to access the sequence read archive in the cloud. In a separate post, we’ll explore getting the metadata out of bigtable.Continue reading
We have just made the transition of most of the servers from CentOS6 or CentOS7 to CentOS8. Most everything should be unified on CentOS8 (unless you know what you are doing).
This brings several new changes (as always) and some added benefits. This is a summary and does not reflect all the changes.
To check your servers operating system version, use this command:
The biggest changes should allow you to install software by yourself! There are two different ways you can install easily install software if either are supported by whatever you are trying to install.
Please note, that if you do not want to do either of these, it is fine. Just let me know and I am happy to install software for you (and everyone else) to use.
A lot of bioinformatics software is now available via conda. It is installed globally, but you can not install packages globally. You can create your own environment and then use that.
The first time you use conda, you will need to create a local environment. Start with:
conda create --name <username>
But use your username instead of
After this has run, any time you need to use conda, you can use the command
conda activate <username>
And you will get into your environment.
A simple test is to install my
fastq-pair package and see if it works:
conda install -c bioconda fastq-pair
once it has installed, this command should give some output
Another popular way of sharing software is by using docker. We don’t support docker, but we support a drop-in replacement called podman.
Anywhere you see docker, you can use podman instead. For example, we created a focus docker image for the cami challenge described here: https://hub.docker.com/r/linsalrob/cami-focus and you can install that with
podman pull linsalrob/cami-focus
If you are trying to run some python code and don’t have the appropriate library, you should be able to use pip install as a user to add it. For example:
pip3 install --user xmlschema
this will install the appropriate libraries into your account. Of course, if you want them globally installed, just let me know.
Deprecated software and alternatives
|screen||tmux||Virtual terminals. You should use this!||tmux has similar keys to screen but uses |
|cd-hit||mmseqs||Clustering sequences||cd-hit is still an option if you want, but mmseqs2 appears to be much better|
For our search SRA engine, we want to remove the ribosomal RNA operon (not just the 16S gene, the whole opeon) before we run the search, otherwise all our hits are to the rRNA genes!
Here’s who you can use PATRIC to download a genome and remove the 16S region. For the example, we’re going to use a Faecalibacterium prausnitzii genome, because, well why not!
First, we download the genome and convert the GTO to fasta
p3-gto 657322.3 rast-export-genome -i 657322.3.gto contig_fasta > 657322.3.fna
Next, we use a couple of helper scripts from the EdwardsLab Git Repo. We start by converting the gto to a tab separated file with features and their locations
python3.7 ~/EdwardsLab/patric/parse_gto.py -f 657322.3.gto -p > 657322.3.tab
Then we can
grep through that file for the ribosomal genes:
grep rna 657322.3.tab | grep Subunit
We only find two of the genes:
fig|657322.3.rna.5 Large Subunit Ribosomal RNA; lsuRNA; LSU rRNA FP929046 586941 - 589785 (-) fig|657322.3.rna.6 Small Subunit Ribosomal RNA; ssuRNA; SSU rRNA FP929046 590567 - 591540 (-)
Now we can trim out the sequences and keep only the non-rRNA regions. Note that here I trim a little extra off the sequences, but you may not wish to do that
python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -e 576941 -c FP929046 > FP929046.fna
python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -b 601540 -c FP929046 >> FP929046.fna
We run this twice, which is suboptimal, but this is definitely not the most computationally challenging thing we will do with those sequences!
We use anvi’o for all sorts of ‘omics analysis, but it is a pain to run on your laptop as you can’t watch netflix and youtube, check facebook, and post to twitter at the same time (well, you can, but why would you?).
Instead, we have the latest version of anvi’o installed on tatabox, one of the machines in our HPC environment. After you have run all the
anvi-commands, very often you want to launch
anvi-interactive, but tatabox is safely behind a firewall.
We can make a two step connection to tatabox using port tunneling. Depending on how you do this, you will need three terminals open.
Next, open a terminal on your computer, and use this command. Change XXXX to a port near the on that you ssh to on edwards-data, change YYYY to the port that you normally use, and change USERNAME to your USERNAME.
ssh -L 5555:localhost:XXXX -N -p YYYY
On edwards-data, run this command:
ssh -L XXXX:localhost:8080 -N USERNAME@tatabox
Finally, on your laptop, you should open a new browser window and paste this URL:
You should see the
anvi-interactive interface appear, and you can get to work.
Our paper on the global phylogeography of crAssphage is published in Nature Microbiology. You can read the paper at the Nature Microbiology website or on ReadCube. The paper garnered international press attention, and here we have summarized the press coverage.
Please let Rob know if you are aware of any other reports that are not included here.Continue reading
For several years NSF ran a trial where they would ask for a conflicts form in excel-type format. Recently, that has been codified into the Collaborators and Other Affiliations Information form. You can find more information about that form at the NSF Website and the NSF GPG.
We developed a simple script to help complete this form for you. It does not do all the work, but it gets you a long way there, and you can do the rest a lot easier. After all, you have a lot of other things to worry about when you are writing that grant. Read more to see how to use it.Continue reading
One of the most essential tools in bioinformatics is counting k-mers. These are short, identical strings, of length k. Here, we will look at counting all the k-mers in string.
We had a great turnout for the 2019 Student Research Symposium as usual, with everyone in the lab either presenting their work or judging the work of others. Congratulations to Dean’s award for Science winners Holly Norman and Ashelyn Lutrick for their presentation on “Analyzing the Presence, in Humans, of crAssphage: A Highly Abundant Bacteriophage Found Around the Globe”.
Here are Jillian, Melisssa, Shane, and Rob in front of Jillian and Shane’s posters.