Author Archives: Rob Edwards

New features in CentOS8

We have just made the transition of most of the servers from CentOS6 or CentOS7 to CentOS8. Most everything should be unified on CentOS8 (unless you know what you are doing). 

This brings several new changes (as always) and some added benefits. This is a summary and does not reflect all the changes.

To check your servers operating system version, use this command:

cat /etc/redhat-release

Software Installs

The biggest changes should allow you to install software by yourself! There are two different ways you can install easily install software if either are supported by whatever you are trying to install.

Please note, that if you do not want to do either of these, it is fine. Just let me know and I am happy to install software for you (and everyone else) to use.

Conda

A lot of bioinformatics software is now available via conda. It is installed globally, but you can not install packages globally. You can create your own environment and then use that. 

The first time you use conda, you will need to create a local environment. Start with:

source /usr/local/anaconda3/bin/activate
conda create --name <username>

But use your username instead of <username>!

After this has run, any time you need to use conda, you can use the command

conda activate <username>

And you will get into your environment. 

A simple test is to install my fastq-pair package and see if it works:

conda install -c bioconda fastq-pair

once it has installed, this command should give some output

fastq-pair

Docker

Another popular way of sharing software is by using docker. We don’t support docker, but we support a drop-in replacement called podman.

Anywhere you see docker, you can use podman instead. For example, we created a focus docker image for the cami challenge described here: https://hub.docker.com/r/linsalrob/cami-focus and you can install that with

podman pull linsalrob/cami-focus

pip

If you are trying to run some python code and don’t have the appropriate library, you should be able to use pip install as a user to add it. For example:

pip3 install --user xmlschema

this will install the appropriate libraries into your account. Of course, if you want them globally installed, just let me know.

Deprecated software and alternatives

DeprecatedAlternateUsed ForAlternative
screentmuxVirtual terminals. You should use this!tmux has similar keys to screen but uses ctrl-b instead of ctrl-a to access them. eg. create a new window: “ctrl-b n
cd-hitmmseqsClustering sequencescd-hit is still an option if you want, but mmseqs2 appears to be much better

Download a genome and remove the ribosomal RNA operon

For our search SRA engine, we want to remove the ribosomal RNA operon (not just the 16S gene, the whole opeon) before we run the search, otherwise all our hits are to the rRNA genes!

Here’s who you can use PATRIC to download a genome and remove the 16S region. For the example, we’re going to use a Faecalibacterium prausnitzii genome, because, well why not!

First, we download the genome and convert the GTO to fasta

p3-gto 657322.3
rast-export-genome -i 657322.3.gto contig_fasta > 657322.3.fna

Next, we use a couple of helper scripts from the EdwardsLab Git Repo. We start by converting the gto to a tab separated file with features and their locations

python3.7 ~/EdwardsLab/patric/parse_gto.py -f 657322.3.gto -p > 657322.3.tab

Then we can grep through that file for the ribosomal genes:

grep rna 657322.3.tab | grep Subunit

We only find two of the genes:

fig|657322.3.rna.5      Large Subunit Ribosomal RNA; lsuRNA; LSU rRNA   FP929046 586941 - 589785 (-)

fig|657322.3.rna.6      Small Subunit Ribosomal RNA; ssuRNA; SSU rRNA   FP929046 590567 - 591540 (-)

Now we can trim out the sequences and keep only the non-rRNA regions. Note that here I trim a little extra off the sequences, but you may not wish to do that

python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -e 576941 -c FP929046 > FP929046.fna
python3.7 ~/EdwardsLab/manipulate_genomes/trim_fasta.py -f 657322.3.fna -b 601540 -c FP929046 >> FP929046.fna

We run this twice, which is suboptimal, but this is definitely not the most computationally challenging thing we will do with those sequences!

Connecting to an anvi’o server on tatabox

We use anvi’o for all sorts of ‘omics analysis, but it is a pain to run on your laptop as you can’t watch netflix and youtube, check facebook, and post to twitter at the same time (well, you can, but why would you?).

Instead, we have the latest version of anvi’o installed on tatabox, one of the machines in our HPC environment. After you have run all the anvi-commands, very often you want to launch anvi-interactive, but tatabox is safely behind a firewall. 

We can make a two step connection to tatabox using port tunneling. Depending on how you do this, you will need three terminals open.

First, start anvi-interactive on tatabox, and keep that window open (or use screen or tmux which are much better alternatives).

Next, open a terminal on your computer, and use this command. Change XXXX to a port near the on that you ssh to on edwards-data, change YYYY to the port that you normally use, and change USERNAME to your USERNAME.

ssh -L 5555:localhost:XXXX -N -p YYYY USERNAME@edwards-data.sdsu.edu

Next, open another terminal (or if you are using screen or tmux, open a new terminal emulator), and login to edwards-data.sdsu.edu using your normal account (the USERNAME from above).

On edwards-data, run this command:

 ssh -L XXXX:localhost:8080 -N USERNAME@tatabox

Finally, on your laptop, you should open a new browser window and paste this URL:

http://localhost:5555/

You should see the anvi-interactive interface appear, and you can get to work.

Virus Hunting in the Cloud Codeathon v2!

We are pleased to announce the second installment of the Virus Hunting Codeathon!

From 4-6 November, 2019, the NCBI will help run a bioinformatics codeathon in College Park, MD  hosted by the UMIACS and CBCB at the University of Maryland. We are going to put a few hundred thousand metagenomic datasets on cloud infrastructure and further identify known, taxonomically definable and novel viruses with even faster approaches!  We’re specifically looking for folks who have experience in Computational Virus Hunting or adjacent fields! If this describes you, please apply! This event is for researchers, including students and postdocs, who are already engaged in the use of bioinformatics data or in the development of pipelines for virological analyses from high-throughput experiments. The event is open to anyone selected for the codeathon and willing to travel to College Park (see below).

Working groups of five to six individuals will be formed into five to eight teams.  These teams will build pipelines to analyze large datasets within a cloud infrastructure. 

TOPICS

  • Fast, federated indexing
    • Big Query
  • Metadata features 
  • Genome graphs for viruses
  • Approximate taxonomic analysis
  • Domain/HMM Boundary and Taxonomic Refinement
  • Bringing together approximate taxonomy and domain models
  • Sequence data quality metrics
  • Phage-host interactions

The final list of projects will be unveiled before the codeathon starts, and will build off of previous NCBI codeathons.

Organization

After a brief organizational session, teams will spend three days addressing a challenging set of scientific problems related to a group of datasets. Participants will analyze and combine datasets in order to work on these problems. We will be writing code and solving problems.

Throughout the three days will breakout to discuss progress on each of the topics, bioinformatics best practices, coding styles, etc.

Datasets

Datasets will come from public repositories, with a focus on metagenomics datasets in the sequence read archive that were been ported to cloud infrastructure, as well as derivative contigs of the above.

Products

All pipelines and other scripts, software, and programs generated in this codeathon will be added to a public GitHub repository designed for that purpose (currently github.com/NCBI-Hackathons, but a new one may exist by the event).

Manuscripts describing the design and usage of the software tools constructed by each team may be submitted to an appropriate journal such as the F1000Research hackathons channel, BMC Bioinformatics, GigaScience, Genome Research or PLoS Computational Biology.  Ideally, we will present a searchable, streamlined virological index from these datasets on cloud infrastructure.

How To Apply

To apply, please complete this form (approximately 10 minutes to complete). Initial applications are due Monday, October 7th, 2019 by 3 pm ET. Participants will be selected based on the experience and motivation they provide on the form.

Prior participants and applicants are especially encouraged to apply. The first round of accepted applicants will be notified on October 8th by 11:59 pm ET, and have until October 11th at 4 pm ET to confirm their participation.  International applicants or those with particular skillsets may be accepted early. If you confirm, please make sure it is highly likely you can attend, as confirming and not attending prevents other data scientists from attending this event. Please include a monitored email address, in case there are follow-up questions.

Note: Participants will need to bring their own laptop to this program. A working knowledge of scripting (e.g., Shell, Python, R) is useful but not necessary to be successful in this event. Employment of higher level scripting or programming languages may also be useful. Applicants must be willing to commit to all three days of the event.

No financial support for travel, lodging or meals is available for this event. Also, note that the codeathon may extend into the evening hours each day. Please make any necessary arrangements to accommodate this possibility. Depending on the number of people that need accommodation, we will attempt to get a group rate at one of the local hotels. Please indicate on the registration form if you need a hotel room.

There will be no registration fee or cost associated with attending this event.

For more information, or with any questions, please contact Ben Busby (ben.busby@nih.gov ) with any questions.

Press about Global Phylogeography of crAssphage

Our paper on the global phylogeography of crAssphage is published in Nature Microbiology. You can read the paper at the Nature Microbiology website or on ReadCube. The paper garnered international press attention, and here we have summarized the press coverage.

Please let Rob know if you are aware of any other reports that are not included here.

Continue reading