edwardslab

  • Increase font size
  • Default font size
  • Decrease font size
Home Lab blog
Lab blog

Entries in the NCBI Taxonomy Database

The NCBI Taxonomy database currently contains 865,348 different taxonomy entries that can be accessed using a unique identifier (the taxid). This unique identifier is an integer from 1 to 1,154,685 that can be used to access database entries at different taxonomic levels (kingdom, phylum, ...). The graphic below summarizes the content of the NCBI Taxonomy database and highlights the phyla or families with the most entries. Most taxonomy entries by domain are for Eukaryota, followed by Bacteria and Viruses. Unclassified sequences include, for example, entries for metagenomes.

ncbi_taxon_entries

(Click on the image to see a larger version.)

 

Perl one liner to extract sequences by their ID from a FASTA file

The first one liner is useful if you only want to extract a few sequences by their identifier from a FASTA file.

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' fasta.file

This will extract the two sequences with the sequence idenfiers id1 and id2. You only have to change the identifiers within the parentheses and separate them by space to extract the sequences you need.

 

If you have a large number of sequences that you want to extract, then you most likely have the sequence identifiers in a separate file. Assuming that you have one sequence identifier per line in the file ids.file, then you can use this one line:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file

 


 

Three easy ways to download multiple sequences from NCBI

There are different ways of how to download multiple sequences from the NCBI databases in a single request.

 

1) Using the batch Entrez website

http://www.>ncbi.nlm.nih.gov/sites/batchentrez

 

2) Using Perl: (copy into your terminal and press return/enter)

perl -e 'use LWP::Simple;getstore("http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=text&id=".join(",",qw(6701965 6701969 6702094 6702105 6702160)),"seqs.fasta");'

This takes the IDs separated by spaces and the filename of the fasta file with the sequences that will be generated (seqs.fasta). If you don't try to get the nucleotide data, then you will have to change the database name as well.

 

3) Using your browser: (paste this to the address field)

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nucleotide&rettype=fasta&retmode=text&id=6701965,6701969,6702094,6702105,6702160
This time the IDs are separated by commas. Same here, if you need to get data from a different database you just have to change that.
 

Forking jobs on a multiprocessor machine

If you're running on a multi-thread or multi-processor machine but your application doesn't take advantage of those additional cpus, you can run lots of jobs simultaneously. We use it for running the same application (e.g. phispy or mauve) on multiple genomes. Here's a code snippet to allow you to run as many jobs as you want simultaneously.

 

Read more...
 

Parsing Newick Trees

We often need to parse "newick" format phylogentic trees to figure out some information. Writing a parser is good for the soul, because the best way to do it is through recursion.

After the readmore, I provide some perl code for parsing newick phylogenetic trees into a lightweight data structure. Each node consists of an array of three things [left child, right child, and distance]. If the node is a leaf then the node consists of ["node", the node name, and the distance]. It allows for very easy analysis of the tree, and simple ways to get data back. I also provide some example code for printing out the root-to-tip distance of every leaf in the tree.

 

Read more...
 

Server admin day a success, again!

More space, and a drive that is not warning of impending doom! Here are some tips and tricks for updating the disk capacity of a server with minimal down time and even letting users know about it!

 

Read more...
 

Reverse complement function in PERL

it includes IUPAC consensus characters:

sub reverse_complement () {
my $new = $_[0];
$new =~ tr/acgtrymkbdhvACGTRYMKBDHV/tgcayrkmvhdbTGCAYRKMVHDB/;
$new = reverse ($new);
return ($new); }

 

 

Edwards Lab On TV!

Our recent expedition to the Abrolhos Islands off the coast of Brazil was featured on Good News on RedeTV! You can watch the full video on the RedeTV! website, or below. This show also includes an Ion Torrent, if you are watching carefully. The show is in two parts because you can't get all that corally goodness in just one segment.

Here are the shows on RedeTv's website: Part 1 and Part 2, and a local version is below.

 

Read more...
 

Mapping UniRef100 to PhAnToMe

UniRef100 is another non-redundant database. In this post, I describe how to map the UniRef100 proteins to the proteins in the phantome database and get the subsystems for each.

This is similar to the description of how to map things to the SEED using the SEED servers, but this time we'll download everything and do it locally.

 

Read more...
 

Real time metagenomics

A while ago, we developed the Real Time Metagenomics web site (aka metagenomics using k-mers) and related applications to allow rapid annotation of metagenomic sequences using the SEED subsystems. In this post we discuss how this works, and how you can use real time metagenomics, either through the web site or directly on your own computer to analyze your data.

 

Read more...
 

SEED to GO Mapping using the SEED servers

The SEED contains most complete microbial and phage genomes, and includes an ontology built by annotators for annotators. The SEED systems contain the most complete microbial annotations anywhere.

The Gene Ontology project (GO) aims to unify annotations, but has long had a focus on eukaryotes and has repeatedly ignored prokaryotes. Tired of building tables mapping SEED functions to GO functions, this post will show you how to do so using the SEED servers, so that you may update the comparison any time you like.

 

Read more...
 
  • «
  •  Start 
  •  Prev 
  •  1 
  •  2 
  •  3 
  •  4 
  •  5 
  •  6 
  •  7 
  •  8 
  •  9 
  •  10 
  •  Next 
  •  End 
  • »


Page 1 of 17