Author Archives: Robert

What is Bioinformatics?

The term “bioinformatics” has be defined in many different ways since its first use more than 20 years ago. However, the interdisciplinary application of computers to biological data has always been part of the definition. Here is a small collection of (short) answers to the question “What is bioinformatics?”:

“Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioral or health data, including those to acquire, store, organize, archive, analyze, or visualize such data.” – NIH Biomedical Information Science and Technology Initiative (2000)

“Bioinformatics is the application of computer technology to the management of biological information. Computers are used to gather, store, analyze and integrate biological and genetic information which can then be applied to gene-based drug discovery and development.” – (2001?)

“Bioinformatics is the application of statistics and computer science to the field of molecular biology. It includes computational biology, algorithm development, statistics techniques, data modeling and visualization.” – Owen White (2010)

“Bioinformatics is a science where we integrate computer science, genetics and genomics.” – Atul Butte (2010)

“Bioinformatics is the application of computer science and information technology to the field of biology and medicine.” – (2012)

For a more detailed definition of the term bioinformatics, take a look at the list provided by the International Society for Computational Biology (ISCB) or the Bioinformatics FAQ at

Profiling / Benchmarking Perl code

There is an easy way to measure the performance of every part of your Perl code – it’s called NYTProf.

If you don’t have it yet, install the profiling modul Devel::YTProf

sudo perl -MCPAN -e 'install Devel::NYTProf'

Then run your Perl script with an additional call to the profiler:

perl -d:NYTProf input.file

The -d starts the debug mode which is a short hand for -MDevel:: (loads the module Devel::NYTProf before running your Perl script).

The profiler produces the file nytprof.out. Please note that the profiler will add some addtional processing time to your script.

The last step is to generate the HTML output that will show you all the results of the profiler (including the time spend on each line of code used while running the script).

nytprofhtml -o nytprof -f nytprof.out

The -o defines the output directory where all the HTML files will be written to and the -f defines the input file name (useful if you want to compare multiple runs, otherwise it can be ignored as it defaults to nytprof.out).

Now open the index.html in the output directory and start improving your code!


Entries in the NCBI Taxonomy Database

The NCBI Taxonomy database currently contains 865,348 different taxonomy entries that can be accessed using a unique identifier (the taxid). This unique identifier is an integer from 1 to 1,154,685 that can be used to access database entries at different taxonomic levels (kingdom, phylum, …). The graphic below summarizes the content of the NCBI Taxonomy database and highlights the phyla or families with the most entries. Most taxonomy entries by domain are for Eukaryota, followed by Bacteria and Viruses. Unclassified sequences include, for example, entries for metagenomes.


(Click on the image to see a larger version.)

Perl one liner to extract sequences by their ID from a FASTA file

The first one liner is useful if you only want to extract a few sequences by their identifier from a FASTA file.

perl -ne 'if(/^>(\S+)/){$c=grep{/^$1$/}qw(id1 id2)}print if $c' fasta.file

This will extract the two sequences with the sequence idenfiers id1 and id2. You only have to change the identifiers within the parentheses and separate them by space to extract the sequences you need.


If you have a large number of sequences that you want to extract, then you most likely have the sequence identifiers in a separate file. Assuming that you have one sequence identifier per line in the file ids.file, then you can use this one line:

perl -ne 'if(/^>(\S+)/){$c=$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' ids.file fasta.file



Three easy ways to download multiple sequences from NCBI

There are different ways of how to download multiple sequences from the NCBI databases in a single request.


1) Using the batch Entrez website



2) Using Perl: (copy into your terminal and press return/enter)

perl -e 'use LWP::Simple;getstore("".join(",",qw(6701965 6701969 6702094 6702105 6702160)),"seqs.fasta");'

This takes the IDs separated by spaces and the filename of the fasta file with the sequences that will be generated (seqs.fasta). If you don’t try to get the nucleotide data, then you will have to change the database name as well.


3) Using your browser: (paste this to the address field),6701969,6702094,6702105,6702160
This time the IDs are separated by commas. Same here, if you need to get data from a different database you just have to change that.

A List of software developed in our lab

Please use the tools Q&A site ( if you have any question related to the tools listed below.

Jump to:

Tools for metagenomic and metatranscriptomic data
Tools for genomic data
Tools for phage data
Tools for mobile and other handheld devices
Tools for coral reef biology

Tools for metagenomic and metatranscriptomic data

logo_prinseq PRINSEQ
A sequence processing tool that can be used to filter, reformat and trim genomic and metagenomic sequence data. It generates summary statistics of the input in graphical and tabular format that can be used for quality control steps. PRINSEQ is available as both standalone and web-based version.
logo_tagcleaner TagCleaner
Tool to automatically detect and efficiently remove tag sequences (e.g. WTA or MID tags) from metagenomic datasets. TagCleaner is available as both standalone and web-based version.
logo_deconseq DeconSeq
Tool to automatically detect and efficiently remove any type of known sequence contamination from metagenomic datasets. The tool uses a modified version of the BWA-SW aligner and can be applied to longer-read datasets (150+bp read length). DeconSeq is available as both standalone and web-based version.
logo_riboPicker riboPicker
Tool to automatically identify and efficiently remove rRNA-like sequences from metatranscriptomic datasets. The tool was designed to process longer-read datasets (150+bp read length), but works on 100+bp reads too. riboPicker is available as both standalone and web-based version.


A statistical tool written by Beltran Rodriguez in Forest Rohwer’s lab. We used it to analyze the first pyrosequencing-based metagenomes (from the Soudan Mine in Minnesota) and the original Sargasso Samples from Venter. You can download the source code of Xipe-totec, or use the online version of the tool.


Metagenome annotation services across several platforms of technology, which include CGI scripting and web services (RTMg.web), the new Android cell phone operating system (RTMg.mob), and all OpenSocial-based social network sites (RTMg.os)
logo_seed SEED database tools
The SEED database of microbial genomes is used in many projects. We work with our colleagues at Argonne National Laboratory to develop the SEED and tools that access and use it.
crAss: Reference-independent comparative metagenomics using cross-assembly
crAss is a web tool for comparative metagenomics using cross-assembly.
FOCUS: An Alignment-Free Model To Identify Organisms In Metagenomes Using Non-Negative Least Squares UsageFOCUS, an innovative and agile model to profile and report organisms present in metagenomic samples based on composition usage without sequence length dependencies.
 sf_logo SUPER-FOCUS: An agile homology-based approach using a reduced SEED database to report the subsystems present in metagenomic samples and profile their abundances.
ccom CCOM:  This web-based tool is used to reconstruct the uncultured genome from environmental samples.

Tools for genomic data

Software to order contigs generated by draft sequencing along a reference sequence. Gaps are filled with N’s and small overlaps are aligned with Muscle and the consensus created with IUPAC codes. Scaffold_builder can help in the assembly and annotation of genomes by revealing what is missing and allowing targeted sequencing to close those gaps.

Tools for phage data


We are developing the PHage ANnotation TOols and MEthods (PhAnToME) site in collaboration with researchers in Arizona, Florida, and Virginia. Most of those tools are being released through that website.


PHAge Classification Tool Set or PHACTS is a web based program that is used to calculate whether a phage is lytic or lysogenic.


Phage Proteomic Tree
A phylogentic classic that is so good its in text books!


The Phage SEED
Database for annotating and comparing phage genomes.


The Phage Biobike
A graphical programming language for non-computer scientists to analyze biological data.


Phage Eco-Locator
Tool to identify phages in metagenomes.


A novel algorithm for finding prophages in microbial genomes that combines similarity-based and composition-based strategies.

Tools for mobile and other handheld devices

mobilemetagenomics2 Mobile Metagenomics
An app that allows you to annotate metagenomes using android phones. The source code is hosted on Google Code. Also check out our youtube video!
genomesearch GenomeSearch
An interactive app for searching genomes at the SEED.


Documentation for a tool which handles custom capture, save, and viewing of images taken with an Allied Vision Technologies camera.
Source code is available on github at:

Tools for coral reef biology

Barracuda Coral Reef Photography
New software that allows you to analyze images from coral reefs and identify movement from the background noise.

Random Selection

Student Research Symposium 2012

{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/srs2012{/gallery}









Conferences & Meetings

{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/misc-conferences{/gallery}










Lab Group Pictures

{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/lab-group{/gallery}


Lab trip to the Google I/O Conference

{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/2009-google-io{/gallery}




{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/2011-lab-bbq{/gallery}










{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/2011-phantome-meeting{/gallery}









Abrolhos Islands Field Trip, Brazil

{gallery sortcriterion=filename sortorder=ascending cols=5 alignment=left-float}sigplus/2011-abrolhos-islands{/gallery}

Add Posters to the Page Publications->Posters

(1) Convert the poster file into a PDF file

(2) Create a screen capture of the PDF file (e.g. import the PDF to Photoshop) and save it as GIF file with the longest dimension being 200px

(3) Copy the PDF file to the server under /var/www/html/labsite/media/poster/

(4) Copy the GIF file to the server under /var/www/html/labsite/images/stories/poster/

(5) Add the new poster image and description to the poster web page (make sure it is added to the correct year)

That’s it!