Author Archives: Ramy

Will NCBI ever update taxonomy data?

An annoying thing that keeps occurring while I’m trying to update phage metadata is that I’m heavily relying on NCBI:Taxonomy. Well, obviously I know I can’t blame them since they post a not-so-funny disclaimer at the end of any record

Disclaimer: The NCBI taxonomy database is not an authoritative source for nomenclature or classification – please consult the relevant scientific literature for the most reliable information.

However, it is really annoying and it reflects everything else in NCBI: so static, unlike everything else on the web in the past 5 years… Now to update the record of each of about 55 phages described as “unclassified” in NCBI, I have to spend anything between 10 – 100 minutes, and I may end up without getting an answer. Just take for example, Gifsy-1 and Gifsy-2, two of the most famous prophages of Salmonella. They are lambdoid phages, i.e. tailed siphoviruses. Yet, their NCBI records say: unclassified!

ICTV doesn’t seem to be doing any better with individual viruses (see their latest list).

This is not why I started this post anyway, I’m trying to document the “evidence” behind my taxonomy udpates to the metadata table, because it would be too messy if I include these data in each cell of the Google Doc. However, maybe later we will come up with a ‘taxonomy evidence’ record as we have an annotation evidence one in the SEED database (called ‘Feature evidence’).

Metadata evidence:

  • Gifsy: from Salmonella: Methods and Protocols @ Google Books
  • Enterobacteria phage YYZ-2008: This one is tricky. BLASTN shows that its best matches are Enterobacteria phage 2851 (classified as Podoviridae) and Stx2-converting phage 1717 (classified as Siphoviridae)

The reference to number of phages on Earth

I have always taken (and used) for granted the 1031 number of phages in the planet. Normally, this is calculated from the estimation that there are 10 phages per prokaryotic cells, and the latter are estimated to be 1030. Usually the references to these numbers are: Jiang & Paul 1998, PMID 9687430 and Whitman 1998, PMID 9618454

Today I found what might be an older reference: Bergh et al. 1989, PMID 2755508, High abundance of viruses found in aquatic environments

Once I get access to the full-text paper (“thanks to” Nature’s unwillingness to open even older articles), I can confirm the exact phage number as claimed in 1989.

If you know of a better (aka older) reference, feel free to share it.

This number (1031), by the way, can be read as: ten nonillions (by the US numbering system)

You gotta lyse that lysin before the lysin lyses you!

Phages kill bacteria. That’s their ultimate goal. Yet, they have to maintain the bacterial cell integrity until they’re done with making new phage particles. So, they carefully control the bacterial genome till they replicate their DNA and package it in nascent phage particles. Once these are formed and are ready to leave, they need to leave. They engage in a highly timed and orchestrated procedure of poking holes in the bacterial membranes (using phage holins), degrading the bacterial peptidoglycan-based cell wall, then—if the bacterial host happens to be a gram-negative cell—breaking the outer membrane too!

In the event a phage decides to remain “dormant” inside a bacterium, things get a bit more complicated. A so-called “arms race” is generated. For bacteria, phages are time bombs that can be induced at any time to kill the bacteria. How would bacteria avoid this fatal vampirish ending? They have to “tolerate mutations” in the phage’s most dangerous protein-encoding genes. If the gene that controls phage induction is damaged, this may salvage the bacteria. Other tempting targets are the lysis modules! If lysins or holins are disabled, the domant prophages may remain captive forever (or rather until prince “helper phage” comes and frees them from that peptidoglycan-walled prison.

So, if you’re a bacterium, it’s smart to disable the lysin genes, one way or another. If you’re a scientist studying bacterial and phage genomes, there is no better way to find this out than using the subsystems-based SEED server. Using subsystems allows you to find out how closely related phages and prophages may have very different lysin genes. In the diagram below, a bunch of staphylococcal phage and prophage genomes are compared. You will notice immediately how some of their lysins (in Red, labeled # 1) are sometimes truncated. A truncated lysin is bad news for a phage. It means the phage is on its way to be enslaved by the bacterium for long years to come!

Truncated and intact lysins in staphylococcal phages



Phages are the most abundant biological entities on the planet and have had tremendous impact on biological sciences; however, phage genomes lag behind bacterial and eukaryotic genomes in the quality of annotation. For this purpose, the PhAnToMe project was launched to establish a phage annotation database, a rapid annotation pipeline for phage genomes (PhiRAST or phage Rapid Annotation Using Subsystem Technology), and a graphic programming interface for biologists (using the BioBIKE interface).

The PhAnToMe project involves multiple research centers in the United States and includes several stages. The SDSU center is in charge of developing phage genomic subsystems, phage protein families (FIGFams), and subsequently the first release of PhiRAST. As a member of this team, I am in charge of building or coordinating subsystems, and of establishing links with the phage research community. In addition, once the PhiRAST is developed, I will also be in charge of coordinating training workshops and developing testable hypotheses based on the PhAnToMe annotations and subsystems.

Perl tips: saving a hash to the disk

From: Perl Cookbook
use Storable;  store(%hash, "filename");  # later on...   $href = retrieve("filename");        # by ref %hash = %{ retrieve("filename") };   # direct to hash
OR From Perl Monks

#Save use Data::Dumper; $Data::Dumper::Purity = 1; open FILE, ">$outfile" or die "Can't open '$outfile':$!"; print FILE Data::Dumper->Dump([$main], ['*main']); close FILE; #restore open FILE, $infile; undef $/; eval ; close FILE;

Perl tips: saving a hash to the disk

From: Perl Cookbook
use Storable;  store(%hash, "filename");  # later on...   $href = retrieve("filename");        # by ref %hash = %{ retrieve("filename") };   # direct to hash
OR From Perl Monks

#Save use Data::Dumper; $Data::Dumper::Purity = 1; open FILE, ">$outfile" or die "Can't open '$outfile':$!"; print FILE Data::Dumper->Dump([$main], ['*main']); close FILE; #restore open FILE, $infile; undef $/; eval ; close FILE;

Perl tips: how to read a .gz file

Well, and since I’m in a blogging mode, here is one more thing I didn’t know about Perl (I can easily count the things I know!):

Question: Can Perl read a .gz file?

Answer: Of course. Ask the right question now.

Question: How can Perl read a .gz file?

Answer: Still not good enough a question…

Question: What is at least one way for Perl to read a .gz file?

Answer: Try:

if ($file =~ /.gz$/) {
open(IN, “gunzip -c $file |”) || die “can’t open pipe to $file”;
else {
open(IN, $file) || die “can’t open $file”;

while () {

Answer Credit: Rob Edwards

Blast output 8

Blast tabular output (#8) is arguably the most useful (usable?) blastall output. However, I always forget what the column headers are (although–theoretically–I shouldn’t). I never put enough reminders. I know that I have one or two files in my computer with the column headers, but, why not add one more (honestly, I may have already blogged about it Embarassed:

Here are blast output 8 column headers:

query subject %id


mismatches gap