Edwards Lab

Delivering the best in bioinformatics...

Calculating the average and standard deviation in perl

The following two subroutines can be used as a drop in replacement for Math::NumberCruncher. This is useful when trying to limit the number of dependencies of your perl script.

Read more: Calculating the average and standard deviation in perl

How to remove human DNA sequence contamination from metagenomes

The immense amount of metagenomic data produced today requires an automated approach for data processing and analysis. Before any downstream analysis will be performed, the datasets should be preprocessed to ensure the quality of the data and prevent erroneous conclusions. One step of your data preprocessing (usually the last) should be to check for sequence contamination (DNA from sources other than the sample). This post will show you how to identify and remove human sequence contamination from metagenomes, but can also be applied to any other type of sequence dataset or contamination.

Read more: How to remove human DNA sequence contamination from metagenomes

Perl subroutine to extract fasta sequences from a file

The following perl subroutine will read in a fasta formatted file, parse the file, and return all the sequences in a reference to a hash table.

Read more: Perl subroutine to extract fasta sequences from a file

How to convert FASTQ to FASTA

The following examples show how to convert a FASTQ file to a FASTA file. The commands assume a Unix-based operating system with Perl.

Read more: How to convert FASTQ to FASTA

Contamination of sequencing data (Pt. 2)

It is amazing how easily the processing of samples can lead to contamination of data. Something like 22% of sequenced genomes contain AluY elements from the human genome. As noted in the following posting from The Scientist, this alarming discovery could also be indicative of contamination of sequenced genomes by DNA from other sources, such as the commonly used E. coli, which could be problematic when working with other bacterial genomes. This possibility could have grave consequences when it comes to evaluating horizontal gene transfer.



Though, it should be noted that as per the article in The Scientist, this is only applicable to female scientists ("But probably the most common contaminant is the scientist herself." from paragraph 4).

Project Update: Multi-threading or Cluster Computing?

Recently, I've been faced with a problem where I feel my metagenome comparator program is running too slow. The main reason behind it is that it's performing operations that occur multiple times in a loop. These operations involve different tasks such as: reading lines from text, creating objects, inserting those objects into a data structure, retrieving those objects from the data structure, and writing the data structures to disk (just to name a few). So it would be natural to suggest to someone in my position to parallelize it all, and that's exactly what I want to do. However, I've never written any type of parallel applications, and thus, I need to do a little bit of learning and researching into parallel programming. (More of my ramblings after the Read More break)

Read more: Project Update: Multi-threading or Cluster Computing?

Extreme caution is needed when sequencing

One of my friends from U of I sent me an interesting discussion of a recent paper with a potentially fatal error.....Note the article's editor...hehe. Anyway, the paper talks about a horizontally transferred gene from the human genome to the genome of the intracellular pathogen, Neisseria gonorrhoeae (http://mbio.asm.org/content/2/1/e00005-11.full).

The following blog has an interesting description of a more plausible reason for the published finding:


Reminds me of the story about Shewanella and Burkholderia being the clearly dominating organisms in one of the Sargasso Sea metagenomes...(http://www.nature.com/nrmicro/journal/v3/n6/pdf/nrmicro1158.pdf)

How to create a database for BWA and BWA-SW

The following example shows how to create a database from the human reference genome for the use with BWA, BWA-SW and DeconSeq. The commands used below assume a Unix-based operating system.

Read more: How to create a database for BWA and BWA-SW

CSRC You Tube Video

Watch the CSRC YouTube video and see some of your favorite scientists at work!


Will NCBI ever update taxonomy data?

An annoying thing that keeps occurring while I'm trying to update phage metadata is that I'm heavily relying on NCBI:Taxonomy. Well, obviously I know I can't blame them since they post a not-so-funny disclaimer at the end of any record

Disclaimer: The NCBI taxonomy database is not an authoritative source for nomenclature or classification - please consult the relevant scientific literature for the most reliable information.

However, it is really annoying and it reflects everything else in NCBI: so static, unlike everything else on the web in the past 5 years... Now to update the record of each of about 55 phages described as "unclassified" in NCBI, I have to spend anything between 10 - 100 minutes, and I may end up without getting an answer. Just take for example, Gifsy-1 and Gifsy-2, two of the most famous prophages of Salmonella. They are lambdoid phages, i.e. tailed siphoviruses. Yet, their NCBI records say: unclassified!

ICTV doesn't seem to be doing any better with individual viruses (see their latest list).

This is not why I started this post anyway, I'm trying to document the "evidence" behind my taxonomy udpates to the metadata table, because it would be too messy if I include these data in each cell of the Google Doc. However, maybe later we will come up with a 'taxonomy evidence' record as we have an annotation evidence one in the SEED database (called 'Feature evidence').

Metadata evidence:

  • Gifsy: from Salmonella: Methods and Protocols @ Google Books
  • Enterobacteria phage YYZ-2008: This one is tricky. BLASTN shows that its best matches are Enterobacteria phage 2851 (classified as Podoviridae) and Stx2-converting phage 1717 (classified as Siphoviridae)

The reference to number of phages on Earth

I have always taken (and used) for granted the 1031 number of phages in the planet. Normally, this is calculated from the estimation that there are 10 phages per prokaryotic cells, and the latter are estimated to be 1030. Usually the references to these numbers are: Jiang & Paul 1998, PMID 9687430 and Whitman 1998, PMID 9618454

Today I found what might be an older reference: Bergh et al. 1989, PMID 2755508, High abundance of viruses found in aquatic environments

Once I get access to the full-text paper ("thanks to" Nature's unwillingness to open even older articles), I can confirm the exact phage number as claimed in 1989.

If you know of a better (aka older) reference, feel free to share it.

This number (1031), by the way, can be read as: ten nonillions (by the US numbering system)

You are here: Home Lab blog