SEED to GO Mapping using the SEED servers

The SEED contains most complete microbial and phage genomes, and includes an ontology built by annotators for annotators. The SEED systems contain the most complete microbial annotations anywhere.

The Gene Ontology project (GO) aims to unify annotations, but has long had a focus on eukaryotes and has repeatedly ignored prokaryotes. Tired of building tables mapping SEED functions to GO functions, this post will show you how to do so using the SEED servers, so that you may update the comparison any time you like.


Quick summary:

  1. Download and install the SEED servers:
  2. Download the latest version of Uniprot:
  3. Download the parser:
  4. Run the parser on the uniprot file: perl uniprot_sprot.dat.gz > uniprot_md5_go_prot.txt
  5. Download the code to map the parsed uniprot file to the SEED:
  6. Run the code to compare uniprot to SEED: perl -f uniprot_md5_go_prot.txt > uniprot_seed.txt
  7. Compare the results in uniprot_seed.txt




The SEED servers are a live API to access the SEED data, including all of the genomes in the SEED. There are several different servers, and an array of scripts and programs that you can use. If you have a mac, all you need to do is download the .dmg file, click the installer, and you have instant access to the SEED. Windows and Linnux installers are equally simple. You can download the installers from the SEED servers site. If you install the server codes you should be able to complete this demonstration with a few more lines of code.

The Gene Ontology group doesn’t appear to have a list of all their annotations in a useable form, so for the purposes of this demonstration, we’ll take the GO annotations from Swiss Prot and map them to the SEED.

To start, we need to get the Swiss Prot files. I use ncftpget to download the file, but you can browse their site and choose your own version.


This will download the gzipped file that contains all of Swiss Prot data. We don’t really need it all, so I wrote a small script to parse out the pieces that I want, explicitly:

  1. The identifier
  2. An accession number
  3. The GO terms
  4. The MD5 sum (*)
  5. The translation

Note that the MD5 sum is not included in the Swiss Prot file.

Calculating the MD5 sum. The MD5 sum is a hash for any string. It is 32 characters long, and there is a simple way to compress any string of any length into the MD5 sum. We use the MD5 sum to map between sequences because it is extremely fast and because it compresses the sequences to 32 characters we can store lots of MD5 sums in memory and build fast look up tables in databases. The potential downside is that two proteins with different protein sequences could potentially have the same MD5 sum. We have not seen that happen yet. If you are interested, our rules are: (i) Upper case protein sequences; (ii) No identifiers, only the sequences; (iii) no new lines or returns; (iii) the stop codon is not included. With these rules you will always get the same MD5 for the same sequence, regardless of when you compute it.

In this example code I use the perl module Digest::MD5 to calculate the MD5 sum, and insert it into the output.

Run the parser and redirect the output using a command like this:

perl uniprot_sprot.dat.gz > uniprot_md5_go_prot.txt

This will make a table with the columns noted above.

Next, download this simple script that uses the SEED servers to get all proteins and functions that have the same MD5s as the Uniprot functions.

Run this script, and capture the output like this:

perl -f uniprot_md5_go_prot.txt > uniprot_seed.txt

You now have a text file (uniprot_seed.txt) that has the following columns:

  1. The identifier
  2. An accession number
  3. The GO terms
  4. The MD5 sum (*)
  5. The translation
  6. The FIG id
  7. The functional role of the protein
  8. The subsystem that this protein is in
  9. The first level classification of the subsystem
  10. The second level classification of the subsystem

Note that this is a one to many mapping on multiple levels — each protein is mapped to an MD5 sum which then maps to many identical proteins. Each of those proteins may be in a different subsystem, and so those are reported separately.

An example of the output that you will get is available for download [gzip compressed text file; 64M]