We are interested in knowing when two things are in the same, or different, taxonomic groups, and so we were curious when two things would be considered the same based on protein similarity. We used a couple of python scripts and a couple of perl scripts to generate all the pairwise alignments and calculate the average protein similarity at different taxonomic levels.

** Edit**: This post has been updated. The original results are at the bottom. I redid the calculations without preclustering at 90% as suggested by @pangenomics on twitter while the computation was running:

@linsalrob Super interesting! A Q: Do you think preclustering at 90% removes a lot of intra-species/strain dists?

— Daniel Mende (@pangenomics) April 3, 2016

**Edit 2: **As suggested by Lewis_X I have included violin plots below (because I don’t know how to jitter my outliers, but it does sound fun!)

@linsalrob violin plots would probably be better than bars. Or jitter your outliers ?

— Lewis_X (@Lewis_Lab) April 6, 2016

As part of GenomePeek we have curated databases of different proteins, for this test we just chose the RecA protein, a well known protein involved in recombination. Our data set is available in our github repository for all of this data and code. The RecA protein data set has 6,633 proteins in it, and we computed all pairwise alignments for n-choose-2 combinations of these. We performed pairwise alignments of those using the ultra-fast Needleman-Wunsch alignment code that we forked from Isaac Turner. We ran this on ~22 million pairs of proteins, just printing out the two input files and the percent identity between them. This took basically overnight a long time on one of our computes, and is limited by file I/O not compute cycles (ie. we can compute the alignment as fast as the disk can read/write the input/output!).

For FOCUS2, we also developed a normalized taxonomy file that has all the NCBI taxonomy but only for selected taxonomic levels (namely: kingdom, phylum, class, order, family, genus, species, and strain). Armed with these two pieces of information, lets calculate the average percent identity for things that are the same. We munged that table a little bit to match up the names of the organisms with those in our RecA file. We end up with a file that has the name from the RecA file followed by its taxonomy.

First, we use pairwise_percent_ids.py to calculate a table of all ids and the percent pairwise identity [figshare] between them. Then we use a PERL script [average_pairwise.pl] to join the data from that percent pairwise identity file to the taxonomy file, and test whether the different taxonomic levels are the same. This generates a JSON file that has the taxonomic level and a list of percent identities between them.

Finally, we use a Python script [plot_pairwise_percents.py] to plot that data as box plots.

Here’s the results:

For all 21,995,028 pairs of (non-identical) RecA proteins, the average similarity is 58.81% and the median similarity is 59.89%.

Here is the breakdown by phylogenetic level:

PHYLOGENETIC LEVEL | NUMBER OF SIMILARITIES | MEAN PERCENT IDENTITY | MEDIAN PERCENT IDENTITY |
---|---|---|---|

kingdom | 19,059,690 | 61.39 | 60.30 |

phylum | 6,331,565 | 69.30 | 66.84 |

class | 3,167,385 | 75.39 | 72.21 |

order | 1,289,635 | 83.40 | 88.17 |

family | 849,970 | 90.95 | 94.83 |

genus | 418,628 | 93.07 | 99.3 |

species | 241,150 | 98.62 | 100.0 |

strain | 7,084 | 96.68 | 100.0 |

And here is the graphical summary of the similarity at different phylogenetic levels.

Here are the violin plots. The red lines are the mean of the data.

Here is the original data with 90% pre-clustering with CD-HIT for comparison to the full data set above.

The original post used a data set with 1,883 proteins that had been reduced from the data set above using CD-HIT with a 90% cutoff.

For all 3,543,806 pairs of (non-identical) RecA proteins, the average similarity is 53.95% and the median similarity is 58.64%.

Here is the breakdown by phylogenetic level:

Phylogenetic Level | Number of similarities | Mean percent identity | Median percent identity |
---|---|---|---|

kingdom | 2,559,426 | 58.75 | 59.78 |

phylum | 590,066 | 64.44 | 65.05 |

class | 206,872 | 68.19 | 69.9 |

order | 71,824 | 70.17 | 70.46 |

family | 35,730 | 69.26 | 71.51 |

genus | 9,266 | 76.90 | 79.81 |

species | 410 | 74.49 | 79.79 |

strain | 272 | 73.26 | 78.09 |

And here is the graphical summary of the similarity at different phylogenetic levels.

This data suggests that on average proteins are ~75% identical at the genus/species/strain level, which very little (and not statistically significant) differences between those levels. The average similarity drops to the low 70%/high 60% for family, class, order; and then kingdom and phylum are in the 50% – mid-60% similarity.