After reading “Tripp H.J., et al. Misannotations of rRNA can now generate 90% false positive protein matches in metatranscriptomic studies.”, I had some additional questions that were not addressed:
How many unique genes in protein databases are similar to known rRNA sequence on the nucleotide level?
What genes will most likely be removed from metatranscriptomes when removing rRNA-like sequences?
If you had similar questions, then this site will give you some answers: http://edwards.sdsu.edu/rrnavsprot/
The detailed steps are descripted in the readme file linked on the site above. Using the default parameters, the most common genes with similarities to rRNAs (or misannotations) in both the RefSeq and SEED database are transposase.
The comparison was done between RefSeq or SEED and the rrnadb database from riboPicker (http://ribopicker.sourceforge.net/). Alternatively, SINA (http://www.arb-silva.de/aligner/sina-download/) could have been used in a similar approach.