Difference between timelogic and NCBI blastall

From EdwardsLab

Jump to: navigation, search

Contents

Spurious differences between TimeLogic and NCBI Blastall

Overview

We've been having odd, in explicable differences between Timelogic BLAST and NCBI BLASTALL results. This is a part of that investigation. Here I show that the scoring is somewhat different, and the calculation is a bit different too

Methods

Compare NCBI blastall using this command:

    blastall -d ~/databases/SEED_2007_01_01 -i test.fa -o seed-blastall.100.out -m 8 -e 100 -p blastx

With TimeLogic running this template file:

    [comment] The following block should not be changed by the user
    [algorithm] tera-blastx
    [target path] %TARGETPATH%
    [result path] %RESULTPATH%
    [query type] nt
    [query search] 1 2 3 -1 -2 -3
    [target type] aa
    [target frames] 1
    
    [comment] The following block will rarely be changed by the user
    [neighborhood threshold] off
    [word size] 3
    [query increment] 1
    [extension threshold] 12
    [gapped alignment] banded
    [matrix] %MATRIXPATH%/blosum62.maa
    [open penalty] -1
    [extend penalty] -1
    
    [comment] The following block will often be changed by the user
    [query filter] on
    [max scores] 500
    [max alignments] 500
    [significance] evalue
    [threshold] significance=0.1
    [output format] tab percentage fieldrecord 
    [field] querylocus targetlocus percentalignment alignmentlength matches gaps querystart queryend targetstart targetend significance score querylength targetlength

In theory these should be the same parameters. I ran six of Mike's test sequences against the SEED_2007_01_01 database. The statistics of the database are:

    Database: SEED non-redundant database, January 1st 2007 
                4,270,186 sequences; 1,500,715,585 total letters

    File names:
    /home/redwards/databases/SEED_2007_01_01.00
       Date: Jan 6, 2007 12:56 PM    Version: 4    Longest sequence: 36,621 res
    /home/redwards/databases/SEED_2007_01_01.01
       Date: Jan 6, 2007 12:59 PM    Version: 4    Longest sequence: 36,805 res 


The TimeLogic used the same database, formatted as

   /decypher/cli/bin/dc_new_target_rt -source /home/redwards/databases/SEED_2007_01_01 -template format_aa_into_aa -targ SEED_2007_01_01_formatted


Results

Example BLAST results

This table shows the relative values from one of the alignments, chosen essentially at random

Raw output. Note that NCBI blast is standard -m 8 format, and Timelogic is as shown above.

NCBI blastall	10908495	xxx02961237     30.86   81      55      3       428     189     93      167	87		28.9
Timelogic 	10908495	xxx02961237     30      182     56      40      2       158     80      218	1.4e-019        99.58   159     386


ParameterNCBI blastallTimeLogic
query locus1090849510908495
target locusxxx02961237xxx02961237
percent alignment30.8630
alignment length81182
matches26(*)56
gaps340
query start4282
query end189158
target start9380
target end167218
significance871.4e-019
score28.999.58


   (* note that NCBI blastall reports # mismatches in this case as 55, so 26 is 81-55)


Correlation of scores

Correlation between NCBI and Timelogic scores

For all the sequences that were deemed similar in both NCBI blastall and timelogic, I compared the scores. You can download that comparion as an excel file and also see the table in html.

Here is an image showing the correlation between the scores:

The good thing is there is a correlation. The bad thing is that it is weak, to a power law, and I can't explain where it comes from.

Discussion

I am not sure of the differences either! It appears that there is a relationship (not linear) between scores.

Based on this one dataset, you could correlate NCBI and Timelogic scores like this, to titrate the desired cutoff value using the timelogic.

10^NCBI blastallTimelogic
1105.52408E-13
014E-14
-10.12.89641E-15
-20.012.0973E-16
-30.0011.51866E-17
-40.00011.09966E-18
-50.000017.96269E-20
-60.0000015.76581E-21
-70.00000014.17503E-22
-80.000000013.02315E-23
-90.0000000012.18907E-24
-101E-101.58511E-25


NOTE: This is very dangerous, as n=1. This should be repeated lots of times!

Summary

Something is going on with the way the stats are being calculated. The NCBI statistics are described at the NCBI site. I don't know a site that describes where the Timelogic statistics are calculated.

Update, May 2009

I received an email from TimeLogic suggesting some alternate parameters that I could use to bring the TeraBlast into agreement with the NCBI blastall. Since I believe that these maybe of interest to others, I am appending my results below.

PLEASE NOTE: I believe that the above information is factually correct. I believe that all of the statements I have made representing the differences between Timelogic's BLAST and NCBI BLASTALL are correct. If there are misstatements I will correct them. I will not remove the data unless it is wrong since it is (a) part of the education process to learn how experiments work and how data is interpreted, (b) if the data is correct, then there is no reason to remove the data since I am not misrepresenting anything, and (c) I will, as time permits, present the data based on an email from TimeLogic below. If I have made a factual error, I apologize and will correct it ASAP.


Note also, that I state quite clearly that n=1, and you should be careful. Your results should hopefuly vary.

New Parameters

Distilling the email, these are the parameters that are suggested.

Note in particular the values for these options that vary from those above. I could not easily identify in the documentation when I was looking. (There are a couple of minor differences, eg. the E value).

open penalty; extend penalty; neighborhood threshold; extension threshold; output format (see below);

In the output format numbernucleic: Writes the alignment data with respect to the original nucleic positioning when the data is translated from nucleotide to amino.

countdowncomplement: For alignments performed with a complement strand, shows the complement sequence according to its position in the original direct strand. Numbering of the alignment counts down from the end to the beginning; the end point of the alignment is shown lower than the start.

(These descriptions are taken from the DeCypher documentation).

   [comment] A blastx definition file that was supplied by TimeLogic
   [comment] This should more closely mimic the blastall searches
   [neighborhood threshold] off
   [comment] The following block should not be changed by the user
   [algorithm] tera-blastx
   [target path] %TARGETPATH%
   [result path] %RESULTPATH%
   [query type] nt
   [query search] 1 2 3 -1 -2 -3
   [target type] aa
   [target frames] 1

   [comment] The following block will rarely be changed by the user
   [neighborhood threshold] 12
   [word size] 3
   [query increment] 1
   [extension threshold] 20
   [gapped alignment] banded
   [matrix] %MATRIXPATH%/blosum62.maa
   [open penalty] -11
   [extend penalty] -1

   [comment] The following block will often be changed by the user
   [query filter] on
   [max scores] 500
   [max alignments] 250
   [significance] evalue
   [threshold] significance=0.1
   [output format] tab percentage numbernucleic countdowncomplement fieldrecord
   [field] querylocus targetlocus percentalignment alignmentlength matches gaps querystart queryend targetstart targetend significance score
Personal tools
peoples pages