03-511/711, 15-495/856 Course Notes - November 17, 2005


BLAST Statistics

RECAP

LOCAL ALIGNMENT STATISTICS (Karlin and Altschul)

Statistical questions:

  1. How are S and E related?
  2. Significance of matches reported.
  3. Information content of alignments. How short can Q be?
  4. How to choose the scoring matrix S[i,j]
See the statistics of sequence similarity scores for more details



1. How are S and E related?

2. Significance of matches reported.

For each match in a DB search, Blast reports both the HSP score and it's significance. The statistical significance is expressed in terms of E and is often referred to as the "e value".

The user specifies the significance threshold for the search in terms of the maximum acceptable e value. Blast will report all matches that have an e value that is lower than (i.e., more significant than) the threshold.



3. Information content of alignments.

Notice that the significance of a match depends on the size of the search space m n and the substitution matrix S[i,j] (see Equation 1). As the data base increases, so does the probability of finding a match by chance. For a given substitution matrix, how long does the query sequence have to be in order to have some chance of finding significant matches?

Since E depends on the product of λ and S, we can modify the parameter λ at our convenience by making a corresponding change in the substitution matrix. Let λ = ln 2 .   Then, solving Equation (1) for S in terms of E, we can show that under the assumption that K ~ 0.1 and E ~ 0.1,

     S ~ log2 (m n).
Thus, S is the number of bits of information required to specify the start of alignment in the query and data base sequences, respectively.

For example,

A minimum of 38 bits of information are required to find a significant match in today's amino acid sequence database.

The amount of information available depends on the length of the query sequence and the amount of information per position provided by the substitution matrix. This is called the relative entropy:

         H = Σ ijqij S[i,j]

Note that H is maximimized when



Information Content in Substitution Matrices
   BLOSUM    H    PAM    H    %I
         20    2.95    83
         30    2.57   
         50    2.00    63
   90    1.18    100    1.18    43
   80    0.99    120    0.98    38
   62    0.70    160    0.70    30
   45    0.38    250    0.36    20


Using this table, we can estimate the minimum query length by observing that we have m H bits of information available in a sequence of length m and will require at least log2 (m n) bits of information to find a significant match. Ideally, we would like the solution to the equation

m H = log2 (m n)
but we can approximate it by replacing the m on the right hand side with 250, the length of a typical amino acid sequence, yielding
m = log2 (250 n)/H

For example, if we require 38 bits of information, the query sequence must be at least 15 residues long with a PAM30 matrix, 54 residues long with a PAM160 matrix and 105 residues long with a PAM250 matrix.


4. How to choose a scoring matrix.

Continued on Tuesday.

Last modified: November 17, 2005.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.