Local alignment using dynamic programming requires O(m,n) time . This is prohibitive when n is large
Statistical questions:
E = Kmne-λS (1)where λ is specified by the equation
1 = Σi,j pi•pj eλSand K can be computed analytically for various substitution matrices from the theory. (Simulations are not necessary.)
For each match in a DB search, Blast reports both the HSP score and it's significance. The statistical significance is expressed in terms of E and is often referred to as the "e value".
The user specifies the significance threshold for the search in terms of the maximum acceptable e value. Blast will report all matches that have an e value that is lower than (i.e., more significant than) the threshold.
Notice that the significance of a match depends on the size of the search space m n and the substitution matrix S[i,j] (see Equation 1). As the data base increases, so does the probability of finding a match by chance. For a given substitution matrix, how long does the query sequence have to be in order to have some chance of finding significant matches?
Since E depends on the product of λ and S, we can modify the parameter λ at our convenience by making a corresponding change in the substitution matrix. Let λ = ln 2 . Then, solving Equation (1) for S in terms of E, we can show that under the assumption that K ~ 0.1 and E ~ 0.1,
S ~ log2 (m n).Thus, S is the number of bits of information required to specify the start of alignment in the query and data base sequences, respectively.
For example,
The amount of information available depends on the length of the query sequence and the amount of information per position provided by the substitution matrix. This is called the relative entropy:
H = Σ ijqij S[i,j]
Note that H is maximimized when
| Information Content in Substitution Matrices | ||||
|---|---|---|---|---|
| BLOSUM | H | PAM | H | %I |
| 20 | 2.95 | 83 | ||
| 30 | 2.57 | |||
| 50 | 2.00 | 63 | ||
| 90 | 1.18 | 100 | 1.18 | 43 |
| 80 | 0.99 | 120 | 0.98 | 38 |
| 62 | 0.70 | 160 | 0.70 | 30 |
| 45 | 0.38 | 250 | 0.36 | 20 |
Using this table, we can estimate the minimum query length by observing that we have m H bits of information available in a sequence of length m and will require at least log2 (m n) bits of information to find a significant match. Ideally, we would like the solution to the equation
m H = log2 (m n)but we can approximate it by replacing the m on the right hand side with 250, the length of a typical amino acid sequence, yielding
m = log2 (250 n)/H
For example, if we require 38 bits of information, the query
sequence must be at least 15 residues long with a PAM30 matrix,
54 residues long with a PAM160 matrix and 105 residues long with
a PAM250 matrix.
4. How to choose a scoring matrix.
Continued on Tuesday.
Last modified: November 17, 2005.