03-511/711, 15-495/856 Course Notes

03-511/711, 15-495/856 Course Notes - November 17, 2005

BLAST Statistics

RECAP

Database Searching:
Local alignment using dynamic programming requires O(m,n) time . This is prohibitive when n is large
BLAST:
- A fast heuristic for finding ungapped local alignments (HSP's) with score at least S
- Approach
  - Choose desired E (the number false positives tolerated). This gives a value for S.
  - Given S, select BLAST parameters to optimize the number of false negatives and speed of search

LOCAL ALIGNMENT STATISTICS (Karlin and Altschul)

Statistical questions:

How are S and E related?
Significance of matches reported.
Information content of alignments. How short can Q be?
How to choose the scoring matrix S[i,j]

See the statistics of sequence similarity scores for more details

1. How are S and E related?

Let E be the expected number of false positives; i.e., E is the expected number of HSPs with score >= S given a "random" query sequence of length m and a "random" data base sequence of length n. Here, "random" means sampled with background frequencies.
- H0 - sequences are unrelated (background frequencies). The probability of seeing i aligned with j is p_i p_j.
- Ha - sequences are related (target frequencies). The probability of seeing i aligned with j is q_ij.
Approach to estimating E
- Model alignment as a random walk
  
  G G A ... G
  G A A C ...
  - We can treat an alignment as random walk, where a match is a step in the positive direction and a mismatch is a step in the negative direction.
  - Each site corresponds to one step in the walk.
  - Step size - S[i,j]
  - Step probability - p_i• p_j
- HSP's in the alignment are "excursions" in the random walk.
  - We define a "ladder point" L to be a new low in this random walk.
  - Let the random variable Y_i describe the local maximum point between the ladder points L_i and L_i+1
  - An excursion is the walk from the L_i to Y_i.
  - Excursions correspond to segment pairs.
- Apply known statistics about excursions of length at least S to get statistics about HSPs of score at least S:
  - In order to apply random walk theory, we require:
    1. at least one positive step (S[i,j] ≥ 0) and one negative step (S[k,l] < 0)
    2. step size has negative mean Σ_i,j S[i,j] p_i p_j < 0
  - Then the expected number of HSP's under H0 is
```
   E = Kmne^-λS     (1)
```
    where λ is specified by the equation
```
   1 = Σ_i,j pi•pj e^λS 
```
    and K can be computed analytically for various substitution matrices from the theory. (Simulations are not necessary.)
- Note that
  - The PAM and BLOSUM matrices meet conditions (i) and (ii). They contain positive and negative entries and Σ_i,j S[i,j] p_ip_j < 0,
  - E increases with sequence lengths m and n and decreases exponentially with S.

2. Significance of matches reported.

For each match in a DB search, Blast reports both the HSP score and it's significance. The statistical significance is expressed in terms of E and is often referred to as the "e value".

The user specifies the significance threshold for the search in terms of the maximum acceptable e value. Blast will report all matches that have an e value that is lower than (i.e., more significant than) the threshold.

3. Information content of alignments.

Notice that the significance of a match depends on the size of the search space m n and the substitution matrix S[i,j] (see Equation 1). As the data base increases, so does the probability of finding a match by chance. For a given substitution matrix, how long does the query sequence have to be in order to have some chance of finding significant matches?

Since E depends on the product of λ and S, we can modify the parameter λ at our convenience by making a corresponding change in the substitution matrix. Let λ = ln 2 . Then, solving Equation (1) for S in terms of E, we can show that under the assumption that K ~ 0.1 and E ~ 0.1,

     S ~ log₂ (m n).

Thus, S is the number of bits of information required to specify the start of alignment in the query and data base sequences, respectively.

For example,

for two protein sequences of length 250 each, S ~ 16 bits
for a protein sequence of length 250 and a database sequence of length 1,000,000,000, S ~ 38 bits

A minimum of 38 bits of information are required to find a significant match in today's amino acid sequence database.

The amount of information available depends on the length of the query sequence and the amount of information per position provided by the substitution matrix. This is called the relative entropy:

H = Σ _ijq_ij S[i,j]

Note that H is maximimized when

|S[i,j]| >> 0; i.e., when the target frequencies are very different from the background frequencies.
|S[i,j]| corresponds to the target frequencies in the regions sought.

Information Content in Substitution Matrices
BLOSUM	H	PAM	H	%I
		20	2.95	83
		30	2.57
		50	2.00	63
90	1.18	100	1.18	43
80	0.99	120	0.98	38
62	0.70	160	0.70	30
45	0.38	250	0.36	20

Using this table, we can estimate the minimum query length by observing that we have m H bits of information available in a sequence of length m and will require at least log₂ (m n) bits of information to find a significant match. Ideally, we would like the solution to the equation

m H = log₂ (m n)

but we can approximate it by replacing the m on the right hand side with 250, the length of a typical amino acid sequence, yielding

m = log₂ (250 n)/H

For example, if we require 38 bits of information, the query sequence must be at least 15 residues long with a PAM30 matrix, 54 residues long with a PAM160 matrix and 105 residues long with a PAM250 matrix.

4. How to choose a scoring matrix.

Continued on Tuesday.

Last modified: November 17, 2005.
Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.