- Similarity functions
- simplest:
*M, m, g* - more general:
*p(x,y)*- similarity of*x*and*y* - select
*p( )*to take into account- biophysical properties of residues.
- evolutionary divergence. Substitutions that are surprising in closely related sequences might not be surprising in distant ones.
- (possibly) multiple substitutions.

For comparison, consider the Jukes Cantor model, which is a Markov model of point mutations in nucleic acid sequences.

- Instantaneous rate matrix
- P[i,j]=α i≠j
- P[i,i]= 1-3 α

- Probability of change over time,
*t*.- 4 α t P(i,i) = 1/4 + 3/4 e

- 4 α t P(i,j) = 1/4 (1 - e)

This is a scoring matrix parameterized by the evolutionary distance,

*α t*. Note that when*t=0, p(i,i)=1*and*p(i,j)=0*. When*t=α, p(i,i)=P(i,j)=1/4*.However, this scoring matrix, isn't very useful because we don't know

*α t*. Instead, we use the Jukes Cantor model to correct for multiple substitutions by counting the number of mismatches and then correct the distance by observing that the expected

number of substitutions that occured is6 α t = 3/4 log (1 - 4/3 P(i≠j)),

where*P(i≠j)*is the expected number of*observed*mismatches and can be estimated by*m/n*, the number of observed mismatches divided by the alignment length .

- Jukes-Cantor model
- has an explicit evolutionary model
- takes evolutionary divergence into account
- is determined from first principles

- The Kimura 2 parameter model does take some biophysical properties into account (models both transitions and transversions), but does not capture all aspects.

Goal: Amino acid similarity matrices that take into account

- biophysical properties of residues,
- evolutionary divergence and
- multiple substitutions.

Markov models of sequence evolution require

- Nucleic acids 4 states, 16 transitions
- Amino acids 20
states, 400 transitions
Use data to infer transition probabilities for amino acids.

Two commonly used families of amino acid substitution matrices

- PAM - Dayhoff
*et al.*, 1978 - BLOSUM - Henikoff S., Henikoff JG., 1992.

- "Trusted" MSA's (ungapped)
- Count substitutions, correcting for sample bias in choice of sequences
- Estimate substitution frequencies

PAM - evolutionary model

BLOSUM - directly from data - Construct Log odds scoring matrix

**Definition:**- PAM = "accepted point mutation" or "percent accepted mutation".
- PAM is a unit of evolutionary distance.
- We say two sequence are
*n*PAMs apart if there are, on average,*n*actual changes (including multiple substitutions) beteen them per 100 residues. - Our goal is to construct a family of matrices parameterized by PAM distance.

**Approach:**- Construct a family of Markov chains with twenty states. If
the chain is in state
*j*at time*t*,we say that we see residue*j*at site*i*at time*t*. Note that this model assumes site independence. - Derive the PAM 1 transition probability,
*P*from closely related alignments (no multiple substitutions.)^{1}[j,k] - Extrapolate to obtain the PAM
*n*transition probability,*P*.^{n}[j,k]*P*is the probability that^{n}[j,k]*j*will be replaced with*k*in sequences that are*n*PAM units apart.

- Construct a family of Markov chains with twenty states. If
the chain is in state
**Strategy:**- "Trusted" ungapped MSAs of 71 groups of closely related sequences.
Within each group, the sequence similarity is >= 85%.

- Count replacements, correcting for sample bias in choice of sequences by
averaging over all most parsimonious trees. For each tree,
*T*, we calculate*A*by counting the number of edges connecting^{T}_{jk}*j*and*k*.*A*equals twice the number of edges connecting^{T}_{jj}*j*and*j*. We obtain overall counts by summing over all trees:A

_{jk}= (1/|T|) Σ_{T}A^{T}_{jk}

- Using
*A*, we obtain the transition matrix P_{jk}^{1}[*j,k*], the probability that amino acid*j*will be replaced by amino acid*k*sequences separated by one PAM of evolutionary distance, as follows:*P*^{1}[j,k] = m_{j}A_{j,k}---------- Σ_{i ≠ j}A_{j,i}*P*^{1}[j,j] = 1 - m_{j}

where------- ------------- n p*mj = 1*Σ_{i ≠ j}A_{j,i}_{j}z Σ_{h}Σ_{i ≠ h}A_{h,i}*p*is the background frequency of_{j}*j*and*n*is the length of the MSA. Select the nomalization factor,*z*, so that*Σ*_{j = 1 to 20}(p_{j}m_{j}) = 0.01

yielding

*m*--- ------------ p_{j}= 0.01 1 Σ_{i ≠ j}A_{j,i}_{j}Σ_{h}Σ_{i ≠ h}A_{h,i}

Note - P[

*j,k*] is a Markov chain- rows sum to 1
- history independent
- finite, aperiodic, irreducible => stationary distribution

- "Trusted" ungapped MSAs of 71 groups of closely related sequences.
Within each group, the sequence similarity is >= 85%.

PAM2 matrix:

P^{2}[j,k] = ΣP[j,l] P[l,k] = (P^{1}[j,k])^{2}

PAMn matrix:

P^{n}[j,k] = (P^{1}[j,k])^{n}

Let *q ^{n}(j,k) = p_{j} P^{n}[j,k]* be the probability that, at a given position, we see amino
acid

i.e., that amino acid

S[*j,k*] = λ log __q[jk]__

p_{j} p_{k}

= λ log *P ^{n}[j,k]
*

p_{k}

where λ is a constant. Typically λ = 10 and the entries of S[] are rounded to the nearest integer.

- Transition matrix - no

Replacing amino acid*i*with amino acid*j*is not the same as replacing*j*with*i*. - Scoring matrix - yes

In an alignment, we cannot determine direction of evolution.

Last modified: November 19, 2010.

Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.