For comparison, consider the Jukes Cantor model, which is a Markov model of point mutations in nucleic acid sequences.
- 4 α t P(i,i) = 1/4 + 3/4 e
- 4 α t P(i,j) = 1/4 (1 - e)
This is a scoring matrix parameterized by the evolutionary distance, α t. Note that when t=0, p(i,i)=1 and p(i,j)=0. When t=α, p(i,i)=P(i,j)=1/4.
However, this scoring matrix, isn't very useful because we don't know α
t. Instead, we use the Jukes Cantor model to correct for multiple substitutions by
counting the number of mismatches and then correct the
distance by observing that the expected
number of substitutions that occured is
6 α t = 3/4 log (1 - 4/3 P(i≠j)),
Goal: Amino acid similarity matrices that take into account
Markov models of sequence evolution require
Use data to infer transition probabilities for amino acids.
Two commonly used families of amino acid substitution matrices
Ajk = (1/|T|) ΣT ATjk
P1[j,k] = mj Aj,k ---------- Σi ≠ j Aj,i P1[j,j] = 1 - mj
mj = 1 Σi ≠ j Aj,i ------- ------------- n pj z Σh Σi ≠ h Ah,iwhere pj is the background frequency of j and n is the length of the MSA. Select the nomalization factor, z, so that
Σj = 1 to 20 (pj mj) = 0.01
mj = 0.01 1 Σi ≠ j Aj,i --- ------------ pj Σh Σi ≠ h Ah,i
Note - P[j,k] is a Markov chain
P2[j,k] = ΣP[j,l] P[l,k] = (P1[j,k])2
Pn[j,k] = (P1[j,k])n
Let qn(j,k) = pj Pn[j,k] be the probability that, at a given position, we see amino
acid j aligned with amino acid k;
i.e., that amino acid j is replaced by amino acid k after n PAMs of mutational change. Then the PAM n scoring matrix is
S[j,k] = λ log q[jk]
= λ log Pn[j,k]
where λ is a constant. Typically λ = 10 and the entries of S are rounded to the nearest integer.