We that observe that a pairwise alignment of two sequences *r* and
*s* of length *n* has *m* mismatches (no gaps).

We suspect that this is an underestimate of the number of positions at which at least on substitution occurred. For example, if there is an A at the same position in both sequences, it could be because the ancestral state was also A and no change occured (left hand figure) or because parallel changes occured in both sequences (right hand figure).

Given the observed number of mismatches, we wish to estimate the expected number
of substitutions that actually occurred. Let *p**=
m/n* be the observed frequency of mismatches. Then,
*p* is an estimator for the expected number of sites at which at least one substitution occurred.

Assume a constant rate of substitution, λ, in both lineages. Let *t* be the elapsed time since *s* and *r* diverged from a common ancestor.
The number of substitutions per site is *2λt*, which is unknown.

We define a Markov model of instantaneous change. The simplest
such model for DNA is the *Jukes-Cantor* model, which assumes

- that all point mutations (A->C, A->G, A->T, C->A...) are
equally likely and occur at a rate
*a*. - that the substitution probabilities at each site are independent

To: A G C T --------------------------------- A | 1-3a a a a From: G | a 1-3a a a C | a a 1-3a a T | a a a 1-3a

The consequences of this assumption are that

- the rate of substitution is
*λ=3 a*, - the expected number of substitutions per site is
*2λt=6 a t*, where*t*is the time since the two sites diverged from their common ancestor, - given sufficient time, all nucleotides will appear equally frequently in
the sequence. In other words, the stationary distribution of this Markov chain
is
*(0.25,0.25,0.25,0.25)*.

Strategy for estimating

- Markov model of sequence evolution
- Use model to derive an expression for
*P*in terms of_{mismatch}*at*;*p**= f(at)* - Solve for
*at = f*^{-1}(p) *E [substitutions/site] ≈ 6f*^{-1}(p)

**First**, we define a Markov model of sequence evolution,
such as the Jukes Cantor model introduced above.

**Second**, using this Markov chain we derive an expression describing how
changes accumulate over a period of time *t*. The probability that residue
*X* is in site *k* after time *t* is

- 4 a t pif we started out in state_{XX}(t) = 1/4 + 3/4 e

- 4 a t pif we did not start out in state_{YX}(t) = 1/4 - 1/4 e

**Third**, using the above formulae, we derive an expression for
*P _{mismatch}*, the expected number of observable differences,
between the two sequences:

P_{mismatch}= 3/4 ( 1 - e^{ - 8 a t}).

We solve the above equation for to obtain an expression for
*a t* in terms of *P _{mismatch}*:

```
a t = -1/8 ln ( 1 - 4/3 p ).
```

** Fourth**, the expected number of substitutions per site (whether they are
observable or not) is *6 a t* (see above). Multiplying both sides of the
equation by 6, we obtain an expression for the expected number of
substitutions per site, in terms of the number of sites with an
observable difference:

E[sub] = - 3/4 ln ( 1 - 4/3 PIf we estimate the expected number of observable differences by the number of differences actually observed,_{mismatch}).

E[sub] = - 3/4 ln ( 1 - 4/3 m/n ).So, for example, if we observe mismatches at 10% of the sites, then the Jukes-Cantor model predicts that the actual number of substitutions per site is 0.107.

For more details, see Mona Singh's Phylogeny notes or Durbin, *et al*:
8.1, 8.2.

Last modified: Oct. 7, 2010.

Maintained by Dannie Durand (durand@cs.cmu.edu) and Annette McLeod.