03-511/711, 15-495/856 Course Notes

03-511/711, 15-495/856 Course Notes - Sept 30, 04

Review of distance-based methods

Obtain pairwise distances from a multiple sequence alignment. Correct for multiple substitutions.
Fitting distances to a tree
- Additive matrices
  - distances fit a tree
  - satisfy the 4-point condition
- Ultrametric matrices
  - distance fit a rooted tree where all paths to the root are of equal length
  - satisfy the 3-point condition
  - rate of mutation is the same in all lineages
  - branch lengths equals time

Distance-based phylogeny reconstruction when the observed distance do not fit a tree

UPGMA and Neighbor Joining (NJ) are greedy algorithms that will reconstruct the correct tree in polynomial time when D[i,j], the matrix of observed inter-taxon distances, is ultrametric or additive, respectively.

How should you reconstruct a tree when D[i,j] is not additive? NJ can be used as a greedy heuristic. It is not guaranteed to give the correct tree, but if D[i,j] does not deviate too far from additive it may give a tree which is a good hypothesis for the evolutionary events that occurred.

Alternatively, one may use distance-based exhaustive search. The basic approach is as follows

For each tree topology, t, with
leaves
- Estimate the branch lengths.
- Score the tree.
Select the tree with optimal score.

This problem is NP-complete. Since the number of trees grows very rapidly with the number of taxa, exhaustive search can only be used for small data sets.

The following questions remain:

How do we estimate the branch lengths?
How do we score the tree?

Branch Lengths

We select branch lengths that minimize the error in fitting the tree metric, T[i,j], to D[i,j]. There are a number of ways to measure this error. Many take the following approach:

E =	_{k k}
	∑ ∑ (D[i,j] - T[i,j])^a	(1)
	_i=l j=i+l

where

E is error fitting distances to tree
k is the number of taxa
D[i,j] is the pairwise distance estimate
T[i,j] is the distance between i and j on the tree
For weighted sum of errors, a = 1, weighted squared errors, a=2

This model can be refined by choosing different values of a.

To actually compute the branch lengths requires considerable computational effort. We need to find edge lengths e₁, e₂, ... such that Equation (1) is minimized given the following constraints

The edge lengths are greater than zero: e_i > 0 .
The tree distances are greater than the observed distances: T[i,j]≥ D[i,j]. This is because multiple substitutions tend to cause us to underestimate intertaxon distances.

Scoring the tree

There are also several approaches to scoring the tree, t. One approach is to score the tree according to its error, as given in Equation 1. In this case, the minimum error is sought.

Another frequently used approach is Minimum Evolution. In this method, the branch lengths are fitted using ordinary least squares (a=2). The tree is scored by the length of the tree; that is, by summing the lengths of all branches in the tree.

Rzhetsky and Nei (1993) claim that minimum evolution performs well on simulated trees and that under certain conditions the true tree minimizes the tree length score. However, Gascuel, 2000 reported that minimum evolution did not work much better than NJ in his simulation studies.

Greedy methods for distance-based phylogeny reconstruction

Taxa are points in a metric space with pairwise distances, D[i,j]. Tree building is equivalent to hierachical clustering of these points.

Both of these greedy algorithms maintain a forest of subtrees, beginning with the set of singleton trees (i.e., trees with one leaf and no edges). At each iteration, the algorithm merges two neighboring subtrees in the forest. This step is repeated until only one tree remains - the final result.

The algorithms differ in

How neighbors to be merged are identified.
How the branch lengths are computed.

Unweighted Paired Group Method with Average Means (UPGMA)

The UPGMA algorithm is a variant of average linkage. UPGMA is based on the molecular clock assumption. The consequences of this assumption are that

At each step, the two closest taxa are selected as neighbors.
The height of the least common ancestor of any pair of leaves is half the distance between the leaves.
If a distance matrix, D[], is an ultrametric, then UPGMA will reconstruct the correct rooted tree in quadratic time.

However, if the assumption is violated (i.e., if D is a not ultrametric), then

UPGMA will not give the correct tree even if D[] is additive.
The tree that results can have the incorrect topology and/or the incorrect branch lengths.
Errors in topology can occur because UPGMA selects pairs of taxa with minimum distance as neighbors, leading to the wrong tree when the closest taxa are not adjacent in the true tree.
Errors in the branch lengths can occur because UPGMA assumes that the internal node joining a pair of leaves is always equidistant between them.

Neighbor Joining (NJ)

The NJ algorithm deals with this problem by correcting for variations in the rate of change. The "corrected" distance between a pair of nodes is calculated by subtracting the average of the distances to all other leaves.

Thm:: If D is additive, the pair of taxa that minimimize this "corrected" distance matrix are neighbors in the true tree.
Proof:: Durbin et al., 7.8

If O is additive, then NJ will reconstruct the correct unrooted tree in quadratic time.

Determining the root of the tree

If D is a ultrametric, then the root can be determined directly from the data as a consequence of the molecular clock hypothesis. The root is located at the midpoint of the longest pathway between two taxa. UPGMA does this automatically.

If D is not ultrametric, then additional information is needed. To root a tree one should add an outgroup to the data set. An outgroup is an taxon for which external information (eg. paleontological information, morphology, ...) is available that indicates that the outgroup branched off before all other taxa. For example, bear could be used as an outgroup in a canine phylogeny.

The outgroup should not be too closely related to the taxa in question. Nor should the outgroup be very distantly related to the taxa. The use of more than one outgroup generally improves the estimate of tree topology.

Last modified: September 30th, 2004.
Maintained by Dannie Durand and Annette McLeod (durand@cs.cmu.edu).