03-511/711, 15-495/856 Course Notes - Oct 05, 2006


Distance-based methods

Greedy methods for distance-based phylogeny reconstruction

Taxa are points in a metric space with pairwise distances, O[i,j]. Tree building is equivalent to hierachical clustering of these points.

Both of these greedy algorithms maintain a forest of subtrees, beginning with the set of singleton trees (i.e., trees with one leaf and no edges). At each iteration, the algorithm merges two neighboring subtrees in the forest. This step is repeated until only one tree remains - the final result.

The algorithms differ in

Unweighted Paired Group Method with Average Means (UPGMA)

The UPGMA algorithm is a variant of average linkage. UPGMA is based on the molecular clock assumption. The consequences of this assumption are that

However, if the assumption is violated (i.e., if O is a not ultrametric), then


Neighbor Joining (NJ)

The NJ algorithm deals with this problem by correcting for variations in the rate of change. The "corrected" distance between a pair of nodes is calculated by subtracting the average of the distances to all other leaves.

Thm:
If O is additive, the pair of taxa that minimimize this "corrected" distance matrix are neighbors in the true tree.
Proof:
Durbin et al., 7.8

If O is additive, then NJ will reconstruct the correct unrooted tree in quadratic time.


Distance-based phylogeny reconstruction when the observed distance do not fit a tree

UPGMA and Neighbor Joining (NJ) are greedy algorithms that will reconstruct the correct tree in polynomial time when O[i,j], the matrix of observed inter-taxon distances, is ultrametric or additive, respectively.

How should you reconstruct a tree when O[i,j] is not additive? NJ can be used as a greedy heuristic. It is not guaranteed to give the correct tree, but if O[i,j] does not deviate too far from additive it may give a tree which is a good hypothesis for the evolutionary events that occurred.

Alternatively, one may use distance-based exhaustive or heuristic search. The basic approach is as follows

This problem is NP-complete. Since the number of trees grows very rapidly with the number of taxa, exhaustive search can only be used for small data sets.

The following questions remain:

  • Branch Lengths

    We select branch lengths that minimize the error in fitting the tree metric, T[i,j], to O[i,j]. There are a number of ways to measure this error. Many take the following approach:

    Q  =     k      k
    ∑ ∑  (O[i,j] - T[i,j])2     (1)
    i=1  j=i+1

    where

    To actually compute the branch lengths requires considerable computational effort. We need to find edge lengths x = x1, x2, ... such that the error Q given in Equation (1) is minimized. We do this by solving the equation Q' = 0. This means that we must solve a system of equations for each tree topology. Sometimes the following constraints are imposed:

  • Scoring the tree

    There are several approaches to scoring the tree. One approach is to score the tree according to its error, as given in Equation 1. In this case, the tree with the minimum error is sought.

    Another frequently used approach is Minimum Evolution. In this method, the branch lengths are fitted using ordinary least squares (Equation 1). The score of the tree is the length of the tree; that is, by summing the lengths of all branches in the tree.

    Rzhetsky and Nei (1993) claim that minimum evolution performs well on simulated trees and that under certain conditions the true tree minimizes the tree length score. However, Gascuel, 2000 reported that minimum evolution did not work much better than NJ in his simulation studies.



    Evaluating how well the data supports a given tree

    Bootstrapping, Branches and Partitions




    Last modified: October 5th, 2006.
    Maintained by Dannie Durand and Annette McLeod (durand@cs.cmu.edu).