## 03-511/711, 15-495/856 Course Notes - Oct. 28, 2010

### Gibbs Sampler

A method for discovering ungapped local alignments (Lawrence, et al. '93). See Ewens & Grant, section 6.6 for a straightforward introduction to their algorithm.

• Input: k sequences, t1...tk, of length n1, n2...nk
• Output: k subsequences of length w starting at positions i1...ik that are "most similar" to each other
• Search space: All possible alignments or, equivalently, all sets of starting positions.

Intuition:
• Input: Grandma's illegible pie recipe
• Initialization:
• Guess quantities of butter, sugar...
Repeat {
• select ingredient i at random.
• Vary quantity of i. Bake. Taste. Change quanity of i in recipe
} until (cake doesn't change or time for party).
Mathematical Framework:

We seek a property of the joint probability distribution of a set of random variables: f(i1, i2, ... ik). For example, these random variables might be the quantities of the ingredients in the cake recipe, or the starting points of a conserved pattern in a set of k sequences. In many cases, calculating the joint probability is hard, but calculating the conditional probability, f(ij | i2, ij-1, ij+1,.. ik) is relatively easy. The Gibbs method simulates the joint distribution by generating an instance of each random variable in turn, conditional on the current values of the remaining k-1 random variables.

Algorithm:
Let t* = t1
For j = 2 to k
let ij be a random number between 1 and nj − w + 1
Let A be a (k−1) x w dimensional matrix containing the local MSA of the k-1 subsequences of length w starting at t2[i2], t3[i3]...tk[ik]
Repeat {
• Let S[|Σ|, w] be the PSSM of the local alignment A[ ]
• For each position i in t*, let

 P[i]= score (t*, i, w) (1) Σj score (t*, j, w)

where the sum in the denominator is over all positions j=1 to (nj−w+1) and
score(t*, j, w) = ΣL S[t*[i+L],L] where 1 ≤ L < w.

• Select a starting position i* in t* with probability P(i*)
• Select a row, r, in A[ ] at random and replace it with the subsequence of length w starting at t*[i*]
• Let t* = r.
} until (S stops changing)

• We assume that each sequence has exactly one copy of the pattern. Problems could occur if there is more than one copy, no copies, or we did not guess the length of the pattern correctly. Also, we might find a biologically meaningful pattern, but not the one we hoped to find. For example, we might find a direct repeats from a transposable element insertion, when we were looking for a transcription factor binding site.
• Pseudocounts guarantee that every character is represented at every position.

• Convergence:
• The Gibbs method samples a Markov chain with N states, where N= πi (ni−w−1). The states correspond to all possible sets of starting positions.
• The transition probability is determined by equation (1).
• We require randomness to avoid being trapped in local optima.
• This Markov chain is finite, irreducible (the state space is connected), and aperiodic. Therefore, it has a stationary distribution.
• Given enough time, the algorithm will converge to the alignment that maximizes this distribution. (For details, see optional readings.)
• In practice, the algorithm can get stuck in local optima. This problem can be ameliorated with by executing several runs with different random starts.

• Black magic
• How to choose pseudo counts?
• How to choose the window length, w?
• When to quit?
• How to choose the initial alignment?

• More about theory, convergence proofs, black magic
• Ewens & Grant 10.5.2, Lawrence, et al. 1993, Casella & George, 1992.