## Pairwise Global Sequence Comparison

#### Applications

• Looking for errors; compare output of two sequencing runs of the same DNA fragment
• Comparing closely related gene and protein sequences
• Comparing cDNA with genomic DNA
• Protein structure prediction
• Single nucleotide polymorphisms (SNPs)
Definitions

#### Alignment

• ∑' = ∑ ∪ {"_"}
• Given s[1..m] and t[1..n],   α(s',t') is an alignment if
1. s', t' in   (∑')*
2. |s'| = |t'| = l ≥ max{m,n}
3. s is the subsequence obtained by removing "_" from s'   (ditto for t and t')
4. There is no value of i for which s'[i] = t'[i] = "_".
• Goal: Find the optimal alignment w.r.t. a given scoring scheme

#### Distance Based Scoring

• D[s,t]   =  ∑(d[s'[i],t'[i]),   i  =  1..l
• d(x,x)  =  0
• d(x,y)   ≥   0
• d(x,"_")   ≥   0
• d(x,z) < d(x,y) + d(y,z)
• NOTE:
• If d(x,y)  =  1 and d(x,"_")  =  1, then D(s,t) is the minimum number of operations required to transform s into t, where the operations are substitution, insertion and deletion. This is called the "edit distance".
• If d(x,y)  ≥  1 and d(x,"_")  ≥  1, then it is called the "weighted edit distance".
• D[s,t] is a metric. It satisfies the triangle inequality.
• D[s,t] is the sum of the distances for positions in the alignment. This implies that we assume positional independence.

#### Dynamic Programming Algorithm for Global Alignment

• Initialization
• D[0,t[j]]  =  D[0,t[j-1]] + d(t[j],"_")
• D[s[i],0]  =  D[s[i-1],0] + d("_",s[i])

• Recurrence
 D[i,j]  =   min  { D[i-1,j] + d(s[i], "_") D[i-1,j-1] + d(s[i], t[j]) D[i,j-1] + d("_", t[j])

• Compute score of all pairs of prefixes in O(m • n) time. D[m,n] gives the score of the optimal alignment.
• Trace back through the alignment matrix in O(m+n) time to obtain the optimal alignment.
• There may be more than one optimal alignment