15-451 Algorithms
Fall 2012
D. Sleator
String Matching Algorithms November 27, 2012
Knuth-Morris-Pratt (KMP) algorithm
Karp-Rabin (fingerprinting) Algorithm
--------------------------------------------------------------------------
There's an entire field dedicated to solving problems on strings.
The book "Algorithms on Strings, Trees, and Sequences" by Dan
Gusfield covers this field of research. (Not to mention the
algorithms in this lecture.)
Sample problems:
Given a text string and a pattern, find all occurrences of the
pattern in the text. (Classic text search)
The above problem where the pattern can have "don't cares" in it.
Given a string, find longest string that occurs twice in it.
Compute the "edit" distance between two strings, where various
editing operations are define, along cost.
Given a set of strings, compute the cheapest tree that
connects them all together (phylogeny tree)
Compute the shortest "superstring" of a set of strings. That is the
shortest string that contains all the given strings as a substring.
We're only going to scratch the surface of this field and talk about
a couple of these problems and algorithms for them.
The Knuth-Morris-Pratt Algorithm
================================
This algorithm can solve the classic text search problem in linear
time in the length of the text string. (It can also be used for a
variety of other string searching problems.)
You can view the algorithm as deterministic finite automaton. There
are a line of m states. We're in the jth state after we've already
matched a sequence of j characters in a particularly location. Now we
compare the next two characters. If we get a match we move to the
next state in the list (matching j+1 chars). If we get a mismatch, we
can skip back to a previous state --- the furthest back one that we
know we can go to based on the information that we have.
In the practical implementation below the data structure that we build
from preprocessing the pattern is all contained in an array of
integers called memo[]. This implementation illustrates the use of
the algorithm for finding all matches of a pattern s in a text t.
public class kmp {
public static void main(String[] args) {
char[] s = "test".toCharArray();
int n = s.length;
int[] memo = new int[n];
/* memo[i] will store the length of the longest prefix of s
that matches the tail of s1...si */
int j=0;
for (int i=1; i0 && s[i] != s[j]) j = memo[j-1];
if (s[i] == s[j]) j++;
memo[i] = j;
}
/* Here's another version of the algorithm that computes
exactly the same thing. */
/*
for (int i=1; i0 && s[i] != s[memo[j-1]]) j = memo[j-1];
memo[i] = (j>0) ? memo[j-1]+1 : 0;
}
*/
char[] t = "testestest hello there test!".toCharArray();
int m = t.length;
j=0;
for (int i=0; i0 && t[i] != s[j]) j = memo[j-1];
if (t[i] == s[j]) j++;
if (j==n) {
System.out.println("match found at "+(i-j+1));
j = memo[j-1];
}
}
}
}
The first phase of the algorithm computes the memo[] array.
This is a preprocessing step on the pattern, and does not
depend on the text string at all.
The second phase uses the memo array to find all the matches
of the pattern in the text.
The code is tricky to understand given how short it is.
The invariant that holds at the start of the loop is the following:
memo[k] for all k