15-451 Algorithms Fall 2012 D. Sleator String Matching Algorithms November 27, 2012 Knuth-Morris-Pratt (KMP) algorithm Karp-Rabin (fingerprinting) Algorithm -------------------------------------------------------------------------- There's an entire field dedicated to solving problems on strings. The book "Algorithms on Strings, Trees, and Sequences" by Dan Gusfield covers this field of research. (Not to mention the algorithms in this lecture.) Sample problems: Given a text string and a pattern, find all occurrences of the pattern in the text. (Classic text search) The above problem where the pattern can have "don't cares" in it. Given a string, find longest string that occurs twice in it. Compute the "edit" distance between two strings, where various editing operations are define, along cost. Given a set of strings, compute the cheapest tree that connects them all together (phylogeny tree) Compute the shortest "superstring" of a set of strings. That is the shortest string that contains all the given strings as a substring. We're only going to scratch the surface of this field and talk about a couple of these problems and algorithms for them. The Knuth-Morris-Pratt Algorithm ================================ This algorithm can solve the classic text search problem in linear time in the length of the text string. (It can also be used for a variety of other string searching problems.) You can view the algorithm as deterministic finite automaton. There are a line of m states. We're in the jth state after we've already matched a sequence of j characters in a particularly location. Now we compare the next two characters. If we get a match we move to the next state in the list (matching j+1 chars). If we get a mismatch, we can skip back to a previous state --- the furthest back one that we know we can go to based on the information that we have. In the practical implementation below the data structure that we build from preprocessing the pattern is all contained in an array of integers called memo[]. This implementation illustrates the use of the algorithm for finding all matches of a pattern s in a text t. public class kmp { public static void main(String[] args) { char[] s = "test".toCharArray(); int n = s.length; int[] memo = new int[n]; /* memo[i] will store the length of the longest prefix of s that matches the tail of s1...si */ int j=0; for (int i=1; i0 && s[i] != s[j]) j = memo[j-1]; if (s[i] == s[j]) j++; memo[i] = j; } /* Here's another version of the algorithm that computes exactly the same thing. */ /* for (int i=1; i0 && s[i] != s[memo[j-1]]) j = memo[j-1]; memo[i] = (j>0) ? memo[j-1]+1 : 0; } */ char[] t = "testestest hello there test!".toCharArray(); int m = t.length; j=0; for (int i=0; i0 && t[i] != s[j]) j = memo[j-1]; if (t[i] == s[j]) j++; if (j==n) { System.out.println("match found at "+(i-j+1)); j = memo[j-1]; } } } } The first phase of the algorithm computes the memo[] array. This is a preprocessing step on the pattern, and does not depend on the text string at all. The second phase uses the memo array to find all the matches of the pattern in the text. The code is tricky to understand given how short it is. The invariant that holds at the start of the loop is the following: memo[k] for all k