Recitation 12/1/97 ------------------ - hand out quiz 3. - The KMP string matching algorithm - running time - constructing the next array KMP algorithm: kmp(pattern, text) { next = compute_next(pattern); j=k=0; m=length of pattern n=length of text while (j=0 && text[k] != pattern[j]) j = next[j]; k=k+1; j = j+1; } if (j==m) match-found } Running time is O(n+m). O(m) to compute_next and O(n) to match. Although there are two nested loops the time bound for the matching part is O(n). The inner while loop only decreases j since by construction -1 <= next[j] < j. So the inner loop can only run when j>=0. However, j increases only when k increases. Thus, the inner loop can only run k times, since that is the number of times that j is increased. (A more formal treatment is beyond the scope of this class. It requires a potential function (the value of j) which can go up by at most 1 between the inner while and the j=j+1 line. Further, it is also >=0 at the j=j+1 line, so the inner loop is O(1) while the outer loop is O(n).) Lets review what the next function tells us and how KMP works and then compute the next function. At some point in the comparison of the pattern (ABABACA) to a text we find ourselves in the following position: k 0 1 2 3 4 5 6 7 8 9 Text String: B A C B A B A B A B A C A B A B | | | | | ? Pattern: A B A B A C A Next: (a) 0: -1 (b) 1: 0 (a) 2: -1 (b) 3: 0 (a) 4: -1 (c) 5: 3 (a) 6: -1 The first 5 characters have matched (j=5) and the 6th doesn't match. By keeping track of how many characters match we can determine the corresponding characters in the text. This means we can determine immediately whether it pays to shift the pattern over one character and check again. In this case does it pay? (no). In fact, we can determine immediately how many characters to shift the pattern over and start checking again. In the above, text[9] != pattern[5], so the body of the while loop executes and j=3. In other words, a shift of two over has been achieved. In fact, text[9] == pattern[3] and the loop terminates and progress is made in the text pointer, k. Why do we know that if we have matched 5 characters of the pattern that the next match can't possibly occur unless we shift the pattern over by 2? If we have matched j characters of the pattern and then have a mismatched character, then we know that the text string contains the first j characters of the pattern followed by some character != pattern[j+1]. E.g., the text string above contains . . A B A B A . . Now if we look at the part of the pattern that is matched we know that there is no point in shifting the pattern only 1 space to the right because the pattern doesn't start with a B. In fact, we want to shift the pattern over such that the longest prefix of the pattern is also a suffix of the the first j characters of the pattern. In this case A B A B A A B A We shift 2, because ABA is the longest prefix of ABABACA that is also a suffix of ABABA. So, (j-next[j]) tells us how much to shift the pattern so that the suffix of what has already matched is a prefix of the pattern, and thus is known to match the pattern. Since we know it matches the pattern we skip over those characters in the text. Below is the code that computes next. It uses the same algorithm as the matcher using previously computed values of next[] to calculate the future values of next, i.e., it is dynmaic programming. The key idea is that next[] always contains the longest prefix of the pattern that is a proper suffix of the portion of the pattern already matched. int* compute_next(char* pattern) { int m = strlen(pattern); int* next = new int[m]; next[0] = -1; int j=0; int k=-1; while (j=0) && (pattern[j] != pattern[k])) k = next[k]; k++; j++; if (pattern[j] == pattern[k]) next[j]=next[k]; else next[j] = k; } return next; }