11/25 lecture - KMP string matching - Quiz Naive string matching: O(mn) for (i=0; text[i]; i++) { for (j=0; text[i+j]==pattern[j]; j++); if (pattern[j] == 0) match-found; } 0 1 2 3 4 5 6 7 8 9 10 11 T: x y a b a b a c z b a c i=0: X i=1: X i=2: a b a X i=3: X i=4: a b a c i=5: X i=6: a X i=7: X i=8: X i=9: a X i=10: X But we get alot of information from mismatches. For instance, searching for "abcabcacab" a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | We know that last four chars have been abcx, x != a therefore can shift pattern by 4 characters. a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | We can shift three places since the largest prefix of the pattern that matches a suffix of the recenlty seen input is "abc" a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | Now try and match again. and fail and can shift four places a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | now advance text pointer til mismatch a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | Then shift three places a b c a b c a c a b b a b c b a b c a b c a a b c a b c a b c a c a b c | voila. Key is to have the auxilary table which tells us how much to shift. Lets look at a naive FA to do string matching in linear time. abac: +------------------------------+ | a | | b | | +----------------+ | | | a \|/ b \|/ a | c --------> empty ---------> "a" --------> "ab" --------> "aba" --------> "abac" ^ | | | | other | | | +---------------+-------------+----------------+ self loop of a From state+char we goto state which is longest suffix of state+char E.g., aba+b -> ab since suffix ab matches ab What KMP did is come up with a way to build a FA in linear time using linear space (above construction can use O(m^2) space and can take O(m^3) time to build. Lets abandon the FA forumlation and look at KMP. j=k=0; while (j=0 && text[k] != pattern[j]) j = next[j]; k=k+1; j = j+1; } if (j==m ) match found This constructs an FA where j is a state pointer and it either moves to the right or is reset based on next[j]. For abac: "a" "b" "a" "c" start ----> j=0 -----> j=1 ------> j=2 -----> j=3 ------> j=4 j=4 -- any -> j=0 j=3 -- mis -> j=1 j=2 -- mis -> j=0 j=1 -- mis -> j=0 either match or mismatch. Which state comes from the next array. Key is to build next. Lets say searching: x y a b a b a c z x b a c j=0 mis k=0 j=0 mis k=1 j=0 match a k=2 j=1 match b k=3 j=2 match a k=4 j=3 mis k=5 j=1 match b k=5 j=2 match a k=6 j=3 match c k=7 j=4 we can use this idea to make the code more efficient. j = 0; for (i = 0; i < n; i++) for (;;) { // loop until break if (T[i] == P[j]) { // matches? j++; // yes, move on to next state if (j == m) { // maybe that was the last state found a match; j = next[j]; } break; } else if (j == 0) break; // no match in state j=0, give up else j = next[j]; // try shorter partial match } How to compute next? We want it to be a table which contains the largest amount to shift the pattern such that the suffix of what we have just seen is a prefix of the pattern. We need to compute this for every prefix of the pattern. We will do the actual computation in recitation on monday.