15-451 Algorithms Fall 2012 D. Sleator Suffix Trees and Suffix Arrays December 4, 2012 References: Algorithms on Strings Trees and Sequences by Dan Gusfield http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L10/lecture10.pdf http://courses.csail.mit.edu/6.897/spring03/scribe_notes/L11/lecture11.pdf Outline: Suffix Trees definition properties (i.e. O(n) space) applications Suffix Arrays definition how to compute a suffix array (and prefix length array) in O(n log^2 n) time how to convert this into a suffix tree in O(n) time ---------------------------------------------------------------------------- Suffix Trees ------------ Consider a string s of length n (long). Our goal is to preprocess s to allow various kinds of queries on the string to be done efficiently. The most basic example of which is simply this: given a pattern p, find all occurrences of p in s. The time should be O(|p| + k) where k is the number of occurrences of p in s. An ideal solution to this problem will take O(n) time to do the preprocessing, and O(n) space to store the data structure. Suffix trees are a solution to this problem, with all these ideal properties. They can be used to solve many other problems as well. [In this lecture, we're going to consider the alphabet size to be O(1)] Tries ----- A trie is a data structure for storing a set of strings. Each edge of the tree is labeled with a character of the alphabet. Each node then implicitly represents a certain string of characters. Specifically a node N represents the string of letters on the edges that we follow to get from the root to N. Each node has a bit in it that indicates whether the path from the root to this node is a member of the set. Since our alphabet is small, we can use an array of pointers at each node to point at the subtrees of it. So to determine if a pattern p occurs in our set we simply traverse down from the root of the tree one character at a time until we either (1) walk off the bottom of the tree, in which case p does not occur, or (2) we stop at some node M. If M is marked, then p is in our set, otherwise it is not. This process takes O(|p|) time because each step simply looks up the next character of p in an array of child pointers from the current node. Note that if we were to keep a count at each node of the number of marked nodes in the subtree rooted there, we could then efficiently determine for a pattern p, how many members of my set of strings begin with the characters of p. Now, returning to suffix trees ------------------------------ Our first attempt to build a data structure that solves this problem is to build a trie which stores all the strings which are suffixes of the given string s. It's going to be useful to avoid having one suffix match the beginning of another suffix. So in order to avoid this we will affix a special character denoted "$" at the end of the string s, which occurs nowhere else in s. (This character is lexicographically less than any other character.) Suppose we build a trie as described above using all the suffixes of s, and we added the counts as described above to the trie. Now given a pattern p, we can count the number of occurrences of p in s in O(|p|) time. We just walk down the trie and when we run out of p we look at the count of the node we're sitting on. It's our answer. But there are a number of problems with this solution. First of all, the space to store this data structure could be as large as O(n^2). And it will also take too long to build it. Also, it's unsatisfactory in that it does not tell us where in s these patterns occur. Because no string occurs as a prefix of any other, we can divide the nodes of our trie into internal and leaf nodes. The leaf nodes have no children, and represent a suffix of s. So we can have the leaf node point to the place in s where the given suffix begins. We can also get the space consumption down to O(n). Suppose in the trie there is a long path with branching factor 1 at each node on that path. That string of characters must occur in s, so we can represent it implicitly by a pair of pointers into the string s. So an edge is now labeled with a pair of indices into s instead of just a single character. (Figure) This representation uses O(n) space. (We count pointers as O(1) space.) One nice way to see this is to imagine building this data structure by adding suffixes into it one at a time. To add a new suffix, we walk down the current tree until we come to a place where the path leads off of the current tree. (This must occur because the suffix is not already in the tree.) This could happen in the middle of an edge, or at an already existing node. In the former case, we split the edge in two and add a new node with a branching factor of 2 in the middle of it. In the latter case we simply add a new edge from an already existing node. In either case the process terminates. The number of nodes in the tree is thus O(n). (Figure) The running time of this naive construction algorithm is still O(n^2). We'll talk more later about how to make this more efficient. Other Applications of Suffix Trees ---------------------------------- There are many other applications of suffix trees to practical problems on strings. Gusfield discusses many of these in his book. I'll just mention a couple here. 1. Longest Common Substring of Two Strings ------------------------------------------ Given two strings s and t, what is the longest substring that occurs in both of them. For example if a="boogie" and b="ogre" then the answer is "og". The question is how to compute this efficiently. And the answer is to use suffix trees. Here's how. Construct a new string s = a%b. That is, concatenate a and b together with an intervening special character that occurs nowhere else (indicated here by "%"). Now construct the suffix tree for s. Every leaf of the suffix tree represents a suffix that begins in a or in b. Mark every internal node with two bits: one that indicates that this subtree contains a substring of a, and another for b. These marks can be computed by depth first search (linear time). Now take the deepest (in the sense of the longest string path length in the suffix tree) node in the suffix tree that has both marks. This tells you the the longest common substring. It was believed by Knuth prior to suffix trees that this problem could not be solved in linear time. 2. Finding all Maximal Palindromes ---------------------------------- A maximal palindrome in a string s is a palindrome which cannot be extended by grabbing another character of s on each end. Gusfield explains on pages 197 and 198 how to find all of these in O(n) time. Computing the Suffix Tree ------------------------- I'll explain how to compute the suffix tree from two other constructs of the string s. They are the suffix array and the prefix length array. Imagine that you write down all the suffixes of a string s. The ith suffix is the one that begins at position i. Now imagine that you sort all of these suffixes. And you write down the indices of them in an array in their sorted order. This is the suffix array. Example: s=banana$ 0 1 2 3 4 5 6 b a n a n a $ 6: $ 5: a$ 3: ana$ 1: anana$ 0: banana$ 4: na$ 2: nana$ So the suffix array is: 6 5 3 1 0 4 2 Each successive suffix in this order matches the previous one in some number of letters. This is called the common prefix lengths array. In this case we have: suffix array is: 6 5 3 1 0 4 2 common prefix lengths array 0 1 3 0 0 2 Given these two things, the suffix tree can be computed in linear time. We add the suffixes one at a time into a partially built suffix tree in the order that they appear in the suffix array. We keep at any point in time the sequence of nodes on the path from the most recently added leaf to the root. To add the next suffix, we find where its path deviates from the current one. To do this we use the common prefix length value. We walk up the path until we pass this prefix length. This tells us where to add the new node. The time to build the suffix tree in this fashion is the same as the time it takes to traverse the suffix tree from left to right. That is, it's linear time. So how do we compute the suffix array and the common prefix lengths array? There are linear time algorithms for this, but here I will describe a probabilistic method that is O(n log^2 n). It's based on Karp-Rabin fingerprinting. If we could compare two suffixes in O(1) time we could then just sort them in O(n log n) time. Instead use a method for comparing two suffixes that works in O(log n) time. Using Karp-Rabin fingerprinting we can in O(1) time (as I explained in the last lecture) compare two substrings for equality. To compare two suffixes for lexicographic order, we use binary search to find the shortest length R such that the first R characters of each of the suffixes differ, but the first R-1 characters of them are the same. Then the lexicographic order is determined by the Rth character of them. Furthermore this also tells us the common prefix length between the two strings. Here's a java implementation of this technique. /* O(n log^2(n) algorithm to compute the suffix array of a string based on Karp-Rabin fingerprinting. D. Sleator Dec 4, 2012 */ import java.io.*; import java.util.*; public class Suffix_Array { static final long P = 1000000007; static long[] p; // p[i] = P^i modulo 2^64 static long[] a; // a[i] = s[i-1]*p[0] + s[i-2]*p[1] + ... + s[0]*p[i-1] static char[] s; static int n; static long hh(int x, int y) { /* Assumes x<=y. Let k = y-x. This function returns * s[x]*p[k] + s[x+1]*p[k-1] + ... + s[y]*p[0]. * In other words, it's the hash function from x to y inclusive */ return a[y+1]-a[x]*p[y-x+1]; } static int pre_len; /* a side effect of comp, which is the common prefix length of the two strings just compared */ static int comp (int x, int y) { /* Compare the two strings which are the suffix of s beginning * at x and beginning at y. Return <0, 0, or >0 depending * on the outcome. (Actually in this context they can't be equal.) */ int R = Math.min(n-x-1,n-y-1); int L=0; while(L() { public int compare(Integer A, Integer B) {return comp(A,B);} }); int[] prefix = new int[n-1]; for (int i=0; i