15-451 Algorithms 10/24/00 * Union-find - list-based method - forest-based method midterm results: median 61/75, middle half 52/75 to 68/75. ============================================================================= Today's topic is a data structure for the union-find problem. Here's the problem: We want to maintain a collection of disjoint sets with operations: MakeSet(x) - creates set containing just x. Returns ptr p_x Union(p_x,p_y) - union together the set x is in with the set y is in (replacing those two sets with their union). Find(p_x) - return the unique ID for the set containing x (this is just some representative element of this set). [From now on, will just say "x" instead of "p_x", etc] E.g., Kruskal's algorithm: - Do MakeSet(v) for each node v. - for each edge (x,y) in order from shortest to longest: if find(x) != find(y), put (x,y) into tree and do Union(x,y). Notation: n = # of make-set operations, (so at most n-1 unions) m = # of total operations. This matches n= # vertices, m = # of edges pretty well, since we do 2 finds for each edge. Algorithm 1 (list-based): total cost will be O(m + n log n) ----------------------------------------------------------- Each set will be just a linked list: each element has pointer to next element in list. But, augment so that each element also has a pointer directly to head of list. Head of list is the representative element. Makeset(x) - constant time: just create list of 1 element and return ptr. Find(x) - also constant time: just follow pointer to head of list. Union(x,y) - to union the two lists (let's call them A and B with lengths L_A and L_B). We append B onto end of A. What is is the time needed? Answer: we first have to walk down A to get to bottom element. Then hook up head of B to this. Then walk down B updating the pointers to the head. So total time is O(L_A + L_B) Q: Can we get just O(L_B)? Method 1: splice B into the middle of A, at x. Method 2: if store pointer to tail in the head, this also avoids walking down A. Q: Can we reduce this to O(min(L_A,L_B))? A: Yes. Just store length of list in the head. Then compare and append shorter one onto end of longer one. Then update the length count. Claim: this gives us the O(m + n log n) total running time. Proof: The only part we need to worry about is the "walking down the shorter list" part of the union operation. All the rest is constant time per op, so goes into the O(m). So, let's look at this from the point of view of some lowly element x sitting in a list somewhere. Over time, how many times does x get stepped on as we walk down the list and update pointers? Answer: at most O(log n) times. Reason: because every time x gets stepped on, the size of the list he is in doubles (at least). So, if we look at the total time for m operations, we spend O(m) in the constant-time portion, and then we can charge off the walking-down portion to the elements walked over. There are n elements, and for each one the total cost is at most log(n). So, overall, the total is O(m + n log n) ...This is a neat proof I think. A good time for questions... Notice that in Kruskal's algorithm, it's the sorting time that dominates since that was O(m log m). Algorithm 2: total cost will be O(m lg^*(n)) -------------------------------------------- -> lg^*(n) is the number of times you need to take the lg until you get down to 1. So, lg^*(2) = 1 lg^*(2^2 = 4) = 2 lg^*(2^2^2 = 16) = 3 lg^*(2^2^2^2 = 65536) = 4 lg^*(2^65536) = 5 -> So, basically, lg^*(n) is never bigger than 5. Actually running time of this algorithm is even better: O(m alpha(m,n)) where alpha is the inverse-Ackermann function I mentioned last time. This grows even slower than lg^*. We'll prove the lg^*(n) bound because it's a little easier. But, it's still pretty tricky. Algorithm: - we'll represent each set by a tree with pointers going *up* the tree. Root is the "representative element". - MakeSet(x) takes constant time: just set x->parent = x. - To do a find(x), we walk up tree to the root. So, time taken is proportional to depth of x. To speed things up, we'd like trees to be bushy. To make trees bushier, we'll do what's called PATH COMPRESSION. When we do a find(x), we update all the nodes we pass over to point to the root. We can do this nicely by making find(x) be a recursive procedure: If x is root, return x. Else, x->parent = find(x->parent). Return x->parent. - To do union(x,y) we do a find on each one, and then make one of the roots point to the other. Second speedup is called UNION-BY-RANK, where tree whose root is of lower rank is made the "child tree". "rank" defined by: - When you do a makeset(x), x gets rank = 0. - When you union a tree whose root has rank r1 with a tree whose rank is r2, then if r1 == r2, you add 1 to the rank of the new root. Otherwise you don't. (This speedup is kindof like our list-based algorithm) Note: properties of ranks 1. if x isn't root, then rank(x) is strictly less than rank(x->parent). 2. Rank(x) can only change if x is a root. Once x becomes a non-root, then its rank never changes again. [Note: once you become a non-root, you're never a root again] 3. At most n/2^r nodes of rank r. Reason is that when you create a (root) node of rank r, the tree it's in must have at least 2^r nodes in it (can see by induction). The nodes in that tree (except for the root) must all have rank < r. So, if we imagine for each node of rank r, circling all the nodes "responsible" for it getting its rank (i.e., the ones in its tree at the time), then these are all disjoint. Since only n nodes total, this means there are at most n/2^r nodes of rank r. Analysis: * First of all, the union does 2 find operations plus a constant amount of extra work. So, we only need to worry about the time for the (at most 2m) find operations. * When we do a find(x), if x was a root then pay $1, so that's ok. If x was a child of a root we pay $2 so that's ok. If x is lower, then the very rough idea is that (except for the last $2) every $ we spend is shrinking the tree because of our path compression, so we'll be able to amortize it somehow. So, for remaining part of proof, we're only going to worry about the steps taken in a Find(x) operation up until we reach the child of the root, since the remainder is just constant time per operation. We'll analyze using the ranks, and the properties we figured out before. Step 1: let's imagine putting ranks into buckets. Bucket 0 goes from has all of rank 0, bucket 1 has all of rank 1, bucket 2 has ranks 2 to 2^2 - 1, bucket 3 has ranks 2^2 to 2^2^2 - 1, bucket 4 has ranks 2^2^2 to 2^2^2^2 - 1, etc. So about log^*(n) buckets. Point of this definition is that the number of nodes with rank in any given bucket is at most n/(upper bound of bucket + 1). Step 2: When we walk up the tree in the Find(x) operation, we need to charge out steps to somebody. Here's the rule: if the step we take moves us from a node u to a node v such that rank(v) is in the same bucket as rank(u), then we charge it to the node u we stepped on (the step-ee). But, if the rank of v is in the next higher bucket, we charge that to x (the step-er). The easy part of this is the charge to x. We can move up in buckets at most log^*(n) times, since there are at most log^*(n) different buckets ever. So total cost for this (adding up over the m Find() operations) is at most m log^*(n) The harder part is the charge to the guy stepped on. The point here is that first of all, the guy u we charged isn't a root, so his rank isn't ever going to change. Secondly, every time we charge him, the rank of his new parent is at least 1 larger than the rank of his old parent. Finally, once his parent has a rank that's in a different block than he is, we never charge him again for being stepped on. ... SO... the maximum charge that u gets is the range of his bucket. But, we just said that a bucket of upper bound B-1 has at most n/B elements in it. And, the range is <= B. So, the total charge of this kind adding up over all nodes u in that bucket is at most n. (Remember that the only elements being charged are non-roots, so once they start getting charged, their rank is fixed so they can't jump to some other bucket and start getting charged there too or something weird like that.) So the *total* charge of this kind is at most n*(# buckets) = n log^*(n). QED (ta-da)