3/10 15-862 Randomized Algorithms * Analysis of local optimization algorithms (work in [Aldous-Vazirani] and [Dimitriou-Impagliazzo] see: FOCS '94, pages 492-501, and STOC '96, pages 304-313 ) =========================================================================== In practice, if you're given an instance of a hard problem to solve, you'll typically use some sort of local optimization algorithm. In general, doing this involves 3 steps: 1) setting up some solution space, and attaching a "goodness" or "value" to each solution. Typically the size of this space is exponential in the size of the instance. 2) Defining some notion of "neighbor" among the solutions in the space. (I.e., so we now have a graph with nodes as solutions; of course, this graph is only defined implicitly, since it has exponentially many nodes). 3) doing local search to try to find the best solution in the space. For instance * MAX-SAT: Here (1) could be the space of all assignments, where the value of an assignment is just the number of clauses it satisfies. We can do (2) by saying that we're allowed to move from one assignment to another by flipping a bit (i.e., 2 assignments are neighbors if they differ in one bit position). * TSP: Here a natural solution space is the set of all tours, with value equal to the length of the tour (so value is now "badness"). Then, one notion of neighbor might be that you're allowed to take two directed edges a-->b and c-->d in the tour and replace them by a-->c and b-->d. (This is sometimes called the 2-OPT rule). * In machine learning, (1) corresponds to the "hypothesis space" -- the set of legal hypotheses under consideration. Most learning algorithms correspond to some sort of natural neighborhood structure on this space and some natural local search algorithm. Often much of the effort is in areas (1) and (2). But, we're going to ignore that and instead focus on (3): what can we say about the behaviour of different local search algorithms, based on combinatorial properties of the graph structure we're searching in. Let's make a few simplifying assumptions (1) values of solutions are integers in some range [0,d], where d=poly(n). (2) A solution has at most poly(n) neighbors. (3) If a solution has value i, then its neighbors have value either i-1 or i+1. (e.g., MAX SAT satisfies (1) and (2). Not (3) but (3) is really just to simplify matters.) So, we can think of the solution space as a layered graph, and our goal is to get to the deepest layer. --> Interesting special case of tree, and how one might use a tree to model general structure of a search graph. Here are three interesting classes of algorithms: (1) Hill-climbing: begin at top. while not at a leaf, go to a random child. When you get to a leaf, halt. (and then start over at the top) (2) simulated-annealling style: like (1), but you also allow yourself to move up with some probability. Or, DFS: do hill-climbing, but back up to previous decision point when you get stuck. (3) survival-of-the-fittest: run many particles in parallel. In particular, let's define GW1: - begin with B particles at the start. - if all particles are at leaves, then halt. Otherwise, move each particle at a leaf to the location of a random non-leaf particle. - move each particle to a random child. E.g., what about this kind of tree: o / \ o o / \ o o / \ o o / \ o o / \ o o In this case, GW1 only fails if on some branch point, all particles go in the wrong direction (so long as this doesn't happen, the particles will then regroup in the second step of the algorithm). So, the probability of failure is at most d*(1/2)^B. So, we succeed w.h.p. for B >> log(d). Here's a generalization: o / \ T o | o / \ o T | o / \ T o | ... Where "T" is a complete binary tree of some depth d' (say d'=sqrt{d}), and "|" represents a path of length d'+1. In this case, the same analysis holds for GW1, but now both DFS and simulated annealing (and hill climbing) fail. Another interesting case is a branching process: given a node, flip a coin: with prob gamma < 1/2 make it a leaf, and with prob 1-gamma, give it two children. Then cut this off at depth d. (If you look at the number of coins still to flip, this is like a random walk, starting at 1, where each coin flip moves you 1 step to the left with prob gamma and 1 step to the right with prob 1-gamma.) --> steady state will be at most O(log B) particles per node, and at least B/c nodes occupied for some constant c. Will GW1 work for all trees having at P=poly(d) modes per level? No. E.g., o |\ o o | |\ o o o | | |\ o o o o | o If depth is >> log(B), then w.h.p., we'll end up with no particles to the right. How to modify? Idea: redistribute particles according to their degree. Equivalently, look at the union of the children of the current nodes occupied and distribute the B particles uniformly among those children. This rule will keep particles evenly distributed among the nodes at any given level. (Note: we now fail on the previous class of trees, though.) Generalization: what if we have a search graph where we can group the solutions in the search space into "supernodes", such that all solutions in a given supernode have the same value and furthermore: (1) Each supernode C at level i has |C| > n_i/poly(n), where n_i is the number of solutions of value i. (2) The bipartite graph between any supernode C at level i and the union of its children at level i+1 is a good expander. (3) n_{i-1}/n_i < poly(n). (by (2) we mean that for any set S of solutions at value i+1, |N(N(S)) - S| > const*|S| for |S| < n_{i+1}/2.) Then, * If all nodes have the same up-degree, we can use the same tree algorithm, except after we move the B particles to the children, we then do a random walk on the bipartite graph between the level i and level i+1 nodes. This will randomize the positions of the B particules among the children nodes. The reason we want a random sample is that this ensures that each child supernode receives a number of particles in proportion to its size. * If all nodes DON'T have the same up-degree, then we are over-counting those child nodes of higher degree. We can compensate by, at the end, redistributing our B particles with probability inversely proportional to the up-degrees of the nodes. Everything will work fine so long as B is sufficiently larger than the maximum degree of any node.