3/10		15-862 Randomized Algorithms

* Analysis of local optimization algorithms
  (work in [Aldous-Vazirani] and [Dimitriou-Impagliazzo]
   see:  FOCS '94, pages 492-501, and STOC '96, pages 304-313
  )
===========================================================================

In practice, if you're given an instance of a hard problem to solve,
you'll typically use some sort of local optimization algorithm.   In
general, doing this involves 3 steps:

1) setting up some solution space, and attaching a "goodness" or
"value" to each solution.  Typically the size of this space
is exponential in the size of the instance.

2) Defining some notion of "neighbor" among the solutions in the
space.  (I.e., so we now have a graph with nodes as solutions; of
course, this graph is only defined implicitly, since it has
exponentially many nodes).

3) doing local search to try to find the best solution in the space.

For instance
* MAX-SAT: Here (1) could be the space of all assignments, where the
value of an assignment is just the number of clauses it satisfies.  We
can do (2) by saying that we're allowed to  move from one assignment
to another by flipping a bit (i.e., 2 assignments are neighbors if
they differ in one bit position). 

* TSP: Here a natural solution space is the set of all tours, with
value equal to the length of the tour (so value is now "badness").
Then, one notion of neighbor might be that you're allowed to take two
directed edges a-->b and c-->d in the tour and replace them by a-->c
and b-->d.  (This is sometimes called the 2-OPT rule).

* In machine learning, (1) corresponds to the "hypothesis space" --
the set of legal hypotheses under consideration.  Most learning
algorithms correspond to some sort of natural neighborhood structure
on this space and some natural local search algorithm.

Often much of the effort is in areas (1) and (2).  But, we're going to
ignore that and instead focus on (3): what can we say about the
behaviour of different local search algorithms, based on combinatorial
properties of the graph structure we're searching in.

Let's make a few simplifying assumptions
(1) values of solutions are integers in some range [0,d], where d=poly(n).
(2) A solution has at most poly(n) neighbors.
(3) If a solution has value i, then its neighbors have value either
	i-1 or i+1.

(e.g., MAX SAT satisfies (1) and (2). Not (3) but (3) is really just
to simplify matters.)

So, we can think of the solution space as a layered graph, and our
goal is to get to the deepest layer. 

--> Interesting special case of tree, and how one might use a tree to
model general structure of a search graph.

Here are three interesting classes of algorithms:
(1) Hill-climbing: begin at top.  while not at a leaf, go to a random
child. When you get to a leaf, halt.  (and then start over at the top)

(2) simulated-annealling style: like (1), but you also allow yourself
to move up with some probability.   Or, DFS: do hill-climbing, but
back up to previous decision point when you get stuck.

(3) survival-of-the-fittest: run many particles in parallel.  In
particular, let's define GW1: 
	- begin with B particles at the start.
	- if all particles are at leaves, then halt.  Otherwise, move each
	  particle at a leaf to the location of a random non-leaf particle.
	- move each particle to a random child.

E.g., what about this kind of tree:
            o
           / \
          o   o
             / \
            o   o
           / \
          o   o
             / \
            o   o
               / \
              o   o

In this case, GW1 only fails if on some branch point, all particles go
in the wrong direction (so long as this doesn't happen, the particles
will then regroup in the second step of the algorithm).  So, the
probability of failure is at most d*(1/2)^B.  So, we succeed
w.h.p. for B >> log(d).  

Here's a generalization:
            o
           / \
          T   o
              |
              o
             / \
            o   T
            |
            o
           / \
          T   o
              |
             ...

Where "T" is a complete binary tree of some depth d' (say d'=sqrt{d}),
and "|" represents a path of length d'+1.

In this case, the same analysis holds for GW1, but now both
DFS and simulated annealing (and hill climbing) fail.

Another interesting case is a branching process: 
given a node, flip a coin: with prob gamma < 1/2 make it a leaf, and
with prob 1-gamma, give it two children.  Then cut this off at depth d.

(If you look at the number of coins still to flip, this is like a
random walk, starting at 1, where each coin flip moves you 1 step to
the left with prob gamma and 1 step to the right with prob 1-gamma.)

--> steady state will be at most O(log B) particles per node, and at
least B/c nodes occupied for some constant c.


Will GW1 work for all trees having at P=poly(d) modes per level?  No.
E.g.,

       o
       |\
       o o
       | |\
       o o o
       | | |\
       o o o o
             |
             o

If depth is >> log(B), then w.h.p., we'll end up with no particles to
the right. 

How to modify?
Idea: redistribute particles according to their degree.  Equivalently,
look at the union of the children of the current nodes occupied and
distribute the B particles uniformly among those children.

This rule will keep particles evenly distributed among the nodes at
any given level.

(Note: we now fail on the previous class of trees, though.)

Generalization: what if we have a search graph where we can group the
solutions in the search space into "supernodes", such that all
solutions in a given supernode have the same value and furthermore:
(1) Each supernode C at level i has |C| > n_i/poly(n), where n_i is
	the number of solutions of value i.
(2) The bipartite graph between any supernode C at level i and the
	union of its children at level i+1 is a good expander.
(3) n_{i-1}/n_i < poly(n).

(by (2) we mean that for any set S of solutions at value i+1, 
	|N(N(S)) - S| > const*|S| for |S| < n_{i+1}/2.)

Then, 
* If all nodes have the same up-degree, we can use the same tree
algorithm, except after we move the B particles to the children, we
then do a random walk on the bipartite graph between the level i and
level i+1 nodes.  This will randomize the positions of the B
particules among the children nodes.  The reason we want a random
sample is that this ensures that each child supernode receives a
number of particles in proportion to its size.

* If all nodes DON'T have the same up-degree, then we are
over-counting those child nodes of higher degree.  We can compensate
by, at the end, redistributing our B particles with probability
inversely proportional to the up-degrees of the nodes.   Everything
will work fine so long as B is sufficiently larger than the maximum
degree of any node.