15-750				04/19/01

* A simple coupling example (walk on a line)
* eigenvalues and rapid mixing
* conductance and expanders
* a practical example of MCMC (Markov Chain Monte Carlo).

See also: http://www.cs.berkeley.edu/~sinclair/mcmc.ps

==================================

One more (simple) coupling example:
Consider a random walk on the line 1,...,n.  You start at point 1.
At each step you flip a coin: heads means go left, tails means
go right. If you're at 1 and get a heads, just stay put (same if
you're at n and get a tails).

1. What is the stationary distribution?  Ans: uniform.  [Can verify
directly or view as special case of random walk on graph, where the
self-loops give each node the same out-degree]

2. How long until we get close?  Let's usse a coupling argument. We
are particle x, starting at location 1.  Also imagine particle y, at a
random spot according to stationary distribution.  Now, let's
correlate the walks: each time we get heads, so does y, and same with
tails.  So, y stays uniformly distirbuted.  But also, notice that our
distance to y never goes up, and sometimes goes down.  In particular,
by the time we reach point n, we must have coupled.

We can now use fact from before that expected time to reach n is at
most 2mn = 2n^2.  Let's use "Markov's inequality": the probability we
have reached by time 4n^2 is at least 1/2.  If not, then at least a
1/2 chance in next 4n^2 steps.  So, if we want success probability
1-epsilon, then we just have to walk for O(n^2* log(1/epsilon))
steps.  So, in this many steps, the prob we have *not* coupled is at
most epsilon, so by our lemma from last time, this means our variation
distance is at most epsilon.  That's it.

Note: the lemma about variation distance is kind of clever.
More straightforward way to do that last part of the argument is:
	Pr(x is at i) >= Pr(y is at i)*Pr(coupled | y is at i).  
We want this to be at least 1/n * (1-epsilon).
We know Pr(y is at i) is 1/n.  But the coupling probability might
depend on y [think of n=2 and a zero-step walk].  So you end up
getting a bit worse bounds if you do it this way.

==================================
Now we're going to talk more generally about what properties of a
graph make the walk rapidly mixing.  

One way to think of a random walk is as a matrix-vector product.  Say
we have n states.  Define transition matrix:

         M_ij = prob of going to j given that you're in state i.

 E.g., our random walk on the line...

 Write a prob distribution as a row vector q.  Then, one step of walk = qM

 stationary distribution: eigenvector of eigenvalue 1.


Say we're interested in rapid mixing.  Want to approach stationary
dist in time poly in log(n).  Can we talk about this in terms of
properties of the matrix?

Eigenvalue gap -> Rapid Mixing
==============================

THEOREM: Say M is a transition matrix with real eigenvalues and orthogonal
eigenvectors.  Then, for any starting distribution, the L_2 distance
between q^(t) (the distribution after t steps) and the stationary
distribution \pi is at most |\lambda_2|^t, where \lambda_2 is the
eigenvalue of second-largest absolute value.

First of all, when do we get real eigenvalues and orthogonal
eigenvectors?  Well, one case is when M is symmetric.  E.g., think of
a random walk on a regular graph (all nodes have the same degree).  In
fact, can generalize to ``time-reversible'' Markov chains: \pi_i *
M_{ij} = \pi_j * M_{ji} for all i,j.  E.g., random walk on an
undirected graph where nodes are not necessarily all the same degree.
[Non-time-reversible: random walk on a directed cycle].

Second: we know that the stationary distribution is an eigenvector of
eigenvalue equal to 1.  This is also the largest eigenvalue: all other
eigenvalues have magnitude <= 1.  (The proof of the theorem will show
us why.)  The difference between 1 and |\lambda_2| is called the
"eigenvalue gap".

So, if we have an eigenvalue gap of some amount gamma, and we want an
L_2 distance of delta_2, then we need only O((1/gamma)*log(1/delta_2))
steps.  We'll relate L_2 distance to variation distance in a minute.


PROOF OF THEOREM: 
Say orthogonal eigenvectors are e_1, \ldots, e_n.  e_1 = \pi.
Say we start at

        q^{(0)} = c_1e_1 + ... + c_ne_n.

By definition of eigenvectors (and since M(a+b) = Ma + Mb), after t
steps, we're at:

    q^{(t)} = c_1e_1 + (c_2(\lambda_2)^te_2 +... + c_n (\lambda_n)^te_n).

Notice that |lambda_2| cannot be larger than 1, since otherwise our
vector is getting bigger and bigger.  Can't have a probability vector
with a Euclidean length longer that 1 (Euclidean length is always
between 1/sqrt(n) and 1).  

Assuming that |lambda_2| < 1 (else the theorem is vacuous), and e_1 is
pi, we must have c_1 = 1.  So, the L_2 distance between q^(t) and e_1
is just the length of the vector corresponding to the second part of
the RHS.  We can rewrite this vector like:
		(lambda_2)^t * blah
where blah is shorter than (c_2e_2 + ... + c_ne_n), which for sure is
shorter than q^(0), which has length at most 1.  So, this means the
L_2 distance between q^(t) and e_1 is at most (lambda_2)^t.  QED


How does this relate to variation distance?  Variation distance = 1/2
of L_1 distance.  L_1 distance is at most sqrt(n) * L_2 distance.
[[E.g., if you have a unit-length vector in R^n, the largest the sum
of the entries can be is if they are all equal to 1/sqrt(n), in which
case the sum is sqrt(n).]]  So, if we want a variation distance of
delta_1, we just need to run for O((1/gamma)*log(n/delta_1)) steps.


What does the eigenvector of 2nd-largest eigenvalue look like?  It has
some positive entries and some negative entries.  E.g., if graph is
regular, then the eigenvector of highest eigenvalue has all entries
equal to 1/n; since the next one is perpendicular, the sum of its
positive entries must equal the sum of its negative entries.  If the
graph is dumbbell shaped, then the positives will be on one side and
the negatives will be on the other.  These describe the sets that in a
sense mix most slowly together.  In fact, this approach is often used
for graph partitioning (splitting into large pieces that have few
edges between them).
=============================================================================

When do we get rapid mixing?

Ans #1: if have gap between largest and 2nd largest eigenvalue.

Ans #2: when the Markov chain has high "conductance".

[Note: this has nothing to do with the notion of "resistance" we
discussed last time.]


Def:

* pi(S) = sum_{i in S} pi_i.
  (prob of being in S according to pi)

* flow(S) = sum_{i in S, j not in S} w_ij, 
  where w_ij = (pi_i)*(p_ij) = prob of being on edge i->j in pi.
  (measure of # edges leaving S).

* Conductance.  Let Phi(S) = flow(S)/pi(S).
  Conductance Phi is the minimum Phi(S) over all sets S of pi <= 1/2.

Think of Phi(S) this way:  If you started out at a random place in S
(according to pi), what's the chance that you won't be in S in your
next step? This is sum_{i in S, j not in S} Pr(was at i and went to j).
This is sum_{i in S, j not in S} p_{ij} * pi_i / pi(S).
This is flow(S) / pi(S).


If graph has *low* conductance,then it's pretty clear it's *not*
rapidly mixing.  In particular, if you start out at a random point in
the bad set S, it will take a long time to get out.  More interesting
thing is that if have high conductance, then it *is* rapidly mixing.

Note: very similar to definition of an Expander graph.
A graph is an epsilon-expander if for all sets S of <= n/2 nodes, 
                |N(S) - S| > epsilon*|S|.

Here is the main theorem (we won't give the proof): if diagonal entries
are constant > 0 (e.g., self-loops) then
	Eigenvalue gap is Omega(conductance^2).

In the other direction (also won't give the proof):
	Conductance is Omega(eigenvalue gap).

Intuition for main theorem: Imagine giving each node a number
according to eigenvector of 2nd largest eigenvalue, and pretend this
is a distribution (it's not: the values all sum to 0).  After 1 step
in the walk, the fact that we have high conductance means there will
be a lot of cancelling going on.  (and self-loops make surewe don't
just bounce between S and its complement.)

==================================================

Application: Gibbs sampling from a posterior.  Say you have some large
set of hypotheses about some phenomenon.  before you see any data, you
might have some prior belief about which is true.  Now, you see some
data, and use Bayes rule: Pr(h | d) = Pr(d | h)*Pr(h) / Pr(d).
Want to find h of highest posterior.

Let's ignore the denominator (it's the same for each h) and say we
know how to compute Pr(d | h) for any given h.  Problem is: too many
h's.  So, how about this: can we at least randomly *sample* h's
according to the posterior distribution?  This is called "Gibbs
sampling".

From now on, let value(h) = posterior prob of h.

Idea: set up some neighborhood structure and take a random walk using
"Metropolis filter"  (Metropolis is someone's name --- I used to think
this was because people were thinking of this as a walk on a grid...)
Specifically, let's say we set up a neighborhood structure so that
each h has d-1 neighbors, plus add self-loop for aperiodicity, so a
total of d neighbors.  Now, say we're at h and our coin toss tells us
to move to neighbor h'.  If value(h') >= value(h) then we go there.
Otherwise, we only go there with prob value(h')/value(h); otherwise we stay
put.  Claim is that this has the correct stationary distribution:

Say at time t, prob at h = value(h) for all h.  Then at time t+1,
Pr(h) = Pr(got here by taking an edge) + Pr(got here by not taking an edge)
      = sum_{nbrs h' of h} value(h')*(1/d)*min{1,value(h)/value(h')} +
        sum_{nbrs h' of h} value(h)*(1/d)*max{0,1 - value(h')/value(h)}.
    = avg_{nbrs h' of h} [min{value(h'), value(h)} + max{0,value(h)-value(h')}]

If h' has smaller value than h, then thing in [...] is
value(h')+value(h)-value(h') = value(h).  If h' has larger value than
h, then thing in [...] is value(h)+0.  Either way, we're averaging a
bunch of things all of which are value(h).

So, the stationary dist is just what we want.  We just hope the markov
chain has high conductance.  

Unfortunately, high conductance means not only that neighborhood
structure is expanding, but also that values are nice.  One place this
happens [from Kalai-Vempala'00] is when h's are CRP investment
strategies.  value(h) means how much money we would have had if we had
used that investment strategy from the start.  Goal is pick random
strategy in proportion to value.  Here, the natural neighborhood
structure has the property that similar strategies have similar
values.  So this all works.