15-854 Approximation and Online Algorithms	02/16/00

* Approximations using random projection -- approximate nearest neighbor.

============================================================================

-- recap SDP results: (1) use SDP to convert graph problem to problem
about points in n-dim space, then (2) "round" the solution by doing
some kind of projections.  E.g., for MAX-cut we just projected onto
random 1-dim space.  For 3-coloring (simpler version) we projected
into what could be thought of as a "random k-dim space" with 1 color
per quadrant. For the better version we computed dot products with a
collection of t random centers, one per color.

-- The idea of approximating a complicated space by a simple space is
a nice general approximation tool.  E.g., 
  - approximating distances in a graph by a tree (or distribution over
	trees)
  - approximating a high-dim space by a low-dim space.

Today will look at this second idea.

Theorem [Johnson-Lindenstrauss]:  Given a set of n points in Euclidean
space (which we can assume is R^n), if we project onto random
k-dimensional space for k = O((log n)/epsilon^2), then, whp, all
pairwise distances are preserved up to a factor of (1 + epsilon).
[actually we need to scale by multpilying all coordinates by
sqrt(n/k)].

Proof intuition:

If you take a random unit vector in R^n, what do we expect the x1
coordinate to look like?  Easier: what is expected value of the square
of the x1 coordinate?  Answer: 1/n (since the sum of all the squares
For intuition, let's imagine the square of the x1, x2, ..., xk
coordinate values are each either 0 or 2/n, at random, independently.
If this was the case, what could we say about the length of the
projection onto these first k coordinates?  The expected square of the
projection is k/n.  Since things are independent, we can use Hoeffding
bounds: the chance we have epsilon*k more heads than tails or vice
versa (heads = 1/2n, tails = 0) is at most 2e^{-2k*epsilon^2}.
Plugging in k = (ln n)/epsilon^2 makes this 2/n^2. [or if we double k
we get 2/n^4].  

So what do we have? With probability 1 - 1/poly(n), the projection has
(length)^2 in [(1-epsilon)k/n, (1+epsilon)k/n], or equivalently,
length in [sqrt((1-epsilon)k/n), sqrt((1+epsilon)k/n)], which is about
[(1 - epsilon/2)*sqrt(k/n), (1+epsilon/2)*sqrt(k/n)].  So, if we
project and then stretch by a factor of sqrt(n/k) we expect the
length to be in the range [1 - epsilon/2, 1 + epsilon/2]

Now, what does this have to do with our original problem?  Well, fix a
pair of points (p,q) in our set and look at the vector q - p.
Projecting onto a random k-dim space and asking for the new distance
between p and q is the same as taking a random vector of length |q-p|
and asking for the length of the projection onto x1,...,xk.  So, what
we have is that whp, in the projection and scaling, the distance between p and
q has been multiplied by some number in the range [1-epsilon/2,
1+epsilon/2].  The last step is that this "whp" is so high that in
fact whp all n-choose-2 distances behave this way.

What we argued above wasn't a real proof because (a) each coordinate
doesn't behave in this binary way, and (b) there are small
correlations: e.g., if the x1 projection is 1 then you know the rest
must be 0.

Simpler argument with better bounds: instead of projecting onto random
k-dimensional space, pick k vectors v1, v2, ..., vk at random
according to n-dimensional gaussian, and then map each point p into
the k-tuple of dot products: (p.v1, p.v2, ... , p.vk).  Just like in
the previous method, in the projection, the distance between p and q
is the same as the length of the projection of vector q-p: it is
sqrt{(q.v1 - p.v1)^2 + ... + (q.vk - p.vk)^2}.  So, we just need to
show that given a unit vector, the length of the projection is tightly
concentrated around the expectation.  But, this is easier now, since
each component is iid from the standard 1-dimensional gaussian.  So,
it just boils down to: "I pick k numbers iid from the standard normal
distribution, and add up their squares.  How tightly concentrated is
this?".  That's still a bit of a mess, but you could imagine looking
it up.  Or, we could be really crude and calculate a value B s.t. whp,
none of the coordinates are > B and reduce to binary case.

==================================================

Applications:

1. Machine learning: say I have a set S of red and blue points in an
n-dimensional space and want to separate them by a surface of some
type, e.g., a hyperplane.  Suppose there exists a separating
hyperplane that has a large "margin": can wiggle by angle epsilon and
still separate.  Then our above results tell us that a random
projection onto an O((log S)/epsilon^2) dimensional space should still
be linearly separable.  Why?  We showed that all pairwise distances
are approximately preserved.  Also, the distance of each point to the
origin is too for the same reason (just think of the origin as one
more point).  So, this means angles are preserved too --- in
particular, the angle to the normal vector for that hyperplane.

This is nice because lower dimensional spaces are easier to deal
with.  Also, gives an alternate way of looking at things like "support
vector machines" (an algorithm that tries to find the hyperplane with
the largest margin) and why large margins imply less overfitting.


2. Approximate nearest neighbor.  Given a big dataset in a high
dimensional space, we would like to store a data structure that lets
us compute the nearest neighbor to a query point quickly.  E.g., our
data set has n points in d dimensions, and we want to find the nearest
neighbor in o(n) time.  In low dimensional spaces there are a lot of
tools for this, but seems really hard in high dimensions.  Idea: if
we're satisfied with approximate answers, then we can use random
projection to reduce the number of dimensions.

Before going into details, first there are two naive baselines.
Baseline 1 is just store data in a list and compare query to each
one.  This has linear storage, but also linear lookup time.  Baseline
2: say each axis has k values (e.g., if we're in the boolean cube,
then k=2, or if the queries are to b bits of precision then k = 2^b).
Then we could just precompute the answer to every possible query and
store in a hash table.  This produces fast lookup time, but storage is
O(k^d).  Our goal will be poly(n,d) storage, and polylog(n,d) lookup time.

Approach of [Indyk-Motwani]:

Simpler-to-think-of problem:  Given set S of n data points and a radius r,
create a data structure so that given a query q, can quickly answer
the question "is there a point in S within distance r from q, and if
so, produce one".  Approximate version: "if there is a point in S within
distance r from q, produce one within distance r(1+epsilon)".  Note:
for approximate version, the algorithm may or may not produce anything
if the nearest point to q is > r but < r(1+epsilon).

How can we use a solution to this to solve our original problem?  If
smallest pairwise distance in our database is 2*r_0, then create a
bunch of these data structures with radius r0, r0(1+epsilon),
r0(1+epsilon)^2,... up to (diameter of space)/2. Then when doing
querying, can do binary search on r.  If we let R = (diameter of
space)/(min distance), then our storage and construction time has
dependence O(log R) and our query time has dependence O(loglog R).  It
turns out that if R is big, there are other reductions you can use.

How to solve this simpler problem? ("PLEB": "Point Location in Equal
Balls")  Let's consider the case of d=2 (a 2-dimensional world).
Here's a really crude method: just
plunk down graphpaper with grid spacing r*epsilon/sqrt(2).  For each
grid cell that intersects some ball, store the coordinates of the
lower-left corner of that cell along with the associated ball in a
hash table.  To handle a query: round down each coordinate to find
which cell it is in and hash.  How many cells do we need to store?
It's basically the volume of a ball divided by the volume of a cell,
which in 2D is (pi*r^2)/(r^2*epsilon^2/2) = O(1/epsilon^2).
In general, for d dimensions we get O(1/epsilon^d).
[technical issue: the denominator has a "sqrt(d)^d" term, but so does
the numerator.  The volume of a d-dimensional ball of radius r is
(r^d)(2*pi^{d/2})/(d * (d/2)!) ]

This is bad for large d.  But, we can use our random projection and we
only need d = O(log(n)/epsilon^2) to keep distances correct within a
(1+epsilon) factor.  So, the total space is 
    n^(O(log 1/epsilon)*(1/epsilon^2)).  
Not perfect, but at least polynomial for constant epsilon.  E.g., if
we're happy about finding a point that's at most twice as far as our
nearest neighbor, then epsilon=1.

One technical issue we glossed over: For any set S of n data points and
any query q, whp the projection approximately preserves distances to q
and so we get an approximate answer.  Or, for any set S of points and
set T of queries, whp the projection approximately preserved distances
for all queries if n = |S| + |T|.  But, once we fix the data
structure, we might be able to find queries that fail.  One way to
handle that is if each coordinate of the query is given to b bits of
precision (so the total # of bits is b*d and there are 2^{b*d}
possible queries), what we do is create O(b*d) data structures, each
iid at random. Then can argue that whp, the majority vote over these
will be correct for *every* query.