15-854 Approximation and Online Algorithms 02/16/00 * Approximations using random projection -- approximate nearest neighbor. ============================================================================ -- recap SDP results: (1) use SDP to convert graph problem to problem about points in n-dim space, then (2) "round" the solution by doing some kind of projections. E.g., for MAX-cut we just projected onto random 1-dim space. For 3-coloring (simpler version) we projected into what could be thought of as a "random k-dim space" with 1 color per quadrant. For the better version we computed dot products with a collection of t random centers, one per color. -- The idea of approximating a complicated space by a simple space is a nice general approximation tool. E.g., - approximating distances in a graph by a tree (or distribution over trees) - approximating a high-dim space by a low-dim space. Today will look at this second idea. Theorem [Johnson-Lindenstrauss]: Given a set of n points in Euclidean space (which we can assume is R^n), if we project onto random k-dimensional space for k = O((log n)/epsilon^2), then, whp, all pairwise distances are preserved up to a factor of (1 + epsilon). [actually we need to scale by multpilying all coordinates by sqrt(n/k)]. Proof intuition: If you take a random unit vector in R^n, what do we expect the x1 coordinate to look like? Easier: what is expected value of the square of the x1 coordinate? Answer: 1/n (since the sum of all the squares For intuition, let's imagine the square of the x1, x2, ..., xk coordinate values are each either 0 or 2/n, at random, independently. If this was the case, what could we say about the length of the projection onto these first k coordinates? The expected square of the projection is k/n. Since things are independent, we can use Hoeffding bounds: the chance we have epsilon*k more heads than tails or vice versa (heads = 1/2n, tails = 0) is at most 2e^{-2k*epsilon^2}. Plugging in k = (ln n)/epsilon^2 makes this 2/n^2. [or if we double k we get 2/n^4]. So what do we have? With probability 1 - 1/poly(n), the projection has (length)^2 in [(1-epsilon)k/n, (1+epsilon)k/n], or equivalently, length in [sqrt((1-epsilon)k/n), sqrt((1+epsilon)k/n)], which is about [(1 - epsilon/2)*sqrt(k/n), (1+epsilon/2)*sqrt(k/n)]. So, if we project and then stretch by a factor of sqrt(n/k) we expect the length to be in the range [1 - epsilon/2, 1 + epsilon/2] Now, what does this have to do with our original problem? Well, fix a pair of points (p,q) in our set and look at the vector q - p. Projecting onto a random k-dim space and asking for the new distance between p and q is the same as taking a random vector of length |q-p| and asking for the length of the projection onto x1,...,xk. So, what we have is that whp, in the projection and scaling, the distance between p and q has been multiplied by some number in the range [1-epsilon/2, 1+epsilon/2]. The last step is that this "whp" is so high that in fact whp all n-choose-2 distances behave this way. What we argued above wasn't a real proof because (a) each coordinate doesn't behave in this binary way, and (b) there are small correlations: e.g., if the x1 projection is 1 then you know the rest must be 0. Simpler argument with better bounds: instead of projecting onto random k-dimensional space, pick k vectors v1, v2, ..., vk at random according to n-dimensional gaussian, and then map each point p into the k-tuple of dot products: (p.v1, p.v2, ... , p.vk). Just like in the previous method, in the projection, the distance between p and q is the same as the length of the projection of vector q-p: it is sqrt{(q.v1 - p.v1)^2 + ... + (q.vk - p.vk)^2}. So, we just need to show that given a unit vector, the length of the projection is tightly concentrated around the expectation. But, this is easier now, since each component is iid from the standard 1-dimensional gaussian. So, it just boils down to: "I pick k numbers iid from the standard normal distribution, and add up their squares. How tightly concentrated is this?". That's still a bit of a mess, but you could imagine looking it up. Or, we could be really crude and calculate a value B s.t. whp, none of the coordinates are > B and reduce to binary case. ================================================== Applications: 1. Machine learning: say I have a set S of red and blue points in an n-dimensional space and want to separate them by a surface of some type, e.g., a hyperplane. Suppose there exists a separating hyperplane that has a large "margin": can wiggle by angle epsilon and still separate. Then our above results tell us that a random projection onto an O((log S)/epsilon^2) dimensional space should still be linearly separable. Why? We showed that all pairwise distances are approximately preserved. Also, the distance of each point to the origin is too for the same reason (just think of the origin as one more point). So, this means angles are preserved too --- in particular, the angle to the normal vector for that hyperplane. This is nice because lower dimensional spaces are easier to deal with. Also, gives an alternate way of looking at things like "support vector machines" (an algorithm that tries to find the hyperplane with the largest margin) and why large margins imply less overfitting. 2. Approximate nearest neighbor. Given a big dataset in a high dimensional space, we would like to store a data structure that lets us compute the nearest neighbor to a query point quickly. E.g., our data set has n points in d dimensions, and we want to find the nearest neighbor in o(n) time. In low dimensional spaces there are a lot of tools for this, but seems really hard in high dimensions. Idea: if we're satisfied with approximate answers, then we can use random projection to reduce the number of dimensions. Before going into details, first there are two naive baselines. Baseline 1 is just store data in a list and compare query to each one. This has linear storage, but also linear lookup time. Baseline 2: say each axis has k values (e.g., if we're in the boolean cube, then k=2, or if the queries are to b bits of precision then k = 2^b). Then we could just precompute the answer to every possible query and store in a hash table. This produces fast lookup time, but storage is O(k^d). Our goal will be poly(n,d) storage, and polylog(n,d) lookup time. Approach of [Indyk-Motwani]: Simpler-to-think-of problem: Given set S of n data points and a radius r, create a data structure so that given a query q, can quickly answer the question "is there a point in S within distance r from q, and if so, produce one". Approximate version: "if there is a point in S within distance r from q, produce one within distance r(1+epsilon)". Note: for approximate version, the algorithm may or may not produce anything if the nearest point to q is > r but < r(1+epsilon). How can we use a solution to this to solve our original problem? If smallest pairwise distance in our database is 2*r_0, then create a bunch of these data structures with radius r0, r0(1+epsilon), r0(1+epsilon)^2,... up to (diameter of space)/2. Then when doing querying, can do binary search on r. If we let R = (diameter of space)/(min distance), then our storage and construction time has dependence O(log R) and our query time has dependence O(loglog R). It turns out that if R is big, there are other reductions you can use. How to solve this simpler problem? ("PLEB": "Point Location in Equal Balls") Let's consider the case of d=2 (a 2-dimensional world). Here's a really crude method: just plunk down graphpaper with grid spacing r*epsilon/sqrt(2). For each grid cell that intersects some ball, store the coordinates of the lower-left corner of that cell along with the associated ball in a hash table. To handle a query: round down each coordinate to find which cell it is in and hash. How many cells do we need to store? It's basically the volume of a ball divided by the volume of a cell, which in 2D is (pi*r^2)/(r^2*epsilon^2/2) = O(1/epsilon^2). In general, for d dimensions we get O(1/epsilon^d). [technical issue: the denominator has a "sqrt(d)^d" term, but so does the numerator. The volume of a d-dimensional ball of radius r is (r^d)(2*pi^{d/2})/(d * (d/2)!) ] This is bad for large d. But, we can use our random projection and we only need d = O(log(n)/epsilon^2) to keep distances correct within a (1+epsilon) factor. So, the total space is n^(O(log 1/epsilon)*(1/epsilon^2)). Not perfect, but at least polynomial for constant epsilon. E.g., if we're happy about finding a point that's at most twice as far as our nearest neighbor, then epsilon=1. One technical issue we glossed over: For any set S of n data points and any query q, whp the projection approximately preserves distances to q and so we get an approximate answer. Or, for any set S of points and set T of queries, whp the projection approximately preserved distances for all queries if n = |S| + |T|. But, once we fix the data structure, we might be able to find queries that fail. One way to handle that is if each coordinate of the query is given to b bits of precision (so the total # of bits is b*d and there are 2^{b*d} possible queries), what we do is create O(b*d) data structures, each iid at random. Then can argue that whp, the majority vote over these will be correct for *every* query.