Notes for March 4, 1998 * If you haven't yet, talk with me about your presentation/project. =============================================================================== Recap: * Fourier analysis of boolean functions: - for a given distribution, we can view functions as vectors in a 2^n-dimensional space where |f| = 1 and the dot product is Pr_D(f(x) = g(x)) - Pr_D(f(x) != g(x)). (which makes sense since = 1) what's nice about this is we have a lot of intuition and facts about vectors in space thjat we can use * We saw that a statistical query can be viewed as basically asking for the correlation of the target function with some query vector. (any query can be decomposed into a part that depends only on D (and not the target) and a part like this.) * Any unit vector can't have too many large coefficients. At most 1/tau^2 coeffs can have magnitude larger than tau. This is since the length of the vector is the sum of the squares of its coefficients. * In particular, given any set S of orthogonal unit vectors, this means that any query unit vector can have correlation > tau or < -tau with at most 1/tau^2 of them. * So, if the target class has a large number (say, N) pairwise uncorrelated functions, and if the target is a random one of those, then each query will only throw out 1/tau^2 possibilities. So, we need: (#queries) * (1/tau^2) >= N/2 on average, to get a query that gives any info about the target. * So, if N is super-polynomial then can't learn in poly time in SQ model. Equivalently, if we're using an SQ algorithm, need super-polynomial number of samples. * For example, size-n decision trees have Omega(n^(log n)) pairwise uncorrelated functions over the uniform distribution, since that class contains all "parities of up to log(n) variables", and any two parity functions are uncorrelated. So, the bad case is: target is a random parity of log(n) variables. I think this is really nice. Most algorithms are basically SQ algorithms and using Fourier analysis we are able to get strong bounds on what you can or can't do with these kinds of algorithms. * On the other hand, there *is* a (non-SQ) algorithm for learning parity. If target is (a_1,...,a_n) where a_i = 1 or 0 depending on whether x_i is in the target or not, then an example (x_1,...,x_n) gives us a linear equality a.x = 1 (mod 2). Draw a sample and then use gaussian elimination. * Big open problem: is there a poly-time algorithm for learning Decision Trees over the uniform distribution? We *can* solve in time n^(log n) over *any* distribution (see hwk). ============================================================================ We've seen how Fourier analysis can be used to analyze limitations on algorithms. Now turn to how can use it in an algorithm itself. This becomes particularly interesting if we extend our learning model to allow for "membership queries". Learning with Membership queries: in addition to getting examples from distribution D, the learning algorithm is allowed to ask for c(x) for *any* x of its choosing. E.g., this makes the most sense if we think of the target function as some device that we are trying to reverse-engineer. We can put it into "normal operating conditions" and observe its behavior, but we can also poke it with inputs of our own choosing and see what it does. So, membership queries are a kind of "active learning". For example, parity functions are particularly easy to learn with membership queries. My plan is to spend the next few classes talking about active learning in general. Today, segue into this, from our previous topic, see how to use membership queries to find fourier coefficients. In particular, using membership queries, we can find all the "large" fourier coefficients of a function, under the parity basis. I.e., any function f over {0,1}^n can be written as: f(x) = \sum_i \hat{f}_i \phi_i(x) where \phi_i is the ith parity function. What we can find are (estimates of) all \hat{f}_i that have magnitude > 1/poly(n). Why is this interesting? ------------------------ Suppose the target has most of itself in its high-weight coefficients, and say we find a set of coefficients S such that sum_{i \in S} (\hat{f}_i)^2 > 1-\epsilon then the claim is that g(x) = sum_{i \in S} (\hat{f}_i)^2 \phi_i(x) is a good approximation to f. Note that g may not be a boolean function. But, the claim is that the average squared error (over the uniform distribution) is low. In particular, E(f(x) - g(x))^2 < epsilon. This is because: E(f(x) - g(x))^2 = E( \sum_{i \not\in S} \hat{f}_i\phi_i(x) )^2 = E( \sum_{i \not\in S} (\hat{f}_i)^2 \phi_i(x)^2 (since the crossterms cancel) and this equals \sum_{i \not\in S} (\hat{f}_i)^2 < \epsilon. If we want a good hypothesis in the usual sense (a boolean function with a low probability of error) we can use h(x) = sign(g(x)). Why? Remember f is a {-1, +1} function. So, if the sign is wrong then the squared difference is at least 1. So there is at most a probability epsilon of error (under the uniform distribution). Furthermore, we can prove the following: * any poly-size decision tree has a poly(n, 1/epsilon) set of fourier coefficients whose sum of squares is at least 1-epsilon (i.e., they capture most of its "length"). So can learn DTs with membership queries, with respect to the uniform distribution. * Any DNF f has SOME |\hat{f}_i| > 1/(4s), where s = number of terms in f. So, we can use that parity function as a weak approximator of f over the uniform distribution. Actually, you can plug this into the boosting results. This is tricky since they modify the distribution. But you can keep track of this modification and compensate for it if you are careful. The end result is can strongly learn DNF over the uniform distribution, with membership queries. This was Jeff Jackson's thesis a couple years back. So, there are now three things: (1) prove fact about DNF (2) prove fact about decision trees (3) do the algorithm for finding fourier coefficients. Will do (1) today and (3) next time. Probably will skip (2). Theorem: any DNF of s terms has some fourier coefficient of magnitide at least 1/(4s), over the parity basis. Proof: First some notation. For a set of variables A, define \phi_A to be the function that is the parity of the variables in A. E.g., for A = {x_1, x_3} then you get x_1 \oplus x_3. Say T is a term of f, and V is the set of variables used in T. Claim: \sum_{A \subseteq V} |\hat{f}_A| >= 1 Why is this good enough: notice that if a term has L literals in it, then the probability it is satisfied by a random example is 1/2^L. So, if all the terms of f have size >= log_2(4s), then the probability that a random example is positive for f is at most 1/4. This means that "all negative" is a weak hypothesis. On the other hand, if some term of f has size < log_2(4s) then there are at most 4s values being summed above, with a total sum of 1, so at least one of them has to be at least 1/4s. Proof of Claim: Define the function T(x) as: T(x) = 1 if x satisfies T and T(x) = 0 otherwise. This is a little weird since it is not a boolean function according to our definition ({-1, +1}). But it is still a legal function and will be helpful. Let L = # of literals in T. What does the fourier decomposition of T look like: \hat{T}_A = E[T(x)\phi_A(x)] = 0 if A is not a subset of V. (V = set of variables in T) = (1/2^L)*\phi_A(T) if A IS a subset of V. (phi_A(T) is the value of phi_A on an example satisfying T) Now, here's the trick: f(x)T(x) = T(x), because if T(x)=1 then f(x)=1. So, clearly E[f(x)T(x)] = E[T(x)]. The right-hand-side of the above equation is 1/2^L. The left-hand-side of the above equation is the dot product of f and T. Using the fourier basis, this is sum_A \hat{f}_A \hat{T}_A which, by our analysis of \hat{T}_A is equal to (1/2^L) * sum_{A \subseteq V} \phi_A(T). So, we're done.