15-859(B) Machine Learning Theory 03/05/02 * Membership Query model * Some simple algorithms - parity - log(n)-variable functions - monotone DNF * KM (GL) algorithm for finding large Fourier coefficients ========================================================================= Learning with Membership queries: in addition to getting examples from distribution D, the learning algorithm is allowed to ask for c(x) for *any* x of its choosing. E.g., this makes the most sense if we think of the target function as some device that we are trying to reverse-engineer. We can put it into "normal operating conditions" and observe its behavior, but we can also poke it with inputs of our own choosing and see what it does. So, membership queries are a kind of "active learning". Some Simple Examples: 1. Learning parity functions (don't even need D for this). 2. Learning monotone DNF (here you need D). [go through Angluin's algorithm] 3. Learning the class "all functions that have at most log(n) relevant variables". [Go through algorithm. Show slides?] Cases where MQs make sense: trying to "break" encryption on a smart-card [mention Angluin's DFA algorithm], active exploration of environment, converting a complicated hypothesis (e.g., a neural net) into a more understandable form. Cases where MQs don't make so much sense: classifying documents/images (doesn't make sense to "walk" a document about topic X towards a document about topic Y and ask at what point the topic changes) Interesting area for project: define and prove something about other "reasonable" kinds of queries you might be able to ask in different settings. Need to make a reasonable model of how well the "teacher" can articulate what it knows. Plan for next few classes: - MQ algs based on Fourier. (e.g., learn decision trees, DNF wrt uniform) - Bshouty's alg: learn Decision trees in PAC+MQ over any distribution (or mistake-bound + MQs) - Angluin's algorithm for learning Finite Automata. Extensions of Rivest&Schapire. ======================================================================== Fourier-based MQ algorithms =========================== We know that any function f over {0,1}^n can be written as a sum of parity functions: f(x) = \sum_i \hat{f}_i \phi_i(x) where \phi_i is the ith parity function, and \hat{f}_i is the correlation of f with \phi_i under the uniform distribution. Claim: using MQs, we can find all i such that (\hat{f}_i)^2 > tau in time poly(n, 1/tau). 1. Why is this interesting? 2. How can we do it? Why is this interesting? ------------------------ Can prove that any DNF formula has at least one non-negligible fourier coefficient in parity basis. So, this lets us WEAK-LEARN DNF over the uniform distribution. Also, any DECISION TREE has 1-epsilon of its "length" (sum of squares of coefficients) in a small (poly(n/1/epsilon)) number of coefficients. So, this lets us strongly-learn decision trees, with respect to the uniform distribution. [Quick check: decision trees are a special case of DNF, right?] [Technical point: why is it enough to get "most" of the fourier coefficients? E.g., if we get all but epsilon of the (\hat{f}_i)^2, then the resulting g has (f-g)^2 = epsilon. The claim is that even if g doesn't correspond to a boolean function, we can use sign(g) as a good approx to f. The reason is that by definition, E[(f(x)-g(x))^2] = epsilon. Notice that every x where sign(g) is wrong contributes at least 1. So, the error of sign(g) is at most epsilon.] Then Jackson showed how to use boosting to get strong learning of DNF over unif. [not trivial --- remember that boosting changes the distribution, so you need to show that with appropriate boosting procedure, they stay "close enough" to uniform] Let's prove claims about DNF, DTs later. Go to alg for finding large fourier coeffs now. ====================================================================== Kushilevitz-Mansour (Goldreich-Levin) algorithm: -------------------------------------------------- Let's index parity functions by their indicator vectors. Our goal is to find all \hat{f}_v such that (\hat{f}_v)^2 > \tau. Say S is the set of all indicator vectors. So the sum, over v in S, of the (\hat{f}_v)^2 equals 1. Suppose we could split S into two equal-sized sets S_0 and S_1 and for each one, ask for the sum of squares of the \hat{f}_v in that set. And, suppose we could do this recursively. Then what? Then we could build up a tree, splitting each node whose answer is > tau, and ignoring those < tau. - max width of tree is 1/tau. - Just keep going to leaves at depth n. => Only n/tau questions to ask. So, all boils down to: how can we ask for the sum of squares of the \hat{f}_v's inside these sets? First of all, to be able to do this, we're going to need to split these sets in a particular way. Specifically, we'll do this by fixing a prefix of the variables, and then the set will be "all parities consistent with this prefix". For instance, S_0 = {v: v begins with 0} S_1 = {v: v begins with 1} Then, will split S_0 into S_00 = {v: v begins with 00} S_01 = {v: v begins with 01} etc. In general, our goal is now to estimate, for some string alpha of 0s and 1s, the sum of the squares of the \hat{f}_v's for all v that begin with alpha. (Note: we won't find the sum exactly. We'll find an estimate of the sum. But that's fine. It just means we throw out nodes whose estimate is < tau/2 instead of those whose estimate is < tau) There's now a neat trick for doing this. Let's do case of alpha = 00000 (a string of k zeros). So, we want the sum of squares over all parities that don't include any of first k variables. Remember, f is a {-1, +1}-valued function. The claim is that this sum just happens to equal the expected value, over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx). I.e. pick random suffix x. Pick two random prefixes y and z and multiply f(yx) by f(zx). Repeat many times and take the average. Intuition: independent y and z will cancel out correlations we don't want. E.g., what if f was a parity function? If f agrees with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1. BUT, if f DOESN'T agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0. Now, since any f can be written as a sum of parities, hopefully we can reduce the case of general f to the case of "f is a parity function"..... It's actually almost but not quite so easy. Question we need to answer is: for two (possibly different) vectors v,w what is E [ phi_v(yx) * phi_w(zx) ] ? x,y,z Claim: This is 0 if v != w, and is also 0 if v,w aren't all 0s on the first k bits. If v=w and they are all 0 on 1st k bits, then the expectation is 1. This is enough. Now we get: E[f(yx)f(zx)] = E[ (sum_v \hat{f}_v \phi_v(yx)) (sum_w \hat{f}_w \phi_w(zx))] = sum of squares of coeffs starting with k zeroes. So, we're done for the case of alpha = "all zeros". What about other alpha (i.e., requiring parity to have a certain fixed dependence on the first k bits)? Since we know the dependence (alpha) just "undo it". Specifically, we instead ask for: E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)] where phi_alpha is the parity of the variables set to 1 in alpha. (then we get same as in above Claim, except the non-zero case is when v=w and they are equal to alpha on the first k bits). That's it. Actually, one technical point: we need to estimate an expectation. We'll do it by sampling. OK since the quantity inside the expectation is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he shows is that instead of modifying the distribution, you can modify f to a non-boolean function. Then, need to argue that (based on how the boosting results work) f doesn't get too large, so you can still estimate f(yx)f(zx) from sampling. ============================================================================