15-859(B) Machine Learning Theory 03/19/08 * Learning with Membership Queries - Some simple algorithms: monotone DNF * KM algorithm for finding large Fourier coefficients ========================================================================= Learning with Membership queries ================================ So far we have focused on learning in a passive model, where the learning algorithm has no control over the examples it sees. We are now going to consider what happens if we allow a learning algorithm to ask for labels on examples of its own construction. These are called "Membership Queries" (MQs). These are natural for learning problems where the target function as some device that we are trying to reverse-engineer. Also, at the very least, this allows for more interesting algorithms since there is more for the algorithm to do. [Terminology note: "active learning" corresponds to something weaker where you have a pool of unlabeled data and can only select among them. MQs allow the algorithm to completely specify the example to be labeled.] Can define the following models: PAC + MQ: Can sample from D and ask MQs. Goal: low error wrt D. Special case D=uniform on {0,1}^n. Can view this is you have a black box that you can prod as you like and you want h that agrees with target on most of the domain. MB + MQ: Mistake-bound setting with queries. Pay $1 per mistake and pay $1 per query. Also called the "equivalence + membership query model". Exact learning with MQ only: need to recover target exactly using only queries. Pretty tough. E.g., can't even do conjunctions (why?). But you *can* do parity (how?) Cases where MQs make sense: trying to break encryption on a smart card, active exploration of environment (example = series of actions), converting a complicated hypothesis into a more understandable form. Cases where MQs don't make so much sense: classifying documents/images (doesn't make sense to "walk" a document about topic X towards a document about topic Y and ask at what point the topic changes) Let's start with a simple algorithm. Learning Monotone DNF in MB+MQ model ==================================== Ideas? * Start with h(x) = "all-negative". * If predict negative on a positive example x: - "walk" x down to a minimal positive example (go left to right along x, flipping 1's to 0's so long as they keep the example positive). - The conjunction of the variables in x must be a term of c. So put it into h. * Since we maintain that all terms of h are also in c, we don't make any mistakes on negatives. Total number of mistakes is at most # of terms in target. Number of queries at most n per term. Fourier-based MQ algorithms =========================== We know that any f over {0,1}^n can be written as a sum of parity functions: f(x) = sum_i hat{f}_i phi_i(x) where phi_i is the ith parity function, and hat{f}_i is the correlation of f with phi_i under the uniform distribution. Claim 1: using MQs, we can find all i such that (hat{f}_i)^2 > tau in time poly(n, 1/tau). So, if most of the L_2 length of f is contained in its large Fourier coefficients over the parity basis, then this lets us learn over the uniform distribution (using the results from Mon). Before showing this, let's get a feel for when a target function f might have most of its length in the large fourier coefficients. Define L_1(f) = sum_v |hat(f)_v|. So, this is the L_1 length of f in the parity basis. Note that while L_2 length is independent of your choice of basis (and for boolean functions it's always 1), the L_1 length depends greatly on your basis. E.g., if f is a parity function, then it has L_1 length = 1 in the parity basis, but it has L_1 length = 2^{n/2} (as do all boolean functions) in the standard basis where you list f as (f(0)*2^{-n/2}, f(1)*2^{-n/2},...). Claim 2: Any vector f with L_2 length 1 has the property that the sum of squares of the coefficients less than epsilon/L_1(f) is at most epsilon. Proof: a coefficient of magnitude alpha contributes alpha^2 to the sum of squares, and alpha to the sum of absolute values. So, the small coefficients contribute at least a L_1(f)/epsilon factor *more* to the latter than to the former. Since in total they can contribute at *most* L_1(f) to the sum of absolute values, they can contribute at most epsilon to the sum of squares. QED For example, we will see next time that for a decision tree, we have L_1(f) <= 2m+1, where m is the number of positive leaves in the tree. Now, on to the algorithm... Kushilevitz-Mansour (Goldreich-Levin) algorithm: ------------------------------------------------ Let's index parity functions by their indicator vectors. Our goal is to find all hat{f}_v such that (hat{f}_v)^2 > tau. Say S is the set of all indicator vectors. So the sum, over v in S, of the (hat{f}_v)^2 equals 1. Suppose we split S into S_0 = {v: v begins with 0} S_1 = {v: v begins with 1} and suppose for each of these we could ask for the sum of squares of the hat{f}_v in that set. And, suppose we could do this recursively. E.g., ask for sum of squares in: S_00 = {v: v begins with 00} S_01 = {v: v begins with 01} Then what? Then we could build up a tree, splitting each node whose answer is > tau, and ignoring those < tau. - max width of tree is 1/tau. - Just keep going to leaves at depth n. => Only n/tau questions to ask. So, all boils down to: how can we ask for the sum of squares of the hat{f}_v's inside these sets? In general, our goal is now to estimate, for some string alpha of 0s and 1s, the sum of the squares of the hat{f}_v's for all v that begin with alpha. (Note: we won't find the sum exactly. We'll find an estimate of the sum. But that's fine. It just means we throw out nodes whose estimate is < tau/2 instead of those whose estimate is < tau) There's now a neat trick for doing this. Let's do case of alpha = 00000 (a string of k zeros). So, we want the sum of squares over all parities that don't include any of first k variables. Remember, f is a {-1, +1}-valued function. The claim is that this sum just happens to equal the expected value, over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx). I.e. pick random suffix x. Pick two random prefixes y and z and multiply f(yx) by f(zx). Repeat many times and take the average. Intuition: independent y and z will cancel out correlations we don't want. E.g., what if f was a parity function? If f agrees with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1. BUT, if f DOESN'T agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0. Now, since any f can be written as a sum of parities, hopefully we can reduce the case of general f to the case of "f is a parity function"... It's actually almost this easy. Question we need to answer is: for two (possibly different) vectors v,w what is E [ phi_v(yx) * phi_w(zx) ] ? x,y,z Claim: This is 0 if v != w, and is also 0 if v=w is not all 0s on the first k bits. If v=w and they are all 0 on 1st k bits, then the expectation is 1. Proof: [go through cases] This is enough. Now we get: E[f(yx)f(zx)] = E[ (sum_v hat{f}_v phi_v(yx)) (sum_w hat{f}_w phi_w(zx))] = sum of squares of coeffs starting with k zeroes. So, we're done for the case of alpha = "all zeros". What about other alpha (i.e., requiring parity to have a certain fixed dependence on the first k bits)? Since we know the dependence (alpha) just "undo it". Specifically, we instead ask for: E[f(yx)f(zx)\phi_{alpha}(y)phi_{alpha}(z)] where phi_alpha is the parity of the variables set to 1 in alpha. (then we get same as in above Claim, except the non-zero case is when v=w and they are equal to alpha on the first k bits). That's it. Actually, one technical point: we need to estimate an expectation. We'll do it by sampling. OK since the quantity inside the expectation is bounded (-1 or +1). In Jackson's result on boosting a weak-learner for DNF, what he shows is that instead of modifying the distribution, you can modify f to a non-boolean function. Then, need to argue that (based on how the boosting results work) f doesn't get too large, so you can still estimate f(yx)f(zx) from sampling.