15-859(B) Machine Learning Theory 03/02/04 * Finish topic of learning over uniform distribution * Learning with Membership Queries (active learning) - Some simple algorithms: - monotone DNF - log(n)-variable functions - begin KM algorithm for finding large Fourier coefficients ========================================================================= Discuss project/presentation topics... Learning over uniform distribution on {0,1}^n ============================================= We've seen over last few classes a collection of results that relate to the problem of learning over the uniform distribution on {0,1}^n. [I.e., have access to uniform random examples, and goal is to find h such that h(x)=f(x) on most of the domain.] - SQ lower bounds imply that if C contains > poly(n) parity functions, then can't even weak-learn wrt uniform in poly time. E.g., decision trees, DNF have n^{log n} parities in them. - But you can weak-learn any monotone function. Saw simple alg with error < 1/2 - 1/sqrt(n). Lower bound saying can't do much better in general. [But if you allow # samples to be polynomial in the description length of the target as a monotone DNF, then lower-bound breaks down and this is an open problem] A few more results: - We can complement our SQ lower bound with a simple n^{O(log s)} time algorithm to learn DNF of s terms over uniform distribution. Ideas? + how important are terms of size >> log(s)? + Do greedy set-cover. Get DNF of s*log(1/epsilon) terms. Size is bounded so can use Occam to upper-bound sample size needed. [Can't do "list-and-cross-off" because hypothesis size too big] - Note: LMN use Fourier analysis to get this kind of n^polylog(s) bound for learning AC^0 circuits. (DNF are depth-2 circuits). - On hwk, you are giving alg to learn decision trees over *any* distribution in time n^{log n}. Learning with Membership queries ================================ In addition to getting random examples, the learning algorithm is allowed to ask for c(x) for *any* x of its choosing. These are called "membership queries" (MQs). E.g., this makes the most sense if we think of the target function as some device that we are trying to reverse-engineer. We can put it into "normal operating conditions" and observe its behavior, but we can also poke it with inputs of our own choosing and see what it does. So, membership queries are a kind of "active learning". Can define the following models: PAC + MQ: Can sample from D and ask MQs. Goal: low error wrt D. Unif + MQ: Special case D=uniform. Can view setting is you have a black box that you can prod as you like and you want h that agrees with target on most of the domain. MB + MQ: Mistake-bound setting with queries. Pay $1 per mistake and pay $1 per query. Also called the "equivalence + membership query model". Exact learning with MQ only: need to recover target exactly using only queries. Pretty tough. E.g., can't even do conjunctions (why?). But you *can* do parity (how?) Cases where MQs make sense: trying to "break" encryption on a smart-card [mention Angluin's DFA algorithm], active exploration of environment, converting a complicated hypothesis (e.g., a neural net) into a more understandable form. Cases where MQs don't make so much sense: classifying documents/images (doesn't make sense to "walk" a document about topic X towards a document about topic Y and ask at what point the topic changes) Interesting area for project: define and prove something about other "reasonable" query types you might be able to ask in different settings. Plan for next few classes: - Some simple algorithms - MQ algs based on Fourier. (for learning wrt uniform) - Bshouty's alg: learn Decision trees in PAC+MQ (or MB+MQ) model. - Angluin's algorithm for learning Finite Automata. Extensions of Rivest&Schapire. Some Simple Algorithms ====================== Angluin's algorithm for learning Monotone DNF in MB+MQ model ------------------------------------------------------------ * Start with h(x) = "all-negative". * If predict negative on a positive example x: - "walk" x down to a minimal positive example (go left to right along x, flipping 1's to 0's so long as they keep the example positive). - The conjunction of the variables in x must be a term of c. So put it into h. * Since we maintain that all terms of h are also in c, we don't make any mistakes on negatives. Total number of mistakes is at most # of terms in target. Number of queries at most n per term. Learning functions with r = O(log n) relevant variables -------------------------------------------------------- - randomly sample until you get a positive and a negative example, or halt if have only seen one type after 2^r log(1/delta) examples. [Can instead use "(n,r)-universal sets"] - Walk the positive and negative toward each other to identify a relevant variable x. - Build decision tree with x at root. Recursively learn each subtree. - alg gives exact learning with MQ only. ======================================================================== Fourier-based MQ algorithms =========================== We know that any f over {0,1}^n can be written as a sum of parity functions: f(x) = \sum_i \hat{f}_i \phi_i(x) where \phi_i is the ith parity function, and \hat{f}_i is the correlation of f with \phi_i under the uniform distribution. Claim: using MQs, we can find all i such that (\hat{f}_i)^2 > tau in time poly(n, 1/tau). 1. Why is this interesting? 2. How can we do it? Why is this interesting? ------------------------ Can prove that any DNF formula has at least one non-negligible fourier coefficient in parity basis. So, this lets us WEAK-LEARN DNF over the uniform distribution. Also, any DECISION TREE has 1-epsilon of its "length" (sum of squares of coefficients) in a (poly(n/1/epsilon)) number of coefficients. So, this lets us strongly-learn decision trees, with respect to the uniform distribution. QUESTION: why is it enough to get "most" of the fourier coefficients? E.g., if we get all but epsilon of the (\hat{f}_i)^2, then the resulting g has (f-g)^2 = epsilon. The claim is that even if g doesn't correspond to a boolean function, we can use sign(g) as a good approx to f. The reason is that by definition, E[(f(x)-g(x))^2] = epsilon. Notice that every x where sign(g) is wrong contributes at least 1. So, the error of sign(g) is at most epsilon. Then Jackson showed how to use boosting to get strong learning of DNF over unif. [not trivial --- remember boosting changes the distribution] Won't prove claims about DNF, DTs today. Instead, let's go to alg for finding large fourier coeffs. ====================================================================== Kushilevitz-Mansour (Goldreich-Levin) algorithm: -------------------------------------------------- Let's index parity functions by their indicator vectors. Our goal is to find all \hat{f}_v such that (\hat{f}_v)^2 > \tau. Say S is the set of all indicator vectors. So the sum, over v in S, of the (\hat{f}_v)^2 equals 1. Suppose we could split S into two equal-sized sets S_0 and S_1 and for each one, ask for the sum of squares of the \hat{f}_v in that set. And, suppose we could do this recursively. Then what? Then we could build up a tree, splitting each node whose answer is > tau, and ignoring those < tau. - max width of tree is 1/tau. - Just keep going to leaves at depth n. => Only n/tau questions to ask. So, all boils down to: how can we ask for the sum of squares of the \hat{f}_v's inside these sets? First of all, to be able to do this, we're going to need to split these sets in a particular way. Specifically, we'll do this by fixing a prefix of the variables, and then the set will be "all parity functions whose indicator vectors agree with this prefix". For instance, S_0 = {v: v begins with 0} S_1 = {v: v begins with 1} Then, will split S_0 into S_00 = {v: v begins with 00} S_01 = {v: v begins with 01} etc. In general, our goal is now to estimate, for some string alpha of 0s and 1s, the sum of the squares of the \hat{f}_v's for all v that begin with alpha. (Note: we won't find the sum exactly. We'll find an estimate of the sum. But that's fine. It just means we throw out nodes whose estimate is < tau/2 instead of those whose estimate is < tau) There's now a neat trick for doing this. Let's do case of alpha = 00000 (a string of k zeros). So, we want the sum of squares over all parities that don't include any of first k variables. Remember, f is a {-1, +1}-valued function. The claim is that this sum just happens to equal the expected value, over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx). I.e. pick random suffix x. Pick two random prefixes y and z and multiply f(yx) by f(zx). Repeat many times and take the average. Intuition: independent y and z will cancel out correlations we don't want. E.g., what if f was a parity function? If f agrees with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1. BUT, if f DOESN'T agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0. Now, since any f can be written as a sum of parities, hopefully we can reduce the case of general f to the case of "f is a parity function"... It's actually almost but not quite so easy. Question we need to answer is: for two (possibly different) vectors v,w what is E [ phi_v(yx) * phi_w(zx) ] ? x,y,z Claim: This is 0 if v != w, and is also 0 if v,w aren't all 0s on the first k bits. If v=w and they are all 0 on 1st k bits, then the expectation is 1. This is enough. Now we get: E[f(yx)f(zx)] = E[ (sum_v \hat{f}_v \phi_v(yx)) (sum_w \hat{f}_w \phi_w(zx))] = sum of squares of coeffs starting with k zeroes. So, we're done for the case of alpha = "all zeros". What about other alpha (i.e., requiring parity to have a certain fixed dependence on the first k bits)? Since we know the dependence (alpha) just "undo it". Specifically, we instead ask for: E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)] where phi_alpha is the parity of the variables set to 1 in alpha. (then we get same as in above Claim, except the non-zero case is when v=w and they are equal to alpha on the first k bits). That's it. Actually, one technical point: we need to estimate an expectation. We'll do it by sampling. OK since the quantity inside the expectation is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he shows is that instead of modifying the distribution, you can modify f to a non-boolean function. Then, need to argue that (based on how the boosting results work) f doesn't get too large, so you can still estimate f(yx)f(zx) from sampling. ============================================================================