15-859(A) Machine Learning Theory 03/16/04 Membership Query learning, contd * KM alg recap (finding large Fourier coefficients) * Bshouty's algorithm for learning decision trees and XORs of terms Hardness results for learning ======================================================================= Recap from last time: - Membership queries: get to ask for f(x) for examples x of our choosing. Think of reverse-engineering some black-box device. Assume input space is {0,1}^n. - KM algorithm: find all large (size > 1/poly) fourier coefficients over parity basis. I.e., find all parity functions that correlate by at least 1/poly (or less than -1/poly) with target wrt the uniform distribution. -> If most of "length" of f is in large fourier coefficients, then get a good approx wrt uniform distribution. -> E.g., we saw this was true for decision trees. So, this gives a poly-time algorithm for using MQs to learn DTs wrt the uniform distribution. Also, gives weak-learning for DNF, which can then strengthen to strong-learning [didn't do in class]. ====================================================== Question that was open for then was: Can you use MQs to learn DTs wrt arbitrary distributions? To put this into context: It turns out you can show that a bad distribution can make MQs worthless for learning DNF, if crypto-hard functions exist (a similar statement is that modulo some crypto assumptions, I could construct a 3-SAT formula such that even if you ask me for a bunch of satisfying assginments, you still can't construct a new one that I haven't shown you yet). But this doesn't go through for decision trees. Problem solved by Bshouty: algorithm for learning DTs in PAC+MQ model. Note: algorithm won't be very practical in the sense that not only will it use MQs, but it will also be very sensitive to noise. (in contrast, the Fourier algorithm would still be reasonable if there was random noise, since this would affect the correlations in a reasonable way). ====================================================== Here is a simpler version of Bshouty's original algorithm, from an email message that he sent me (not sure if this simpler version is in print). You'll see some similarities to KM. In fact, algorithm works for a larger class: "XOR of terms". We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where T_i are conjunctions. Why is this a generalization of DTs? We're going to solve by building a decision-tree-like-thing (a "parity decision graph"?) of depth n and width O(m). So, by Occam results, just need to take some initial sample S of size O(nm/epsilon), and then use our membership queries to find a consistent hypothesis of size O(nm). Idea: let's just build a decision tree with x_1 at the root, then x_2 at the next level, etc., until we get to > m leaves. This will happen at some depth d = log(m). Define T_1', ..., T_m' to be what you get by taking the original terms T_1,...,T_m and removing all mention of variables x_1,...,x_d from them. So, if we look at all leaves of our tree so far, the value of the target function, given that the example takes you to that leaf, can be written as an XOR of some subset of the T' terms. For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4) and say d=2. Then the leaves look like: (T1' XOR T3'), T1', (T2' XOR T3'), T2'. Now, the claim is that since there are > m leaves, some of these must be XORs of others. In fact, there must exist some set L of at most m leaves, such that any other leaf is an XOR of some subset of leaves in L. (L is a maximal linearly-independent set mod 2.) E.g., above any leaf is the XOR of the other three leaves. So the plan is now: 1. (somehow) find such a set L, and write down for other leaves how they can be written as an XOR of leaves in L. (oops: the description length of a level could be as high as m^2, so let's have S be of size nm^2/epsilon to be safe....) 2. keep building the tree down from leaves in L until have more than m leaves again, and repeat. 3. (somehow) understand what this means in terms of a final hypothesis. Let's do 1: For each example, vary the first d bits (using MQs) to make the example reach all active leaves. ==> Get a vector of outputs. I.e., what would the output have been if first d bits were set to take you to leaf i. If we write this out for all the examples, we get a big matrix with 1 row per example, where the set L corresponds to a set of <= m columns such that every other column is some linear combination of these (mod 2). Find these with Gaussian elimination. (What's a little confusing here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5. We don't really know if this is true over ALL of {0,1}^n. We just know it's true over examples with suffixes matching S. But, that's OK since our goal is just to find a hypothesis consistent with S.) To do 3: Follow example down the tree. When you get to an "inactive" node, then replace position by the *set* of positions listed in the node. So, our "current location" in the tree is now a set of nodes. More generally, if our current set of nodes is N and say node y in N is inactive, then look at set (N - {y}) XOR (positions listed at y). Finally, as we go down the tree, the T' terms will become constant 1 or 0, and we just take the XOR. Why is this legal? We can see this by arguing bottom-up: Say (as above) that L is the set of active nodes at the current level, and inductively assume that the trees rooted at nodes of L are correct on all examples produced by taking some x in S and replacing its prefix by whatever is needed to reach that node. Then (by construction) the inactive nodes (the ones not in L) are also correct on examples produced by taking some x in S and replacing its prefix with whatever was needed to reach them, too. Therefore, all *parents* of this level are correct on examples produced by taking some x in S and replacing prefixes. And so on. ======================================================== Summary of learnability results =============================== PAC+MQ: (arbitrary distribution, with Memb queries) - monotone DNF - decision trees - XORs of terms - DFAs (target is a DFA of n states. Inputs are strings. Will do this next class) MQ+uniform (memb queries. learning wrt uniform distribution) - general DNF formulas No queries, but allow n^polylog(n) time: - decision trees - DNF over uniform distribution AC^0 too [didn't do this in class] No queries, polynomial time: - decision lists, LTFs, k-CNF, k-DNF. ======================================================== HARDNESS RESULTS ================ Three kinds of computational hardness results in machine learning. (1) Representation dependent. Show that it's hard to find a consistent or optimal hypothesis in a given class. - E.g., it's NP-hard to find an OR function that minimizes the number of mistakes for an arbitrary dataset (direct reduction from set cover). Same for linear separators. - E.g., it's NP-hard to find an intersection of 2 halfspaces (an AND of two linear separators) that is consistent with a dataset. Same with 2-term DNF. Reduction from hypergraph 2-coloring. These statements can be very helpful. E.g., they explain why one expects to see local minima in neural nets, and why all algorithms for training a linear separator always end up minimizing something that is related to but not exactly the same as the "number of mistakes made". On the other hand, they are potentially talking more about the difficulty of finding the optimal member of some hypothesis class rather than the inherent difficulty of the target concept. For instance, given data consistent with a 2-term DNF, we can easily find a consistent 2-CNF. (2) by restricting the way the algorithms can interact with the data -- e.g., the SQ model. Here we got very strong hardness results. What's nice is we could also talk about hard distributions over the targets -- the hard case is when it's picked at random from the large set of orthogonal functions. No complexity assumptions. The main drawback here is that it only applies to SQ algorithms. (3) Cryptographic-based hardness. Show that the class of functions contains cryptographic objects. Today we'll look (briefly) at two results: * Assuming factoring is hard, the class NC^1 (boolean circuits of logarithmic depth) is not weakly-learnable even if membership queries are allowed, and even if D = uniform. (Kharitonov). * Under the same assumption, DFA's can't be learned over arbitrary distributions without MQs. Kharitonov's result =================== DEFINITION: NC^1 is the class of O(log n) depth, bounded fan-in ciruits. It is equivalent to the class of "boolean formulas" (like DNF but can put ORs, ANDs, NOTs together however you want). The difference between a "circuit" and a "formula" is that each gate in a circuit can have arbitrary fan-out, but in a formula all gates have fanout = 1. [Proving equivalence will probably be on hwk5] Important property of NC^1: can multiply n n-bit numbers modulo another n-bit numbers in NC^1. The cryptographic object we'll put into the NC^1 circuit is the BBS pseudorandom bit generator: Given N = p*q where p and q are primes congruent to 3 mod 4, and given a seed s = s_0, define s_i = s_{i-1}^2 mod N. b_i = LSB(s_i) The generator outputs b_0, b_1, b_2, ... This generator has the property that if you could distinguish (A) an m-bit string (where m is poly(n)) produced by first choosing a random s and then running the generator m times, from (B) a truly random m-bit string, then you could use this to factor N in poly time. In fact, all you need is an algorithm that "weakly distinguishes" them, in the "weak-learning" sense. --> The idea is to encode N, p, q, and s into an NC^1 circuit so that on input i, the circuit produces b_i as output. Not obvious how to do this since we need to compute LSB( s^{2^i} mod N), but let's first look at why that gives a hardness result. The point is that if an algorithm was able by asking queries to determine (or even get a good bias on) the value f(x) of some x it hadn't yet seen, it would be able to distinguish the b_i's from random. One technical issue is that the BBS pseudorandom bit generator is only known to be secure if you are looking at an initial poly(n)-sized segment of the b_i's. So, to deal with this, imagine a version of this circuit where we mask off all but the last k*log(n) bits of the input (we set the rest to 0). If someone's magic algorithm claims, say to get bias 1/2 + 1/n^4 after n^5 queries, then we set k=10. After n^5 queries, the probability that the new random example has the same suffix as something it has seen already is at most n^5/n^10 = 1/n^5. So, if A could really get a bias as good as 1/2 + 1/n^4, that means that A could be used to get a weak bias on whether the string b_0,b_1,...,b_{n^10} was pseudorandom or truly random (and therefore A could be used to factor in polynomial time). --> To implement the generator in NC^1 we use the trick of precomputing useful values. Specifically, given input i, we compute LSB( s^{2^i} mod N) in two steps. First we compute j = 2^i mod phi(N) and then we compute s^j mod N. To do the first step, we precompute 2^1, 2^2, 2^4, 2^8, etc., modulo phi(N), and store these values in the circuit. Then, given input i, we use i to tell us which of these precomputed values we want to multiply together modulo phi(N). Remember, we can multiply n n-bit numbers modulo another n-bit number in NC^1. The second step works the same way. We precompute s, s^2, s^4, s^8, ... mod N, and use j to tell us which of these to multiply together modulo N. =========Probably save for next time==================================== Now, we can use this to show that DFA are hard to learn in the PAC model when you don't have membership queries. Idea: given an NC^1 circuit f over X = {0,1}^n, will create DFA g over Y = {0,1,#}*. Also will define a simple mapping h: X->Y (This mapping will be: x -> x#x#x#x#....#x). The DFA g will have the property that g(h(x)) = f(x). The number of "#"'s will be 2^depth(f). -> first, how would you create a DFA for the function x_2 OR not(x_4)? This is easy. In fact, to help us later, we can do this so that the DFA always ends up in either a specific accept-state A or a specific reject-state R. -> Given a DFA for the function f and a DFA for the function g, how to create a DFA for "f AND g", assuming we're allowed to replicate the input? This is easy. The "#" transition from f's R state goes directly to g's R state. The "#" transition from f'a A state goes to g's start state. So we double the length of the input needed each time we go up 1 level in the circuit. That's it. Notice that we have *not* shown that DFA's are hard to learn with membership queries (good thing too, since we are going to give an algorithm that learns DFAs using membership queries). The reason is that only queries of the form "x#x#x#x#x#x#x" acutally map back to legal queries for the original NC^1 circuit.