15-859(B) Machine Learning Theory 03/06/06 * Cryptographic hardness results for learning * Some open computational problems in PAC prediction model. ========================================================================= Today: computational hardness results for learning in PAC prediction model and some open problems. We saw before that if we are not conerned with computation time, then we can learn any class of function using a number of examples that is polynomial (actually, linear) in the number of bits to write down the target. But what happens when computation time enters the picture? HARDNESS RESULTS ================ Three kinds of computational hardness results in machine learning. (1) Representation dependent. Show that it's hard to find a low-error or near-optimal hypothesis in a given class. - E.g., it's NP-hard to determine whether a given dataset S is consistent with an intersection of 2 halfspaces. Same with 2-term DNF. (Reduction from hypergraph 2-coloring). This then implies it's hard to PAC-learn C by C since you could let D be the uniform distribution over S and use epsilon = 1/(|S|+1). - E.g., it's NP-hard to find an OR function that minimizes the number of mistakes for an arbitrary dataset (direct reduction from set cover). Same for linear separators. These statements can be very helpful. E.g., they explain why one expects to see local minima in neural nets, and why all algorithms for training a linear separator always end up minimizing something that is related to but not exactly the same as the "number of mistakes made". On the other hand, they are really talking more about the difficulty of finding the optimal member of some hypothesis class rather than the inherent difficulty of the target concept. For instance, given data consistent with a 2-term DNF, we can easily find a consistent 2-CNF. Big open problem: can you PAC-learn the class of "intersections of 2 halfspaces" using some *other* hypothesis representation. Can you do agnostic learning of halfspaces or OR-functions using some other representation? (2) by restricting the way the algorithms can interact with the data. We'll see this soon when we discuss the Statistical Query model. These hardness results won't require complexity assumptions. The main drawback here is that it only applies to SQ algorithms. (3) Cryptographic-based hardness. Show that the class of functions contains cryptographic objects. E.g., show that any good learning algorithm would allow you to factor efficiently. These results are nice because they will apply without restricting the form of hypotheses. PRELIMINARIES: 3 learning models we will think about ==================================================== A) PAC prediction model. B) PAC prediction over uniform distribution D on {0,1}^n C) PAC model with additional ability of learning algorithm to query target function f(x) on inputs x of its own choosing. These are called MEMBERSHIP QUERIES. So, algorithm can make MQs, and also see examples from D. Goal is to get low-error over D. Can look at version of this where we fix D = uniform distribution. Again, goal is to do learning in time polynomial in number of size of target function (recall that a "Concept Class" really is also a representation language. So, by "size" we mean size in that language.) Today we'll look at two results + some open problems. Let's start with the open problems. Some open problems ================== 1. Can DNF formulas be learned efficiently in model (A) or (B)? It is known that if you add membership queries to the PAC prediction model (A) then that doesn't help for learning DNF (if you can do it with MQs, you can do it without). Intuitively, this is because of difficulty of SAT. (Formally, need to look at digital signatures). On the other hand, Jeff Jackson (in his CMU PhD thesis) showed that you *can* use MQs to learn DNF formulas with respect to the uniform distribution (i.e., adding queries to model (B)). 2. Can Decision Trees be learned efficiently in model (A) or (B)? Interestingly, you *can* learn DTs in model (C) of PAC+MQs. Will see that later. 3. As a special case of problems (1) and (2), can the set of "functions that have only log(n) relevant variables" be learned efficiently in models (A) or (B)? [Note: algorithm is not told *which* vars are relevant] How about loglog(n) relevant variables? In that case, you even have time to try all possible truth-tables of width loglog(n), so the issue is just figuring out which variables are relevant. Can learn with MQs though. 4. Can Monotone DNF be learned efficiently in model (B)? In model (A) they are the same as general DNF. In (B), you can weak-learn but no strong-learning algorithm is known. Best error rate known is 1/2 - 1/sqrt(n). Two hardness results for today: =============================== * Assuming factoring is hard, the class NC^1 (boolean circuits of logarithmic depth) is not weakly-learnable even if membership queries are allowed, and even if D = "the uniform distribution over {0,1}^n". (Kharitonov). Can even extend to AC^0, if you assume that factoring is "really hard". * Under the same assumption, DFA's can't be learned over arbitrary distributions without MQs. (Later in this course we will see an algorithm that learns them with MQs.) Kharitonov's result =================== NC^1 is the class of O(log n) depth, bounded fan-in ciruits. The only fact about NC^1 we will be using is that you can multiply n n-bit numbers modulo an n-bit number. The key idea now is to try to use this power to encode some cryptographic function that is as hard as factoring. Convenient function: BBS pseudorandom bit generator: - Given N = p*q where p and q are primes congruent to 3 mod 4, and given a seed x_0, define x_i = x_{i-1}^2 mod N. b_i = LSB(x_i) - The generator outputs b_0, b_1, b_2, ... This generator has the property that if you could distinguish (A) an m-bit string produced by first choosing a random x_0 and then running the generator m times, from (B) a truly random m-bit string, then you could use this to factor N (in time polynomial in m and running time of your algorithm). In fact, all you need is an algorithm that "weakly distinguishes" them, in the "weak-learning" sense. The idea now is to try to show that for any given N, x_0, there must exist an NC^1 circuit that on input i, produces b_i as output. Specifically, given input i, what we want to compute is: LSB( (x_0)^{2^i} mod N) We will do this in two steps. First show how to compute j =2^i mod phi(N) and then how to compute (x_0)^j mod N. Unfortunately, we can't directly do repeated-squaring in NC^1. However, here is a trick. We first precompute 2^1, 2^2, 2^4, 2^8, etc., modulo phi(N), and store these values in the circuit. Then, given input i, we use i to tell us which of these precomputed values we want to multiply together modulo phi(N). Remember, we can multiply n n-bit numbers modulo another n-bit number in NC^1. The second step works the same way. We precompute x_0, (x_0)^2, (x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to multiply together modulo N. Let C(x_0,N) be the NC^1 circuit that does this computation for a given x_0, N. Let C(x_0,N,t) be the same as C(x_0,N) except that all but the last t input bits are hardwired to 0 (it only looks at the last t bits of its input). Formally, the way we will use this in a hardness proof is like this: say that A is a proposed n^k time/queries/samples algorithm to learn NC^1 circuits of size n with bias 1/2 + 1/n^k. Say we are given an N to factor. We pick a random x_0 as in BBS. Now, algorithm A must work on circuits of the form C(x_0,N,3k*lg(n)). Even though we can't construct the circuit (it required knowing phi(N)), we can simulate it by just producing the first n^{3k} entries in the BBS sequence. Now, since A only has n^k queries or samples, when it comes time for A to make a prediction on a new example, the probability that the new random example has the same suffix as something it has seen already is at most 1 - n^k/n^{3k} = 1 - 1/n^{2k}. So, if A could really get a bias as good as 1/2 + 1/n^k, that means that A could be used to get a weak bias on whether the string b_0,b_1,...,b_{n^3k} was pseudorandom or truly random (and therefore A could be used to factor in polynomial time). Can get statements about AC^0 too (constant depth, unbounded fanin) using stronger assumptions about hardness of factoring. --------------------------- Now, we can use this to show that DFA are hard to learn in the PAC model when you don't have membership queries. Idea: given an NC^1 circuit f over X = {0,1}^n, will create DFA g over Y = {0,1}^m where m = n*2^{depth(f)}. Also will define a simple mapping h: X->Y, x -> xxx...x). The DFA g will have the property that g(h(x)) = f(x). Also the number of states in g will be polynomial in m. -> first, how would you create a DFA for the function x_2 OR not(x_4)? This is easy. In fact, notice we can do this so that the DFA always ends up in either a specific accept-state A or a specific reject-state R,and we can get rid of self-loops if we only care about inputs of size n. -> Given a DFA for the function f1 and a DFA for the function f2, how to create a DFA for "f1 AND f2", assuming we're allowed to replicate the input? Just make f1's A state be f2's start state, and have f1's R state go through a chain of dummy states to f2's R state. Similarly for OR. So we double the length of the input needed each time we go up 1 level in the circuit. That's it. Notice that we have *not* shown that DFA's are hard to learn with membership queries (good thing too, since we are going to give an algorithm that learns DFAs using membership queries). The reason is that only queries of the form "xxx...x" actually map back to legal queries for the original NC^1 circuit.