15-859(B) Machine Learning Theory 02/23/09 * Computational hardness results for learning ========================================================================= Today: computational hardness results for learning plus some open problems. We saw before that if we are not concerned with computation time, then we can learn any class of function using a number of examples that is polynomial (actually, linear) in the number of bits to write down the target. But what happens when computation time enters the picture? HARDNESS RESULTS ================ Three kinds of computational hardness results in machine learning. (1) Representation dependent. Show that it's hard to find a low-error or near-optimal hypothesis in a given class. - E.g., it's NP-hard to determine whether a given dataset S is consistent with an intersection of 2 halfspaces. Same with 2-term DNF. (Reduction from hypergraph 2-coloring). This then implies it's hard to PAC-learn C by C since you could let D be the uniform distribution over S and use epsilon = 1/(|S|+1). - E.g., it's NP-hard to find the lowest-error OR-function on a given dataset. Same for linear separators. Even finding one of error < 49% under the assumption that there exists one of error 1% is NP-hard. These statements help explain why one expects to see local minima in neural nets, and why all algorithms for training a linear separator always end up minimizing something like hinge loss. On the other hand, they are really talking more about the difficulty of finding the optimal member of some hypothesis class rather than the inherent difficulty of the target concept. For instance, given data consistent with a 2-term DNF, we can easily find a consistent 2-CNF. (2) by restricting the way the algorithms can interact with the data. We'll see this soon when we discuss the Statistical Query model. These hardness results won't require complexity assumptions. The main drawback here is that it only applies to SQ algorithms. (3) Cryptographic-based hardness. Show that the class of functions contains cryptographic objects. E.g., show that any good learning algorithm would allow you to factor efficiently. These results are nice because they will apply without restricting the form of hypotheses. We are going to focus on hardness results of type (3) today. PRELIMINARIES: 4 learning models we will think about ==================================================== A) Standard PAC model, where h can be any poly-time computable function. B) Same as (A) but restrict D to be uniform on {0,1}^n C) Model (A) but give learning algorithm the additional ability to query target function f(x) on inputs x of its own choosing. These are called MEMBERSHIP QUERIES. They make most sense in contexts where you think of f as a device you want to reverse-engineer. So, algorithm can make MQs, and also see examples from D. Goal is to get low-error over D. D) Model (B) + Membership queries. Here you can think of querying as the *only* thing the algorithm can do, and goal is to find h that agrees with f on 99% of the points in {0,1}^n. Goal is to do learning in time polynomial in size of target function (recall that a "Concept Class" really is also a representation language. So, by "size" we mean size in that language.) Today we'll look at two results + mention some classic open problems. Let's start with the open problems. Some open problems ================== 1. Can DNF formulas be learned efficiently in model (A) or (B)? A very nice result by Jeff Jackson (in his CMU PhD thesis) showed how to learn DNF formulas in model (D). We'll see in a later class an algorithm for using queries to *weak learn* with respect to the uniform distribution. (Note: boosting doesn't *immediately* follow since this is for a specific distribution). Also, under crypto assumptions, it is known that model (C) is as hard as model (A). Intuitively, this is because of difficulty of SAT. (Being able to produce any interesting query at all.) Formally, assumption is the existence of digital signatures. 2. Can Decision Trees be learned efficiently in model (A) or (B)? Interestingly, you *can* learn DTs with queries in model (C). Will see that later. You gave an n^O(log(s))-time algorithm for model (A). 3. Can Monotone DNF be learned efficiently in model (B)? In model (A) they are the same as general DNF. In (B), you can weak-learn but no strong-learning algorithm is known. Best error rate known is 1/2 - 1/sqrt(n). Model (C) is easy (will see that later). Two hardness results for today: =============================== * Assuming factoring is hard, the class of Boolean Formulas (like DNF but not restricted to OR of ANDs -- you allow an arbitrary "formula tree") or equivalently, NC^1 (boolean circuits of logarithmic depth) is not weakly-learnable even in model (D). (Kharitonov) [equivalence between NC^1 and BF will be on future hwk...] * Under the same assumption, DFA's can't be learned over arbitrary distributions without MQs (model (A)). Later in this course we will see an algorithm that learns them with MQs. Kharitonov's result =================== NC^1 is the class of O(log n) depth, bounded fan-in ciruits. The only fact about NC^1 we will be using is that you can multiply n n-bit numbers modulo an n-bit number using a circuit in NC^1. The key idea now is to try to use this power to encode some cryptographic function that is as hard as factoring. Convenient function: BBS pseudorandom bit generator: - Given N = p*q where p and q are primes congruent to 3 mod 4, and given a seed x_0, define x_i = x_{i-1}^2 mod N. b_i = LSB(x_i) - The generator outputs b_0, b_1, b_2, ... This generator has the property that if you could distinguish (I) an m-bit string produced by first choosing a random x_0 and then running the generator m times, from (II) a truly random m-bit string, then you could use this to factor N (in time polynomial in m and running time of your algorithm). In fact, all you need is an algorithm that "weakly distinguishes" them, in the "weak-learning" sense. The idea now is to try to show that for any given N, x_0, there must exist an NC^1 circuit C_{x_0,N} that on input i, produces b_i as output. Before showing that, let's see how that would imply what we want. First of all, define C_{x_0,N,t} to be the same as C(x_0,N) except that all but the last t input bits are hardwired to 0 (it only looks at the last t bits of input i). So, this is in NC^1 too. Now, the way we will use this in a hardness proof is like this: say that A is a proposed poly-time weak learner for NC^1. So, for some k, if the target has n gates, A uses at most n^k time/queries/samples and gets error at most 1/2 - 1/n^k. Now, say we are given an m-bit string that's either of type (I) or (II) above, for m = n^{3k}. What we do is use this as a lookup table. Given an input i, we look at the last 3k*lg(n) bits of i and output the corresponding position in our bit string. Now, since A only has n^k queries or samples, when it comes time for A to make a prediction on a new example, the probability that the new random example has the same suffix as something it has seen already is at most n^k/n^{3k} = 1/n^{2k}. So, if our bit string was type (II), then A would have probability at most 1/2 + 1/n^{2k} of getting it right. On the other hand, if our bit string was type (I), then we have been consistent with C_{x_0,N,3k*log(n)} and so A should have success probability at least 1/2 + 1/n^k. This is a weak-distinguisher and so you could factor. So, we just need to argue that we can make an NC^1 circuit that on input i computes: LSB( (x_0)^{2^i} mod N ) We will do this in two steps. First show how to compute j =2^i mod phi(N) and then how to compute (x_0)^j mod N. Unfortunately, we can't directly do repeated-squaring in NC^1. However, here is a trick. We first precompute 2^1, 2^2, 2^4, 2^8, etc., modulo phi(N), and store these values in the circuit. Then, given input i, we use i to tell us which of these precomputed values we want to multiply together modulo phi(N). Remember, we can multiply n n-bit numbers modulo another n-bit number in NC^1. The second step works the same way. We precompute x_0, (x_0)^2, (x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to multiply together modulo N. Can get statements about AC^0 too (constant depth, unbounded fanin) using stronger assumptions about hardness of factoring. --------------------------- Now, we can use this to show that DFA are hard to learn in the PAC model when you don't have membership queries. Idea: given an NC^1 circuit f over {0,1}^n, will create DFA g over {0,1}^m where m = n*2^{depth(f)}. Also will define a simple mapping r: x -> xxx...x. The DFA g will have the property that g(r(x)) = f(x). Also the number of states in g will be polynomial in m. -> first, how would you create a DFA for the function x_2 OR not(x_4)? This is easy. In fact, notice we can do this so that the DFA always ends up in either a specific accept-state A or a specific reject-state R,and we can get rid of self-loops if we only care about inputs of size n. -> Given a DFA for the function f1 and a DFA for the function f2, how to create a DFA for "f1 AND f2", assuming we're allowed to replicate the input? Just make f1's A state be f2's start state, and have f1's R state go through a chain of dummy states to f2's R state. Similarly for OR. So we double the length of the input needed each time we go up 1 level in the circuit. That's it. Notice that we have *not* shown that DFA's are hard to learn with membership queries (good thing too, since we are going to give an algorithm that learns DFAs using membership queries). The reason is that only queries of the form "xxx...x" actually map back to legal queries for the original NC^1 circuit. We have also *not* shown that DFA's are hard to learn over the uniform distribution on n-bit inputs. (This is a nice open problem).