15-859(B) Machine Learning Theory 03/14/02 * Bshouty's algorithm for learning decision trees and XORs of terms * Cryptographic hardness results for learning ANNOUNCEMENT: starting next week, class will be in Wean 4623. ========================================================================= Bshouty's algorithm =================== -> see notes from last time. Summary of learnability results =============================== PAC+MQ: (arbitrary distribution, with Memb queries) - monotone DNF - decision trees - XORs of terms - CNF intersect DNF (functions with poly-size CNF and poly-size DNF representations. Didn't prove this in class) - DFAs (target is a DFA of n states. Inputs are strings. Will do this next class) MQ/uniform approx (memb queries. learning wrt uniform distribution) - general DNF formulas No queries, but allow n^polylog(n) time: - decision trees - DNF over uniform distribution. [can show this now] - AC^0 over uniform distribution. [uses Fourier] No queries, polynomial time: - decision lists, LTFs, k-CNF, k-DNF. HARDNESS RESULTS ================ Three kinds of computational hardness results in machine learning. (1) Representation dependent. Show that it's hard to find a consistent or optimal hypothesis in a given class. - E.g., it's NP-hard to find an OR function that minimizes the number of mistakes for an arbitrary dataset (direct reduction from set cover). Same for linear separators. - E.g., it's NP-hard to find an intersection of 2 halfspaces (an AND of two linear separators) that is consistent with a dataset. Same with 2-term DNF. Reduction from hypergraph 2-coloring. These statements can be very helpful. E.g., they explain why one expects to see local minima in neural nets, and why all algorithms for training a linear separator always end up minimizing something that is related to but not exactly the same as the "number of mistakes made". On the other hand, they are really talking more about the difficulty of finding the optimal member of some hypothesis class rather than the inherent difficulty of the target concept. For instance, given data consistent with a 2-term DNF, we can easily find a consistent 2-CNF. Big open problem: can you PAC-learn the class of "intersections of 2 halfspaces" using some *other* hypothesis representation. Interestingly, the XOR of two halfspaces is easier. Using the fact that "(P(x) > 0) XOR (Q(x) > 0)" is the same as "P(x)*Q(x) > 0" (so long as we don't care about the "= 0" case) we can learn using the class of "degree 2 threshold functions". (2) by restricting the way the algorithms can interact with the data -- e.g., the SQ model. Here we got very strong hardness results. What's nice is we could also talk about hard distributions over the targets -- the hard case is when it's picked at random from the large set of orthogonal functions. No complexity assumptions. The main drawback here is that it only applies to SQ algorithms. (3) Cryptographic-based hardness. Show that the class of functions contains cryptographic objects. Today we'll look at two results: * Assuming factoring is hard, the class NC^1 (boolean circuits of logarithmic depth) is not weakly-learnable even if membership queries are allowed, and even if D = "the uniform distribution over {0,1}^n". (Kharitonov). Can even extend to AC^0, if you assume that factoring is "really hard". * Under the same assumption, DFA's can't be learned over arbitrary distributions without MQs. Kharitonov's result =================== --> describe NC^1. NC^1 is the class of O(log n) depth, bounded fan-in ciruits. It is equivalent to the class of "boolean formulas". See hwk6. The difference between a "circuit" and a "formula" is that each gate in a circuit can have arbitrary fan-out, but in a formula all gates have fanout = 1. --> describe BBS pseudorandom bit generator. Given N = p*q where p and q are primes congruent to 3 mod 4, and given a seed x_0, define x_i = x_{i-1}^2 mod N. b_i = LSB(x_i) The generator outputs b_0, b_1, b_2, ... This generator has the property that if you could distinguish (A) an s-bit string (where s is poly(n)) produced by first choosing a random x_0 and then running the generator s times, from (B) a truly random s-bit string, then you could use this to factor N in poly time. In fact, all you need is an algorithm that "weakly distinguishes" them, in the "weak-learning" sense. --> The idea now is to encode N, p, q, and x_0 into an NC^1 circuit so that on input i, the circuit produces b_i as output. --> Specifically: given an input x, we will look at the last k*log(n) bits of x and view that as a number i between 0 and n^k. The circuit will then output b_i. The reason we only use the last k*log(n) bits is that technically, the BBS pseudorandom bit generator is only known to be secure if you are looking at an initial poly(n)-sized segment of the b_i's. The way we will use this in a hardness proof is like this: say that A is a proposed n^5 time/queries/samples algorithm to learn NC^1 circuits with bias 1/2 + 1/n^4. To foil this algorithm we build a circuit with k=10. Since A only has n^5 queries or samples, when it comes time for A to make a prediction on a new example, the probability that the new random example has the same suffix as something it has seen already is at most 1 - n^5/n^10 = 1 - 1/n^5. So, if A could really get a bias as good as 1/2 + 1/n^4, that means that A could be used to get a weak bias on whether the string b_0,b_1,...,b_{n^k} was pseudorandom or truly random (and therefore A could be used to factor in polynomial time). --> only fact about NC^1 we will be using is that you can multiply n n-bit numbers modulo an n-bit number. --> To implement the generator in NC^1 we use the trick of precomputing a bunch of values. Specifically, given input i, what we want to compute is: LSB( (x_0)^{2^i} mod N) We will do this in two steps. First we compute j =2^i mod phi(N) and then we compute (x_0)^j mod N. To do the first step, we precompute 2^1, 2^2, 2^4, 2^8, etc., modulo phi(N), and store these values in the circuit. Then, given input i, we use i to tell us which of these precomputed values we want to multiply together modulo phi(N). Remember, we can multiply n n-bit numbers modulo another n-bit number in NC^1. The second step works the same way. We precompute x_0, (x_0)^2, (x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to multiply together modulo N. Can get statements about AC^0 too (constant depth, unbounded fanin) using stronger assumptions about hardness of factoring. --------------------------- Now, we can use this to show that DFA are hard to learn in the PAC model when you don't have membership queries. Idea: given an NC^1 circuit f over X = {0,1}^n, will create DFA g over Y = {0,1,#}*. Also will define a simple mapping h: X->Y (This mapping will be: x -> x#x#x#x#....#x). The DFA g will have the property that g(h(x)) = f(x). The number of "#"'s will be 2^depth(f). -> first, how would you create a DFA for the function x_2 OR not(x_4)? This is easy. In fact, to help us later, we can do this so that the DFA always ends up in either a specific accept-state A or a specific reject-state R. -> Given a DFA for the function f and a DFA for the function g, how to create a DFA for "f AND g", assuming we're allowed to replicate the input? This is easy. The "#" transition from f's R state goes directly to g's R state. The "#" transition from f'a A state goes to g's start state. So we double the length of the input needed each time we go up 1 level in the circuit. That's it. Notice that we have *not* shown that DFA's are hard to learn with membership queries (good thing too, since we are going to give an algorithm that learns DFAs using membership queries). The reason is that only queries of the form "x#x#x#x#x#x#x" acutally map back to legal queries for the original NC^1 circuit.