Notes for 3/30/98 * using cryptography to generate hard learning problems * using learning problems to generate cryptographic primitives (won't really get to this) --------------------------------------------------------------------- Hardness results we've seen so far have been of two kinds: (A) representation dependent. E.g., the following are NP-hard "Given set of data S, is there a consistent 3-node neural net?" "Given set of data S, is there a consistent 2-term DNF?" "Given set of data S, is there a linear threshold function that makes < m mistakes over S?" (the question if there exists a consistent LTF can be solved with linear programming) These statements can be very helpful. E.g., they explain why one expects to see local minima in neural nets, and why all algorithms for training a linear separator always end up minimizing something that is related to but not exactly the same as the "number of mistakes made". On the other hand, they are really talking more about the difficulty of finding the optimal member of some hypothesis class rather than the inherent difficulty of the target concept. For instance, given data consistent with a 2-term DNF, we can easily find a consistent 2-CNF. Big open problem: can you PAC-learn the class of "3-node neural nets" (as we defined earlier), using some *other* hypothesis representation. Interestingly, if you replace the top node of the network with an XOR, then you *can* learn, using the representation of "degree-2 threshold functions". This is using the fact that "(P(x) > 0) XOR (Q(x) > 0)" is the same as "P(x)*Q(x) > 0" (so long as we don't care about the "= 0" case. (B) in the SQ model. Here we got very strong hardness results. For instance, suppose the target function is a parity function selected by choosing a random subset of O(log(n)) of the n input variables as the "relevant variables" to the parity function. In this case we showed that *any* SQ algorithm will require n^(log n) queries. This is nice because first of all, it doesn't depend on the internal representation of the SQ algorithm, and second, it gives us an easy way of choosing the "hard target functions". Also, there is no complexity assumption involved. Even if P = NP, it would still be the case that no SQ algorithm could learn this class (and therefore no SQ algorithm can learn the classes of poly-size DNF formulas or poly-size decision trees). The main drawback here is that it only applies to SQ algorithms. Can we get results like (B) above, but that hold for all algorithms? To do this, we will need to make some sort of complexity theoretic assumption because, for instance, if P=NP, then we *could* learn any class of low VC-dimension. For instance, the problem of learning the class of "DNF of size <= s" is in NP because we can just draw a sample of size (1/epsilon)[s + ln(1/delta)] and ask "is there a consistent DNF of size <= s"? This question is in NP because it is easy to verify a guess (just check it on the sample). What we will do now is give results of this form based on cryptographic assumptions. E.g., if you could learn class C then you could factor quickly. Here is a result of Kharitonov's that shows that the class NC^1 (boolean circuits of logarithmic depth) is not weakly-learnable even if membership queries are allowed, and even if D = "the uniform distribution over {0,1}^n". --> describe NC^1. NC^1 is the class of O(log n) depth, bounded fan-in ciruits. It is equivalent to the class of "boolean formulas" (see hwk6). The difference between a "circuit" and a "formula" is that each gate in a circuit can have arbitrary fan-out, but in a formula all gates have fanout = 1. --> describe BBS pseudorandom bit generator. Given N = p*q where p and q are primes congruent to 3 mod 4, and given a seed x_0, define x_i = x_{i-1}^2 mod N. b_i = LSB(x_i) The generator outputs b_0, b_1, b_2, ... This generator has the property that if you could distinguish (A) an s-bit string (where s is poly(n)) produced by first choosing a random x_0 and then running the generator s times from (B) a truly random s-bit string, then you could use this to factor N in poly time. In fact, all you need is an algorithm that "weakly distinguishes" them, in the "weak-learning" sense. --> The idea now is to encode N, p, q, and x_0 into an NC^1 circuit so that on input i, the circuit produces b_i as output. --> Specifically: given an input x, we will look at the last k*log(n) bits of x and view that as a number i between 0 and n^k. The circuit will then output b_i. The reason we only use the last k*log(n) bits is that technically, the BBS pseudorandom bit generator is only known to be secure if you are looking at an initial poly(n)-sized segment of the b_i's. The way we will use this in a hardness proof is like this: say that A is a proposed n^5 time/queries/samples algorithm to learn NC^1 circuits with bias 1/2 + 1/n^4. To foil this algorithm we build a circuit with k=10. Since A only has n^5 queries or samples, when it comes time for A to make a prediction on a new example, the probability that the new random example has the same suffix as something it has seen already is at most 1 - n^5/n^10 = 1 - 1/n^5. So, if A could really get a bias as good as 1/2 + 1/n^4, that means that A could be used to get a weak bias on whether the string b_0,b_1,...,b_{n^k} was pseudorandom or truly random (and therefore A could be used to factor in polynomial time). --> only fact about NC^1 we will be using is that you can multiply n n-bit numbers modulo an n-bit number. --> To implement the generator in NC^1 we use the trick of precomputing a bunch of values. Specifically, given input i, what we want to compute is: LSB( (x_0)^{2^i} mod N) We will do this in two steps. First we compute j =2^i mod phi(N) and then we compute (x_0)^j mod N. To do the first step, we precompute 2^1, 2^2, 2^4, 2^8, etc., modulo phi(N), and store these values in the circuit. Then, given input i, we use i to tell us which of these precomputed values we want to multiply together modulo phi(N). Remember, we can multiply n n-bit numbers modulo another n-bit number in NC^1. The second step works the same way. We precompute x_0, (x_0)^2, (x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to multiply together modulo N. --------------------------- Now, we can use this to show that DFA are hard to learn in the PAC model when you don't have membership queries. Idea: given an NC^1 circuit f over X = {0,1}^n, will create DFA g over Y = {0,1,#}^(n^2). Also will define a simple mapping h: X->Y (This mapping will be: x -> x#x#x#x#....#x). The DFA g will have the property that g(h(x)) = f(x). -> first, how would you create a DFA for the function x_2 OR not(x_4)? This is easy. In fact, to help us later, we can do this so that the DFA always ends up in either a specific accept-state A or a specific reject-state R. -> Given a DFA for the function f and a DFA for the function g, how to create a DFA for "f AND g", assuming we're allowed to replicate the input? This is easy. The "#" transition from f's R state goes directly to g's R state. The "#" transition from f'a A state goes to g's start state. That's it. Notice that we have *not* shown that DFA's are hard to learn with membership queries (good thing too, since we earlier saw an algorithm that learns DFAs using membership queries). The reason is that only queries of the form "x#x#x#x#x#x#x" acutally map back to legal queries for the original NC^1 circuit. ----------------- Using learning problems to generare cryptographic primitives: see the paper on the web page. Basically, to go in the reverse direction, you need something a bit stronger than just saying "PAC learning C is hard over distribution D". You need that there is actually a hard distribution P over target concepts, and you need both P and D to be "simple" distributions in the sense that it is possible to samplefromthem in polynomial time. In other words, you need it to be possible for a polynomial time algorithm to play the role of the "environment" (by picking a target concept according to distribution P and then picking examples according to distribution D).