Notes for 3/30/98

* using cryptography to generate hard learning problems

* using learning problems to generate cryptographic primitives
  (won't really get to this)

---------------------------------------------------------------------

Hardness results we've seen so far have been of two kinds:

(A) representation dependent.  E.g., the following are NP-hard

  "Given set of data S, is there a consistent 3-node neural net?"

  "Given set of data S, is there a consistent 2-term DNF?"

  "Given set of data S, is there a linear threshold function that
	makes < m mistakes over S?"
   (the question if there exists a consistent LTF can be solved with
	linear programming)

  These statements can be very helpful.  E.g., they explain why one
  expects to see local minima in neural nets, and why all algorithms for
  training a linear separator always end up minimizing something that is
  related to but not exactly the same as the "number of mistakes made".
  On the other hand, they are really talking more about the difficulty
  of finding the optimal member of some hypothesis class rather than the
  inherent difficulty of the target concept.  For instance, given data
  consistent with a 2-term DNF, we can easily find a consistent 2-CNF.
  
  Big open problem: can you PAC-learn the class of "3-node neural nets" (as
  we defined earlier), using some *other* hypothesis representation.
  Interestingly, if you replace the top node of the network with an XOR,
  then you *can* learn, using the representation of "degree-2 threshold
  functions".  This is using the fact that "(P(x) > 0) XOR (Q(x) > 0)"
  is the same as "P(x)*Q(x) > 0" (so long as we don't care about the "=
  0" case. 

(B) in the SQ model.  Here we got very strong hardness results.  For
  instance, suppose the target function is a parity function
  selected by choosing a random subset of O(log(n)) of the n input
  variables as the "relevant variables" to the parity function.  In this
  case we showed that *any* SQ algorithm will require n^(log n)
  queries.   This is nice because first of all, it doesn't depend on the
  internal representation of the SQ algorithm, and second, it gives us
  an easy way of choosing the "hard target functions".  Also, there is
  no complexity assumption involved.  Even if P = NP, it would still be
  the case that no SQ algorithm could learn this class (and therefore no
  SQ algorithm can learn the classes of poly-size DNF formulas or
  poly-size decision trees).    The main drawback here is that it only
  applies to SQ algorithms.  

Can we get results like (B) above, but that hold for all algorithms?
To do this, we will need to make some sort of complexity theoretic
assumption because, for instance, if P=NP, then we *could* learn any
class of low VC-dimension.  For instance, the problem of learning the
class of "DNF of size <= s" is in NP because we can just draw a sample
of size (1/epsilon)[s + ln(1/delta)] and ask "is there a consistent
DNF of size <= s"?  This question is in NP because it is easy to
verify a guess (just check it on the sample).

What we will do now is give results of this form based on
cryptographic assumptions.  E.g., if you could learn class C then you
could factor quickly.

Here is a result of Kharitonov's that shows that the class NC^1
(boolean circuits of logarithmic depth) is not weakly-learnable even if
membership queries are allowed, and even if D = "the uniform
distribution over {0,1}^n".

--> describe NC^1.

NC^1 is the class of O(log n) depth, bounded fan-in ciruits.   It is
equivalent to the class of "boolean formulas" (see hwk6). The
difference between a "circuit" and a "formula" is that each gate in a
circuit can have arbitrary fan-out, but in a formula all gates have
fanout = 1.


--> describe BBS pseudorandom bit generator.
Given N = p*q where p and q are primes congruent to 3 mod 4, and given
a seed x_0, define 
	x_i = x_{i-1}^2 mod N.
	b_i = LSB(x_i)

The generator outputs b_0, b_1, b_2, ...

This generator has the property that if you could distinguish (A) an
s-bit string (where s is poly(n)) produced by first choosing a random
x_0 and then running the generator s times from (B) a truly random s-bit
string, then you could use this to factor N in poly time.  In fact,
all you need is an algorithm that "weakly distinguishes" them, in
the "weak-learning" sense.

--> The idea now is to encode N, p, q, and x_0 into an NC^1 circuit so
that on input i, the circuit produces b_i as output.

--> Specifically: given an input x, we will look at the last k*log(n)
bits of x and view that as a number i between 0 and n^k.  The circuit
will then output b_i.  The reason we only use the last k*log(n) bits
is that technically, the BBS pseudorandom bit generator is only known
to be secure if you are looking at an initial poly(n)-sized segment of
the b_i's.  

The way we will use this in a hardness proof is like this: say that  A
is a proposed n^5 time/queries/samples algorithm to learn NC^1
circuits with bias 1/2 + 1/n^4.  To foil this algorithm we build a
circuit with k=10.  Since A only has n^5 queries or samples, when it
comes time for A to make a prediction on a new example, the
probability that the new random example has the same suffix as
something it has seen already is at most 1 - n^5/n^10 = 1 - 1/n^5.
So, if A could really get a bias as good as 1/2 + 1/n^4, that means
that A could be used to get a weak bias on whether the string 
b_0,b_1,...,b_{n^k} was pseudorandom or truly random (and therefore A
could be used to factor in polynomial time).
 
--> only fact about NC^1 we will be using is that you can multiply n
n-bit numbers modulo an n-bit number.

--> To implement the generator in NC^1 we use the trick of
precomputing a bunch of values.  Specifically, given input i, what we
want to compute is:
		LSB( (x_0)^{2^i} mod N)

We will do this in two steps.  First we compute j =2^i mod phi(N) and
then we compute (x_0)^j mod N.

To do the first step, we precompute 2^1, 2^2, 2^4, 2^8, etc., modulo
phi(N), and store these values in the circuit.  Then, given input i,
we use i to tell us which of these precomputed values we want to
multiply together modulo phi(N). Remember, we can multiply n n-bit
numbers modulo another n-bit number in NC^1.

The second step works the same way.  We precompute x_0, (x_0)^2,
(x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to
multiply together modulo N.

---------------------------

Now, we can use this to show that DFA are hard to learn in the PAC model
when you don't have membership queries.

Idea: given an NC^1 circuit f over X = {0,1}^n, will create DFA g over
Y = {0,1,#}^(n^2).  Also will define a simple mapping h: X->Y
(This mapping will be: x -> x#x#x#x#....#x).  The DFA g will have the
property that g(h(x)) = f(x).

-> first, how would you create a DFA for the function x_2 OR not(x_4)?

This is easy.  In fact, to help us later, we can do this so that the
DFA always ends up in either a specific accept-state A or a specific
reject-state R. 

-> Given a DFA for the function f and a DFA for the function g, how to
create a DFA for "f AND g", assuming we're allowed to replicate the
input?  

This is easy.  The "#" transition from f's R state goes directly to
g's R state.  The "#" transition from f'a A state goes to g's start state.

That's it.  Notice that we have *not* shown that DFA's are hard to
learn with membership queries (good thing too, since we earlier saw an
algorithm that learns DFAs using membership queries).  The reason is
that only queries of the form "x#x#x#x#x#x#x" acutally map back to
legal queries for the original NC^1 circuit.  


-----------------

Using learning problems to generare cryptographic primitives:  see the
paper on the web page.  Basically, to go in the reverse direction, you
need something a bit stronger than just saying "PAC learning C is hard
over distribution D".  You need that there is actually a hard
distribution P over target concepts, and you need both P and D to be
"simple" distributions in the sense that it is possible to
samplefromthem in polynomial time.  In other words, you need it to be
possible for a polynomial time algorithm to play the role of the
"environment" (by picking a target concept according to distribution P
and then picking examples according to distribution D).