15-859(B) Machine Learning Theory            03/06/06

* Cryptographic hardness results for learning
* Some open computational problems in PAC prediction model.
=========================================================================
Today: computational hardness results for learning in PAC prediction
model and some open problems.  We saw before that if we are not
conerned with computation time, then we can learn any class of
function using a number of examples that is polynomial (actually,
linear) in the number of bits to write down the target.  But what
happens when computation time enters the picture?

HARDNESS RESULTS
================
Three kinds of computational hardness results in machine learning.

(1) Representation dependent.  Show that it's hard to find a
  low-error or near-optimal hypothesis in a given class.  

 - E.g., it's NP-hard to determine whether a given dataset S is
   consistent with an intersection of 2 halfspaces. Same with 2-term DNF.
   (Reduction from hypergraph 2-coloring).  This then implies it's hard
   to PAC-learn C by C since you could let D be the uniform distribution
   over S and use epsilon = 1/(|S|+1).
 - E.g., it's NP-hard to find an OR function that minimizes the
   number of mistakes for an arbitrary dataset (direct reduction from
   set cover).   Same for linear separators.

  These statements can be very helpful.  E.g., they explain why one
  expects to see local minima in neural nets, and why all algorithms for
  training a linear separator always end up minimizing something that is
  related to but not exactly the same as the "number of mistakes made".
  On the other hand, they are really talking more about the difficulty
  of finding the optimal member of some hypothesis class rather than the
  inherent difficulty of the target concept.  For instance, given data
  consistent with a 2-term DNF, we can easily find a consistent 2-CNF.
  
  Big open problem: can you PAC-learn the class of "intersections of 2
  halfspaces" using some *other* hypothesis representation.
  Can you do agnostic learning of halfspaces or OR-functions using
  some other representation?

(2) by restricting the way the algorithms can interact with the data.
  We'll see this soon when we discuss the Statistical Query model.
  These hardness results won't require complexity assumptions.
  The main drawback here is that it only applies to SQ algorithms.  

(3) Cryptographic-based hardness.  Show that the class of functions
  contains cryptographic objects.  E.g., show that any good learning
  algorithm would allow you to factor efficiently.  These results are
  nice because they will apply without restricting the form of hypotheses.


PRELIMINARIES: 3 learning models we will think about
====================================================
A) PAC prediction model.
B) PAC prediction over uniform distribution D on {0,1}^n
C) PAC model with additional ability of learning algorithm to query
   target function f(x) on inputs x of its own choosing.  These are
   called MEMBERSHIP QUERIES.  So, algorithm can make MQs, and also see
   examples from D.  Goal is to get low-error over D.  Can look at
   version of this where we fix D = uniform distribution.

Again, goal is to do learning in time polynomial in number of
size of target function (recall that a "Concept Class" really is also
a representation language.  So, by "size" we mean size in that language.)

Today we'll look at two results + some open problems.  Let's start
with the open problems.

Some open problems
==================
1. Can DNF formulas be learned efficiently in model (A) or (B)?

   It is known that if you add membership queries to the PAC
   prediction model (A) then that doesn't help for learning DNF (if you
   can do it with MQs, you can do it without).  Intuitively, this is
   because of difficulty of SAT.  (Formally, need to look at digital
   signatures). 

   On the other hand, Jeff Jackson (in his CMU PhD thesis) showed that
   you *can* use MQs to learn DNF formulas with respect to the uniform
   distribution (i.e., adding queries to model (B)).

2. Can Decision Trees be learned efficiently in model (A) or (B)?
   Interestingly, you *can* learn DTs in model (C) of PAC+MQs.  Will see
   that later.

3. As a special case of problems (1) and (2), can the set of
   "functions that have only log(n) relevant variables" be learned
   efficiently in models (A) or (B)?  [Note: algorithm is not told
   *which* vars are relevant]  How about loglog(n) relevant variables?
   In that case, you even have time to try all possible truth-tables of
   width loglog(n), so the issue is just figuring out which variables are
   relevant.   Can learn with MQs though.

4. Can Monotone DNF be learned efficiently in model (B)?  In model (A)
   they are the same as general DNF.  In (B), you can weak-learn but
   no strong-learning algorithm is known.  Best error rate known is 1/2 -
   1/sqrt(n). 


Two hardness results for today:
===============================
* Assuming factoring is hard, the class NC^1 (boolean circuits of
  logarithmic depth) is not weakly-learnable even if membership
  queries are allowed, and even if D = "the uniform distribution over
  {0,1}^n". (Kharitonov).  

  Can even extend to AC^0, if you assume that factoring is "really hard".

* Under the same assumption, DFA's can't be learned over arbitrary
  distributions without MQs.  (Later in this course we will see an
  algorithm that learns them with MQs.)


Kharitonov's result
===================
NC^1 is the class of O(log n) depth, bounded fan-in ciruits. 
The only fact about NC^1 we will be using is that you can multiply n
n-bit numbers modulo an n-bit number.   The key idea now is to try to
use this power to encode some cryptographic function that is as hard
as factoring.

Convenient function: BBS pseudorandom bit generator:

 - Given N = p*q where p and q are primes congruent to 3 mod 4, and given
   a seed x_0, define 
	x_i = x_{i-1}^2 mod N.
	b_i = LSB(x_i)

 - The generator outputs b_0, b_1, b_2, ...

This generator has the property that if you could distinguish 
  (A) an m-bit string produced by first choosing a random x_0 and then
      running the generator m times, from  
  (B) a truly random m-bit string, 
then you could use this to factor N (in time polynomial in m and
running time of your algorithm).  In fact, all you need is an
algorithm that "weakly distinguishes" them, in the "weak-learning" sense. 

The idea now is to try to show that for any given N, x_0, there must
exist an NC^1 circuit that on input i, produces b_i as output.
Specifically, given input i, what we want to compute is:
		LSB( (x_0)^{2^i} mod N)

We will do this in two steps.  First show how to compute j =2^i mod
phi(N) and then how to compute (x_0)^j mod N.

Unfortunately, we can't directly do repeated-squaring in NC^1.
However, here is a trick.  We first precompute 2^1, 2^2,
2^4, 2^8, etc., modulo phi(N), and store these values in the circuit.
Then, given input i, we use i to tell us which of these precomputed
values we want to multiply together modulo phi(N). Remember, we can
multiply n n-bit numbers modulo another n-bit number in NC^1.

The second step works the same way.  We precompute x_0, (x_0)^2,
(x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to
multiply together modulo N.

Let C(x_0,N) be the NC^1 circuit that does this computation for a
given x_0, N.  Let C(x_0,N,t) be the same as C(x_0,N) except that all
but the last t input bits are hardwired to 0 (it only looks at the
last t bits of its input).

Formally, the way we will use this in a hardness proof is
like this: say that A is a proposed n^k time/queries/samples algorithm
to learn NC^1 circuits of size n with bias 1/2 + 1/n^k.  Say we are
given an N to factor.  We pick a random x_0 as in BBS.  Now, algorithm
A must work on circuits of the form C(x_0,N,3k*lg(n)).  Even
though we can't construct the circuit (it required knowing phi(N)), we
can simulate it by just producing the first n^{3k} entries in the BBS
sequence.  Now, since A only has n^k queries or samples, when it
comes time for A to make a prediction on a new example, the
probability that the new random example has the same suffix as
something it has seen already is at most 1 - n^k/n^{3k} = 1 - 1/n^{2k}.
So, if A could really get a bias as good as 1/2 + 1/n^k, that means
that A could be used to get a weak bias on whether the string 
b_0,b_1,...,b_{n^3k} was pseudorandom or truly random (and therefore A
could be used to factor in polynomial time).
 
Can get statements about AC^0 too (constant depth, unbounded fanin)
using stronger assumptions about hardness of factoring.


---------------------------

Now, we can use this to show that DFA are hard to learn in the PAC model
when you don't have membership queries.

Idea: given an NC^1 circuit f over X = {0,1}^n, will create
DFA g over Y = {0,1}^m where m = n*2^{depth(f)}.  Also will define a
simple mapping h: X->Y, x -> xxx...x).  The DFA g will have the
property that g(h(x)) = f(x).  Also the number of states in g will be
polynomial in m.

-> first, how would you create a DFA for the function x_2 OR not(x_4)?

This is easy.  In fact, notice we can do this so that the
DFA always ends up in either a specific accept-state A or a specific
reject-state R,and we can get rid of self-loops if we only care about
inputs of size n. 

-> Given a DFA for the function f1 and a DFA for the function f2, how to
create a DFA for "f1 AND f2", assuming we're allowed to replicate the
input?  

Just make f1's A state be f2's start state, and have f1's R state go
through a chain of dummy states to f2's R state.  Similarly for OR.

So we double the length of the input needed each time we go up 1 level
in the circuit.

That's it.  Notice that we have *not* shown that DFA's are hard to
learn with membership queries (good thing too, since we are going to
give an algorithm that learns DFAs using membership queries).  The
reason is that only queries of the form "xxx...x" actually map
back to legal queries for the original NC^1 circuit.