15-859(B) Machine Learning Theory 03/01/07
* Computational hardness results for learning
* Some open computational problems
=========================================================================
Today: computational hardness results for learning plus some open
problems. We saw before that if we are not concerned with computation
time, then we can learn any class of function using a number of examples
that is polynomial (actually, linear) in the number of bits to write down
the target. But what happens when computation time enters the picture?
HARDNESS RESULTS
================
Three kinds of computational hardness results in machine learning.
(1) Representation dependent. Show that it's hard to find a
low-error or near-optimal hypothesis in a given class.
- E.g., it's NP-hard to determine whether a given dataset S is
consistent with an intersection of 2 halfspaces. Same with 2-term DNF.
(Reduction from hypergraph 2-coloring). This then implies it's hard
to PAC-learn C by C since you could let D be the uniform distribution
over S and use epsilon = 1/(|S|+1).
- E.g., it's NP-hard to find the lowest-error OR-function on a given
dataset. Same for linear separators.
These statements help explain why one expects to see local minima in
neural nets, and why all algorithms for training a linear separator
always end up minimizing something that is related to but not
exactly the same as the "number of mistakes made". On the other
hand, they are really talking more about the difficulty of finding
the optimal member of some hypothesis class rather than the inherent
difficulty of the target concept. For instance, given data
consistent with a 2-term DNF, we can easily find a consistent 2-CNF.
Big open problem: can you PAC-learn the class of "intersections of 2
halfspaces" using some *other* hypothesis representation.
A possibly easier problem: the best approximation guarantee known
for finding an OR-function with the fewest errors on a dataset is
sqrt(n) [i.e., making sqrt(n)*OPT mistakes]. Can you beat that?
(2) by restricting the way the algorithms can interact with the data.
We'll see this soon when we discuss the Statistical Query model.
These hardness results won't require complexity assumptions.
The main drawback here is that it only applies to SQ algorithms.
(3) Cryptographic-based hardness. Show that the class of functions
contains cryptographic objects. E.g., show that any good learning
algorithm would allow you to factor efficiently. These results are
nice because they will apply without restricting the form of hypotheses.
We are going to focus on hardness results of type (3) today.
PRELIMINARIES: 4 learning models we will think about
====================================================
A) Standard PAC model, where h can be any poly-time computable function.
B) Same as (A) but restrict D to be uniform on {0,1}^n
C) Model (A) but give learning algorithm the additional ability to
query target function f(x) on inputs x of its own choosing. These
are called MEMBERSHIP QUERIES. They make most sense in contexts
where you think of f as a device you want to reverse-engineer. So,
algorithm can make MQs, and also see examples from D. Goal is to
get low-error over D.
D) Model (B) + Membership queries. Here you can think of querying as
the *only* thing the algorithm can do, and goal is to find h that
agrees with f on 99% of the points in {0,1}^n.
Goal is to do learning in time polynomial in size of target function
(recall that a "Concept Class" really is also a representation
language. So, by "size" we mean size in that language.)
Today we'll look at two results + some open problems. Let's start
with the open problems.
Some open problems
==================
1. Can DNF formulas be learned efficiently in model (A) or (B)?
A very nice result by Jeff Jackson (in his CMU PhD thesis) showed
how to learn DNF formulas in model (D). We'll see in a later class
an algorithm for using queries to *weak learn* with respect to the
uniform distribution. (Note: boosting doesn't immediately follow
since this is for a specific distribution).
Also, under crypto assumptions, it is known that model (C) is as
hard as model (A). Intuitively, this is because of difficulty of
SAT. (Being able to produce any interesting query at all.)
Formally, assumption is the existence of digital signatures.
2. Can Decision Trees be learned efficiently in model (A) or (B)?
Interestingly, you *can* learn DTs with queries in model (C). Will
see that later. You gave an n^O(log(s))-time algorithm for model (A).
3. Can Monotone DNF be learned efficiently in model (B)? In model (A)
they are the same as general DNF. In (B), you can weak-learn but
no strong-learning algorithm is known. Best error rate known is 1/2 -
1/sqrt(n). Model (C) is easy (will see that later).
Two hardness results for today:
===============================
* Assuming factoring is hard, the class NC^1 (boolean circuits of
logarithmic depth) is not weakly-learnable even in model (D). (Kharitonov)
Can even extend to AC^0, if you assume that factoring is "really hard".
* Under the same assumption, DFA's can't be learned over arbitrary
distributions without MQs (model (A)). Later in this course we will
see an algorithm that learns them with MQs.
Kharitonov's result
===================
NC^1 is the class of O(log n) depth, bounded fan-in ciruits.
The only fact about NC^1 we will be using is that you can multiply n
n-bit numbers modulo an n-bit number. The key idea now is to try to
use this power to encode some cryptographic function that is as hard
as factoring.
Convenient function: BBS pseudorandom bit generator:
- Given N = p*q where p and q are primes congruent to 3 mod 4, and given
a seed x_0, define
x_i = x_{i-1}^2 mod N.
b_i = LSB(x_i)
- The generator outputs b_0, b_1, b_2, ...
This generator has the property that if you could distinguish
(I) an m-bit string produced by first choosing a random x_0 and then
running the generator m times, from
(II) a truly random m-bit string,
then you could use this to factor N (in time polynomial in m and
running time of your algorithm). In fact, all you need is an
algorithm that "weakly distinguishes" them, in the "weak-learning" sense.
The idea now is to try to show that for any given N, x_0, there must
exist an NC^1 circuit C_{x_0,N} that on input i, produces b_i as output.
Before showing that, let's see how that would imply what we want.
First of all, define C_{x_0,N,t} to be the same as C(x_0,N) except
that all but the last t input bits are hardwired to 0 (it only looks
at the last t bits of input i). So, this is in NC^1 too.
Now, the way we will use this in a hardness proof is like this: say
that A is a proposed poly-time weak learner for NC^1. So, for some n^k,
if the target has n gates, A uses at most n^k time/queries/samples
and gets error at most 1/2 - 1/n^k. Now, say we are given an m-bit
string that's either of type (I) or (II) above, for m = n^{3k}. What
we do is use this as a lookup table. Given an input i, we look at
the last 3k*lg(n) bits of i and output the corresponding position in
our bit string. Now, since A only has n^k queries or samples, when it
comes time for A to make a prediction on a new example, the
probability that the new random example has the same
suffix as something it has seen already is at most n^k/n^{3k} =
1/n^{2k}. So, if our bit string was type (II), then A would have
probability at most 1/2 + 1/n^{2k} of getting it right. On the other
hand, if out bit string was type (I), then we have been consistent
with C_{x_0,N,3k*log(n)} and so A should have success probability at
least 1/2 + 1/n^k. This is a weak-distinguisher and so you could factor.
So, we just need to argue that we can make an NC^1 circuit that on
input i computes: LSB( (x_0)^{2^i} mod N )
We will do this in two steps. First show how to compute j =2^i mod
phi(N) and then how to compute (x_0)^j mod N.
Unfortunately, we can't directly do repeated-squaring in NC^1.
However, here is a trick. We first precompute 2^1, 2^2,
2^4, 2^8, etc., modulo phi(N), and store these values in the circuit.
Then, given input i, we use i to tell us which of these precomputed
values we want to multiply together modulo phi(N). Remember, we can
multiply n n-bit numbers modulo another n-bit number in NC^1.
The second step works the same way. We precompute x_0, (x_0)^2,
(x_0)^4, (x_0)^8, ... modulo N, and use j to tell us which of these to
multiply together modulo N.
Can get statements about AC^0 too (constant depth, unbounded fanin)
using stronger assumptions about hardness of factoring.
---------------------------
Now, we can use this to show that DFA are hard to learn in the PAC model
when you don't have membership queries.
Idea: given an NC^1 circuit f over {0,1}^n, will create DFA g over
{0,1}^m where m = n*2^{depth(f)}. Also will define a simple mapping
r: x -> xxx...x. The DFA g will have the property that g(r(x)) =
f(x). Also the number of states in g will be polynomial in m.
-> first, how would you create a DFA for the function x_2 OR not(x_4)?
This is easy. In fact, notice we can do this so that the
DFA always ends up in either a specific accept-state A or a specific
reject-state R,and we can get rid of self-loops if we only care about
inputs of size n.
-> Given a DFA for the function f1 and a DFA for the function f2, how to
create a DFA for "f1 AND f2", assuming we're allowed to replicate the
input?
Just make f1's A state be f2's start state, and have f1's R state go
through a chain of dummy states to f2's R state. Similarly for OR.
So we double the length of the input needed each time we go up 1 level
in the circuit.
That's it. Notice that we have *not* shown that DFA's are hard to
learn with membership queries (good thing too, since we are going to
give an algorithm that learns DFAs using membership queries). The
reason is that only queries of the form "xxx...x" actually map
back to legal queries for the original NC^1 circuit. We have also
*not* shown that DFA's are hard to learn over the uniform distribution
on n-bit inputs. (This is a nice open problem).