15-859(A) Machine Learning Theory            03/16/04

Membership Query learning, contd
* KM alg recap (finding large Fourier coefficients)
* Bshouty's algorithm for learning decision trees and XORs of terms
Hardness results for learning
=======================================================================

Recap from last time:

 - Membership queries: get to ask for f(x) for examples x of our
   choosing.  Think of reverse-engineering some black-box device.
   Assume input space is {0,1}^n.

 - KM algorithm: find all large (size > 1/poly) fourier coefficients
   over parity basis.  I.e., find all parity functions that correlate
   by at least 1/poly (or less than -1/poly) with target wrt the uniform
   distribution.

   -> If most of "length" of f is in large fourier coefficients, then
      get a good approx wrt uniform distribution. 

   -> E.g., we saw this was true for decision trees.  So, this gives a
      poly-time algorithm for using MQs to learn DTs wrt the uniform
      distribution.  Also, gives weak-learning for DNF, which can then
      strengthen to strong-learning [didn't do in class].

======================================================

Question that was open for then was: Can you use MQs to learn DTs
wrt arbitrary distributions?

To put this into context: It turns out you can show that a bad
distribution can make MQs worthless for learning DNF, if crypto-hard
functions exist (a similar statement is that modulo some crypto
assumptions, I could construct a 3-SAT formula such that even if you
ask me for a bunch of satisfying assginments, you still can't
construct a new one that I haven't shown you yet).  But this doesn't
go through for decision trees.  Problem solved by Bshouty: algorithm
for learning DTs in PAC+MQ model.

Note: algorithm won't be very practical in the sense that not only
will it use MQs, but it will also be very sensitive to noise.  (in
contrast, the Fourier algorithm would still be reasonable if there was
random noise, since this would affect the correlations in a reasonable way).

======================================================

Here is a simpler version of Bshouty's original algorithm, from an
email message that he sent me (not sure if this simpler version is in
print).   You'll see some similarities to KM.

In fact, algorithm works for a larger class: "XOR of terms".  We'll
assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where
T_i are conjunctions.

Why is this a generalization of DTs?

We're going to solve by building a decision-tree-like-thing (a "parity
decision graph"?) of depth n and width O(m).  So, by Occam results,
just need to take some initial sample S of size O(nm/epsilon), and then use
our membership queries to find a consistent hypothesis of size O(nm).

Idea: let's just build a decision tree with x_1 at the root, then x_2
at the next level, etc., until we get to > m leaves.  This will happen
at some depth d = log(m).

Define T_1', ..., T_m' to be what you get by taking the
original terms T_1,...,T_m and removing all mention of variables
x_1,...,x_d from them.  

So, if we look at all leaves of our tree so far, the value of the
target function, given that the example takes you to that leaf, can be
written as an XOR of some subset of the T' terms.

For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4)

and say d=2.  Then the leaves look like:

        (T1' XOR T3'), T1', (T2' XOR T3'), T2'.

Now, the claim is that since there are > m leaves, some of these must
be XORs of others.  In fact, there must exist some set L of at most m
leaves, such that any other leaf is an XOR of some subset of leaves in
L. (L is a maximal linearly-independent set mod 2.) E.g., above any
leaf is the XOR of the other three leaves.

So the plan is now:
1. (somehow) find such a set L, and write down for other leaves how
        they can be written as an XOR of leaves in L.

  (oops: the description length of a level could be as high as m^2, so
   let's have S be of size nm^2/epsilon to be safe....)

2. keep building the tree down from leaves in L until have more than m
  leaves again, and repeat.

3. (somehow) understand what this means in terms of a final hypothesis.

Let's do 1:  For each example, vary the first d bits (using MQs) to
make the example reach all active leaves.  ==> Get a vector of
outputs.  I.e., what would the output have been if first d bits were
set to take you to leaf i. 

If we write this out for all the examples, we get a big matrix with 1
row per example, where the set L corresponds to a set of <= m columns
such that every other column is some linear combination of these (mod
2).  Find these with Gaussian elimination.  (What's a little confusing
here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5.  We
don't really know if this is true over ALL of {0,1}^n.  We just know
it's true over examples with suffixes matching S.  But, that's OK
since our goal is just to find a hypothesis consistent with S.)

To do 3: Follow example down the tree.  When you get to an "inactive"
node, then replace position by the *set* of positions listed in the node.
So, our "current location" in the tree is now a set of nodes.  More
generally, if our current set of nodes is N and say node y in N is
inactive, then look at set (N - {y}) XOR (positions listed at y).

Finally, as we go down the tree, the T' terms will become constant 1
or 0, and we just take the XOR.

Why is this legal?  We can see this by arguing bottom-up: Say (as
above) that L is the set of active nodes at the current level, and
inductively assume that the trees rooted at nodes of L are correct on all
examples produced by taking some x in S and replacing its prefix by
whatever is needed to reach that node.  Then (by construction) the
inactive nodes (the ones not in L) are also correct on examples
produced by taking some x in S and replacing its prefix with whatever
was needed to reach them, too.  Therefore, all *parents* of this level
are correct on examples produced by taking some x in S and replacing
prefixes. And so on.

========================================================

Summary of learnability results
===============================

PAC+MQ: (arbitrary distribution, with Memb queries)
  - monotone DNF
  - decision trees
  - XORs of terms
  - DFAs (target is a DFA of n states.  Inputs are strings.  Will do
    this next class)

MQ+uniform (memb queries.  learning wrt uniform distribution)
  - general DNF formulas

No queries, but allow n^polylog(n) time:
  - decision trees
  - DNF over uniform distribution  AC^0 too [didn't do this in class]

No queries, polynomial time:
  - decision lists, LTFs, k-CNF, k-DNF.

========================================================

HARDNESS RESULTS
================

Three kinds of computational hardness results in machine learning.

(1) Representation dependent.  Show that it's hard to find a
  consistent or optimal hypothesis in a given class.  

    - E.g., it's NP-hard to find an OR function that minimizes the
    number of mistakes for an arbitrary dataset (direct reduction from
    set cover).   Same for linear separators.

    - E.g., it's NP-hard to find an intersection of 2 halfspaces (an
    AND of two linear separators) that is consistent with a dataset.
    Same with 2-term DNF.  Reduction from hypergraph 2-coloring.

  These statements can be very helpful.  E.g., they explain why one
  expects to see local minima in neural nets, and why all algorithms for
  training a linear separator always end up minimizing something that is
  related to but not exactly the same as the "number of mistakes made".
  On the other hand, they are potentially talking more about the difficulty
  of finding the optimal member of some hypothesis class rather than the
  inherent difficulty of the target concept.  For instance, given data
  consistent with a 2-term DNF, we can easily find a consistent 2-CNF.
  
(2) by restricting the way the algorithms can interact with the data
  -- e.g., the SQ model.  Here we got very strong hardness results.
  What's nice is we could also talk about hard distributions over the
  targets -- the hard case is when it's picked at random from the
  large set of orthogonal functions.   No complexity assumptions.
  The main drawback here is that it only applies to SQ algorithms.  

(3) Cryptographic-based hardness.  Show that the class of functions
  contains cryptographic objects.  

Today we'll look (briefly) at two results:

* Assuming factoring is hard, the class NC^1 (boolean circuits of
  logarithmic depth) is not weakly-learnable even if membership
  queries are allowed, and even if D = uniform. (Kharitonov).  

* Under the same assumption, DFA's can't be learned over arbitrary
  distributions without MQs.

Kharitonov's result
===================

DEFINITION: NC^1 is the class of O(log n) depth, bounded fan-in
ciruits.  It is equivalent to the class of "boolean formulas" (like
DNF but can put ORs, ANDs, NOTs together however you want).  The
difference between a "circuit" and a "formula" is that each gate in a
circuit can have arbitrary fan-out, but in a formula all gates have
fanout = 1.  [Proving equivalence will probably be on hwk5]

Important property of NC^1: can multiply n n-bit numbers modulo
another n-bit numbers in NC^1.

The cryptographic object we'll put into the NC^1 circuit is the BBS
pseudorandom bit generator: Given N = p*q where p and q are primes
congruent to 3 mod 4, and given a seed s = s_0, define 
	s_i = s_{i-1}^2 mod N.
	b_i = LSB(s_i)
The generator outputs b_0, b_1, b_2, ...

This generator has the property that if you could distinguish 
  (A) an m-bit string (where m is poly(n)) produced by first choosing
      a random s and then running the generator m times, from 
  (B) a truly random m-bit string, 
then you could use this to factor N in poly time.  In fact, all you
need is an algorithm that "weakly distinguishes" them, in the
"weak-learning" sense. 

--> The idea is to encode N, p, q, and s into an NC^1 circuit so that
on input i, the circuit produces b_i as output.  Not obvious how to do
this since we need to compute LSB( s^{2^i} mod N), but let's first
look at why that gives a hardness result.  The point is that if an
algorithm was able by asking queries to determine (or even get a good
bias on) the value f(x) of some x it hadn't yet seen, it would be able
to distinguish the b_i's from random.

One technical issue is that the BBS pseudorandom bit generator is only known
to be secure if you are looking at an initial poly(n)-sized segment of
the b_i's.  So, to deal with this, imagine a version of this circuit
where we mask off all but the last k*log(n) bits of the input (we set
the rest to 0).  If someone's magic algorithm claims, say to get bias
1/2 + 1/n^4 after n^5 queries, then we set k=10.  After n^5 queries, the
probability that the new random example has the same suffix as
something it has seen already is at most n^5/n^10 = 1/n^5.
So, if A could really get a bias as good as 1/2 + 1/n^4, that means
that A could be used to get a weak bias on whether the string 
b_0,b_1,...,b_{n^10} was pseudorandom or truly random (and therefore A
could be used to factor in polynomial time).
 
--> To implement the generator in NC^1 we use the trick of
precomputing useful values.  Specifically, given input i, we compute 
LSB( s^{2^i} mod N) in two steps.  First we compute j = 2^i mod phi(N)
and then we compute s^j mod N.

To do the first step, we precompute 2^1, 2^2, 2^4, 2^8, etc., modulo
phi(N), and store these values in the circuit.  Then, given input i,
we use i to tell us which of these precomputed values we want to
multiply together modulo phi(N). Remember, we can multiply n n-bit
numbers modulo another n-bit number in NC^1.  The second step works
the same way.  We precompute s, s^2, s^4, s^8, ... mod N, and use j to
tell us which of these to multiply together modulo N.


=========Probably save for next time====================================

Now, we can use this to show that DFA are hard to learn in the PAC model
when you don't have membership queries.

Idea: given an NC^1 circuit f over X = {0,1}^n, will create
DFA g over Y = {0,1,#}*.  Also will define a simple mapping h: X->Y
(This mapping will be: x -> x#x#x#x#....#x).  The DFA g will have the
property that g(h(x)) = f(x).  The number of "#"'s will be 2^depth(f).

-> first, how would you create a DFA for the function x_2 OR not(x_4)?

This is easy.  In fact, to help us later, we can do this so that the
DFA always ends up in either a specific accept-state A or a specific
reject-state R. 

-> Given a DFA for the function f and a DFA for the function g, how to
create a DFA for "f AND g", assuming we're allowed to replicate the
input?  

This is easy.  The "#" transition from f's R state goes directly to
g's R state.  The "#" transition from f'a A state goes to g's start state.

So we double the length of the input needed each time we go up 1 level
in the circuit.

That's it.  Notice that we have *not* shown that DFA's are hard to
learn with membership queries (good thing too, since we are going to
give an algorithm that learns DFAs using membership queries).  The
reason is that only queries of the form "x#x#x#x#x#x#x" acutally map
back to legal queries for the original NC^1 circuit.