15-859(B) Machine Learning Theory            04/03/06
Membership Query learning, contd
* Recap of where we are now
* Fourier spectrum of Decision Trees and DNF
* Bshouty's algorithm for learning decision trees and XORs of terms
=======================================================================
Recap of where we are now: 
   - Model is we have an unknown function f from {0,1}^n -> {-1,1},
     which we can ping with inputs x and get out f(x).  Our goal is to come
     up with an approximation g which agrees with f on most of the domain.

   - Fourier idea: if think of function as vector of its values,
     scaled by sqrt(1/2^n), then <f,g> = Pr(f(x)=g(x))-Pr(f(x)!=g(x)) =
     correlation of f and g.  There are 2^n parity functions phi_v:
     x_1, x_2, x_1+x_2,... that are all pairwise uncorrelated, so they
     form a basis.  This means we can write any function f as: f(x) =
     sum_v hat{f}_v phi_v(x), where  hat{f}_v = <f,phi_v>.

   - KM algorithm: find all large (magnituge > 1/poly) fourier coefficients
     over parity basis.  Alg idea: estimate sum of squares of all
     fourier coefficients corresponding to a given prefix.  Trick of
     using E[f(yx)f(zx)].  Grow a tree, expand nodes with value > tau. 

   - If most of "length" of f is in large fourier coefficients, then
     get a good approx wrt uniform distribution.  If get at least
     some non-negligable coefficient, then have weak-learning.

On the Fourier Spectrum of DNF and Decision Trees
=================================================
Consider a single conjunction (not necessarily monotone).  Let T(x) =
1 if x satisfies the conjunction and T(x)=0 otherwise.  So, it's not a
boolean fn in our sense.  In fact <T,T> = E[T(x)T(x)] = 1/2^|T|.
If we analyze the fourier coefficients (which we did last time) we
get:
        hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1]

                  = 0 if phi_v has a relevant variable outside T (by
		      usual parity argument) 
                  = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the
		    value of phi_v on an example satisfying T.

         E.g., so it's very simple.  Sum of absolute values is 1.  Sum
         of squares is 1/2^|T| (which we knew already)

Analyzing Decision Trees
========================
Suppose f is a decision tree.  Using the terms corresponding to
positive leaves, we can write f as  

	 f(x) = [T_1(x) + ... + T_m(x)]*2 - 1

So, to get the fourier coefficients hat(f)_v, we just add up for the
corresponding functions.  One easy thing to notice is that each
function here has a sum of absolute values of coefficients equal to 1
(even the function "-1" which has coefficient 1 along the empty parity
function and 0 everywhere else).  So, really crudely, this immediately
means that
	   sum_v |hat(f)_v| <= 2m+1
where m is the number of positive leaves.

Claim: this fact by itself implies that f must have most of its length
in large fourier coefficients.   In particular, any vector f with L_2
length 1 (i.e., <f,f>=1) has the property that the sum of squares of
the coefficients less than epsilon/L_1(f) is at most epsilon, where
L_1(f) is the L_1 length of f (the sum of absolute-values of
coefficients).

Proof: a coefficient of magnitide alpha contributes alpha^2 to the sum
of squares, and alpha to the sum of absolute values.  So, the small
coefficients contribute at least a L_1(f)/epsilon factor *more* to the
latter than to the former.  Since in total they can contribute at
*most* L_1(f) to the sum of absolute values, they can contribute at
most epsilon to the sum of squares.  QED

Analyzing DNF
=============
For DNF we can't say that most of the DNF is in the large fourier
coefficients.  The problem with the above reasoning is that the terms
aren't disjoint so we can't just add them up.  But we *can* say that a
DNF formula must have at least one reasonably-sized fourier
coefficient.

Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m).

Proof: We can assume f has a term T of size at most lg(4m), else the 
the probability that a random example is positive for f is at most
1/4, and so the empty "all negative" parity function has good correlation.

Now, here's the point: f has a noticeable correlation with T, because
when T=1, f=1 too, and when T=0 it doesn't matter.  Formally, <f,T> =
E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m).
Now, we use what we know about T to argue f has reasonable fourier
coefficients too.  Formally,

   <f,T> = sum_v hat(f)_v * hat(T)_v.

Now, using what we know about hat(T)_v, we get:

   <f,T> = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T)

and we know this equals 1/2^|T|.  So, we get:

    sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1.

so at least one of them has magnitude at least 1/(4m).

======================================================
Can we use MQs to learn DTs wrt arbitrary distributions?

To put this into context: It turns out you can show that a bad
distribution can make MQs worthless for learning DNF, if crypto-hard
functions exist.  But this doesn't go through for decision trees.
Problem solved by Bshouty: algorithm for learning DTs in PAC+MQ model.

Note: algorithm won't be very practical in the sense that not only
will it use MQs, but it will also be very sensitive to noise.  (in
contrast, the Fourier algorithm would still be reasonable if there was
random noise, since this would affect the correlations in a reasonable way).

Bshouty's algorithm
===================
Here is a simpler version of Bshouty's original algorithm, from an
email message that he sent me (not sure if this simpler version is in
print).   You'll see some similarities to KM.

In fact, algorithm works for a larger class: "XOR of terms".  We'll
assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where
T_i are conjunctions.

Why is this a generalization of DTs?

We're going to solve by building a decision-tree-like-thing (a "parity
decision graph"?) of depth n and width O(m).  So, by Occam results,
just need to take some initial sample S of size O(nm/epsilon), and then use
our membership queries to find a consistent hypothesis of size O(nm).

Idea: let's just build a decision tree with x_1 at the root, then x_2
at the next level, etc., until we get to > m leaves.  This will happen
at some depth d = log(m).

Define T_1', ..., T_m' to be what you get by taking the
original terms T_1,...,T_m and removing all mention of variables
x_1,...,x_d from them.  

So, if we look at all leaves of our tree so far, the value of the
target function, given that the example takes you to that leaf, can be
written as an XOR of some subset of the T' terms.

For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4)

and say d=2.  Then the leaves look like:

        (T1' XOR T3'), T1', (T2' XOR T3'), T2'.

Now, the claim is that since there are > m leaves, some of these must
be XORs of others.  In fact, there must exist some set L of at most m
leaves, such that any other leaf is an XOR of some subset of leaves in
L. (L is a maximal linearly-independent set mod 2.) E.g., above any
leaf is the XOR of the other three leaves.

So the plan is now:
1. (somehow) find such a set L, and write down for other leaves how
        they can be written as an XOR of leaves in L.

  (oops: the description length of a level could be as high as m^2, so
   let's have S be of size nm^2/epsilon to be safe....)

2. keep building the tree down from leaves in L until have more than m
  leaves again, and repeat.

3. (somehow) understand what this means in terms of a final hypothesis.

Let's do 1:  For each example, vary the first d bits (using MQs) to
make the example reach all active leaves.  ==> Get a vector of
outputs.  I.e., what would the output have been if first d bits were
set to take you to leaf i. 

If we write this out for all the examples, we get a big matrix with 1
row per example, where the set L corresponds to a set of <= m columns
such that every other column is some linear combination of these (mod
2).  Find these with Gaussian elimination.  (What's a little confusing
here is: say this tells us that leaf 1 is the XOR of leaves 2,3,5.  We
don't really know if this is true over ALL of {0,1}^n.  We just know
it's true over examples with suffixes matching S.  But, that's OK
since our goal is just to find a hypothesis consistent with S.)

To do 3: Follow example down the tree.  When you get to an "inactive"
node, then replace position by the *set* of positions listed in the node.
So, our "current location" in the tree is now a set of nodes.  More
generally, if our current set of nodes is N and say node y in N is
inactive, then look at set (N - {y}) XOR (positions listed at y).

Finally, as we go down the tree, the T' terms will become constant 1
or 0, and we just take the XOR.

Why is this legal?  We can see this by arguing bottom-up: Say (as
above) that L is the set of active nodes at the current level, and
inductively assume that the trees rooted at nodes of L are correct on all
examples produced by taking some x in S and replacing its prefix by
whatever is needed to reach that node.  Then (by construction) the
inactive nodes (the ones not in L) are also correct on examples
produced by taking some x in S and replacing its prefix with whatever
was needed to reach them, too.  Therefore, all *parents* of this level
are correct on examples produced by taking some x in S and replacing
prefixes. And so on.

========================================================
Summary of learnability results
===============================
PAC+MQ: (arbitrary distribution, with Memb queries)
  - monotone DNF
  - decision trees
  - XORs of terms
  - DFAs (target is a DFA of n states.  Inputs are strings.  Will do
    this next class)

MQ+uniform (memb queries.  learning wrt uniform distribution)
  - general DNF formulas

No queries, but allow n^polylog(n) time:
  - decision trees
  - DNF over uniform distribution  AC^0 too [didn't do this in class]

No queries, polynomial time:
  - decision lists, LTFs, k-CNF, k-DNF.
========================================================