15-859(B) Machine Learning Theory            03/24/08
Membership Query learning, contd
* Fourier spectrum of Decision Trees and DNF
* Bshouty's algorithm for learning decision trees and XORs of terms
=======================================================================
From last time:

Looking at learning function f: {0,1}^n -> {-1,1}, where our goal is
to come up with an approximation g which agrees with f on most of the
domain (i.e., want to do well with respect to the uniform distribution).

In fourier view, think of function as a vector in a 2^n-dimensional
space.  In this view, any boolean function corresponds to a vector of
L_2 length equal to 1.  Last time we saw that an interesting thing to
look at is: what is the L_1 length of the target under the parity
basis?  (Note: L_2 length doesn't depend on the basis but L_1 length
*does* depend on the basis).

As a quick sanity check: what are the smallest and largest possible L_1
lengths for a unit L_2-length vector? [Smallest = 1.  Largest = sqrt(2^n)]
We proved:

Thm: any boolean function f must have most of its L_2 length in
fourier coefficients that are larger than epsilon/L_1(f):
   sum_{v: hat{f}_v < epsilon/L_1(f)} (hat{f}_v)^2 < epsilon 

KM algorithm: if you have ability to make membership queries, can find
all large (magnituge > 1/poly) fourier coefficients over parity basis.
Alg idea: estimate sum of squares of all fourier coefficients
corresponding to a given prefix.  Trick of using E[f(yx)f(zx)].  Grow
a tree, expand nodes with value > tau.   Combining this with the above
structural statement, we can learn any function in time/queries
polynomial in L_1(f).

On the Fourier Spectrum of DNF and Decision Trees
=================================================
Consider a single conjunction (not necessarily monotone).  Let T(x) =
1 if x satisfies the conjunction and T(x)=0 otherwise.  So, it's not a
boolean fn in our sense.  In fact <T,T> = E[T(x)T(x)] = 1/2^|T|.
If we analyze the fourier coefficients we get:

        hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1]

                  = 0 if phi_v has a relevant variable outside T (by
		      usual parity argument) 
                  = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the
		    value of phi_v on an example satisfying T.

         E.g., so it's very simple.  Sum of absolute values is 1.  Sum
         of squares is 1/2^|T| (which we knew already)

Analyzing Decision Trees
========================
Suppose f is a decision tree.  Using the terms corresponding to
positive leaves, we can write f as  

	 f(x) = [T_1(x) + ... + T_m(x)]*2 - 1

So, to get the fourier coefficients hat(f)_v, we just add up for the
corresponding functions.  One easy thing to notice is that each
function here has a sum of absolute values of coefficients equal to 1
(even the function "-1" which has coefficient 1 along the empty parity
function and 0 everywhere else).  So, really crudely, this immediately
means that
	   sum_v |hat(f)_v| <= 2m+1
where m is the number of positive leaves.  So, we get decision trees.

Analyzing DNF
=============
For DNF we can't say that most of the DNF is in the large fourier
coefficients.  The problem with the above reasoning is that the terms
aren't disjoint so we can't just add them up.  But we *can* say that a
DNF formula must have at least one reasonably-sized fourier
coefficient.

Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m).

Proof: We can assume f has a term T of size at most lg(4m), else the 
the probability that a random example is positive for f is at most
1/4, and so the empty "all negative" parity function has good correlation.

Now, here's the point: f has a noticeable correlation with T, because
when T=1, f=1 too, and when T=0 it doesn't matter.  Formally, <f,T> =
E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m).
Now, we use what we know about T to argue f has reasonable fourier
coefficients too.  Formally,

   <f,T> = sum_v hat(f)_v * hat(T)_v.

Now, using what we know about hat(T)_v, we get:

   <f,T> = sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T)

and we know this equals 1/2^|T|.  So, we get:

    sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1.

so at least one of them has magnitude at least 1/(4m).

======================================================
Strong-learning DNF wrt uniform distribution: the above result gives
us weak-learning.  Jackson (CMU PhD thesis '94) showed how to use a
specific boosting algorithm which doesn't change the distirb too much,
to use this to get strong learning.  Key idea: in Fourier view, can
think of keeping distrib fixed, but having target change and become
non-boolean.   If doesn't change too much, can still argue about
having a large fourier coeff.
  
======================================================
Can we use MQs to learn DTs wrt arbitrary distributions?

To put this into context: you can show that a bad distribution can
make MQs worthless for learning DNF, if crypto-hard functions exist.
But this doesn't go through for decision trees.  Problem solved by
Bshouty: algorithm for learning DTs in PAC+MQ model.

Note: unlike the Fourier-based algorithms, this one will be very
sensitive to noise.

Bshouty's algorithm
===================
Here is a simpler version of Bshouty's original algorithm, from an
email message that he sent me (not sure if this simpler version is in
print).   You'll see some similarities to KM.

In fact, algorithm works for a larger class: "XOR of terms".  We'll
assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where
T_i are conjunctions.  Why is this a generalization of DTs?

We're going to solve by building a decision-tree-like-thing (a "parity
decision graph") of depth n and width O(m).  So, by Occam results,
just need to take some initial sample S of size O(nm/epsilon), and then use
our membership queries to find a consistent hypothesis of size O(nm).

Idea: let's just build a decision tree with x_1 at the root, then x_2
at the next level, etc., until we get to > m leaves.  This will happen
at some depth d = log(m).

Define T_1', ..., T_m' to be what you get by taking the
original terms T_1,...,T_m and removing all mention of variables
x_1,...,x_d from them.  

So, if we look at all leaves of our tree so far, the value of the
target function, given that the example takes you to that leaf, can be
written as an XOR of some subset of the T' terms.

For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4)

and say d=2.  Then the leaves look like:

        (T1' XOR T3'), T1', (T2' XOR T3'), T2'.

Now, the claim is that since there are > m leaves, some of these must
be XORs of others.  In fact, there must exist some set L of at most m
leaves, such that any other leaf is an XOR of some subset of leaves in
L. (L is a maximal linearly-independent set mod 2.) E.g., above any
leaf is the XOR of the other three leaves.

So the plan is now:
1. (somehow) find such a set L, and write down for other leaves how
        they can be written as an XOR of leaves in L.

  (oops: the description length of a level could be as high as m^2, so
   let's have S be of size nm^2/epsilon)

2. keep building the tree down from leaves in L until have more than m
  leaves again, and repeat.

3. (somehow) understand what this means in terms of a final hypothesis.

Let's do 1:  For each example in S, vary the first d bits (using MQs) to
make the example reach all active leaves.  ==> Get a vector of
at most 2m outputs.  I.e., what would the output have been if first d
bits were set to take you to leaf i. 

If we write these vectors out for all the examples, we get a matrix with 1
row per example, where the set L corresponds to a set of <= m columns
such that every other column is some linear combination of these (mod
2).  Find these with Gaussian elimination.  

A subtle issue here is: say this tells us that leaf 1 is the XOR of
leaves 2,3,5.  We don't really know if this is true over ALL of
{0,1}^n.  We just know it's true for strings whose suffixes match
examples in S and whose prefixes match the paths down to the current
nodes.  But, that's OK since our goal is just to find a hypothesis 
consistent with S, and we will maintain inductively (see step 3 below)
that in following the decision grapth to classify an example in S we
are always looking at strings with that property.

To do 3: Follow example down the tree.  When you get to an "inactive"
node, then replace position by the *set* of positions listed in the
node and conceptually assign prefixes accordingly.  So, our "current
location" in the tree is now a set of nodes.  More generally, if our
current set of nodes is N and say node y in N is inactive, then look
at set (N - {y}) XOR (positions listed at y).

Finally, as we go down the tree, the T' terms will become constant 1
or 0, and we just take the XOR.

========================================================
Summary of learnability results
===============================
PAC+MQ: (arbitrary distribution, with Memb queries)
  - monotone DNF
  - decision trees
  - XORs of terms
  - DFAs (target is a DFA of n states.  Inputs are strings.  Will do
    this next class)

MQ+uniform (memb queries.  learning wrt uniform distribution)
  - general DNF formulas

No queries, but allow n^polylog(n) time:
  - decision trees
  - DNF over uniform distribution  AC^0 too [didn't do this in class]

No queries, polynomial time:
  - decision lists, LTFs, k-CNF, k-DNF.
========================================================