Notes for class 03/09/98

* Using membership queries to:
 - find large fourier coefficients
 - learn XOR of terms, which is a generalization of Decision Trees
=============================================================================
So far, we've been talking pretty much talking only about learning
from passive observations.  Today: start on algorithms for learning
when algorithm can be more active.  In particular, when algorithm can
request classification of examples of its own choosing.  These are
called Membership Queries.

Today, give two interesting algorithms.  The first one (by Kushilevitz
and Mansour, and also by Goldreich and Levin) is for finding all large
(1/poly) fourier coefficients of some boolean function, over parity basis.

Reminder: if A is some set of variables, then the Ath fourier
coefficient of f (over parity basis) is defined as the correlation of
f with the parity function phi_A, over the uniform distribution.  If f
is a boolean function, then the sum of the coeffs squared equals 1.

Last time we showed:
 - any DNF has at least one non-negligable fourier coefficient.  This
means we can use the algorithm to WEAK-LEARN any DNF with respect to
the uniform distribution (using membership queries).  Jeff Jackson
showed how could combine with boosting to STRONG-LEARN with respect
to the uniform distribution

 - if can find "most" of f (fourier coeffs whose sum of squares is
almost 1) then have a good approx of f, with respect to uniform
distribution.   This is nice since, for instance, any poly size
DECISION TREE can be approximated by its large fourier coefficients.

   Rough Argument: 
    Fourier coeff is zero for any parity that is not a subset of some
	path to leaf. (prove by induction.  For instance, look at tree
	with x1 at root and x2 and x3 as children, and the parity
	function x1+x2+x3).  Also, very deep leaves don't affect
        coeffs much since unlikely a random example will go that far.
	So, there is a relatively small number of parities that are
	sharing most of the function.

KM algorithm: how to find the large fourier coeffs
--------------------------------------------------

Let's say our goal is to find all \hat{f}_A such that (\hat{f}_A)^2 > \tau.

Say S is the set of all parity functions, so the sum, over A in S, of
the (\hat{f}_A)^2 equals 1.  Suppose we could split S into two
equal-sized sets S_0 and S_1 and for each one, ask for the sum of
squares of the \hat{f}_A in that set.  And, suppose we could do this
recursively.  Then what?  

Then we could build up a tree, splitting each node whose answer is >
tau, and ignoring those < tau.
 - max width of tree is 1/tau.  
 - Just keep going to leaves at depth n.  
 => Only n/tau questions to ask.

So, all boils down to: how can we ask for the sum of squares of the
\hat{f}_A's inside these sets?  First of all, to be able to do this,
we're going to need to split these sets in a particular way.
Specifically, we'll do this by fixing a prefix of the variables, and
then the set will be "all parities consistent with this prefix".  For
instance,
	S_0 = {A: A doesn't include x_1}
	S_1 = {A: A does include x_1}
Then, will split S_0 into
	S_00 = {A: A doesn't include x_1 or x_2}
	S_01 = {A: A doesn't include x_1 but does include x_2}
etc.

In general, our goal is now to estimate, for some string alpha of 0s
and 1s, the sum of the squares of the \hat{f}_A's for all A in
S_alpha.

(Note: we won't find the sum exactly.  We'll find an estimate of the
sum.  But that's fine.  It just means we throw out nodes whose
estimate is < tau/2 instead of those whose estimate is < tau)

There's now a neat trick for doing this.  Let's do case of alpha = 00000
(a string of k zeros).  So, we want the sum of squares over all
parities that don't include any of first k variables.  Remember, f is
a {-1, +1}-valued function.

The claim is that this sum just happens to equal the expected value,
over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx).

I.e. pick random suffix x.  Pick two random prefixes y and z and
multiply f(yx) by f(zx).  Repeat many times and take the average. 
Intuition: independent y and z will cancel out correlations we don't
want.  E.g., what if f was a parity function?  If f agrees with alpha,
then f(yx)=f(zx) so E[f(yx)f(zx)] = 1.  If f doesn't agree with alpha,
then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0.

Now, since any f can be written as a sum of parities, hopefully we can
reduce the case of general f to the case of "f is a parity function".....
It's actually almost but not quite so easy.  Question we need to
answer is: for two (possibly different) sets A and B, what is

	E [ phi_A(yx) * phi_B(zx) ]  ?
      x,y,z

Claim:  if A contains (at least) one of first k bits then this is 0
	if B contains (at least) one of first k bits then this is 0
	else if A != B, then this is 0
	else (A = B and they don't contain any of first k bits) then this is 1.

This is enough.  Now we get:

	E[f(yx)f(zx)]
	 = E[ (sum_A \hat{f}_A \phi_A(yx)) (sum_B \hat{f}_B \phi_B(zx))]
	 = what we want.

So, we're done for the case of alpha = "all zeros".  What about other
alpha (i.e., requiring parity to have a certain fixed dependence on
the first k bits)?  Since we know the dependence (alpha) just "undo
it".  Specifically, we instead ask for:

	E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)]

where phi_alpha is the parity of the variables set to 1 in alpha. 

That's it.  qed. Ta da.

Actually, one technical point: we need to estimate an expectation.
We'll do it by sampling.  OK since the quantity inside the expectation
is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he
shows is that instead of modifying the distribution, you can
modify f to a non-boolean function.  Then, need to argue that (based
on how the boosting results work) f doesn't get too large, so you can
still estimate f(yx)f(zx) from sampling.
============================================================================

Here's another Membership query algorithm that in some ways is more
powerful than the above.  Use to learn decision trees over *any*
distribution.  

"PAC + MQ" model: draw from D, and also can ask MQs.

In fact, algorithm works for a larger class: "XOR of terms".
We'll assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_t,
where T_i are conjunctions.

Why is this a generalization of DTs?  There's a trick to this....

We're going to solve by building a decision-tree-like-thing (a "parity
decision graph"?) of depth n and width O(t).  So, by Occam results,
just need to take some initial sample S of size O(nt/epsilon), and then use
our membership queries to find a consistent hypothesis of size O(nt).

Idea: let's just build a decision tree with x_1 at the root, then x_2
at the next level, etc., until we get to > t leaves.  This will happen
at some depth d = log(t).

Define \hat{T}_1, ..., \hat{T}_t to be what you get by taking the
original terms T_1,...,T_t and removing all mention of variables
x_1,...,x_d from them.  

So, if we look at all leaves of our tree so far, the value of the
target function, given that the example takes you to that leaf, can be
written as an XOR of some subset of the \hat{T}'s.

For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4)

and say d=2.  Then the leaves look like ((add hats to these)):

	(T1 XOR T3), T1, (T2 XOR T3), T2.

Now, the claim is that since there are > t leaves, some of these must
be XORs of others.  In fact, there must exist some set L of at most t
leaves, such that any other leaf is an XOR of some subset of leaves in
L.  E.g., above any leaf is the XOR of the other three leaves.

So the plan is now:
1. (somehow) find such a set L, and write down for other leaves how
	they can be written as an XOR of leaves in L.

  (oops: the description length of a level could be as high as t^2, so
   let's have S be of size nt^2/epsilon to be safe....)

2. keep building the tree down from leaves in L until have more than t
  leaves again, and repeat.

3. (somehow) understand what this means in terms of a final hypothesis.


Let's do 1:  For each example, vary the first d bits (using MQs) to
make the example reach all active leaves.  ==> Get a vector of
outputs.  I.e., what would the output have been if first d bits were
set to take you to leaf i. 

If we write this out for all the examples, we get a big matrix where
the set L corresponds to a set of <= t columns such that every other column
is some linear combination of these (mod 2).  Find these with Gaussian
elimination.  (What's a little confusing here is: say this tells us
that leaf 1 is the XOR of leaves 2,3,5.  We don't really know if this
is true over ALL of {0,1}^n.  We just know it's true over examples
with suffixes matching S.  But, that's OK since our goal is just to
find a hypothesis consistent with S.)


To do 3: Follow example down the tree.  When you get to an "inactive"
node, then replace position by the *set* of positions listed in the node.
So, our "current location" in the tree is now a set of nodes.  More
generally, if our current set of nodes is N and say node y in N is
inactive, then look at set (N - {y}) XOR (positions listed at y).


Why is this legal?  We can see this by arguing bottom-up: Say (as
above) that L is the set of active nodes at the current level, and
inductively assume that the trees rooted at nodes of L are correct on all
examples produced by taking some x in S and replacing its prefix by
whatever is needed to reach that node.  Then (by construction) the
inactive nodes (the ones not in L) are also correct on examples
produced by taking some x in S and replacing its prefix with whatever
was needed to reach them, too.  Therefore, all *parents* of this level
are correct on examples produced by taking some x in S and replacing
prefixes. And so on.