15-859(B) Machine Learning Theory             03/29/06
* Learning with Membership Queries
  - Some simple algorithms: monotone DNF
* KM algorithm for finding large Fourier coefficients
=========================================================================

Learning with Membership queries
================================
So far we have focused on learning in a passive model, where the
learning algorithm has no control over the examples it sees.  It is
also natural to consider what happens if we allow a learning algorithm
to ask for labels on examples of its own choosing.  These are called
"Membership Queries" (MQs).  E.g., these are natural if we 
think of the target function as some device that we are trying to
reverse-engineer.  Also, at the very least, this allows for more
interesting algorithms since there is more for the algorithm to do.
Can define the following models:

PAC + MQ: Can sample from D and ask MQs.   Goal: low error wrt D.

Unif + MQ: Special case D=uniform.  Can view setting is you have a
    black box that you can prod as you like and you want h that agrees
    with target on most of the domain.

MB + MQ: Mistake-bound setting with queries.  Pay $1 per mistake and
    pay $1 per query.   Also called the "equivalence + membership
    query model".

Exact learning with MQ only: need to recover target exactly using only
    queries.  Pretty tough.  E.g., can't even do conjunctions (why?).
    But you *can* do parity (how?)

Cases where MQs make sense: trying to "break" encryption on a
smart-card [mention Angluin's DFA algorithm], active exploration of
environment, converting a complicated hypothesis into a more
understandable form.  Cases where MQs don't make so much sense:
classifying documents/images (doesn't make sense to "walk" a document
about topic X towards a document about topic Y and ask at what point
the topic changes)

Plan for next few classes:
- Some simple algorithms
- MQ algs based on Fourier.  (for learning wrt uniform)
- Bshouty's alg: learn Decision trees in PAC+MQ (or MB+MQ) model.
- Angluin's algorithm for learning Finite Automata.  Extensions due to
  Rivest&Schapire. 

Angluin's algorithm for learning Monotone DNF in MB+MQ model
============================================================
 * Start with h(x) = "all-negative".
 * If predict negative on a positive example x:
   - "walk" x down to a minimal positive example (go left to right along
      x, flipping 1's to 0's so long as they keep the example positive).
   -  The conjunction of the variables in x must be a term of c.  So put
      it into h.
 * Since we maintain that all terms of h are also in c, we don't make
   any mistakes on negatives.

Total number of mistakes is at most # of terms in target.  Number of
queries at most n per term.

Fourier-based MQ algorithms
===========================

We know that any f over {0,1}^n can be written as a sum of parity functions:

	f(x) = \sum_i \hat{f}_i \phi_i(x)

where \phi_i is the ith parity function, and \hat{f}_i is the
correlation of f with \phi_i under the uniform distribution.

Claim: using MQs, we can find all i such that (\hat{f}_i)^2 > tau in
time poly(n, 1/tau). 

Kushilevitz-Mansour (Goldreich-Levin) algorithm: 
--------------------------------------------------
Let's index parity functions by their indicator vectors.
Our goal is to find all \hat{f}_v such that (\hat{f}_v)^2 > \tau.

Say S is the set of all indicator vectors.  So the sum, over v in S,
of the (\hat{f}_v)^2 equals 1.  Suppose we split S into 
	S_0 = {v: v begins with 0}
	S_1 = {v: v begins with 1}
and suppose for each of these we could ask for the sum of squares of
the \hat{f}_v in that set.  And, suppose we could do this recursively.
E.g., ask for sum of squares in:
	S_00 = {v: v begins with 00}
	S_01 = {v: v begins with 01}

Then what?  Then we could build up a tree, splitting each node whose
answer is > tau, and ignoring those < tau.
 - max width of tree is 1/tau.  
 - Just keep going to leaves at depth n.  
 => Only n/tau questions to ask.

So, all boils down to: how can we ask for the sum of squares of the
\hat{f}_v's inside these sets?  In general, our goal is now to
estimate, for some string alpha of 0s and 1s, the sum of the squares
of the \hat{f}_v's for all v that begin with alpha.

(Note: we won't find the sum exactly.  We'll find an estimate of the
sum.  But that's fine.  It just means we throw out nodes whose
estimate is < tau/2 instead of those whose estimate is < tau)

There's now a neat trick for doing this.  Let's do case of alpha = 00000
(a string of k zeros).  So, we want the sum of squares over all
parities that don't include any of first k variables.  Remember, f is
a {-1, +1}-valued function.

The claim is that this sum just happens to equal the expected value,
over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx).

I.e. pick random suffix x.  Pick two random prefixes y and z and
multiply f(yx) by f(zx).  Repeat many times and take the
average. Intuition: independent y and z will cancel out correlations
we don't want.  E.g., what if f was a parity function?  If f agrees
with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1.  BUT, if f DOESN'T
agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0.

Now, since any f can be written as a sum of parities, hopefully we can
reduce the case of general f to the case of "f is a parity function"...
It's actually almost but not quite so easy.  Question we need to
answer is: for two (possibly different) vectors v,w what is

	E [ phi_v(yx) * phi_w(zx) ]  ?
      x,y,z

Claim:  This is 0 if v != w, and is also 0 if v=w is not all 0s on the
	first k bits.  If v=w and they are all 0 on 1st k bits, then
	the expectation is 1.
Proof:  [go through cases]

This is enough.  Now we get:

	E[f(yx)f(zx)]
	 = E[ (sum_v \hat{f}_v \phi_v(yx)) (sum_w \hat{f}_w \phi_w(zx))]
	 = sum of squares of coeffs starting with k zeroes.

So, we're done for the case of alpha = "all zeros".  What about other
alpha (i.e., requiring parity to have a certain fixed dependence on
the first k bits)?  Since we know the dependence (alpha) just "undo
it".  Specifically, we instead ask for:

	E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)]

where phi_alpha is the parity of the variables set to 1 in alpha. 
(then we get same as in above Claim, except the non-zero case is when
v=w and they are equal to alpha on the first k bits).

That's it.

Actually, one technical point: we need to estimate an expectation.
We'll do it by sampling.  OK since the quantity inside the expectation
is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he
shows is that instead of modifying the distribution, you can
modify f to a non-boolean function.  Then, need to argue that (based
on how the boosting results work) f doesn't get too large, so you can
still estimate f(yx)f(zx) from sampling.
============================================================================

On the Fourier Spectrum of DNF and Decision Trees
=================================================

Let's start by analyzing the fourier spectrum of a single term
(conjunction) T.  In fact, let's define the function T(x) as: T(x) = 1
if x satisfies term T and T(x) = 0 otherwise.  

This is a little weird since it is not a boolean function according to
our definition {-1, +1}.  But it is still a legal function and will be
very helpful, since we can start adding them up.  E.g., since a
decision tree can be viewed as an OR of terms going to its positive
leaves, and since any example satisfies at most one of them, we can
write f(x) = [T_1(x) + ... + T_m(x)]*2 - 1, where the T_i are all the
terms corresponding to positive leaves.

What does the fourier decomposition of T look like:

        hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1]

                  = 0 if phi_v has a relevant variable outside T (by
		      usual parity argument) 

                  = (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the
		    value of phi_v on an example satisfying T.

         E.g., so it's very simple.  Sum of absolute values is 1.  Sum
         of squares is 1/2^|T| (which we knew already since sum of
         squares = <T,T> = E_x[T(x)^2] = Pr_x(T(x)=1)).


Analyzing Decision Trees
========================

Suppose f is a decision tree.  Using the terms corresponding to
positive leaves, we can write f as  

	 f(x) = [T_1(x) + ... + T_m(x)]*2 - 1

So, to get the fourier coefficients hat(f)_v, we just add up for the
corresponding functions.  One easy thing to notice is that each
function here has a sum of absolute values of coefficients equal to 1
(even the function "-1" which has coefficient 1 along the empty parity
function and 0 everywhere else).  So, really crudely, this immediately
means that

	   sum_v |hat(f)_v| <= 2m+1

where m is the number of positive leaves.

Claim: this fact by itself implies that f must have most of its length
in large fourier coefficients.   In particular, any vector f with L_2
length 1 (i.e., <f,f>=1) has the property that the sum of squares of
the coefficients less than epsilon/L_1(f) is at most epsilon, where
L_1(f) is the L_1 length of f (the sum of absolute-values of
coefficients).

Proof: a coefficient of magnitide alpha contributes alpha^2 to the sum
of squares, and alpha to the sum of absolute values.  So, the small
coefficients contribute at least a L_1(f)/epsilon factor *more* to the
latter than to the former.  Since in total they can contribute at
*most* L_1(f) to the sum of absolute values, they can contribute at
most epsilon to the sum of squares.  QED