15-859(B) Machine Learning Theory             03/17/10
* use of Fourier analysis for learning
* Learning via Membership queries I
* KM algorithm for finding large Fourier coefficients
=========================================================================
Recap SQ analysis
=================
- Fourier view of functions as vectors (once we fix a distribution over data)
- Showed that if target is picked randomly from a large set of
  orthogonal functions, then SQ alg is in trouble.
   - can view *any* SQ alg as basically proposing a unit vector and
     asking for correlation (dot-product) with target.
     (Can break any SQ into one part that looks like this, and one
     part that depends only on data distribution)

   - Then use the fact that in any orthogonal coordinate system, a
     unit vector can have at most t^2 coefficients that are >= 1/t.

   - This means that if target is picked from set S of orthogonal
     functions, Pr(dot-product >= 1/t) <= t^2/|S|.

   - So, if > poly(n) orthogonal functions, then can't even weak-learn
     with poly # of queries, each of tolerance 1/poly.


Using Fourier analysis for learning
===================================
- useful tool for learning wrt uniform distribution. (basis of parity fns)

- esp useful with queries.  Will talk about an alg for *finding*
  large fourier coeffs over parity basis if can prod target f with
  inputs x of our own choosing.

Some general results
====================
Suppose phi_1(x), phi_2(x), ... is an orthonomal basis of boolean functions.
(<phi_i,phi_j>=0 for all i!=j)

	f(x) = sum_i hat{f}_i phi_i(x)

hat{f}_i = correlation of f with phi_i = <f, phi_i>.

Claim 1: if you can find non-negligable fourier coefficient then you
can weak-learn.
Proof: Just directly from defn of correlation. Use phi_i or its complement.

Now, suppose we can find "most" of the fourier coefficients.  Say we
can find a set S that captures all but epsilon of the (hat{f}_i)^2.
Let g = sum_{i in S} hat{f}_i phi_i (note, this may not be a boolean
function).  Then, as vectors, we have |f-g|^2 = epsilon.

Claim 2: Pr[f(x) != sign(g(x))] <= epsilon.  So sign(g) is a good apx to f.  

Proof: f-g = (...,sqrt[Pr(x)](f(x)-g(x)),...) so |f-g|^2 = <f-g,f-g> =
E[(f(x)-g(x))^2] = epsilon.  Notice that every x where sign(g) is
wrong contributes at least 1.  So, the error of sign(g) is at most
epsilon.

In fact, can see from above argument that we don't need to get the fourier
coeffs exactly.  If someone hands us S, we can estimate the coefficients
by looking at correlations, and get a good apx to g.  That's all we need.

Learning wrt uniform distrib
============================
Above general statements hold for generic setting.  Now, focus on
learning wrt uniform distribution.  Specific basis of parity
functions.  So, view like this: we have some unknown function f on n
inputs in a box.  Can see f on uniform random pts.  Maybe we're
allowed to query f on inputs of our choosing (the Membership Query
model).  Goal, find g that approximates f in the sense that it gets
most of the domain correct.

Below: Kushilevitz-Mansour algorithm that using Membership Queries,
will find all i such that (\hat{f}_i)^2 > tau in time poly (n, 1/tau).
In the next class we will prove that any Decision Tree must have most
of its length in large fourier coefficients to get a poly-time
algorithm to learn decision trees with MQs wrt the uniform
distribution (via Claim 2 above).

Learning with Membership queries
================================
So far we have focused on learning in a passive model, where the
learning algorithm has no control over the examples it sees.  We are
now going to consider what happens if we allow a learning algorithm
to ask for labels on examples of its own construction.  These are called
"Membership Queries" (MQs).  These are natural for learning problems
where the target function is some device that we are trying to
reverse-engineer.  One thing that makes MQs interesting is now there
is a wider space of things the algorithm can do.

[Terminology note: "active learning" generally corresponds to a setting
where you have a pool of unlabeled data and can only select among them.  
MQs allow the algorithm to completely specify the example to be labeled.] 

Can define the following models:

PAC + MQ: Can sample from D and ask MQs.   Goal: low error wrt D.

    Special case D=uniform on {0,1}^n.  Can view this is you have a
    black box that you can prod as you like and you want h that agrees
    with target on most of the domain.

MB + MQ: Mistake-bound setting with queries.  Pay $1 per mistake and
    pay $1 per query.   Also called the "equivalence + membership
    query model".

Exact learning with MQ only: need to recover target exactly using only
    queries.  Pretty tough.  E.g., can't even do conjunctions (why?).
    But you *can* do parity (how?)

Let's start with a simple algorithm.

Learning Monotone DNF in MB+MQ model
====================================
Ideas?
 * Start with h(x) = "all-negative".
 * If predict negative on a positive example x:
   - "walk" x down to a minimal positive example (go left to right along
      x, flipping 1's to 0's so long as they keep the example positive).
   -  The conjunction of the variables in x must be a term of c.  So put
      it into h.
 * Since we maintain that all terms of h are also in c, we don't make
   any mistakes on negatives.

Total number of mistakes is at most # of terms in target.  Number of
queries at most n per term.

OK, now let's go to the Kushilevitz-Mansour algorithm.

Kushilevitz-Mansour (Goldreich-Levin) algorithm: 
===============================================
Let's index parity functions by their indicator vectors.
Our goal is to find all hat{f}_v such that (hat{f}_v)^2 > tau.

Say S is the set of all indicator vectors.  So the sum, over v in S,
of the (hat{f}_v)^2 equals 1.  Suppose we split S into 
	S_0 = {v: v begins with 0}
	S_1 = {v: v begins with 1}
and suppose for each of these we could ask for the sum of squares of
the hat{f}_v in that set.  And, suppose we could do this recursively.
E.g., ask for sum of squares in:
	S_00 = {v: v begins with 00}
	S_01 = {v: v begins with 01}

Then what?  Then we could build up a tree, splitting each node whose
answer is > tau, and ignoring those < tau.
 - max width of tree is 1/tau.  
 - Just keep going to leaves at depth n.  
 => Only n/tau questions to ask.

So, all boils down to: how can we ask for the sum of squares of the
hat{f}_v's inside these sets?  In general, our goal is now to
estimate, for some string alpha of 0s and 1s, the sum of the squares
of the hat{f}_v's for all v that begin with alpha.

(Note: we won't find the sum exactly.  We'll find an estimate of the
sum.  But that's fine.  It just means we throw out nodes whose
estimate is < tau/2 instead of those whose estimate is < tau)

There's now a neat trick for doing this.  Let's do case of alpha = 00000
(a string of k zeros).  So, we want the sum of squares over all
parities that don't include any of first k variables.  Remember, f is
a {-1, +1}-valued function.

The claim is that this sum just happens to equal the expected value,
over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx).

I.e. pick random suffix x.  Pick two random prefixes y and z and
multiply f(yx) by f(zx).  Repeat many times and take the
average. Intuition: independent y and z will cancel out correlations
we don't want.  E.g., what if f was a parity function?  If f agrees
with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1.  BUT, if f DOESN'T
agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0.

Now, since any f can be written as a sum of parities, hopefully we can
reduce the case of general f to the case of "f is a parity function"...
It's actually almost this easy.  

	E[f(yx)f(zx)]
	 = E[ (sum_v hat{f}_v phi_v(yx)) (sum_w hat{f}_w phi_w(zx))]

So just have to calculate: for two vectors v,w what is

	E [ phi_v(yx) * phi_w(zx) ]  ?
      x,y,z

Claim:  This is 0 if v != w.  If v=w we already showed this is 0 if
	v is not all 0s on the first k bits, else it is 1.
Proof:  If v!=w then just imagine we pick that differening bit last

So,	E[f(yx)f(zx)]
	 = E[ (sum_v hat{f}_v phi_v(yx)) (sum_w hat{f}_w phi_w(zx))]
	 = sum of squares of coeffs starting with k zeroes.

So, we're done for the case of alpha = "all zeros".  What about other
alpha (i.e., requiring parity to have a certain fixed dependence on
the first k bits)?  Since we know the dependence (alpha) just "undo
it".  Specifically, we instead ask for:

	E[f(yx)f(zx)\phi_{alpha}(y)phi_{alpha}(z)]

where phi_alpha is the parity of the variables set to 1 in alpha. 
(then we get same as in above Claim, except the non-zero case is when
v=w and they are equal to alpha on the first k bits).

That's it.

Actually, one technical point: we need to estimate an expectation.
We'll do it by sampling.  OK since the quantity inside the expectation
is bounded (in fact, it is always -1 or +1).