15-859(B) Machine Learning Theory             03/17/08
* use of Fourier analysis for learning (preliminaries)
* Some simple algorithms for learning wrt uniform distrb.
=========================================================================
Recap SQ analysis
=================
- Fourier view of functions as vectors (once we fix a distribution over data)
- Showed that if target is picked randomly from a large set of
  orthogonal functions, then SQ alg is in trouble.
   - can view *any* SQ alg as basically proposing a unit vector and
     asking for correlation (dot-product) with target.
     (Can break any SQ into one part that looks like this, and one
     part that depends only on data distribution)

   - Then use the fact that in any orthogonal coordinate system, a
     unit vector can have at most t^2 coefficients that are >= 1/t.

   - This means that if target is picked from set S of orthogonal
     functions, Pr(dot-product >= 1/t) <= t^2/|S|.

   - So, if > poly(n) orthogonal functions, then can't even weak-learn
     with poly # of queries, each of tolerance 1/poly.

Using Fourier analysis for learning
===================================
- useful tool for learning wrt uniform distribution. (basis of parity fns)

- esp useful with queries.  Will see next time an alg for *finding*
  large fourier coeffs over parity basis if can prod target f with
  inputs x of our own choosing.

Some general results
====================
Suppose phi_1(x), phi_2(x), ... is a basis of boolean functions.

	f(x) = sum_i hat{f}_i phi_i(x)

hat{f}_i = correlation of f with phi_i = <f, phi_i>.

Claim 1: if you can find non-negligable fourier coefficient then you
can weak-learn.
Proof: Just directly from defn of correlation. Use phi_i or its complement.

Now, suppose we can find "most" of the fourier coefficients.  Say we
can find a set S that captures all but epsilon of the (hat{f}_i)^2.
Let g = sum_{i in S} hat{f}_i phi_i.  Then, as vectors, we have
|f-g|^2 = epsilon, and also we have <f,g> = 1-epsilon.

Claim 2: Pr[f(x) != sign(g(x))] <= epsilon.  So sign(g) is a good apx to f.
Proof: by definition, E[(f(x)-g(x))^2] = epsilon.  Notice that every x
where sign(g) is wrong contributes at least 1.  So, the error of
sign(g) is at most epsilon.

In fact, can see from above argument that we don't need to get the fourier
coeffs exactly.  If someone hands us S, we can estimate the coefficients
by looking at correlations, and get a good apx to g.  That's all we need.

A more refined bound
====================
What if we can only find a set S that captures, say, half of the 
(hat{f}_i)^2.  Then the above isn't so helpful.  Can we do any better?

Claim 3: Say |f-g|^2 = epsilon.  Then can define a randomized function
h based on g such that  Pr[f(x) != h(x)] \leq epsilon/2.  So, this is
interesting even for epsilon = 1/2.

Let's first prove something a little easier.

Claim 4: Say <f,g>=1-epsilon *and* g(x) in [-1,1].  Then can define
randomized h based on g s.t. Pr[f(x) != h(x)] \leq epsilon/2.
Proof: Define h as:
       h(x) =  1 with probability (1 + g(x))/2,
       h(x) = -1 with probability (1 - g(x))/2.
So, for any given x, Pr[f(x) != h(x)] = (1 - f(x)g(x))/2.
So, over all, Pr[f(x) != h(x)] = (1 - E[f(x)g(x)])/2 = epsilon/2.

For proof of Claim 3, we need to define h in a messier way:
       h(x) =  1 with probability (1 + g(x))^2 / (2(1 + g^2(x))),
       h(x) = -1 with probability (1 - g(x))^2 / (2(1 + g^2(x))).
       [a little algebra to check this sums to 1].
So, for any given x, Pr[f(x) != h(x)] = (f(x) - g(x))^2 / (2(1+g^2(x))).
So, over all, Pr[f(x) != h(x)] <= epsilon/2.


Learning wrt uniform distrib
============================
Above general statements hold for generic setting.  Now, focus on
learning wrt uniform distribution.  Specific basis of parity
functions.  So, view like this: we have some unknown function f on n
inputs in a box.  Can see f on uniform random pts.  Maybe we're
allowed to query f on inputs of our choosing (the Membership Query
model).  Goal, find g that approximates f in the sense that it gets
most of the domain correct.

E.g., next time: Kushilevitz-Mansour algorithm that using Membership
Queries, will find all i such that (\hat{f}_i)^2 > tau in time poly
(n, 1/tau).  Can combine this with structural statement that any
Decision Tree must have most of its length in large fourier
coefficients to get a poly-time algorithm to learn decision trees with
MQs wrt the uniform distribution (via Claim 2 above).

For now, some simple algorithms that don't use queries.

-------------------
DNF: We can complement our SQ lower bound with a simple n^{O(log s)} time
algorithm to learn DNF of s terms over uniform distribution.

  + must exist *some* term that covers at least 1/s fraction of positives.
  + can find such a term in time n^{O(log s/epsilon)}
  + Do greedy set-cover.
    Get DNF of s*log(1/epsilon) terms.  Size is bounded so can use Occam
    to upper-bound sample size needed.

- Note: LMN use Fourier analysis to get this kind of n^polylog(s)
  bound for learning AC^0 circuits.  (DNF are depth-2 circuits). 

(Weak)-Learning Monotone functions
----------------------------------
A monotone function is one such that if you take some example x and
flip some bit from 0 to 1, then the value of f(x) can't change from
positive to negative.

Quick sanity check: why are we focusing on uniform distribution? If
you allow arbitrary distribution, then you're sunk.  E.g., consider
all probability mass on examples x with |x|=n/2 (the "middle slice").
Target can now be arbitrary on these, so nothing to learn.

But under uniform, you can't have too many orthogonal monotone
functions.  In particular, a classic results is that any monotone
function has to have Omega(1/n) correlation with one of {0,1, x_1, ..., x_n}.

Specifically:
 - If Pr(f(x) = 1) is far from 1/2, we're done.  (can just say
   "everything is positive" or "everything is negative).  So, for now
   let's assume Pr(f(x)=1) is between 1/4 and 3/4. 

 - Want to argue that some variable has correlation.  I.e., for some
   x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon.

   [Sanity check that this is same as our previous def of correlation,
   i.e., x_i is a weak hypothesis.  If error rate is alpha when x_i=0, then
   error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error
   rate is their average which is (1-epsilon)/2.]

How to prove:  look at boolean cube {0,1}^n.  Let's color example red if
positive and blue if negative.  Assume roughly half of each.  E.g.,
say at least 1/4 of nodes are red and at least 1/4 are blue.  Want: no
matter how you color, a lot of edges crossing from red to blue.  Say
we can prove that there must be Omega(2^n) edges with one endpoint of each
color.  Then must be some direction i with Omega(2^n / n) edges in that
direction. Then x_i has good correlation [[why?]]

So, it all boils down to saying that if you want to slice a cube into
two roughly equal halves, you need to cut a lot of edges.

Can prove this using "canonical path" argument.  for each pair of
examples (x,y) where x is red and y is blue, define the canonical path
from x to y to be: "starting with x, scan down the bits from left to
right, flipping where needed".  Must use at least one red-blue edge.
There are at least (1/4)(3/4)(2^{2n}) pairs.  But, any *given*
red-blue edge can be used by at most 2^{n-1} of these paths (why:
because if edge is 01010100 -> 01011100, then x must end with 0100 and
y must begin with 01011).  So, there must be at least (3/8)(2^n)
red-blue edges.

In fact, you can even show a stronger result (which we won't have time
to prove today): any apx balanced monotone function must have
Omega(1/sqrt(n)) correlation with the majority function over all the
variables.  In particular, this implies an even simpler algorithm:

Just asks two SQs:
   1. What is Pr(f(x) = 1)?
   2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)?

Claim: any monotone function must have Omega(1/sqrt(n)) correlation
with (a) the all positive function, (b) the all negative function, or
(c) the majority function "predict positive iff |x| > n/2".