15-859(B) Machine Learning Theory             03/27/06
* finish SQ hardness
* use of Fourier analysis for learning (preliminaries)
* Some simple algorithms for learning wrt uniform distrb.
=========================================================================
Recap SQ analysis
=================
- Fourier view of functions as vectors (once we fix a distribution over data)
- Showed that if target is picked randomly from a large set of
  orthogonal functions, then SQ alg is in trouble.
   - can view *any* SQ alg as basically proposing a unit vector and
     asking for correlation (dot-product) with target.
     (Can break any SQ into one part that looks like this, and one
     part that depends only on data distribution)

   - Then use the fact that in any orthogonal coordinate system, a
     unit vector can have at most 1/tau^2 coefficients that are >= tau

   - This means that if target is picked from set S of orthogonal
     functions, Pr(dot-product >= tau) <= tau/|S|.

   - So, if > poly(n) orthogonal functions, then can't even weak-learn
     with poly # of queries, each of tolerance 1/poly.

Using Fourier analysis for learning
===================================
- useful tool for learning wrt uniform distribution.
- esp useful with queries.  Can think of problem like this: we have a
  black-box containing some unknown function f:{0,1}^n -> {0,1} from
  some class C.  We can prod f with inputs x of our own choosing.  Our
  goal is to come up with a good approx of f in the sense that it gets
  most of the domain correct.

Some general results
====================
Suppose \phi_1(x), \phi_2(x), ... is a basis of boolean functions.

	f(x) = \sum_i \hat{f}_i \phi_i(x)

\hat{f}_i = correlation of f with phi_i.

Claim 1: if you can find non-negligable fourier coefficient then you
can weak-learn.
Proof: Just directly from defn of correlation. Use phi_i or its complement.

Now, suppose we can find "most" of the fourier coefficients.  Say we
can find a set S that captures all but epsilon of the (\hat{f}_i)^2.
Let g = \sum_{i in S} \hat{f}_i \phi_i.  Then, as vectors, we have
|f-g|^2 = epsilon.

Claim 2: even if above g is not a boolean function, sign(g) is a good
apx to f.
Proof: by definition, E[(f(x)-g(x))^2] = epsilon.  Notice that every x
where sign(g) is wrong contributes at least 1.  So, the error of
sign(g) is at most epsilon.

In fact, can see from above argument that we don't need to get the
fourier coefficients exactly.  If someone hands us S, we can estimate
the coefficients by looking at correlations, and get a good apx to g.
That's all we need.  

Learning wrt uniform distrib
============================
Above general statements hold for generic setting.  Now, focus on
learning wrt uniform distribution.  Specific basis of parity
functions.  So, view like this: we have some unknown function f on n
inputs in a box.  Can see f on uniform random pts.  Maybe we're
allowed to query f on inputs of our choosing (the Membership Query
model).  Goal, find g that approximates f in the sense that it gets
most of the domain correct.

E.g., next time: Kushilevitz-Mansour algorithm that using Membership
Queries, will find all i such that (\hat{f}_i)^2 > tau in time poly
(n, 1/tau).  Can combine this with structural statement that any
Decision Tree must have most of its length in large fourier
coefficients to get a poly-time algorithm to learn decision trees with
MQs wrt the uniform distribution (via Claim 2 above).

For now, some simple algorithms that don't use queries.

-------------------
DNF: We can complement our SQ lower bound with a simple n^{O(log s)} time
algorithm to learn DNF of s terms over uniform distribution.   Ideas?

  + how important are terms of size >> log(s)?
  + Do greedy set-cover.
    Get DNF of s*log(1/epsilon) terms.  Size is bounded so can use Occam
    to upper-bound sample size needed.

    [Can't do "list-and-cross-off" because hypothesis size too big]

- Note: LMN use Fourier analysis to get this kind of n^polylog(s)
  bound for learning AC^0 circuits.  (DNF are depth-2 circuits). 

- On hwk, you are giving alg to learn decision trees over *any*
  distribution in time n^{O(log s)}.

(Weak)-Learning Monotone functions 
----------------------------------
A monotone function is one such that if you take some example x and
flip some bit from 0 to 1, then the value of f(x) can't change from
positive to negative.

Quick sanity check: why are we focusing on uniform distribution? If
you allow arbitrary distribution, then you're sunk.  E.g., consider
all probability mass on examples x with |x|=n/2 (the "middle slice").
Target can now be arbitrary on these, so nothing to learn.

But under uniform, you can't have too many orthogonal monotone
functions.  In particular, a classic results is that any monotone
function has to have Omega(1/n) correlation with one of {0,1, x_1, ..., x_n}.

Specifically:
 - If Pr(f(x) = 1) is far from 1/2, we're done.  (can just say
   "everything is positive" or "everything is negative).  So, for now
   let's assume Pr(f(x)=1) is between 1/4 and 3/4. 

 - Want to argue that some variable has correlation.  I.e., for some
   x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon.

   [Sanity check that this is same as our previous def of correlation,
   i.e., x_i is a weak hypothesis.  If error rate is alpha when x_i=0, then
   error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error
   rate is their average which is (1-epsilon)/2.]

How to prove:  look at boolean cube {0,1}^n.  Let's color example red if
positive and blue if negative.  Assume roughly half of each.  E.g.,
say at least 1/4 of nodes are red and at least 1/4 are blue.  Want: no
matter how you color, a lot of edges crossing from red to blue.  Say
we can prove that there must be Omega(2^n) edges with one endpoint of each
color.  Then must be some direction i with Omega(2^n / n) edges in that
direction. Then x_i has good correlation [[why?]]

So, it all boils down to saying that if you want to slice a cube into
two roughly equal halves, you need to cut a lot of edges.

Can prove this using "canonical path" argument.  for each pair of
examples (x,y) where x is red and y is blue, define the canonical path
from x to y to be: "starting with x, scan down the bits from left to
right, flipping where needed".  Must use at least one red-blue edge.
There are at least (1/4)(3/4)(2^{2n}) pairs.  But, any *given*
red-blue edge can be used by at most 2^{n-1} of these paths (why:
because if edge is 01010100 -> 01011100, then x must end with 0100 and
y must begin with 01011).  So, there must be at least (3/8)(2^n)
red-blue edges.

In fact, you can even show a stronger result (which we won't have time
to prove today): any apx balanced monotone function must have
Omega(1/sqrt(n)) correlation with the majority function over all the
variables.  In particular, this implies an even simpler algorithm:

Just asks two SQs:
   1. What is Pr(f(x) = 1)?
   2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)?

Claim: any monotone function must have Omega(1/sqrt(n)) correlation
with (a) the all positive function, (b) the all negative function, or
(c) the majority function "predict positive iff |x| > n/2".