15-859(B) Machine Learning Theory            02/28/02

1.Finish proof from last time (Fourier analysis to prove that can't
  learn C in SQ model if C has more than polynomially-many
  uncorrelated functions) 

2. More SQ algorithms: weak-learning monotone functions under the uniform
  distribution. 

========================================================================

1. --> see notes from last time.

2. Weak-learning monotone functions under uniform distribution
==============================================================

Here is an interesting SQ algorithm, for getting 1/n correlation
(weak-learning) any *monotone* function over the uniform distribution.

Say learning from examples, which are uniformly drawn in {0,1}^n.  Say
that target function is *monotone*.  Monotone means that if you take
some example x and flip some bit from 0 to 1, then the value of f(x)
can't change from positive to negative.  You don't have queries: just
observe uniform random examples.  Just given that the function is
monotone, it turns out there is an easy way to *weak-learn* with
respect to uniform distribution.   Any guesses?

 - Say that Pr(f(x) = 1) = 1/2 (at least approximately) since
   otherwise can just say "everything is positive" or "everything is
   negative". 


 - Want to argue that some variable has correlation.  I.e., for some
   x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon.


   (If can find such an x_i, then can predict positive when x_i = 1
   and negative when x_i = 0)

I.e., target has some noticeable fourier coefficient out of these n+1.

How to prove:  look at boolean cube {0,1}^n.  Let's color example red if
positive and blue if negative.  Assume roughly half of each.  E.g.,
say at least 1/4 of nodes are red and at least 1/4 are blue.  Want: no
matter how you color, a lot of edges crossing from red to blue.  Say
we can prove that there must be Omega(2^n) edges with one endpoint of each
color.  Then must be some direction i with Omega(2^n / n) edges in that
direction. Then x_i has good correlation [[why?]]

So, it all boils down to saying that if you want to slice a cube into
two roughly equal halves, you need to cut a lot of edges.

Can prove this using "canonical path" argument.  for each pair of
examples (x,y) where x is red and y is blue, define the canonical path
from x to y to be: "starting with x, scan down the bits from left to
right, flipping where needed".  There are at least (1/4)(3/4)(2^{2n})
pairs.  But, any *given* edge can be used by at most 2^n of these
paths (why: because if edge is 01010100 -> 01011100, then x must end
with 100 and y must begin with 0101).  So, there must be at least
(3/16)(2^n) edges. 


This algorithm gives 1/n correlation.  It turns out you get 1/sqrt(n)
corelation by an even simpler algorithm that just asks two SQs:
   1. What is Pr(f(x) = 1)?
   2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)?

I.e., we're asking: 
   What is the correlation of f with the "all positive" hypothesis? and 
   What is the correlation of f with the majority function?

And can also prove that no algorithm from a polynomial-sized sample
can hope to do better than log(n)/sqrt(n) for an arbitrary monotone
function.   Won't do the proofs here, but this is from a paper with
Carl Burch and John Langford.


Nice (hard) open question:  if you assume the target can be written as 
a polynomial-size monotone DNF formula, can you do better?  [The lower
bound doesn't hold in that case]

(probably easier) open problem: close the log(n) gap.