Notes for March 4, 1998

* If you haven't yet, talk with me about your presentation/project.

===============================================================================

Recap:

* Fourier analysis of boolean functions:

  - for a given distribution, we can view functions as vectors in a
    2^n-dimensional space where |f| = 1 and the dot product <f,g> is

	    Pr_D(f(x) = g(x)) - Pr_D(f(x) != g(x)).

    (which makes sense since <f,f> = 1)

    what's nice about this is we have a lot of intuition and facts
    about vectors in space thjat we can use

* We saw that a statistical query can be viewed as basically asking
  for the correlation of the target function with some query vector.
  (any query can be decomposed into a part that depends only on D (and
  not the target) and a part like this.)
  
* Any unit vector can't have too many large coefficients.  At most
  1/tau^2 coeffs can have magnitude larger than tau.  This is since the
  length of the vector is the sum of the squares of its coefficients.

* In particular, given any set S of orthogonal unit vectors, this
  means that any query unit vector can have correlation > tau or < -tau
  with at most 1/tau^2 of them.

* So, if the target class has a large number (say, N) pairwise
  uncorrelated functions, and if the target is a random one of those,
  then each query will only throw out 1/tau^2 possibilities.  So, we
  need:
	(#queries) * (1/tau^2) >= N/2 
  on average, to get a query that gives any info about the target.


* So, if N is super-polynomial then can't learn in poly time in SQ
  model.  Equivalently, if we're using an SQ algorithm, need
  super-polynomial number of samples.

* For example, size-n decision trees have Omega(n^(log n)) pairwise
  uncorrelated functions over the uniform distribution, since that class
  contains all "parities of up to log(n) variables", and any two parity
  functions are uncorrelated.  So, the bad case is: target is a random
  parity of log(n) variables.

I think this is really nice.  Most algorithms are basically SQ
algorithms and using Fourier analysis we are able to get strong bounds
on what you can or can't do with these kinds of algorithms.


* On the other hand, there *is* a (non-SQ) algorithm for learning
  parity.  If target is (a_1,...,a_n) where a_i = 1 or 0 depending on
  whether x_i is in the target or not, then an example (x_1,...,x_n)
  gives us a linear equality a.x = 1 (mod 2).  Draw a sample and then
  use gaussian elimination.

* Big open problem: is there a poly-time algorithm for learning
  Decision Trees over the uniform distribution?  We *can* solve in time
  n^(log n) over *any* distribution (see hwk).

============================================================================

We've seen how Fourier analysis can be used to analyze limitations on
algorithms.  Now turn to how can use it in an algorithm itself.  This
becomes particularly interesting if we extend our learning model to
allow for "membership queries".

Learning with Membership queries:  in addition to getting examples
from distribution D, the learning algorithm is allowed to ask for c(x)
for *any* x of its choosing.  E.g., this makes the most sense if we
think of the target function as some device that we are trying to
reverse-engineer.  We can put it into "normal operating conditions"
and observe its behavior, but we can also poke it with inputs of our
own choosing and see what it does.  So, membership queries are a kind
of "active learning".

For example, parity functions are particularly easy to learn with
membership queries.

My plan is to spend the next few classes talking about active learning
in general.  Today, segue into this, from our previous topic, see how
to use membership queries to find fourier coefficients.

In particular, using membership queries, we can find all the "large"
fourier coefficients of a function, under the parity basis.  I.e., any
function f over {0,1}^n can be written as:
	f(x) = \sum_i \hat{f}_i \phi_i(x)
where \phi_i is the ith parity function.  What we can find are
(estimates of) all \hat{f}_i that have magnitude > 1/poly(n).

Why is this interesting?
------------------------
Suppose the target has most of itself in its high-weight coefficients,
and say we find a set of coefficients S such that

	sum_{i \in S} (\hat{f}_i)^2  >  1-\epsilon

then the claim is that

	g(x) = sum_{i \in S} (\hat{f}_i)^2 \phi_i(x)

is a good approximation to f.  Note that g may not be a boolean
function.  But, the claim is that the average squared error (over the
uniform distribution) is low.   In particular, E(f(x) - g(x))^2 < epsilon.
This is because:
	E(f(x) - g(x))^2 = E( \sum_{i \not\in S} \hat{f}_i\phi_i(x) )^2
	                 
			 = E( \sum_{i \not\in S} (\hat{f}_i)^2 \phi_i(x)^2
(since the crossterms cancel)

and this equals \sum_{i \not\in S} (\hat{f}_i)^2 < \epsilon.


If we want a good hypothesis in the usual sense (a boolean function
with a low probability of error) we can use h(x) = sign(g(x)).

Why?  Remember f is a {-1, +1} function.  So, if the sign is wrong
then the squared difference is at least 1.  So there is at most a
probability epsilon of error (under the uniform distribution).

Furthermore, we can prove the following:

* any poly-size decision tree has a poly(n, 1/epsilon) set of fourier
coefficients whose sum of squares is at least 1-epsilon (i.e., they
capture most of its "length").  So can learn DTs with membership
queries, with respect to the uniform distribution.

* Any DNF f has SOME |\hat{f}_i| > 1/(4s), where s = number of terms in
f.  So, we can use that parity function as a weak approximator of f over the
uniform distribution.

Actually, you can plug this into the boosting results.  This is tricky
since they modify the distribution.  But you can keep track of this
modification and compensate for it if you are careful.  The end result
is can strongly learn DNF over the uniform distribution, with
membership queries.  This was Jeff Jackson's thesis a couple years back.

So, there are now three things:
(1) prove fact about DNF
(2) prove fact about decision trees
(3) do the algorithm for finding fourier coefficients.

Will do (1) today and (3) next time.  Probably will skip (2).

Theorem: any DNF of s terms has some fourier coefficient of magnitide at
least 1/(4s), over the parity basis.

Proof:

First some notation.  For a set of variables A, define \phi_A to be
the function that is the parity of the variables in A.  E.g., for A =
{x_1, x_3} then you get x_1 \oplus x_3.

Say T is a term of f, and V is the set of variables used in T.

Claim:  
	\sum_{A \subseteq V} |\hat{f}_A|  >=  1

Why is this good enough:  notice that if a term has L literals in it,
then the probability it is satisfied by a random example is 1/2^L.
So, if all the terms of f have size >= log_2(4s), then the probability
that a random example is positive for f is at most 1/4.  This means
that "all negative" is a weak hypothesis.
On the other hand, if some term of f has size < log_2(4s) then there
are at most 4s values being summed above, with a total sum of 1, so at
least one of them has to be at least 1/4s.

Proof of Claim:  Define the function T(x) as: T(x) = 1 if x satisfies
T and T(x) = 0 otherwise.  This is a little weird since it is not a
boolean function according to our definition ({-1, +1}).  But it is
still a legal function and will be helpful.  Let L = # of literals in T.

What does the fourier decomposition of T look like:

	\hat{T}_A = E[T(x)\phi_A(x)]

                  = 0 if A is not a subset of V.  (V = set of variables in T)

                  = (1/2^L)*\phi_A(T) if A IS a subset of V.

	(phi_A(T) is the value of phi_A on an example satisfying T)
                

Now, here's the trick:

	f(x)T(x) = T(x), because if T(x)=1 then f(x)=1.

So, clearly
	E[f(x)T(x)] = E[T(x)].

The right-hand-side of the above equation is 1/2^L.

The left-hand-side of the above equation is the dot product of f and
T.  Using the fourier basis, this is

	sum_A \hat{f}_A \hat{T}_A

which, by our analysis of \hat{T}_A is equal to

	(1/2^L) * sum_{A \subseteq V} \phi_A(T).

So, we're done.