15-859(A) Machine Learning Theory 03/04/04
Membership Query learning, contd
* KM algorithm for finding large Fourier coefficients
* Fourier spectrum of DNF and Decision Trees
=========================================================================
Recap of fourier representation
===============================
Let's recap fourier representation. View function f over {0,1}^n as vector
f = (f_0, f_1,...) where f_x = sqrt(Pr(x))*f(x)
So, = sum_x Pr(x)f(x)g(x) = E_{x in D}[f(x)g(x)].
Boolean functions are unit-length (Boolean now means f:{0,1}^n -> {-1,+1})
If we fix D to be the uniform distribution, then the parity functions
are orthogonal, and form a basis. Let's index parity functions by
their indicator vectors: phi_v(x) = v.x mod 2, scaled to be {-1,+1}.
We can write any function f as a sum of parity functions:
f(x) = sum_v hat(f)_v phi_v(x)
where
hat(f)_v = = E_x[f(x)phi_v(x)]
and expectation is over the uniform distribution.
Since this is just a change of basis, we can write
= sum_v hat(f)_v * hat(g)_v
Suppose we are able to find most of the fourier coefficients of f.
E.g., all but epsilon of the (hat(f)_i)^2. Then the resulting
function g may not be boolean, but it's close (as a vector) to f. In
particular, |f-g|^2 = epsilon. Then, sign(g) has to be a good
approximation to f, since by definition, E[(f(x)-g(x))^2] = epsilon,
and every x for which sign(g) is wrong contributes at least 1 to the
quantity inside the expectation. So, Pr[f(x) != sign(g)] <= epsilon.
Now, give an algorithm by Kushilevitz and Mansour to find all v such
that (hat(f)_v)^2 > tau, in time poly(n, 1/tau). This is good for
functions that have most of their length in large fourier coefficients.
KM ALGORITHM
============
-> See notes from last time.
To convert this guarantee into theorems about functions we're
interested in (like DNF formulas, decision trees, etc), we need to be
able to analyze what their fourier coefficients look like.
On the Fourier Spectrum of DNF and Decision Trees
=================================================
Let's start by analyzing the fourier spectrum of a single term
(conjunction) T. In fact, let's define the function T(x) as: T(x) = 1
if x satisfies term T and T(x) = 0 otherwise.
This is a little weird since it is not a boolean function according to
our definition {-1, +1}. But it is still a legal function and will be
very helpful, since we can start adding them up. E.g., since a
decision tree can be viewed as an OR of terms going to its positive
leaves, and since any example satisfies at most one of them, we can
write f(x) = [T_1(x) + ... + T_m(x)]*2 - 1, where the T_i are all the
terms corresponding to positive leaves.
What does the fourier decomposition of T look like:
hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1]
= 0 if phi_v has a relevant variable outside T (by
usual parity argument)
= (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the
value of phi_v on an example satisfying T.
E.g., so it's very simple. Sum of absolute values is 1. Sum
of squares is 1/2^|T| (which we knew already since sum of
squares = = E_x[T(x)^2] = Pr_x(T(x)=1)).
Analyzing Decision Trees
------------------------
Suppose f is a decision tree. Using the terms corresponding to
positive leaves, we can write f as
f(x) = [T_1(x) + ... + T_m(x)]*2 - 1
So, to get the fourier coefficients hat(f)_v, we just add up for the
corresponding functions. One easy thing to notice is that each
function here has a sum of absolute values of coefficients equal to 1
(even the function "-1" which has coefficient 1 along the empty parity
function and 0 everywhere else). So, really crudely, this immediately
means that
sum_v |hat(f)_v| <= 2m+1
where m is the number of positive leaves.
Claim: this fact by itself implies that f must have most of its length
in large fourier coefficients. In particular, any vector f with L_2
length 1 (i.e., =1) has the property that the sum of squares of
the coefficients less than epsilon/L_1(f) is at most epsilon, where
L_1(f) is the L_1 length of f (the sum of absolute-values of
coefficients).
Proof: a coefficient of magnitide alpha contributes alpha^2 to the sum
of squares, and alpha to the sum of absolute values. So, the small
coefficients contribute at least a L_1(f)/epsilon factor *more* to the
latter than to the former. Since in total they can contribute at
*most* L_1(f) to the sum of absolute values, they can contribute at
most epsilon to the sum of squares.
So, that's it.
Analyzing DNF
=============
For DNF we can't say that most of the DNF is in the large fourier
coefficients. The problem with the above reasoning is that the terms
aren't disjoint so we can't just add them up. But we *can* say that a
DNF formula must have at least one reasonably-sized fourier
coefficient.
Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m).
Proof: We can assume f has a term T of size at most lg(4m), else the
the probability that a random example is positive for f is at most
1/4, and so the empty "all negative" parity function has good correlation.
Now, here's the point: f has a noticeable correlation with T, because
when T=1, f=1 too, and when T=0 it doesn't matter. Formally, =
E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m).
Now, we use what we know about T to argue f has reasonable fourier
coefficients too. Formally,
= sum_v hat(f)_v * hat(T)_v.
Now, using what we know about hat(T)_v, we get:
= sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T)
and we know this equals 1/2^|T|. So, we get:
sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1.
so at least one of them has magnitude at least 1/(4m).
=========
This means we can weak-learn DNF over the uniform distribution (with
membership queries). Jackson then combines this with a controlled
boosting to get strong-learning.