15-859(B) Machine Learning Theory 03/19/14
More Fourier:
* Learning via Membership queries II
- Connecting KM algorithm to L_1 length
- Fourier spectrum of Decision Trees and DNF
* Classes of high SQ dimension have no large-margin kernels.
=======================================================================
Fourier analysis and Query learning
===================================
Looking at learning function f: {0,1}^n -> {-1,1}, where our goal is
to come up with an approximation g which agrees with f on most of the
domain (i.e., want to do well with respect to the uniform distribution).
Considering basis of parity functions. I.e.,\hat{f}_v = =
E[f(x)phi_v(x)] is the correlation of f with the parity function
having indicator vector v, with respect to the uniform distribution.
Last time, we proved:
THM (KM algorithm): if you have ability to make membership queries,
can find all large (|\hat{f}_v| > 1/poly(n)) fourier coeffs in poly time.
Alg idea: estimate sum of squares of all fourier coefficients
corresponding to a given prefix. Trick of using E[f(yx)f(zx)]. Grow
a tree, expand nodes with value > tau.
So, this means we can learn functions that have most of their length
in their large fourier coefficients. Today: what kinds of functions
have this property. Start with useful def and structural result:
Define L_1(f) = sum_v |hat(f)_v|.
So, this is the L_1 length of f in the parity basis. Note that while
L_2 length is independent of your choice of basis (and for boolean
functions it's always 1), the L_1 length depends on your basis! E.g.,
if f is a parity function, then it has L_1 length = 1 in the parity
basis, but it has L_1 length = 2^{n/2} (as do all boolean functions)
in the standard basis where you list f as (f(0)*2^{-n/2}, f(1)*2^{-n/2},...).
THM: any boolean function f must have most of its L_2 length in
fourier coefficients that are larger than epsilon/L_1(f):
sum_{v: |hat{f}_v| < epsilon/L_1(f)} (hat{f}_v)^2 < epsilon
Proof: LHS < sum_{v: |hat{f}_v| < epsilon/L_1(f)} |hat{f}_v|*epsilon/L_1(f)
<= epsilon/L_1(f) * sum_{v} |hat{f}_v|
= epsilon. QED
On the Fourier Spectrum of DNF and Decision Trees
=================================================
Consider a single conjunction (not necessarily monotone). Let T(x) =
1 if x satisfies the conjunction and T(x)=0 otherwise. So, it's not a
boolean fn in our sense. In fact = E[T(x)T(x)] = 1/2^|T|.
If we analyze the fourier coefficients we get:
hat{T}_v = E[T(x)*phi_v(x)] = E[phi_v(x) | T(x)=1]*Pr[T(x)=1]
= 0 if phi_v has a relevant variable outside T (by
usual parity argument)
= (1/2^|T|)*phi_v(T) otherwise, where phi_v(T) is the
value of phi_v on an example satisfying T.
E.g., so it's very simple. Sum of absolute values is 1. Sum
of squares is 1/2^|T| (which we knew already)
Analyzing Decision Trees
========================
Suppose f is a decision tree. Using the terms corresponding to
positive leaves, we can write f as
f(x) = [T_1(x) + ... + T_m(x)]*2 - 1
So, to get the fourier coefficients hat(f)_v, we just add up for the
corresponding functions. One easy thing to notice is that each
function here has a sum of absolute values of coefficients equal to 1
(even the function "-1" which has coefficient 1 along the empty parity
function and 0 everywhere else). So, really crudely, this immediately
means that
sum_v |hat(f)_v| <= 2m+1
where m is the number of positive leaves. So, this means we can use
the KM algorithm to learn Decision Trees with queries wrt the uniform
distribution.
Analyzing DNF
=============
For DNF we can't say that most of the DNF is in the large fourier
coefficients. The problem with the above reasoning is that the terms
aren't disjoint so we can't just add them up. But we *can* say that a
DNF formula must have at least one reasonably-sized fourier coefficient.
Claim: any m-term DNF f has a fourier coefficient of magnitide at least 1/(4m).
Proof: We can assume f has a term T of size at most lg(4m), else the
the probability that a random example is positive for f is at most
1/4, and so the empty "all negative" parity function has good correlation.
Now, here's the point: f has a noticeable correlation with T, because
when T=1, f=1 too, and when T=0 it doesn't matter. Formally, =
E[f(x)T(x)] = E[T(x)T(x)] = 1/2^|T| >= 1/(4m).
Now, we use what we know about T to argue f has reasonable fourier
coefficients too. Formally,
= sum_v hat(f)_v * hat(T)_v.
Now, using what we know about hat(T)_v, we get:
= sum_{v with rel vars inside T} hat(f)_v * (1/2^|T|)phi_v(T)
and we know this equals 1/2^|T|. So, we get:
sum_{v with rel vars inside T} hat(f)_v * phi_v(T) = 1.
so at least one of them has magnitude at least 1/(4m).
(plus this says somethign about what their Fourier coefficients look like)
======================================================
Strong-learning DNF wrt uniform distribution: the above result gives
us weak-learning. Jackson (CMU PhD thesis '94) showed how to use a
specific boosting algorithm which doesn't change the distirb too much,
to use this to get strong learning. Key idea: in Fourier view, can
think of keeping distrib fixed, but having target change and become
non-boolean. If doesn't change too much, can still argue about
having a large fourier coeff.
======================================================
High SQ dimension => no large-margin kernel
===========================================
Two lectures ago we proved that any class C with N > poly(n) pairwise
uncorrelated functions wrt D cannot be weak-learned over D in the SQ
model. Will start today with a related result (in some sense an
implication): any such class cannot have a large-margin kernel.
In particular, if C has N pairwise uncorrelated functions wrt D, then
for any kernel K, there must be some f in C with margin < 8/sqrt(N).
Moreover, this holds even if we allow average hinge-loss as large as 1/2.
Proof:
- Say the N uncorrelated functions are f_1, ..., f_N
- We know for any hypothesis h, sum_i E_D[h(x)f_i(x)]^2 <= 1.
- So, their average is at most 1/N, so
avg_i |E_D[h(x)f_i(x)}| <= 1/sqrt(N).
- Now, suppose we had a good kernel for all f_i (say margin > gamma
on all points).
- Fix one such f_i. We showed earlier that for a *random* linear
separator h, E_h[|E_D[h(x)f_i(x)]|] = Omega(gamma).
- In fact, even if allow average hinge-loss as large as 1/2, you get
E_h[|E_D[h(x)f_i(x)]|] >= gamma/8.
- This implies there must *exist* h such that
E_i[|E_D[h(x)f_i(x)]|] >= gamma/8.