15-859(A) Machine Learning Theory 02/26/04
Recap SQ analysis from last time.
New problem: (Weak-)learning monotone functions over uniform distribution
- good case-study of analyzing distrib-specific learning
- Upper-bounds will be simple SQ algs. Lower bounds will be info-based.
========================================================================
Recap SQ analysis from last time.
================================
- Fourier view of functions as vectors (once we fix a distribution over data)
- Showed that if target is picked randomly from a large set of
orthogonal functions, then SQ alg is in trouble.
- can view *any* SQ alg as basically proposing a unit vector and
asking for correlation (dot-product) with target.
(Can break any SQ into one part that looks like this, and one
part that depends only on data distribution)
- Then use the fact that in any orthogonal coordinate system, a
unit vector can have at most 1/tau^2 coefficients that are >= tau
- This means that if target is picked from set S of orthogonal
functions, Pr(dot-product >= tau) <= tau/|S|.
- So, if > poly(n) orthogonal functions, then can't even weak-learn
with poly # of queries, each of tolerance 1/poly.
(Weak)-Learning Monotone functions
==================================
Today, we're going to look at the problem of (weak)-learning MONOTONE
functions, over the uniform distribution on {0,1}^n. Monotone means
that if you take some example x and flip some bit from 0 to 1, then
the value of f(x) can't change from positive to negative. Use |x| to
denote the number of 1s in x (the "size" of x).
We're going to allow targets that are arbitrarily complicated (i.e.,
consider algs that don't depend on description length of target) and
see what kinds of upper and lower bounds we can get. It turns out
that even if you assume the target has poly-size representation as
monotone DNF, no better upper-bounds (algorithmic guarantees) are
known than the ones we will discuss today. Whether you can do better
in that case is a big open question.
Quick sanity check: why are we focusing on uniform distribution? If
you allow arbitrary distribution, then you're sunk. E.g., consider
all probability mass on examples x with |x|=n/2 (the "middle slice").
Target can now be arbitrary on these, so nothing to learn.
But under uniform, you can't have too many orthogonal monotone
functions. In particular, the first analysis we will do is show than
any monotone function has to have Omega(1/n) correlation with one of
{0,1, x_1, ..., x_n}.
Specifically:
- If Pr(f(x) = 1) is far from 1/2, we're done. (can just say
"everything is positive" or "everything is negative). So, for now
let's assume Pr(f(x)=1) is between 1/4 and 3/4.
- Want to argue that some variable has correlation. I.e., for some
x_i, Pr(f(x) = 1 | x_i = 1) > Pr(f(x) = 1 | x_i = 0) + epsilon.
[Sanity check that this is same as our previous def of correlation,
i.e., x_i is a weak hypothesis. If error rate is alpha when x_i=0, then
error rate is at most 1-(alpha+epsilon) when x_i=1, so overall error
rate is their average which is (1-epsilon)/2.]
How to prove: look at boolean cube {0,1}^n. Let's color example red if
positive and blue if negative. Assume roughly half of each. E.g.,
say at least 1/4 of nodes are red and at least 1/4 are blue. Want: no
matter how you color, a lot of edges crossing from red to blue. Say
we can prove that there must be Omega(2^n) edges with one endpoint of each
color. Then must be some direction i with Omega(2^n / n) edges in that
direction. Then x_i has good correlation [[why?]]
So, it all boils down to saying that if you want to slice a cube into
two roughly equal halves, you need to cut a lot of edges.
Can prove this using "canonical path" argument. for each pair of
examples (x,y) where x is red and y is blue, define the canonical path
from x to y to be: "starting with x, scan down the bits from left to
right, flipping where needed". There are at least (1/4)(3/4)(2^{2n})
pairs. But, any *given* edge can be used by at most 2^n of these
paths (why: because if edge is 01010100 -> 01011100, then x must end
with 100 and y must begin with 0101). So, there must be at least
(3/16)(2^n) edges.
This algorithm gives error 1/2 - Omega(1/n). Can we improve? Simple
lower bound of Omega(1/sqrt(n)): case of "slice function" where target
is random on middle slice. Each example has Omega(1/sqrt(n)) of
having |x|=n/2. This is a pretty big gap, though.
Turns out we can do better on both sides:
- algorithm of error 1/2 - Omega(1/sqrt(n)).
- lower bound of 1/2 - O(log(n)/sqrt(n)).
A better algorithm
==================
Actually, the algorithm is even simpler than the one above. Just asks two SQs:
1. What is Pr(f(x) = 1)?
2. What is Pr(f(x) = 1 | x has more 1s than 0s in it)?
Claim: any monotone function must have Omega(1/sqrt(n)) correlation
with (a) the all positive function, (b) the all negative function, or
(c) the majority function "predict positive iff |x| > n/2".
Proof: (modulo a classic thm called Kruskal-Katona thm)
- for target c, define p_k = Pr[c(x)=1 | |x|=k].
E.g., what do p_k look like for c = x_1? c=majority?
- Claim 1: for all k, p_k >= p_{k-1}. Seems intuitively like it has
to be true. Proof: one way to pick a random x at a given level is to
start from the all 0s example and put 1s into it at random. Since
example can only switch neg->pos in this process, Pr(x is positive
after k 1s) >= Pr(x is positive after k-1 1s). So, p_k >= p_{k-1}.
- Claim 2: In fact, for i= 1/2 - O(log(sn)/sqrt(n)),
where probability is over choice of both c and x.
We will prove a weaker bounds where use [log(sn)]^{3/2} instead of log(sn).
Here is P_s.
- let t = lg(3sn).
- Put in each conjunction of t variables into c iid with probability p.
- define p so that x with |x|=n/2 has 50/50 chance of being positive.
I.e.,
(1-p)^{n/2 choose t} = 1/2.
First part to analyze is how well can an algorithm do without making
any MQs? If example x has |x|=n/2 then we can't do better than
guessing by definition of p. But what if, say, |x| = n/2 - k? Then,
Pr(x is negative) = (1-p)^{n/2 - k choose t}
= (1/2)^{(n/2 - k choose t)/(n/2 choose t)}
~ (1/2)^{(1 - 2k/n)^t}
~ (1/2)^(1 - 2kt/n)
= (1/2)*(1 - O(kt/n))
So, say we set k to be O(sqrt(nlog(n))). There's at most a 1/n chance
that a random x would have |x| \not\in [n/2 - k, n/2 + k]. For those,
maybe we can get a good bias, but for the rest, our bias will be at
most O(kt/n) = O(log(sn)^{3/2}/sqrt(n)).
So, that's where our log^{3/2} is coming from. Can be more careful
and just make it a single log. But now the rest of the proof is
showing that MQs don't help much.
Here's the idea. Let V_s be a vector of (n choose t) entries, giving
the probability of each term of size t being in the target function
conditioned on the information the alg has so far. So, initially we have:
V_s = (p,p,.....,p).
Now, suppose the algorithm asks a query x and gets back the answer
"negative". Then the conditional distiribution is just to zero out
the entries for conjunctions of size t satisfied by x. For positive
answers, to keep the conditional clean, let's actually give the
algorithm *more* information, by telling it the lexicographically
first term in the target that is satisfied by x. So, after a positive
example, we get something like:
V_s = (p,...,p, 0,0,...,0,1,p,....,p)
Claim 1: after s queries, we have <= s ones in V_s.
Proof: obvious.
Claim 2: after s queries, Pr(#zeros > 2s/p) < e^{-s/4}.
Proof: Can think of each query as flipping coins of bias p until get a
heads or else you run out of coins (terms that x might satisfy). So,
the question is equivalent to asking: how many times will we flip a
coin of bias p until we see s heads? In 2s/p flips we expect to see
2s heads. Chernoff says that the chance we see less than half is at
most e^{-expectation/8} = e^{-s/4}. Since the flipping ends at s
heads, this gives the claim.
So, now say we have a random x.
Q: What is Pr(x satisfies a "1-entry")?
A: <= s/2^t = s/(3ns) = 1/3n.
Q: What is the expected number of 0-entries satisfied?
A: <= (2s/p)/2^t = 2/(3pn).
By Markov, at most 1/sqrt(n) chance it satisfies more than 2/(3p*sqrt(n)).
==> since each term is in target with prob p, this affects the
probability that x is positive by at most p*[2/(3p*sqrt(n))] = 2/(3sqrt(n)).
So, that does it. The queries can only help by O(1/sqrt(n)).
Open problems
=============
Hard open question: strong-learn polynomial-size monotone DNF formulas
without queries?
Probably easier problem: close the log(n) gap.