15-859(A) Machine Learning Theory       02/19/04, 02/24/04

* SQ learnable => PAC-learnable with random noise.
* Characterizing what's learnable in SQ model via Fourier analysis
========================================================================

Last time: introduced random classification noise model: We imagine
there is some true target function c, but the labels are sent through a
noisy channel, with noise rate eta.  Goal is to produce h that is
epsilon-close to c.  epsilon might be << eta.  Saw how to do this for
conjunctions. 

- nice project: try to look at variants of noise models.

Statistical Query model
-----------------------

No noise. Algorithm asks ``what is the probability a labeled example will
have property chi?  Please tell me up to additive error tau}.''

Formally, chi: X x {0,1} --> {0,1} must be poly-time computable. tau > 1/poly.
Asking for Pr_D(chi(x,c(x))=1). 

Examples:
  - Can ask for error rate of current hypothesis h (what is chi(x,y)?)
  - How could we view ID3 decision-tree algorithm in this model?
    E.g., might ask for the correlation of x_2 with the target given
    that x_1=1.  So we ask for Pr(x_2 = c(x) AND x_1 = 1) and we ask
    for Pr(x_1 = 1) and then divide to get Pr(x_2 = c(x) | x_1 = 1).

  - Why would it be "cheating" if we allowed exponentially small tau?
    What would that allow you to do?

(can also handle chi with range [0,1])

May repeat this polynomially many times, and in the end the algorithm
must output a hypothesis with error < epsilon (no delta in this
model).  Algorithm may also ask to see unlabeled examples. 

Example: conjunctions.
 - Ask for Pr[c(x) = 1 and x_i = 0]} with tau = epsilon/2n.
 -  Produce conjunction of all x_i such response was <= epsilon/2n.


SQ => PAC + noise
=================
Want to show how to convert any SQ algorithm to one that works in the
random classificatoin noise model.

In particular, fiven query chi, need to estimate from noisy data. 
Idea:
 - Break chi into part predictably affected by noise and part
   unaffected by noise. 
 - Estimate these parts separately.

Can draw fresh examples for each query, or estimate many
queries on same sample if the VCdim of possible queries of
alg is small.

Specifically, we'll break the instance space X into two parts:
- CLEAN = {x : chi(x,0) = chi(x,1)}.
- NOISY = {x : chi(x,0) != chi(x,1)}.

For example, suppose we are simulating ID3.  We want to ask for, e.g.,
the correlation of x_2 with the target given that x_1=1.  So we ask
for Pr(x_2 = c(x) AND x_1 = 1) and we ask for Pr(x_1 = 1) and then divide.
For the first query, what are CLEAN and NOISY?  What about for the 2nd
query?

We can get Pr((chi = 1) AND (x in CLEAN)) by not even looking at the
labels: we draw a large sample, for each x we test if x in CLEAN by
running chi, and count.

Now, what about Pr((chi=1) AND (x in NOISY))?
- First estimate Pr[x in NOISY].   If small enough then done.
- Otherwise, just need to estimate P = Pr(chi=1 | x in NOISY).  Point here
  is that noise affects this in a predictable way.  In particular,
  given that x is in NOISY, P_eta = (1-eta)P + eta(1-P).  So,
    - estimate P_eta.
    - Use P = (P_eta - eta)/(1 - 2*eta).

    - just need to estimate P_eta up to additive error O(tau(1-2eta)).

If don't know eta can guess and verify.


Characterization of SQs using Fourier analysis
==============================================


SQ-dimension
============
- We will say that functions f and g are uncorrelated if 
      Pr_D [f(x)=g(x)] = 1/2.

We will define the "SQ-dimension" of a class C under distribution D to
be the size of the largest set S of "nearly uncorrelated" functions in
C.  Specifically, the largest S such that for all f,g in S,
    | Pr_D [f(x)=g(x)] - 1/2 | < 1/|S|.

Theorem 1: If SQdim_D(C) = poly(n) then you *can* weak-learn C over D by SQ
algorithms.

Theorem 2: if SQdim_D(C) > poly(n) then you *can't* weak-learn C over D by
SQ algorithms.

Example: parity
- Let D be the uniform distribution over {0,1}^n.
- Fact: any two parity functions are uncorrelated over the uniform distrib.
- So, C = {all 2^n parity functions over {0,1}^n} is not SQ-learnable
  over the uniform distribution.

Example: parity_log
- Let C_log = { parity functions of size log(n) }.
- |C_log| = n^(log n) > poly(n).  So, this is also not SQ-learnable.

- This means that the class of poly-size DNF and the class of
  poly-size decision trees are not SQ-learnable.  Why?  Because these
  contain C_log.

Can anyone think of a non-SQ alg to learn parity? 

Learnability of parity with noise is big open crypto and coding-theory
problem. 

Theorem #1 is the simpler of the two. 

Proof of Theorem 1:

   Let d = SQdim_D(C).  The algorithm will be specific for the
   distribution D.  In particular, we store in advance a maximal set H
   of functions such that for all f,g in H, |Pr[f(x) != g(x)] - 1/2| <
   1/(d+1).  So, |H| <= d.

   To learn, ask for Pr[c(x) = h(x)] for each h in H, with
   tau=1/4(d+1).  We know there must be some h such that the correct
   answer differs from 1/2 by at least 1/(d+1), else H wouldn't be
   maximal.  So, just keep asking until we find one whose response
   differs from 1/2 by at least 3/4(m+1), which means the *true*
   answer for that h differs from 1/2 by at least 1/2(m+1).  So,
   output either h or not(h) as appropriate.

   (Note: for this to imply PAC learning for all D, need small SQ-dim
   for all D and also need to be able to construct hyps in poly time.)


Now, let's go to Theorem 2, which is the more interesting one.  To
make things simpler, let's chance "nearly uncorrelated to
"uncorrelated".  That is, we will show that if C contains a super-poly
size set of uncorrelated functions over D, then C is not even weakly
SQ-learnable over D.

Fourier analysis of finite functions
-------------------------------------
Fancy name for something not inherently complicated.  But it *is*
subtle conceptually.  I think it's neat, so *please* stop me when it
gets confusing....

First of all, let's think of positive as +1 and negative as -1, so a
boolean concept is a function from {0,1}^n to {-1, +1}.

Think of a function f as a vector of 2^n elements, one for each
input.  We'll associate with f the vector:

(sqrt(Pr(000)) * f(000), sqrt(Pr(001)) * f(001), ..., sqrt(Pr(111) * f(111))

In other words, this is the truth table of f, but weighted by sqrt(Pr(x)).

What is the (L^2) length of f as a vector?  <f,f> = 1.

What is <f,g>? 
<f,g> = sum_x Pr(x)f(x)g(x) = E_D[f(x)g(x)] = Pr(f(x)=g(x)) - Pr(f(x)!=g(x)).

Let's call this the correlation of f and g with respect to distrib D.
<f,g> = 0 means that f and g are uncorrelated.  They are orthogonal as
vectors.

So, this is a nice way of looking at a set of functions.  Each
function is a vector in this (2^n)-dimensional space, and the dot
product between two vectors tells us how correlated they are.

"Fourier analysis" is just a fancy way of saying that we want to talk
about what happens when you make a change of basis.

An "orthonormal basis" is a set of orthogonal unit vectors that span
the space.  For instance, in 2-d, let x' be the unit vector in the
x direction and y' be the unit vector in the y direction.  Then a
vector v = (3,2) is 3*x' + 2*y'.  Say we have two other orthogonal
unit vectors a and b. Then could also write v in terms of a and b as:
	v = <v,a>a + <v,b>b
This is called a change of basis.

In our 2^n dimensional space, an orthonormal basis is a set of 2^n
orthogonal unit vectors.  For instance, if D is the uniform
distribution, then the parity functions are one possible orthonormal basis.

Lets fix one: phi_1, phi_2, ..., phi_{2^n}.

Given a vector f, let f_i be the ith entry in the standard basis.
I.e., f_i = f(i)*sqrt(Pr(i)).

then hat{f}_i = <f,phi_i> is the ith entry in the phi basis.

For instance, we could write the vector f as: f = sum_i hat{f}_i * phi_i

The hat{f}_i are called the "fourier coefficients of f" in the phi basis.

Since, as a vector, we have f = sum_i hat{f}_i * phi_i, this also
means that as functions,
	f(x) = sum_i hat{f}_i phi_i(x)

The reason is that the left-hand-side is the xth coordinate of f,
divided by sqrt(Pr(x)).  So, all we're saying is that if one vector is
the sum of a bunch of other vectors, then the xth coordinate of the
vector is the sum of the xth coordinates of the other vectors.

So, nothing fancy so far, but here's one neat thing that comes out of
the point of view we're using:  Since f is a unit vector, the sum of
the squares of its coefficients is equal to 1.  So,

	\sum_i (\hat{f}_i)^2 = 1.

	(If you want to be fancy you could call this fact 
	"Parseval's identity")

One thing this implies is: at most t^2 of the \hat{f}_i can be greater
than 1/t.

In other words, any boolean function f can be weakly correlated with
at most a polynomial number of the phi's.  More generally if we have
some set S of pairwise uncorrelated functions, then f can be weakly
correlated with at most a polynomial number of functions in S (since
we can extend this set arbitrarily into an orthonormal basis).

Here is the point:  Suppose we ask a query "what is the correlation of
the target with my hypothesis h?" with some tau = 1/poly.  Then if the
target function is chosen randomly from some superpolynomial-sized set
of uncorrelated functions (like parity functions) then w.h.p. the
correct answer is infinitesimally small, so a legal response is 0.
(each question throws out only a poly number of possibilities).

Or, to think about what this means going back to the PAC-style
setting:  if you have a fixed set of data and you're doing
hill-climbing, then each time you do a check on your current error,
whp the correct answer is infinitesimmaly close to 1/2, and all the
variation is due to the variation in the data sample.  It will take
more than poly number of steps to stumble on a good query.

It turns out that ANY Stat query can be basically viewed in this way.

Proof:

Step 1: Say the set S of orthogoal functions is phi_1, ..., phi_m.
Let's extend this to an orthonormal basis arbitrarily
	phi_1, ..., phi_m,..., phi_{2^n}
(only the first m of these are functions in our concept class)

Instead of saying a StatQuery is a 0/1 function, let's make it a -1/+1 fn.
So, a query chi takes {0,1}^n x label -> {-1, +1}. 

We're asking for E_D[chi(x, c(x))] where c in one of the phi's (phi_target)
Want to think of this as asking for a correlation.

Step 2:  One way that chi is different from a hypothesis is that it is
a funcion of the label too.  So, to handle this technical difficulty,
lets extend our setting to the domain {0,1}^n x {-1, +1}.

	* define distrib D' = D x (uniform on {-1, +1})

	* Define phi_i(x,l) = phi_i(x).

          still orthogonal: 
          Pr_D[phi_i(x,l) = phi_j(x,l)]  =  Pr_D[phi_i(x)=phi_j(x)]  =  1/2


	* Need 2^n more basis functions.  Let's define

		h_i(x,l) = phi_i(x) * l.  Need to verify this is OK:

	  - check that h_i and h_j are orthogonal (for i != j)

	  - check that h_i and phi_j are orthogonal

	    Pr[h_i(x,l) = phi_j(x)] = 1/2*Pr[phi_i(x) = phi_j(x)] + 
				      1/2*Pr[-phi_i(x) = phi_j(x)]

	    even if i=j we get 1/2.


Step 3: Do a fourier decomposition of chi.

	chi = sum_i alpha_i*phi_i + sum_i beta_i*h_i 

	     where sum_i (alpha_i)^2 + sum_i (beta_i)^2 = 1.

So, we can write the quantity we're asking about

	E_D[chi(x, c(x))]

		= E_D[ sum_i alpha_i*phi_i(x) + sum_i beta_i*h_i(x,c(x))]

		= sum_i alpha_i*E_D[phi_i(x)] + sum_i beta_i*E_D[c(x)phi_i(x)]

The first term is some constant offset that depends on the query but
is independent of the target function.

What about the second term?  Well, since c = phi_target is one of the
phi's, and since they are orthogonal, that expectation is 0 except for
phi_target.  So, the second term is just beta_target.

And, since at most 1/tau^2 of the betas can be larger than tau, since
the target was chosen randomly from superpolynomial-sized set, this is
whp less than tau.  So, can whp respond with just the first part
that is independent of the target function.

That's it.