15-859(B) Machine Learning Theory            03/02/04

* Finish topic of learning over uniform distribution
* Learning with Membership Queries (active learning)
  - Some simple algorithms:
     - monotone DNF
     - log(n)-variable functions
  - begin KM algorithm for finding large Fourier coefficients
=========================================================================

Discuss project/presentation topics...

Learning over uniform distribution on {0,1}^n
=============================================
We've seen over last few classes a collection of results that relate
to the problem of learning over the uniform distribution on {0,1}^n.
[I.e., have access to uniform random examples, and goal is to find h
such that h(x)=f(x) on most of the domain.]

- SQ lower bounds imply that if C contains > poly(n) parity
  functions, then can't even weak-learn wrt uniform in poly time.
  E.g., decision trees, DNF have n^{log n} parities in them.

- But you can weak-learn any monotone function.  Saw simple alg with
  error < 1/2 - 1/sqrt(n).  Lower bound saying can't do much better in
  general.   [But if you allow # samples to be polynomial in the
  description length of the target as a monotone DNF, then lower-bound
  breaks down and this is an open problem]

A few more results:

- We can complement our SQ lower bound with a simple n^{O(log s)} time
  algorithm to learn DNF of s terms over uniform distribution.   Ideas?

  + how important are terms of size >> log(s)?
  + Do greedy set-cover.
    Get DNF of s*log(1/epsilon) terms.  Size is bounded so can use Occam
    to upper-bound sample size needed.

    [Can't do "list-and-cross-off" because hypothesis size too big]

- Note: LMN use Fourier analysis to get this kind of n^polylog(s)
  bound for learning AC^0 circuits.  (DNF are depth-2 circuits). 

- On hwk, you are giving alg to learn decision trees over *any*
  distribution in time n^{log n}.   


Learning with Membership queries
================================

In addition to getting random examples, the learning algorithm is
allowed to ask for c(x) for *any* x of its choosing.  These are called
"membership queries" (MQs).  E.g., this makes the most sense if we
think of the target function as some device that we are trying to
reverse-engineer.  We can put it into "normal operating conditions"
and observe its behavior, but we can also poke it with inputs of our
own choosing and see what it does.  So, membership queries are a kind
of "active learning".  Can define the following models:


PAC + MQ: Can sample from D and ask MQs.   Goal: low error wrt D.

Unif + MQ: Special case D=uniform.  Can view setting is you have a
    black box that you can prod as you like and you want h that agrees
    with target on most of the domain.

MB + MQ: Mistake-bound setting with queries.  Pay $1 per mistake and
    pay $1 per query.   Also called the "equivalence + membership
    query model".

Exact learning with MQ only: need to recover target exactly using only
    queries.  Pretty tough.  E.g., can't even do conjunctions (why?).
    But you *can* do parity (how?)


Cases where MQs make sense: trying to "break" encryption on a
smart-card [mention Angluin's DFA algorithm], active exploration of
environment, converting a complicated hypothesis (e.g., a neural net)
into a more understandable form.

Cases where MQs don't make so much sense: classifying documents/images
(doesn't make sense to "walk" a document about topic X towards a
document about topic Y and ask at what point the topic changes)

Interesting area for project:  define and prove something about other
"reasonable" query types you might be able to ask in different settings. 

Plan for next few classes:
- Some simple algorithms
- MQ algs based on Fourier.  (for learning wrt uniform)
- Bshouty's alg: learn Decision trees in PAC+MQ (or MB+MQ) model.
- Angluin's algorithm for learning Finite Automata.  Extensions of 
  Rivest&Schapire. 


Some Simple Algorithms
======================

Angluin's algorithm for learning Monotone DNF in MB+MQ model
------------------------------------------------------------
 * Start with h(x) = "all-negative".
 * If predict negative on a positive example x:
   - "walk" x down to a minimal positive example (go left to right along
      x, flipping 1's to 0's so long as they keep the example positive).
   -  The conjunction of the variables in x must be a term of c.  So put
      it into h.
 * Since we maintain that all terms of h are also in c, we don't make
   any mistakes on negatives.

Total number of mistakes is at most # of terms in target.  Number of
queries at most n per term.


Learning functions with r = O(log n) relevant variables
--------------------------------------------------------
- randomly sample until you get a positive and a negative example, or
  halt if have only seen one type after 2^r log(1/delta) examples.
  [Can instead use "(n,r)-universal sets"]

- Walk the positive and negative toward each other to identify a
  relevant variable x.

- Build decision tree with x at root.  Recursively learn each subtree.

- alg gives exact learning with MQ only.

========================================================================

Fourier-based MQ algorithms
===========================

We know that any f over {0,1}^n can be written as a sum of parity functions:

	f(x) = \sum_i \hat{f}_i \phi_i(x)

where \phi_i is the ith parity function, and \hat{f}_i is the
correlation of f with \phi_i under the uniform distribution.

Claim: using MQs, we can find all i such that (\hat{f}_i)^2 > tau in
time poly(n, 1/tau). 

1. Why is this interesting?
2. How can we do it?

Why is this interesting?
------------------------
Can prove that any DNF formula has at least one non-negligible fourier
coefficient in parity basis.  So, this lets us WEAK-LEARN DNF over the
uniform distribution.  Also, any DECISION TREE has 1-epsilon of its
"length" (sum of squares of coefficients) in a (poly(n/1/epsilon))
number of coefficients.  So, this lets us strongly-learn decision
trees, with respect to the uniform distribution.

QUESTION: why is it enough to get "most" of the fourier coefficients?
E.g., if we get all but epsilon of the (\hat{f}_i)^2, then the
resulting g has (f-g)^2 = epsilon.  The claim is that even if g
doesn't correspond to a boolean function, we can use sign(g) as a good
approx to f.  The reason is that by definition,
	    E[(f(x)-g(x))^2] = epsilon.
Notice that every x where sign(g) is wrong contributes at least 1.
So, the error of sign(g) is at most epsilon.

Then Jackson showed how to use boosting to get strong learning of DNF
over unif.  [not trivial --- remember boosting changes the distribution]

Won't prove claims about DNF, DTs today.  Instead, let's go to alg for
finding large fourier coeffs.

======================================================================

Kushilevitz-Mansour (Goldreich-Levin) algorithm: 
--------------------------------------------------
Let's index parity functions by their indicator vectors.

Our goal is to find all \hat{f}_v such that (\hat{f}_v)^2 > \tau.

Say S is the set of all indicator vectors.  So the sum, over v in S,
of the (\hat{f}_v)^2 equals 1.  Suppose we could split S into two
equal-sized sets S_0 and S_1 and for each one, ask for the sum of
squares of the \hat{f}_v in that set.  And, suppose we could do this
recursively.  Then what?

Then we could build up a tree, splitting each node whose answer is >
tau, and ignoring those < tau.
 - max width of tree is 1/tau.  
 - Just keep going to leaves at depth n.  
 => Only n/tau questions to ask.

So, all boils down to: how can we ask for the sum of squares of the
\hat{f}_v's inside these sets?  First of all, to be able to do this,
we're going to need to split these sets in a particular way.
Specifically, we'll do this by fixing a prefix of the variables, and
then the set will be "all parity functions whose indicator vectors
agree with this prefix".  For instance,
	S_0 = {v: v begins with 0}
	S_1 = {v: v begins with 1}
Then, will split S_0 into
	S_00 = {v: v begins with 00}
	S_01 = {v: v begins with 01}
etc.

In general, our goal is now to estimate, for some string alpha of 0s
and 1s, the sum of the squares of the \hat{f}_v's for all v that begin
with alpha.

(Note: we won't find the sum exactly.  We'll find an estimate of the
sum.  But that's fine.  It just means we throw out nodes whose
estimate is < tau/2 instead of those whose estimate is < tau)

There's now a neat trick for doing this.  Let's do case of alpha = 00000
(a string of k zeros).  So, we want the sum of squares over all
parities that don't include any of first k variables.  Remember, f is
a {-1, +1}-valued function.

The claim is that this sum just happens to equal the expected value,
over x in {0,1}^{n-k}, y in {0,1}^k, z in {0,1}^k, of f(yx)*f(zx).

I.e. pick random suffix x.  Pick two random prefixes y and z and
multiply f(yx) by f(zx).  Repeat many times and take the
average. Intuition: independent y and z will cancel out correlations
we don't want.  E.g., what if f was a parity function?  If f agrees
with alpha, then f(yx)=f(zx) so E[f(yx)f(zx)] = 1.  BUT, if f DOESN'T
agree with alpha, then Pr[f(yx)=f(zx)] = 1/2, so E[f(yx)f(zx)] = 0.

Now, since any f can be written as a sum of parities, hopefully we can
reduce the case of general f to the case of "f is a parity function"...
It's actually almost but not quite so easy.  Question we need to
answer is: for two (possibly different) vectors v,w what is

	E [ phi_v(yx) * phi_w(zx) ]  ?
      x,y,z

Claim:  This is 0 if v != w, and is also 0 if v,w aren't all 0s on the
	first k bits.  If v=w and they are all 0 on 1st k bits, then
	the expectation is 1.

This is enough.  Now we get:

	E[f(yx)f(zx)]
	 = E[ (sum_v \hat{f}_v \phi_v(yx)) (sum_w \hat{f}_w \phi_w(zx))]
	 = sum of squares of coeffs starting with k zeroes.

So, we're done for the case of alpha = "all zeros".  What about other
alpha (i.e., requiring parity to have a certain fixed dependence on
the first k bits)?  Since we know the dependence (alpha) just "undo
it".  Specifically, we instead ask for:

	E[f(yx)f(zx)\phi_{\alpha}(y)\phi_{\alpha}(z)]

where phi_alpha is the parity of the variables set to 1 in alpha. 
(then we get same as in above Claim, except the non-zero case is when
v=w and they are equal to alpha on the first k bits).

That's it.

Actually, one technical point: we need to estimate an expectation.
We'll do it by sampling.  OK since the quantity inside the expectation
is bounded (-1 or +1). In Jeff Jackson's result about DNF, what he
shows is that instead of modifying the distribution, you can
modify f to a non-boolean function.  Then, need to argue that (based
on how the boosting results work) f doesn't get too large, so you can
still estimate f(yx)f(zx) from sampling.
============================================================================