15-859(B) Machine Learning Theory            04/21/14

(A few loose ends)
* Compression bounds
* Bshouty's algorithm for learning decision trees in PAC+MQ.

===============================================================

Compression Bounds

For some learning algorithms, the hypothesis produced can be uniquely
described by a small subset of k of the training examples.  E.g., if
you are learning an interval on the line using the simple algorithm
``take the smallest interval that encloses all the positive
examples,'' then the hypothesis can be reconstructed from just the
outermost positive examples, so k=2.  For a conservative Mistake-Bound
learning algorithm, you can reconstruct the hypothesis by just looking
at the examples on which a mistake was made, so k \leq M, where M is
the algorithm's mistake-bound.  Note that in this case, we would also
care about the {\em order} in which those examples arrived.  Let's say
that an learning algorithm compresses a sample S of size m down to a
hypothesis of size k if it can produce a set (or sequence) of k
examples in S such that its hypothesis can be reconstructed from that
sequence.  The question we want to ask is: can we get a sample
complexity bound (or error bound) as a function of k?

Let's be a little more formal, and then see how we can prove a PAC
guarantee based on k.  First, before seeing the data, the algorithm
specifies a discription language: a mapping from sequences of examples
to hypotheses. So, so for a given sequence S' of examples we have a
well-defined hypothesis h_{S'}.

Claim: If |S| >= (1/epsilon)[k ln(|S|) + epsilon*k + \ln(1/delta)]
then with probability at least 1-delta, all h_{S'} with zero error on
S have true error <= epsilon.  (For S' a sequence of k examples in S).

[[Note: we have a ln(|S|) on the RHS but can replace that with
O(ln(1/(epsilon*delta*k)))]]

Proof:

Fix a sequence s' of k indices from 1..m.  For a given list of
m examples S, define S|s' to be the examples in S in the positions given
by the indices in s'. Now imagine choosing S by drawing the examples
in s' first.  At this point, h_{S|s'} is now fully defined.  If the
true error of h_{S|s'} is greater than epsilon, then the probability
that h_{S|s'} makes zero mistakes on the remaining examples is at most
(1-\epsilon)^{m-k}.  Formally, if A is the event "some h_{S'} with
zero error on S has true error > epsilon"  whose
probability we wish to bound, and A_{s'} is the event h_{S|s'} has
error 0 on S-{x_i: i \in s'} but true error > epsilon" (so A =
Union_{s'} A_{s'}), then we have Pr(A_{s'}) \leq (1-\epsilon)^{m-k}.
Therefore, Pr(A) \leq m^k * (1-\epsilon)^{m-k}, 
which is at most delta for m satisfying the above inequality. 

Note that it's importat we are doing a union bound over the sequences
of indices s' and not the set of possible hypotheses h_{S'}.
In particular, there are potentially an infinite number of possible
hypotheses $h_{S'}$ so you don't want to do a union bound over all
sets $S' \sim D^k$. 

================================================

Bshouty's algorithm
===================
And now for something completely different...

When we were talking about membership query algorithms, we discussed
the KM algorithm for finding large Fourier coefficients and argued how
this allowed us to learning decision trees over the uniform
distribution.   A natural question then is: can we use MQs to learn
decision trees wrt arbitrary distributions?   We've seen that without
MQs we can learn decision trees in n^(log s) time by converting to
decision lists, but we want poly time.

To put this into context: It turns out you can show that a bad
distribution can make MQs worthless for learning DNF. But this doesn't
go through for decision trees.  Problem solved by Bshouty: algorithm
for learning DTs in PAC+MQ model.

In fact, algorithm works for a larger class: "XOR of terms".  We'll
assume target can be written as f = T_1 XOR T_2 XOR ... XOR T_m, where
T_i are conjunctions.  Why is this a generalization of DTs?

We're going to solve by building a decision-tree-like-thing (a "parity
decision graph") of depth n and width O(m).  So, by Occam results,
just need to take some initial sample S of size O(nm/epsilon), and then use
our membership queries to find a consistent hypothesis of size O(nm).

Idea: let's just build a decision tree with x_1 at the root, then x_2
at the next level, etc., until we get to > m leaves.  This will happen
at some depth d = log(m).

Define T_1', ..., T_m' to be what you get by taking the
original terms T_1,...,T_m and removing all mention of variables
x_1,...,x_d from them.  

So, if we look at all leaves of our tree so far, the value of the
target function, given that the example takes you to that leaf, can be
written as an XOR of some subset of the T' terms.

For example, say f = x_1x_3x_5 XOR not(x_1)x_4x_6 XOR x_2x_3not(x_4)

and say d=2.  Then the leaves look like:

        (T1' XOR T3'), T1', (T2' XOR T3'), T2'.

Now, the claim is that since there are > m leaves, some of these must
be XORs of others.  In fact, there must exist some set L of at most m
leaves, such that any other leaf is an XOR of some subset of leaves in
L. (L is a maximal linearly-independent set mod 2.) E.g., above any
leaf is the XOR of the other three leaves.

So the plan is now:
1. (somehow) find such a set L, and write down for other leaves how
        they can be written as an XOR of leaves in L.

  (oops: the description length of a level could be as high as m^2, so
   let's have S be of size nm^2/epsilon)

2. keep building the tree down from leaves in L until have more than m
  leaves again, and repeat.

3. (somehow) understand what this means in terms of a final hypothesis.

Let's do 1: For each example, vary the first d bits (using MQs) to
make the example reach all active leaves.  ==> Get a vector with one
output per active leaf.  I.e., what would the output have been if
first d bits were set to take you to leaf i.

If we write these vectors out for all the examples, we get a matrix with 1
row per example, where the set L corresponds to a set of <= m columns
such that every other column is some linear combination of these (mod
2).  Find these with Gaussian elimination.  

A subtle issue here is: say this tells us that leaf 1 is the XOR of
leaves 2,3,5.  We don't really know if this is true over ALL of
{0,1}^n.  We just know it's true for strings whose suffixes match
examples in S and whose prefixes match the paths down to the current
nodes.  But, that's OK since our goal is just to find a hypothesis 
consistent with S, and we will maintain inductively (see step 3 below)
that in following the decision graph to classify an example in S we
are always looking at strings with that property.

To do 3: Follow example down the tree.  When you get to an "inactive"
node, then replace position by the *set* of positions listed in the
node and conceptually assign prefixes accordingly.  So, our "current
location" in the tree is now a set of nodes.  More generally, if our
current set of nodes is V and say node y in V is inactive, then look
at set (V - {y}) XOR (positions listed at y).

Finally, as we go down the tree, the T' terms will become constant 1
or 0, and we just take the XOR.

Pretty clever.