15-859(B) Machine Learning Theory            04/21/08
* Active learning
* Retrospective
* Connections to other areas
===========================================================================
Admin:
  - no class on Wed (I need to catch a flight)
  - sign up for project presentations
  - UCA course assessments
===========================================================================

Active learning
===============

A while back we talked about leanring with Membership Queries where
the algorithm could ask for the label f(x) of examples x of its own
choosing.  Makes sense for problems like reverse-engineering some
device, but not so much for problems like image classification (e.g.,
recognizing hand-written digits or faces etc) since a made-up image
might not look like anything.

A more practical model for such settings is what's called Active Learning.

In Active Learning, you are given a large pool U of unlabeled data.
E.g., unlabeled images or digits etc.  Your goal can be either to do
well on U or to do well on new examples from the same distribution as
U.  The algorithm is then allowed to query for f(x) for examples x of
its choosing IN U.  So it can actively SELECT which examples to have
labeled but not construct them out of whole cloth.

Where this fits: in MQ learning, we were mostly concerned with
*computational* advantages.  Here, our concern will instead be *sample
complexity*.  I.e., can we get away with substantially fewer labeled
examples if we can have some say into which examples get labeled?
(Unlike MQ learning, Active learning doesn't help with running time,
since a PAC algorithm with labels for everything in U could just
ignore the ones it wants to).  The key point is that active learning
can in some cases give us an exponential improvement over passive
learning.

Intervals
=========
Consider the class C = {[0,a] : a >= 0}.  Examples are points in R.
Normally, to get error epsilon we need (1/epsilon)*log(1/delta)
examples.  How could we use active learning to get only a logarithmic
dependence on 1/epsilon?  [Ans: do binary search on U until error rate
on U is < epsilon/2].

On the other hand, what about the class C = { [a,b] } where we allow
both endpoints to float.  Can anyone see a problem there?  Problem is
if target is an interval of probability mass 2*epsilon (for
concreteness, can imagine partitioning distribution into 1/(2*epsilon)
disjoint intervals of probability mass 2*epsilon each and choosing one
of them at random to be the target).  Then any algorithm needs
Omega(1/epsilon) queries in expectation to even see a single positive
example!  So, you can't beat linear in 1/epsilon.

On the other hand (where's that 3rd hand when you need it?) a recent
paper of Balcan-Hanneke-Wortman (COLT'08) shows that if you switch the
order of quantification, then the difficulty in this (and more general)
cases goes away.  In particular, note that in the above bad example,
the adversary got to pick the target as a function of epsilon.  What
if we instead say the adversary has to first, and we then look at our
labeled sample complexity as epsilon->0?

E.g., consider the following algorithm: pick random points in U and
ask them to be labeled.  If no positives seen so far, maintain the
"all negative" hypothesis.  However, once you see a positive example,
then do 2-sided binary search to estimate the interval boundaries.
Now, if the target has zero probability on positive examples, then
this algorithm is correct from the beginning.  If the target has
probability mass P on positives, then the number of samples needed
until the error rate is < epsilon is O((1/P) + log(1/epsilon)).

Equivalently, we can think of a model where the algorithm is allowed
to ask q questions.  Above alg initially asks queries at random until
sees a positive and then does binary search.  If no positive seen
in the q questions then just outputs "all negative".  If you look at
error rate as a function of q, then for "all negative" target, it's
always 0.  For target of positive prob mass P, it starts at P and then
as you get q >> p, it starts dropping exponentially to 0.  

An interesting thing they point out is that in the first case, the
algorithm doesn't *know* if its error rate is low or not.  (We saw a
related thing when discussing Semi-Supervised Learning too, where
e.g., under  the "independence given the label" assumption, we could
do co-training from just 1 labeled example.  But this is too few
labels to *verify* that our error is really low - we were really
relying on our belief in our assumptions.)  In usual PAC bounds, you
have enough data to test out your hypotheses so this issue doesn't
come up.

Generic algorithms
==================
One conservative and fairly generic active learning algorithm is,
given some class C, to only query examples where there is at least
some disagreement (Cohen-Atlas-Ladner'92).  E.g., consider learning a
linear separator in R^2.  Hard to analyze generically when this gives
a big improvement.  In fact, most work analyzing active learning has
focused on special case of linear separators under the uniform
distribution where you can get a good handle of what is going on.

Learning linear separators
==========================
Let's look at the problem of learning a linear separator through the
origin, under the uniform distribution over the ball in R^d.  So,
error rate of some hypothesis vector w is equal to the angle between w
and the target w^*, divided by pi.  Here's an interesting algorithm by
Balcan-Broder-Zhang.  It's more aggressive than the above algorithm in
that it focuses on a tighter region of uncertainty to sample from.

Step 1: query random examples to get a hypothesis w_1 of error 1/8.

Now, our goal is going to be to use not too many labeled examples
(will be roughly O(d log(1/epsilon))) in order to cut error rate by a
factor of 2.

Given hypothesis w_k, will query random points satisfying |w_k . x| <= b_k.
Then find separator w_{k+1} consistent with these, subject to having
distance at most 1/2^k with w_k (angle at most pi/2^k).  Use b_k
approx k/(2^k * sqrt(d)).

         [draw picture]

Analysis:  Let alpha = error of w_k.  Want to show error of w_{k+1} <=
alpha/2.

err(w_{k+1}) = Pr(w_{k+1} makes mistake on x, |w_k . x| >= b_k) +
               Pr(w_{k+1} makes mistake on x, |w_k . x| <= b_k)

Let's look at first part first.  Want to show it's less than alpha/4.
Easier to look at w_k first: b_k is chosen so that at most alpha/8
prob mass of w_k's mistakes is outside that margin.  (Geometrical fact
using our given assumption that target has distance at most alpha from
w_k).  But by same reasoning, at most alpha/8 prob mass of w_k's
"mistakes wrt w_{k+1}" is outside that margin.  All mistakes of
w_{k+1} have to be either disagreements between w_k and target or
between w_{k+1} and w_k, so overall this is at most alpha/4.

Second part: key trick here: overall probability mass here is
relatively low: like O(alpha * log(1/alpha)).  So, it's enough to have
an error *rate*, conditioned on belonging to this set, that's
O(1/log(1/alpha)).  But this we can handle!!


Retrospective
=============

Main topic of course: what can we say theoretically about what types
of tasks we can learn, from what types of data, using what types of
algorithms?   What types of guarantees can we get?

Types of data: batch iid (PAC), online adversarial (Mistake Bound),
algorithm-selected (Membership Query and Active Learning)

Types of guarantees: Occam bounds say that if computation time is no
object can learn any function with sample size proportional to number
of bits to write down.  Same with # mistakes.  If we do worry about
computation time, then things get harder.  Analyzed what's doable with
Statistical Query algorithms.  Membership query algorithms allow for
poly-time algorithms for cases where we can't (or don't know how to)
do it without them (e.g., DFAs, monotone DNF, decision trees, XOR of terms).

As a practical matter, samples are more expensive than time: not
enough to just say "sufficient to have # examples proportional to #
bits".  This motivates more careful look at sample complexity.  E.g.,
in terms of description length, VC-dimension, margins, Rademacher
averages.  Try to get a more detailed understanding of how much data
we really need to be able to generalize well.  Understand what
inherently is the relationship between the kinds of thing we are
trying to learn and how much data we need to see to learn it.


Also saw generic conversion techniques, like Weighted-Majority
algorithm and Kalai-Vempala algorithm for getting good regret bounds.
Boosting for weak=>strong learning.

Also looked at from perspective of: given a desired guarantee, what
types of assumptions to we need to make to get it? (maybe assume data
from uniform distrib).  Try to understand what about data might make
the learning problem easier (like large margins).

Connections to other areas
==========================
-> statistics
-> cryptography
-> Game theory
-> Auctions design
-> complexity theory
-> application areas (lots)