15-859(B) Machine Learning Theory            02/27/12		
		       Guest Lecturer: Liu Yang
		
* Intro to Active Learning
* Disagreement-based Active Learning 
* Active Learning with Noisy Labels: Algorithm 
* Surrogate Losses; Drifting Distribution
* Bayesian Active Learning
========================================================================
Intro to Active Learning
========================================================================
- Example : Threshold (1 dimension)
Question 1: How many random labeled pts are sufficient (w.p. 1- delta) to find the threshold, to within ± eps  ?
----------------------------------------------------------------
Analysis
----------------------------------------------------------------
Idea : If we get a - and + both within eps of Z, the midpoint between
them is within eps of Z.  
Say we have m random pts X1,X2, ..., Xm. 
P(Z<= Xi <= Zeps+) = eps
P(exist i <= m: Xi in [Z, Zeps]) = 1 - P(\forall i < m: Xi notin
 [Z, Z+eps]) = 1 - (1-eps)^m >= 1 - e^{-eps*m} 
similar for Xi in [Zeps-, Z), so w.p. 1 - 2 e^{- eps*m} 
exist + and  - within eps of Z. 
To make this >= 1- delta, suffices to take any m>=1/eps * ln(2/delta).

Question 2: How many *active* labeled pts are sufficient (w.p. 1- delta) to find the threshold, to within ± eps  ?
How can we use active learning to reduce the number of labels?
Take m random unlabeled data points. Repeatedly request the label of the median between closest known + and -
Take the mid-point threshold between adjacent +/-points. 
Classifier is the same we get from m labeled data points. But requested only log(m) labels!
Label complexity analysis: 
Since m = 1/eps ln(2/delta) is sufficient; so log(1/eps ln(2/delta)) label request sufficient. 

- Formal Definitions
A hypothesis class C of classifiers h: x -> {±1}. 
Err(h) = P(h(X) \neq h*(X)) for X ~ D
iid data points X1, X2, ..., ~ D
h*(Xi) are hidden from the learning alg until individually requested. 
Alg chooses Xi to request label, receives Yi=h*(Xi); repeat up to n times. 
Try to choose h to minimize error: err(h). 

Def: Alg A(n) achieves label complexity Lambda(eps, delta, h*, D) if it outputs a classifier 
h_n after at most n label requests, and forall eps, delta \in (0, 1), 
\forall  n >= Lambda(eps, delta, h*, D), P(err(h_n) <= eps) >= 1- delta. 

========================================================================
Disagreement-based Active Learning 
========================================================================

- A simple idea from Cohn, Atlas & Ladner (1994).
CAL: 
V = C
for i = 1,2,...
 if exists h,g in V s.t. h(Xi) neq g(Xi)
   request Yi, V <- {h in V : h(Xi) = Yi}
 if requested n labels already, return any h in V

Examples: thresholds, intervals

Historically, this type of algorithm became popular due to the later
A^2 (Agnostic Active) algorithm by Balcan-Beygelzimer-Langford
designed for the agnostic case.  The A^2 algorithm guarantees
correctness (always error at most OPT + epsilon) and the BBL paper
proved it never does worse than passive learning and that in a number
of interesting cases it does dramatically better.  Subsequently,
Hanneke proved a general analysis of A^2, introducing the notion of
disagreement coefficient to characterize its performance.  The
disagreement coefficient is a property of the concept class (like VC
dimension) that determines the number of label requests used.
Interestingly, this can be used to understand the CAL algorithm in the
realizable case as well.

Note that CAL-type algorithms are particularly "mellow" as opposed to
more aggressive algorithms that are known for the realizable case and
for certain types of noise. E.g., better results are known for linear
separators in the realizable case.  It is unknown if one can achieve
better bounds by being more aggressive in the agnostic case.  

OK, now let's go on to analyze CAL in the realizable case.

- Def: Disagreement Coefficient:
DIS(H) = {x : exists h,g in H s.t. h(x) neq g(x)}.
B(f,r) = {h in C : P(h(x) neq f(x)) leq r}
theta(eps) = sup_{r > eps} P(DIS(B(h*,r)))/r.

- Label Complexity of CAL:
theta(eps) (d log(theta(eps)) + log(log(1/eps)/delta))log(1/eps)

Pf idea:
Say V' = V for a given time i' where sup_{g in V'} err(g) > eps.  Then
roughly d theta(eps) label requests later, we had at least 2d
theta(eps) points in DIS(V'), so now every h in V has (by the classic
PAC bounds) err(h|DIS(V')) <= (roughly) ( d / (2 d theta(eps)) = 1/ (2
theta(eps)). So err(h) = err(h|DIS(V'))P(DIS(V')) <= P(DIS(V')) / (2
theta(eps)) <= (theta(eps) * sup_{g in V'} err(g)) / (2 theta(eps)) =
sup_{g in V'} err(g) / 2. So we halve sup_{g in V} err(g) at least
once every (roughly) (2 d theta(eps)) queries. So need to do this at
most log(1/eps) times to get sup_{g in V} err(g) < eps. 

- When is theta(eps) bounded?


========================================================================
Active Learning with Noisy Labels
========================================================================

Now we deal with data from a general Dxy distribution on (X,Y), so
err(h*) can be > 0 (where h* has min err(h) among h in C). 

Define empirical error rate: err(h;L) = (1/|L|) sum_{(x,y) in L} I[h(x) neq y].

- Making CAL robust to noise:
Can't just throw away a classifier for making just one mistake.
Solution: only discard a classifier if it makes many more mistakes
than some other classifier in V.  "many more" is based on results for
the difference between empirical error rate and true error rate (only
throw away a classifier whose empirical error rate is larger than some
upper bound on the empirical error rate of h*). guaranteed to keep h* in V

A robust algorithm based on this idea is BBL's A^2 algorithm

A^2 Algorithm(C, n)
- V <- C, L <- {}
- For m= 1,2, ... 
-     If exists h, g in V s.t. h(Xm) !=  g(Xm)
-        Request Ym 
-        L <-  L \cup {(Xm, Ym)}
-    If m a power of 2
-       V <- {h in V : err(h;L) - min_{h' in V} err(h';L) <= T(V,L)}
-       L <- {}
-   If used n label requests, Return any h in V

T(V,L) is a confidence bound with the guarantee that 
err(h*;L) - min_{h' in V} err(h';L) <= T(V,L).

(note L is conditionally i.i.d. given |L| and V, so we can borrow
T(V,L) from the passive learning literature) 

- A^2 algorithm never eliminates h*

- How about its label complexity?
Definition: Tsybakov noise: P(h(x) neq h*(x)) <= (err(h) - err(h*))^{1/kappa}. 

passive learning has tilde(O)(d*eps^{1/kappa - 2}) sample complexity.

label complexity of Robust CAL (for an appropriate T(V,L) -- based on
Rademacher complexity): 
theta(eps^{1/kappa}) eps^{2/kappa - 2} d log^2 (d/(eps*delta)).  

- a minimax lower bound eps^{2/kappa - 2}, so theta bounded => nearly
minimax optimal.

--------------------------------------------------------------------------- 
Surrogate Losses
--------------------------------------------------------------------------- 
- computationally hard to run Robust CAL (need to minimize number of
mistakes in one step, which can be NP-Hard sometimes). 
- so maybe want to use a convex relaxation of that optimization problem
- this is a common trick in passive learning, and can be analyzed and
shown to still be effective under certain noise conditions. 

---------------------------------------------------------------------------- 
Distribution Drift
---------------------------------------------------------------------------- 
- what if each Xi is sampled from a different unknown distribution?
(with h* held fixed) 
- if the distributions are from a totally bounded family, the
algorithm still works. 
- in particular, if theta(eps) is uniformly bounded (over the family
of distributions) by a o(1/eps) function, then if we run a stream of
examples through CAL,  its expected number of queries is sublinear in
the number of examples it has processed. 

========================================================================
Bayesian Active Learning
========================================================================

If there is a prior over h*, can get a better label complexity.

- Verifiable vs Unverifiable:
self-verifying algorithm takes eps as an argument, decides its own
budget adaptively.  
it not only needs to find an eps-good classifier, but also must *know* that it has found a good one.

- without prior, if VC dimension finite, can always get o(1/eps) label complexity if the alg. doesn't have to be self-verifying.

- without prior, for C=interval classifiers and h*=empty interval, label complexity of self-verifying active learning is 1/eps = passive.

- with prior, can *always* guarantee o(1/eps) label complexity, even for self-verifying algorithms


Lower bound:
- Can get a rough lower bound on the label complexity of Bayesian active learning
by the entropy of a Voronoi partition of C induced by a maximal tilde{O}(eps)-packing
(if the pseudo-metric P(h(x) neq g(x)) is "doubling" for h,g in C)

----------------------------------------------------------------------------------- 
Here are more details on Bayesian active learning, which we won't have time to cover in lecture:
----------------------------------------------------------------------------------- 
Self-Verifiable Active Learning
----------------------------------------------------------------------------------- 
- Self-verifying (a special type of stopping criterion)
- Given eps, adaptively decides # of query, then halts
- has the property that E[err] < eps when halts
Question: Can you do with E[#query] = o(1/eps)?  (passive learning need 1/eps labels) 

Example: Intervals 
Suppose D is uniform on [0,1]

Verification Lower Bound
In non-Bayesian setting, supposing h* is empty interval.
Given any classifier h, just to verify err(h) < eps,
Need to verify h* is not an interval of width 2?.
Need an example in ½(1/eps) regions to verify this fact.

- Learning with a prior 
Suppose we know a distribution the target is sampled from, call it *prior*

- Interval Example with prior
¥ Algorithm: Query random pts till find first +, do
binary search to find end-pts. Halt when reach a prespecified
prior-based query budget. Output posteriorÕs Bayes classifier. 

¥ Let budget N be high enough so E[err] < eps 
- N = o(1/eps) sufficient for E[err|w*>0] < eps: if w* > 0,
even prior-independent analysis needs only
E[#queries|w*] = O(1/w* + log(1/eps)) = o(1/eps).

- N = o(1/eps) sufficient for E[err|w*=0] < eps: if
P(w*=0)>0, then after some L = O(log(1/eps)) queries, w.p.>
1-eps, most prob. mass on empty interval, so posteriorÕs
Bayes classifier has 0 error rate. 

- Can do o(1/eps) for any VC-class 
Theorem: With the prior, can get o(1/eps) QC 
- There are methods that find a good classifier in o(1/eps) queries (though they
arenÕt self-verifying) [BHW08] 
- Need set a stopping criterion for those alg
- The stop criterion we use : budget 
- Set the budget to be just large enough so E[err] < eps.

----------------------------------------------------------------------------------- 
Bayesian Active Learning using Arbitrary Binary Valued Queries
----------------------------------------------------------------------------------- 

To get a lower bound for Bayesian active learning, let's consider a much more powerful model: arbitrary binary valued queries.

- Problem: 
Want to learn some concept up to eps precision. Say we can ask any yes/no question, # questions I need to ask?
Bayesian: we know a distribution for the concept; and want eps expected precision, # questions are enough in expectation?

- Abstract Model

Lossy Source Coding 
Z: some space
rho: Z times Z -> [0, 1], some pseudo-metric (the *loss*) 
pi: a known distri. on Z 
target: X ~ pi 

A code is any pair (C, D)  
where C: Z -> {0, 1}* (encoder, C(z) called codeword)
and D: {0, 1}* -> Z (decoder) 

Ci(z): the ith bit of C(z).  
A code (C, D) is prefix-free if forany z1, z2 \in Z 
with |C(z1)| <|C(z2)|, exist i<= |C(z1)| with Ci(z1) neq Ci(z2). 
Let PF be the set of all prefix-free codes. 

- The Optimal Lossy Code 
For eps > 0, define 
OPT(eps) = inf {E[C(X)|: (C, D) \in PF with E[rho(D(C(X)), X)] <= eps}
The minimum E[code length] required to guarantee E[loss] at most eps. 

- Example: Bayesian Active Learning 
Z: concept space (set of binary classifiers h: R^k -> {0,1}) 
pi: a *prior*
target concept X \in Z
rho(h, g)= P(w: h(w) neq g(w))
rho(h,X) is the error rate 

A learning alg A(eps) makes yes/no queries, then outputs a classifier h: 
representable as a code (C, D). 
Opt(eps): the minimum E[#queries] required to guarantee E[err(h)] <= eps. 

Doubling Dimension 
A way to define a dimension for (Z, rho): it is doubling. 
Def: the doubling dimension d as minimal s.t. forall z \in Z, eps >0, 
{y \in Z: rho(z,y) <= 2 eps} can be covered with <= 2^d eps-balls.  
in other words, exist {z_i}_{i=1}^{2^d} \subset Z, s.t. {y \in Z: rho(z, y) <= 2*eps} 
\subseteq \cup_i {y \in Z: rho(z_i, y) <= eps}.  


- Packing Entropy 
For eps > 0, Y(eps) \subset Z is a maximal eps-packing of Z, 
i.e. Y(eps) is largest s.t. \forall z, y \in Y(eps), rho(z, y) >= eps.  

For eps > 0, rho(eps) = {{x \in Z: y = argmin_{z \ in Y(eps)}  rho(x, z)}: y \in Y(eps)}.
Def: packing entropy: H(P(eps)) = - sum_{S \in P(eps)} pi(S) log_2 pi(S). 

- A straightforward upper bound: 
Theorem: forall eps > 0, Opt(eps) <= H(P(eps)) +1 
Proof: 
P_eps (X): the element of P(eps) containing X, 
Y_eps(X):  the sole element of Y(eps) \cap P_eps (X) 
C: lossless Huffman code for P_eps (X)
D decodes as D(C(X)) = Y_eps(X). 

Huffman proved E[|C(X)|] <= H(P_eps (X)) + 1 = H(P(eps)) + 1. 
A maximal eps-packing implies it is an eps-cover.
So rho(D(C(X)), X) = rho(Y_eps (X), X) <= eps. 

A nearly-matching lower bound:
Theorem: \forall eps > 0, H(P(eps * log_2(1/eps))) - O(d) <= Opt(eps) <= H(P(eps)) + 1. 
Pf: Let alpha = eps log_2 (1/eps). 
Fano's inequality: for a Q-valued random variable q,  for any code (C, D), 
|P(q neq D(C(q))) >= (H(q| C(q)) - 1)/log(|Q|). 
If regions of P(eps) were point-wise eps-separated, Fano's inequality would give a lower 
bound, why? E[rho(D(C(X)), X)] >= eps * |P(P_eps(D(C(X)) neq P_eps (X)) in that case. 

To compensate for close regions, assign each p in P(alpha) a color  K(P) in N so that no 
two  P_1, P_2 of same color have points within alpha. Take K(.) of min entropy. 

Can show exists a valid coloring using <= 12^d colors. 
(based on bounding size of alpha-packing of 3*alpha-ball) 
So H(K(P_eps(X))) <= d log_2(12). 

No z \in Z is alpha/2-close to more than one p \in P(alpha) of a given color. 
So E[E[pho(D(C(X)), X)| K(P_alpha (X))]] >= (alpha/2)* (H(P_alpha (X)| C(X), K(P_alpha(X))) - 1)/log_2(|Y(alpha)|)

log_2(|Y(alpha)| <= d log_2 (4/alpha) 

H(P_alpha (X)| D(C(X)), K(P_alpha(X))) >= H(P_alpha (X)) - H(C(X)) - H(KP_alpha(X))) >= H(P_alpha (X)) - E[|C(X)|] - d log_2(12). 

If eps >= E[rho(D(C(X)), X)] = E[E[rho(D(C(X)), X)| K(P_alpha (X))]] >= (eps * log (1/eps)/2) 
(H(P(eps log_2 (1/eps)) - E[|C(X)|] - d log_2(12))/(2d log(16/eps)) 
Solving: E[|C(X)|] >= H(P(eps logs_2 (1/eps))) - O(d).