15-859(B) Machine Learning Theory 02/27/12 Guest Lecturer: Liu Yang * Intro to Active Learning * Disagreement-based Active Learning * Active Learning with Noisy Labels: Algorithm * Surrogate Losses; Drifting Distribution * Bayesian Active Learning ======================================================================== Intro to Active Learning ======================================================================== - Example : Threshold (1 dimension) Question 1: How many random labeled pts are sufficient (w.p. 1- delta) to find the threshold, to within ± eps ? ---------------------------------------------------------------- Analysis ---------------------------------------------------------------- Idea : If we get a - and + both within eps of Z, the midpoint between them is within eps of Z. Say we have m random pts X1,X2, ..., Xm. P(Z<= Xi <= Zeps+) = eps P(exist i <= m: Xi in [Z, Zeps]) = 1 - P(\forall i < m: Xi notin [Z, Z+eps]) = 1 - (1-eps)^m >= 1 - e^{-eps*m} similar for Xi in [Zeps-, Z), so w.p. 1 - 2 e^{- eps*m} exist + and - within eps of Z. To make this >= 1- delta, suffices to take any m>=1/eps * ln(2/delta). Question 2: How many *active* labeled pts are sufficient (w.p. 1- delta) to find the threshold, to within ± eps ? How can we use active learning to reduce the number of labels? Take m random unlabeled data points. Repeatedly request the label of the median between closest known + and - Take the mid-point threshold between adjacent +/-points. Classifier is the same we get from m labeled data points. But requested only log(m) labels! Label complexity analysis: Since m = 1/eps ln(2/delta) is sufficient; so log(1/eps ln(2/delta)) label request sufficient. - Formal Definitions A hypothesis class C of classifiers h: x -> {±1}. Err(h) = P(h(X) \neq h*(X)) for X ~ D iid data points X1, X2, ..., ~ D h*(Xi) are hidden from the learning alg until individually requested. Alg chooses Xi to request label, receives Yi=h*(Xi); repeat up to n times. Try to choose h to minimize error: err(h). Def: Alg A(n) achieves label complexity Lambda(eps, delta, h*, D) if it outputs a classifier h_n after at most n label requests, and forall eps, delta \in (0, 1), \forall n >= Lambda(eps, delta, h*, D), P(err(h_n) <= eps) >= 1- delta. ======================================================================== Disagreement-based Active Learning ======================================================================== - A simple idea from Cohn, Atlas & Ladner (1994). CAL: V = C for i = 1,2,... if exists h,g in V s.t. h(Xi) neq g(Xi) request Yi, V <- {h in V : h(Xi) = Yi} if requested n labels already, return any h in V Examples: thresholds, intervals Historically, this type of algorithm became popular due to the later A^2 (Agnostic Active) algorithm by Balcan-Beygelzimer-Langford designed for the agnostic case. The A^2 algorithm guarantees correctness (always error at most OPT + epsilon) and the BBL paper proved it never does worse than passive learning and that in a number of interesting cases it does dramatically better. Subsequently, Hanneke proved a general analysis of A^2, introducing the notion of disagreement coefficient to characterize its performance. The disagreement coefficient is a property of the concept class (like VC dimension) that determines the number of label requests used. Interestingly, this can be used to understand the CAL algorithm in the realizable case as well. Note that CAL-type algorithms are particularly "mellow" as opposed to more aggressive algorithms that are known for the realizable case and for certain types of noise. E.g., better results are known for linear separators in the realizable case. It is unknown if one can achieve better bounds by being more aggressive in the agnostic case. OK, now let's go on to analyze CAL in the realizable case. - Def: Disagreement Coefficient: DIS(H) = {x : exists h,g in H s.t. h(x) neq g(x)}. B(f,r) = {h in C : P(h(x) neq f(x)) leq r} theta(eps) = sup_{r > eps} P(DIS(B(h*,r)))/r. - Label Complexity of CAL: theta(eps) (d log(theta(eps)) + log(log(1/eps)/delta))log(1/eps) Pf idea: Say V' = V for a given time i' where sup_{g in V'} err(g) > eps. Then roughly d theta(eps) label requests later, we had at least 2d theta(eps) points in DIS(V'), so now every h in V has (by the classic PAC bounds) err(h|DIS(V')) <= (roughly) ( d / (2 d theta(eps)) = 1/ (2 theta(eps)). So err(h) = err(h|DIS(V'))P(DIS(V')) <= P(DIS(V')) / (2 theta(eps)) <= (theta(eps) * sup_{g in V'} err(g)) / (2 theta(eps)) = sup_{g in V'} err(g) / 2. So we halve sup_{g in V} err(g) at least once every (roughly) (2 d theta(eps)) queries. So need to do this at most log(1/eps) times to get sup_{g in V} err(g) < eps. - When is theta(eps) bounded? ======================================================================== Active Learning with Noisy Labels ======================================================================== Now we deal with data from a general Dxy distribution on (X,Y), so err(h*) can be > 0 (where h* has min err(h) among h in C). Define empirical error rate: err(h;L) = (1/|L|) sum_{(x,y) in L} I[h(x) neq y]. - Making CAL robust to noise: Can't just throw away a classifier for making just one mistake. Solution: only discard a classifier if it makes many more mistakes than some other classifier in V. "many more" is based on results for the difference between empirical error rate and true error rate (only throw away a classifier whose empirical error rate is larger than some upper bound on the empirical error rate of h*). guaranteed to keep h* in V A robust algorithm based on this idea is BBL's A^2 algorithm A^2 Algorithm(C, n) - V <- C, L <- {} - For m= 1,2, ... - If exists h, g in V s.t. h(Xm) != g(Xm) - Request Ym - L <- L \cup {(Xm, Ym)} - If m a power of 2 - V <- {h in V : err(h;L) - min_{h' in V} err(h';L) <= T(V,L)} - L <- {} - If used n label requests, Return any h in V T(V,L) is a confidence bound with the guarantee that err(h*;L) - min_{h' in V} err(h';L) <= T(V,L). (note L is conditionally i.i.d. given |L| and V, so we can borrow T(V,L) from the passive learning literature) - A^2 algorithm never eliminates h* - How about its label complexity? Definition: Tsybakov noise: P(h(x) neq h*(x)) <= (err(h) - err(h*))^{1/kappa}. passive learning has tilde(O)(d*eps^{1/kappa - 2}) sample complexity. label complexity of Robust CAL (for an appropriate T(V,L) -- based on Rademacher complexity): theta(eps^{1/kappa}) eps^{2/kappa - 2} d log^2 (d/(eps*delta)). - a minimax lower bound eps^{2/kappa - 2}, so theta bounded => nearly minimax optimal. --------------------------------------------------------------------------- Surrogate Losses --------------------------------------------------------------------------- - computationally hard to run Robust CAL (need to minimize number of mistakes in one step, which can be NP-Hard sometimes). - so maybe want to use a convex relaxation of that optimization problem - this is a common trick in passive learning, and can be analyzed and shown to still be effective under certain noise conditions. ---------------------------------------------------------------------------- Distribution Drift ---------------------------------------------------------------------------- - what if each Xi is sampled from a different unknown distribution? (with h* held fixed) - if the distributions are from a totally bounded family, the algorithm still works. - in particular, if theta(eps) is uniformly bounded (over the family of distributions) by a o(1/eps) function, then if we run a stream of examples through CAL, its expected number of queries is sublinear in the number of examples it has processed. ======================================================================== Bayesian Active Learning ======================================================================== If there is a prior over h*, can get a better label complexity. - Verifiable vs Unverifiable: self-verifying algorithm takes eps as an argument, decides its own budget adaptively. it not only needs to find an eps-good classifier, but also must *know* that it has found a good one. - without prior, if VC dimension finite, can always get o(1/eps) label complexity if the alg. doesn't have to be self-verifying. - without prior, for C=interval classifiers and h*=empty interval, label complexity of self-verifying active learning is 1/eps = passive. - with prior, can *always* guarantee o(1/eps) label complexity, even for self-verifying algorithms Lower bound: - Can get a rough lower bound on the label complexity of Bayesian active learning by the entropy of a Voronoi partition of C induced by a maximal tilde{O}(eps)-packing (if the pseudo-metric P(h(x) neq g(x)) is "doubling" for h,g in C) ----------------------------------------------------------------------------------- Here are more details on Bayesian active learning, which we won't have time to cover in lecture: ----------------------------------------------------------------------------------- Self-Verifiable Active Learning ----------------------------------------------------------------------------------- - Self-verifying (a special type of stopping criterion) - Given eps, adaptively decides # of query, then halts - has the property that E[err] < eps when halts Question: Can you do with E[#query] = o(1/eps)? (passive learning need 1/eps labels) Example: Intervals Suppose D is uniform on [0,1] Verification Lower Bound In non-Bayesian setting, supposing h* is empty interval. Given any classifier h, just to verify err(h) < eps, Need to verify h* is not an interval of width 2?. Need an example in ½(1/eps) regions to verify this fact. - Learning with a prior Suppose we know a distribution the target is sampled from, call it *prior* - Interval Example with prior ¥ Algorithm: Query random pts till find first +, do binary search to find end-pts. Halt when reach a prespecified prior-based query budget. Output posteriorÕs Bayes classifier. ¥ Let budget N be high enough so E[err] < eps - N = o(1/eps) sufficient for E[err|w*>0] < eps: if w* > 0, even prior-independent analysis needs only E[#queries|w*] = O(1/w* + log(1/eps)) = o(1/eps). - N = o(1/eps) sufficient for E[err|w*=0] < eps: if P(w*=0)>0, then after some L = O(log(1/eps)) queries, w.p.> 1-eps, most prob. mass on empty interval, so posteriorÕs Bayes classifier has 0 error rate. - Can do o(1/eps) for any VC-class Theorem: With the prior, can get o(1/eps) QC - There are methods that find a good classifier in o(1/eps) queries (though they arenÕt self-verifying) [BHW08] - Need set a stopping criterion for those alg - The stop criterion we use : budget - Set the budget to be just large enough so E[err] < eps. ----------------------------------------------------------------------------------- Bayesian Active Learning using Arbitrary Binary Valued Queries ----------------------------------------------------------------------------------- To get a lower bound for Bayesian active learning, let's consider a much more powerful model: arbitrary binary valued queries. - Problem: Want to learn some concept up to eps precision. Say we can ask any yes/no question, # questions I need to ask? Bayesian: we know a distribution for the concept; and want eps expected precision, # questions are enough in expectation? - Abstract Model Lossy Source Coding Z: some space rho: Z times Z -> [0, 1], some pseudo-metric (the *loss*) pi: a known distri. on Z target: X ~ pi A code is any pair (C, D) where C: Z -> {0, 1}* (encoder, C(z) called codeword) and D: {0, 1}* -> Z (decoder) Ci(z): the ith bit of C(z). A code (C, D) is prefix-free if forany z1, z2 \in Z with |C(z1)| <|C(z2)|, exist i<= |C(z1)| with Ci(z1) neq Ci(z2). Let PF be the set of all prefix-free codes. - The Optimal Lossy Code For eps > 0, define OPT(eps) = inf {E[C(X)|: (C, D) \in PF with E[rho(D(C(X)), X)] <= eps} The minimum E[code length] required to guarantee E[loss] at most eps. - Example: Bayesian Active Learning Z: concept space (set of binary classifiers h: R^k -> {0,1}) pi: a *prior* target concept X \in Z rho(h, g)= P(w: h(w) neq g(w)) rho(h,X) is the error rate A learning alg A(eps) makes yes/no queries, then outputs a classifier h: representable as a code (C, D). Opt(eps): the minimum E[#queries] required to guarantee E[err(h)] <= eps. Doubling Dimension A way to define a dimension for (Z, rho): it is doubling. Def: the doubling dimension d as minimal s.t. forall z \in Z, eps >0, {y \in Z: rho(z,y) <= 2 eps} can be covered with <= 2^d eps-balls. in other words, exist {z_i}_{i=1}^{2^d} \subset Z, s.t. {y \in Z: rho(z, y) <= 2*eps} \subseteq \cup_i {y \in Z: rho(z_i, y) <= eps}. - Packing Entropy For eps > 0, Y(eps) \subset Z is a maximal eps-packing of Z, i.e. Y(eps) is largest s.t. \forall z, y \in Y(eps), rho(z, y) >= eps. For eps > 0, rho(eps) = {{x \in Z: y = argmin_{z \ in Y(eps)} rho(x, z)}: y \in Y(eps)}. Def: packing entropy: H(P(eps)) = - sum_{S \in P(eps)} pi(S) log_2 pi(S). - A straightforward upper bound: Theorem: forall eps > 0, Opt(eps) <= H(P(eps)) +1 Proof: P_eps (X): the element of P(eps) containing X, Y_eps(X): the sole element of Y(eps) \cap P_eps (X) C: lossless Huffman code for P_eps (X) D decodes as D(C(X)) = Y_eps(X). Huffman proved E[|C(X)|] <= H(P_eps (X)) + 1 = H(P(eps)) + 1. A maximal eps-packing implies it is an eps-cover. So rho(D(C(X)), X) = rho(Y_eps (X), X) <= eps. A nearly-matching lower bound: Theorem: \forall eps > 0, H(P(eps * log_2(1/eps))) - O(d) <= Opt(eps) <= H(P(eps)) + 1. Pf: Let alpha = eps log_2 (1/eps). Fano's inequality: for a Q-valued random variable q, for any code (C, D), |P(q neq D(C(q))) >= (H(q| C(q)) - 1)/log(|Q|). If regions of P(eps) were point-wise eps-separated, Fano's inequality would give a lower bound, why? E[rho(D(C(X)), X)] >= eps * |P(P_eps(D(C(X)) neq P_eps (X)) in that case. To compensate for close regions, assign each p in P(alpha) a color K(P) in N so that no two P_1, P_2 of same color have points within alpha. Take K(.) of min entropy. Can show exists a valid coloring using <= 12^d colors. (based on bounding size of alpha-packing of 3*alpha-ball) So H(K(P_eps(X))) <= d log_2(12). No z \in Z is alpha/2-close to more than one p \in P(alpha) of a given color. So E[E[pho(D(C(X)), X)| K(P_alpha (X))]] >= (alpha/2)* (H(P_alpha (X)| C(X), K(P_alpha(X))) - 1)/log_2(|Y(alpha)|) log_2(|Y(alpha)| <= d log_2 (4/alpha) H(P_alpha (X)| D(C(X)), K(P_alpha(X))) >= H(P_alpha (X)) - H(C(X)) - H(KP_alpha(X))) >= H(P_alpha (X)) - E[|C(X)|] - d log_2(12). If eps >= E[rho(D(C(X)), X)] = E[E[rho(D(C(X)), X)| K(P_alpha (X))]] >= (eps * log (1/eps)/2) (H(P(eps log_2 (1/eps)) - E[|C(X)|] - d log_2(12))/(2d log(16/eps)) Solving: E[|C(X)|] >= H(P(eps logs_2 (1/eps))) - O(d).