15-859(B) Machine Learning Theory 04/21/08 * Active learning * Retrospective * Connections to other areas =========================================================================== Admin: - no class on Wed (I need to catch a flight) - sign up for project presentations - UCA course assessments =========================================================================== Active learning =============== A while back we talked about leanring with Membership Queries where the algorithm could ask for the label f(x) of examples x of its own choosing. Makes sense for problems like reverse-engineering some device, but not so much for problems like image classification (e.g., recognizing hand-written digits or faces etc) since a made-up image might not look like anything. A more practical model for such settings is what's called Active Learning. In Active Learning, you are given a large pool U of unlabeled data. E.g., unlabeled images or digits etc. Your goal can be either to do well on U or to do well on new examples from the same distribution as U. The algorithm is then allowed to query for f(x) for examples x of its choosing IN U. So it can actively SELECT which examples to have labeled but not construct them out of whole cloth. Where this fits: in MQ learning, we were mostly concerned with *computational* advantages. Here, our concern will instead be *sample complexity*. I.e., can we get away with substantially fewer labeled examples if we can have some say into which examples get labeled? (Unlike MQ learning, Active learning doesn't help with running time, since a PAC algorithm with labels for everything in U could just ignore the ones it wants to). The key point is that active learning can in some cases give us an exponential improvement over passive learning. Intervals ========= Consider the class C = {[0,a] : a >= 0}. Examples are points in R. Normally, to get error epsilon we need (1/epsilon)*log(1/delta) examples. How could we use active learning to get only a logarithmic dependence on 1/epsilon? [Ans: do binary search on U until error rate on U is < epsilon/2]. On the other hand, what about the class C = { [a,b] } where we allow both endpoints to float. Can anyone see a problem there? Problem is if target is an interval of probability mass 2*epsilon (for concreteness, can imagine partitioning distribution into 1/(2*epsilon) disjoint intervals of probability mass 2*epsilon each and choosing one of them at random to be the target). Then any algorithm needs Omega(1/epsilon) queries in expectation to even see a single positive example! So, you can't beat linear in 1/epsilon. On the other hand (where's that 3rd hand when you need it?) a recent paper of Balcan-Hanneke-Wortman (COLT'08) shows that if you switch the order of quantification, then the difficulty in this (and more general) cases goes away. In particular, note that in the above bad example, the adversary got to pick the target as a function of epsilon. What if we instead say the adversary has to first, and we then look at our labeled sample complexity as epsilon->0? E.g., consider the following algorithm: pick random points in U and ask them to be labeled. If no positives seen so far, maintain the "all negative" hypothesis. However, once you see a positive example, then do 2-sided binary search to estimate the interval boundaries. Now, if the target has zero probability on positive examples, then this algorithm is correct from the beginning. If the target has probability mass P on positives, then the number of samples needed until the error rate is < epsilon is O((1/P) + log(1/epsilon)). Equivalently, we can think of a model where the algorithm is allowed to ask q questions. Above alg initially asks queries at random until sees a positive and then does binary search. If no positive seen in the q questions then just outputs "all negative". If you look at error rate as a function of q, then for "all negative" target, it's always 0. For target of positive prob mass P, it starts at P and then as you get q >> p, it starts dropping exponentially to 0. An interesting thing they point out is that in the first case, the algorithm doesn't *know* if its error rate is low or not. (We saw a related thing when discussing Semi-Supervised Learning too, where e.g., under the "independence given the label" assumption, we could do co-training from just 1 labeled example. But this is too few labels to *verify* that our error is really low - we were really relying on our belief in our assumptions.) In usual PAC bounds, you have enough data to test out your hypotheses so this issue doesn't come up. Generic algorithms ================== One conservative and fairly generic active learning algorithm is, given some class C, to only query examples where there is at least some disagreement (Cohen-Atlas-Ladner'92). E.g., consider learning a linear separator in R^2. Hard to analyze generically when this gives a big improvement. In fact, most work analyzing active learning has focused on special case of linear separators under the uniform distribution where you can get a good handle of what is going on. Learning linear separators ========================== Let's look at the problem of learning a linear separator through the origin, under the uniform distribution over the ball in R^d. So, error rate of some hypothesis vector w is equal to the angle between w and the target w^*, divided by pi. Here's an interesting algorithm by Balcan-Broder-Zhang. It's more aggressive than the above algorithm in that it focuses on a tighter region of uncertainty to sample from. Step 1: query random examples to get a hypothesis w_1 of error 1/8. Now, our goal is going to be to use not too many labeled examples (will be roughly O(d log(1/epsilon))) in order to cut error rate by a factor of 2. Given hypothesis w_k, will query random points satisfying |w_k . x| <= b_k. Then find separator w_{k+1} consistent with these, subject to having distance at most 1/2^k with w_k (angle at most pi/2^k). Use b_k approx k/(2^k * sqrt(d)). [draw picture] Analysis: Let alpha = error of w_k. Want to show error of w_{k+1} <= alpha/2. err(w_{k+1}) = Pr(w_{k+1} makes mistake on x, |w_k . x| >= b_k) + Pr(w_{k+1} makes mistake on x, |w_k . x| <= b_k) Let's look at first part first. Want to show it's less than alpha/4. Easier to look at w_k first: b_k is chosen so that at most alpha/8 prob mass of w_k's mistakes is outside that margin. (Geometrical fact using our given assumption that target has distance at most alpha from w_k). But by same reasoning, at most alpha/8 prob mass of w_k's "mistakes wrt w_{k+1}" is outside that margin. All mistakes of w_{k+1} have to be either disagreements between w_k and target or between w_{k+1} and w_k, so overall this is at most alpha/4. Second part: key trick here: overall probability mass here is relatively low: like O(alpha * log(1/alpha)). So, it's enough to have an error *rate*, conditioned on belonging to this set, that's O(1/log(1/alpha)). But this we can handle!! Retrospective ============= Main topic of course: what can we say theoretically about what types of tasks we can learn, from what types of data, using what types of algorithms? What types of guarantees can we get? Types of data: batch iid (PAC), online adversarial (Mistake Bound), algorithm-selected (Membership Query and Active Learning) Types of guarantees: Occam bounds say that if computation time is no object can learn any function with sample size proportional to number of bits to write down. Same with # mistakes. If we do worry about computation time, then things get harder. Analyzed what's doable with Statistical Query algorithms. Membership query algorithms allow for poly-time algorithms for cases where we can't (or don't know how to) do it without them (e.g., DFAs, monotone DNF, decision trees, XOR of terms). As a practical matter, samples are more expensive than time: not enough to just say "sufficient to have # examples proportional to # bits". This motivates more careful look at sample complexity. E.g., in terms of description length, VC-dimension, margins, Rademacher averages. Try to get a more detailed understanding of how much data we really need to be able to generalize well. Understand what inherently is the relationship between the kinds of thing we are trying to learn and how much data we need to see to learn it. Also saw generic conversion techniques, like Weighted-Majority algorithm and Kalai-Vempala algorithm for getting good regret bounds. Boosting for weak=>strong learning. Also looked at from perspective of: given a desired guarantee, what types of assumptions to we need to make to get it? (maybe assume data from uniform distrib). Try to understand what about data might make the learning problem easier (like large margins). Connections to other areas ========================== -> statistics -> cryptography -> Game theory -> Auctions design -> complexity theory -> application areas (lots)