o VC dimension ------------------------------------ Consider problem of learning an initial subinterval of [0,1]. Example is a real number in [0,1]. C = {[0,a] : a <= 1} Here's an algorithm: pick the interval [0,b] where b is the largest positive example seen so far. Question: how much data does this need to achieve PAC guarantee? Answer: consider the region [a',a] of probability measure espilon. If we see an example inside that region, then we're done. Prob we fail in m examples is (1-epsilon)^m. ==> 1/epsilon * log(1/delta) are sufficient. We're doing a lot better than formula that had log(|C|) in it. Why?? --> Only one "degree of freedom". How much is a degree of freedom worth? Are they all worth the same? --> Look at "effective number" of concepts as a function of number of data points seen. Only grows linearly.... Notion that captures this. Number of "effective degrees of freedom": VC-dimension. Go to slides for Defs of Shattering and VCdim Shown that VC-dim of lines in the plane is at least 3. To show exactly 3 need to say NO set of 4 pts is shattered. Proof: if one inside convex hull of others, cant get outsides -, inside +. Otherwise, label alternatingly in cycle. More Examples: Intervals: VC=2. (cant get + - +). Nice proof for axis parallel rectangles: consider 5 pts. draw smallest enclosing box. For each side of box, pick one point on it and color red. Must be at least one pt left -> color blue. Can't have red be positive and blue be negative. Disjunctions: 100, 010, 001. ==> VC-dim >= n. Not n+1 since only 2^n disjunctions. --> 2^VC-DIM(C) <= |C|. All functions: VC-dim = 2^n. Go to slides for theorems. Note: we're calling C[m] for what in book is PI_C(m). Lower bound: Basically same as did in class before. Consider d points that can be shattered and a target that is random on these. 1-4*epsilon on one point, 4*epsilon/(d-1) on the rest. Need to see > 1/2 of rest to hope to get error epsilon (i.e., can only hope to do 50/50 on the ones you didn't see). So, expected error at 1/8 * VCdim/epsilon examples is at least epsilon. Can you learn infinite VC classes? Yes: since allowed examples depending on size(c). Does this mean "complicated hypotheses are bad"? No. Just means you need a lot of data to learn complicated things. --------------------------------------------------------------------- Begin proof of first theorem. Define "m double-choose d" as sum from i=0 to m of m choose i. Number of ways of choosing d or fewer elements out of m. Called PHI_d(m) in the book. Easy to see ((m d)) = ((m-1 d)) + ((m-1 d-1)). Proof: to choose d or fewer out of m you either choose the first element and then d-1 or fewer out of the remaining m-1 or you don't, and then choose d or fewer out of the remaining m. First Theorem: If d = VCDIM(C), then C[m] is at most ((m d)). Proof: Let's say we have a set S of m examples. We want to bound number of ways of splitting S using concepts in C. Define PI_C(S) = set of different ways of splitting (partitioning) S by concepts in C. Think of these as concepts defined only on S. Do proof by induction on m and d. Pick some x in S. So, we know that PI_C(S - {x}) has at most ((m-1 d)) partitions. How are there more splittings of S than S - {x}? Well, increase is exactly the number of pairs in PI_C(S) that differ only on x, since these are the ones that collapse when you remove x. Let's count this by considering C' = {c in PI_C(S) that label x negative for which there is another concept in PI_C(S) that is identical to c except it labels x positive}. (these are concepts that are only defined on S) Note that |C'| = |PI_C'(S - {x})|. If we can show VCDIM(C') <= d-1, then we're done since then by induction, |C'| <= ((m-1 d-1)). But, this is easy since consider any subset S' of S shattered by C'. First of all, S' doesn't include x. But, by definition of C', we must have that S' U {x} is shattered by C. Since d = VCDIM(C), then VCDIM(C') is at most d-1.