Notes for Jan 14, 1998 Last time we discussed "consistency model". Noticed some problems with it: problem: not clear a positive result says anything about ability to predict well on new examples. problem (or peculiarity): negative result doesn't seem related to inherent predictability of concept class. More about ease of finding optimal representation. Latter is sometimes what you want. E.g:., neural nets. Powerful and useful. Train with local optimization rule (backprop / gradient descent) Often finds a local minimum (solution with locally the smallest error) that is not a global minimum. Basic question we might ask: Can we come up with a more closed-form solution or poly time algorithm to find the global optimum? To do this, need to solve the consistency problem. For basically any interesting nnet, consistency turns out to be NP-hard ==> achieving this guarantee is unlikely. Give proof for a simple network: an AND of two linear threshold functions. I.e., a typical concept looks like: (v_1*x_1 + ... + v_n*x_n > v_0) AND (w_1*x_1 + ... + w_n*x_n > w_0) So, given a set of data, the goal is to set the v_i's and the w_i's to fit the data, if possible. Theorem: consistency problem is NP-hard for this concept class. Proof: reduce from set-splitting. In set-splitting you have n points, and a collection of subsets S_1, S_2, ..., S_m. Your goal is to color the points red or blue so that no set is monochrome. (Also called hypergraph 2-coloring.) What we want to do is: given a set-splitting problem, construct a set of data that has a consistent net if and only if the set-splitting problem was solvable. Idea: origin is positive. n-dimensional space. One negative example at each neighbor of origin, corresponding to the n points in the set-splitting problem. Note that every point must be separated from origin by at least one plane. Want to set things up so that (A) given a nnet solution, points separated from origin by plane 1 can be colored red, those separated by plane 2 can be colored blue (and those separated by both can be colored either way), and (B) given a set-splitting solution, can find two planes. Ideas? Basically, for each set S_j, want to make sure can't have all points on one side of the same plane, but still want to allow all other splits. Put pos for each S_j at location with 1's for each element in the set. Nnet ==> split: easy (just look at fixed set S_j) split ==> nnet: red plane is w_1*x_1 + ... + w_n*x_n < 1/2 where w_i = -1 if i is colored red, and w_i = +n if i is colored blue. Mistake-bound model ------------------- The consistency model had the problem that a positive result didn't necessarily imply finding a rule with good "predictive power" in any sense. The Mistake-Bound model addresses this problem by explicitly modeling learning as an on-line process. In the MB model, learning is in stages. In each stage: 1. Learner gets unlabeled example. 2. Learner predicts classification. 3. Learner is told correct answer. Goal: bound the total number of mistakes. Definition: An algorithm A learns C in the mistake bound model if for any sequence of examples consistent with some concept in C, the total number of mistakes ever made by A is bounded by some polynomial in n and s, where n = size of an example s = size (description length) of simplest concept in C consistent with data so far. (s is only important if C contains concepts with description length more than polynomial in n, like DNF or decision trees) Definition: C is (efficiently) learnable in the mistake-bound model if there exists an algorithm A that learns C, where the running time of A per stage is also polynomial in n and s. Conjunctions: Start: h = x_1 \Not{x}_1 x_2 \Not{x}_2 ... x_n \Not{x}_n. Mistake on positive: remove from h all literals not satisfied. Mistake on negative: say ``no consistent conjunction.'' {\bf Guarantee: {literals in target} contained within {literals in h}. Max # of mistakes: n+1 -------------------------- Disjunctions: --> same thing just reverse positive and negative. -------------------------- Lower bound of n: We've seen an algorithm that makes at most n+1 mistakes. In fact, *no* algorithm can guarantee a mistake bound less than n. Easiest to see with disjunctions: Imagine seeing the following sequence of examples: 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 1 then, *any* labeling of these is consistent with some concept in C, so for any algorithm, there exists a sequence of examples forcing it to make n mistakes. [[this in some ways foreshadows the notion of "VC-dimension" that we will get to later...]] Relation to consistency model ----------------------------- If can solve in consistency model, it doesn't necessarily give an algorithm for the mistake-bound model. But, other direction IS true in the following sense: Theorem: if class C is learnable in the MB model by a poly-time algorithm A that uses hypotheses in C, then C is learnable in the consistency model. Proof: Given set S of data, run algorithm through the set over and over until it makes a pass with no mistakes. Let's assume the algorithm is "conservative", meaning that it only changes its hypothesis when it makes a mistake. Then, if it makes a pass with no mistakes, we have a consistent hypothesis. On the other hand, if it keeps making mistakes and exceeds its mistake bound, then we know there is no consistent hypothesis. What about "non-conservative" algorithms? --> can reset its state after seeing a non-mistake --> or, can just use its hypothesis to see if there is any inconsistent example and then feed that example in. In some cases, learning in the MB model can be a bit trickier than learning in the consistency model. Here is an interesting example: (Note: using cryptographic assumptions (like "factoring is hard") one can construct weird concept classes that *are* learnable in the consistency model but not in the MB model. Get to this later when talk about relations between learning and cryptography). Decision lists -------------- say have variables "rainy?" "warm?" "have-jacket?" "snowy?" say want to give rule for when you like to go for a walk. Say you don't like to go for a walk if it's raining. Otherwise, you like to go for a walk if it's warm or if it's snowy, so long as you have a jacket. Could describe as a decision list like this: if rainy then no else if warm then yes else if not(have-jacket) then no else if snowy then yes else no. (continue next time....)