15-859(B) Machine Learning Theory 03/26/02 * Maximum likelihood and Bayesian learning * Hidden Markov Models * EM and Baum-Welsh =========================================================================== Maximum likelihood hypotheses ============================= On the first day of class, we talked about the consistency model: given a set of examples, find a consistent h in C if one exists. This is natural if we assume there really is a deterministic target function in class C. A generalization of this is to assume data is generated through a probabilistic process, and our goal is to find the function that maximizes the probability of the labels we saw. This is called the "maximum likelihood" hypothesis. For example, lets say we want to fit a line to some data points (x_i,y_i). Usually, there is no perfect line, so a common thing to do is to fit the line f(x) that minimizes the mean squared error: (1/m) * sum_i (y_i - f(x_i))^2. Here is one rationale for doing that: Let's imagine there really is a straight-line target function f(x), but then the labels got corrupted by gaussian noise. I.e. y = f(x) + e, where e is gaussian distributed. Then we could ask "what is the hypothesis that maximizes the probability density of the labels we saw?" Why is this the same as the one that minimizes mean squared error? Bayesian learning ================= The term "maximum likelihood" is a little weird because it sounds like you are asking for the most likely *hypothesis*. To talk about that, you need to talk about priors. Bayes rule: Pr(h | data) = Pr(data | h) * Pr(h) / Pr(data). Pr(h) is the prior on h. The "MAP" (maximum a posteriori) hyp is the one that maximizes P(h | data). Can factor out Pr(data) since it's the same for all hyps. For example: suppose someone gets a funny rash and thinks they might have anthrax. Hyps h_a (anthrax) and h_na (no anthrax). Suppose this really is a symptom: a random person without anthrax has only a 2% chance of having a funny rash, but 100% of anthrax cases do have funny rashes. But, most people don't have anthrax: Pr(h_a) = 0.0001 (1/100 of 1%). Let's calculate the odds: Pr(h_a | data) Pr(data | h_a) * Pr(h_a) 1.0 * 0.0001 -------------- = ------------------------ = ------------- = 0.005 Pr(h_na | data) Pr(data | h_na)*Pr(h_na) 0.02 * 0.9999 In other words, your odds of having anthrax have gone up by a factor of 50 (1.0 / 0.02), but they are still small. Recap definitions: Maximum likelihood hypothesis: argmax_h Pr(data | h) MAP hypothesis: argmax_h Pr(h | data) = argmax_h Pr(data|h)*Pr(h) Bayesian-optimal prediction of label of new example: argmax_label sum_h Pr(label | h)*Pr(h | data) Gibbs algorithm: pick label from distribution Pr(label) = sum_h Pr(label | h)*Pr(h | data) Expectation-Maximization ======================== -> see slides Learning Hidden Markov Models. ============================== -> this problem is a lot harder than DFA problem because (A) we don't get to experiment, we only get to observe, and (B) the state transitions are probabiistic. As a result, we won't be able to say too much theoretically. On the other hand, there's a nice algorithm that is used in practice called the Baum-Welsh algorithm. Like backprop in neural networks it isn't guaranteed to converge in poly time, but in practice converges pretty quickly. The real problem (in practice and in theory) is that it is only guaranteed to find a *local* optimum. -> see HMM slides. Solving Evaluation problem: dynamic programming. E.g., "BBA" t = 0 1 2 3 ----------------------------------------------------------------- | alpha_0(t)| 1---0.5*1/3--->1/6-----0.5*1/3-->0.028-----0.5*2/3---->0.0401 | \ \ / | \___0.5_____ \______0.5____ ______0.1_____/ | \ \ / alpha_1(t)| 0 1/2----0.9*1/2--->0.308------0.9*1/2---->0.13875