This is a revision. As Rosie, Hua Yu, and others have pointed out, there are several bugs in the original homework problem. It took me about an hour to work it all out carefully. (Do not infer too much from these numbers -- all of them, including the counts, are made up for this problem.) Extra information: Delete the unigram count "COME 29". "COME FROM" has a count of 30, so "COME" should have a unigram count of 43. There are a total of 15042 words in the corpus. Since the trigram "THE CAT CAUGHT" exists, the bigram "THE CAT" must exist. Summing over the trigram counts gives 17. (This was my error, you were not expected to do this.) THE CAT 17 The same holds for the bigram count of "WHERE DID". Use: WHERE DID 1 ================================================================ [Revised Feb-17-1997; all changes marked with a leading !] This exercise is meant to solidify your understanding of how backing-off language models work. You should send your answer to me (Paul, ) by Friday Feb. 21. Using the tables and other information included below, find the per-sentence perplexity, and word-by-word probabilities of the following two sentences: 1. THE CAT CAUGHT A BROWN MOUSE 2. WHERE DID THE MOUSE COME FROM Pretend you are using a Katz-style trigram language model, based on the counts, alpha values, and discounting table below. (Calculate the first probability of each sentence as a bigram.) ! There are a total of 15042 words in the corpus. Discounting table (only 1 through 5 are actually discounted): c | d_c ------- 1 0.2 2 0.8 3 2.1 4 3.6 5 4.75 6 6 7 7 ... Trigrams: THE BROWN 11 THE CAT 9 THE DOG 11 THE QUICK 11 WHERE DID 1 WHERE WAS 5 CAT CAUGHT FIVE 2 CAT CAUGHT THE 3 CAUGHT A COLD 9 CAUGHT A FISH 3 CAUGHT A MOUSE 4 CAUGHT THE BALL 7 CAUGHT THE FLU 5 COME FROM 2 COME FROM HOME 6 COME FROM THE 9 THE CAT CAUGHT 2 THE CAT CHASED 5 THE CAT RAN 4 THE CAT SAT 6 THE MOUSE ATE 1 THE MOUSE RAN 1 WHERE DID THE 1 Bigrams: A 157 CAN 39 THE 200 WHERE 7 WILL 40 A BALL 9 A BROWN 3 A DOG 10 CAUGHT A 20 CAUGHT FIVE 7 CAUGHT THE 23 COME FROM 30 MOUSE 7 ! THE CAT 17 THE MOUSE 3 ! WHERE DID 1 Unigrams: 2150 2150 A 857 BALL 21 BROWN 42 CAT 39 CAUGHT 251 ! COME 43 DOG 47 FLU 5 MOUSE 37 THE 1047 alphas (*): A BROWN 0.91 A DOG 1.91 CAT CAUGHT 0.79 CAUGHT A 1.23 CAUGHT THE 2.11 DID THE 0.97 DOG CAUGHT 1.09 THE MOUSE 1.98 BALL 1.17 BROWN 2.10 MOUSE 1.31 (* If the context hasn't been seen, the effective alpha is 1.)