This is a revision. As Rosie, Hua Yu, and others have pointed out,
there are several bugs in the original homework problem. It took me
about an hour to work it all out carefully.
(Do not infer too much from these numbers -- all of them, including
the counts, are made up for this problem.)
Extra information:
Delete the unigram count "COME 29". "COME FROM" has a count of 30, so
"COME" should have a unigram count of 43.
There are a total of 15042 words in the corpus.
Since the trigram "THE CAT CAUGHT" exists, the bigram "THE CAT" must
exist. Summing over the trigram counts gives 17. (This was my error,
you were not expected to do this.)
THE CAT 17
The same holds for the bigram count of "WHERE DID". Use:
WHERE DID 1
[Revised Feb-17-1997; all changes marked with a leading !]
This exercise is meant to solidify your understanding of how
backing-off language models work. You should send your answer to me
(Paul, ) by Friday Feb. 21.
Using the tables and other information included below, find the
per-sentence perplexity, and word-by-word probabilities of the
following two sentences:
1. ~~ THE CAT CAUGHT A BROWN MOUSE ~~
2. ~~ WHERE DID THE MOUSE COME FROM ~~
Pretend you are using a Katz-style trigram language model, based on
the counts, alpha values, and discounting table below. (Calculate the
first probability of each sentence as a bigram.)
! There are a total of 15042 words in the corpus.
Discounting table (only 1 through 5 are actually discounted):
c | d_c
-------
1 0.2
2 0.8
3 2.1
4 3.6
5 4.75
6 6
7 7
...
Trigrams:
~~ THE BROWN 11
~~~~ THE CAT 9
~~~~ THE DOG 11
~~~~ THE QUICK 11
~~~~ WHERE DID 1
~~~~ WHERE WAS 5
CAT CAUGHT FIVE 2
CAT CAUGHT THE 3
CAUGHT A COLD 9
CAUGHT A FISH 3
CAUGHT A MOUSE 4
CAUGHT THE BALL 7
CAUGHT THE FLU 5
COME FROM ~~ 2
COME FROM HOME 6
COME FROM THE 9
THE CAT CAUGHT 2
THE CAT CHASED 5
THE CAT RAN 4
THE CAT SAT 6
THE MOUSE ATE 1
THE MOUSE RAN 1
WHERE DID THE 1
Bigrams:
~~ A 157
~~~~ CAN 39
~~~~ THE 200
~~~~ WHERE 7
~~~~ WILL 40
A BALL 9
A BROWN 3
A DOG 10
CAUGHT A 20
CAUGHT FIVE 7
CAUGHT THE 23
COME FROM 30
MOUSE ~~ 7
! THE CAT 17
THE MOUSE 3
! WHERE DID 1
Unigrams:
2150
~~ 2150
A 857
BALL 21
BROWN 42
CAT 39
CAUGHT 251
! COME 43
DOG 47
FLU 5
MOUSE 37
THE 1047
alphas (*):
A BROWN 0.91
A DOG 1.91
CAT CAUGHT 0.79
CAUGHT A 1.23
CAUGHT THE 2.11
DID THE 0.97
DOG CAUGHT 1.09
THE MOUSE 1.98
BALL 1.17
BROWN 2.10
MOUSE 1.31
(* If the context hasn't been seen, the effective alpha is 1.)
