Homework 8
Out: Feb-10 Due: Feb-15 Wednesday night (12:00)
To submit: Send to Stan (scjou@cs.cmu.edu) the NFS path containing your work.

In this homework we are going to build language models (LMs). Please follow the steps below:
- Use the SRI LM Toolkit and follow Exercise-8 to do the Tasks below. The training, development, and test sets are /project/Class-11-753/data/CH/trl.utf8.set/trl.utf8.{train|dev|test} , respectively.
- Task 8-2: Build word based language models (1/2/3-Gram) for Mandarin given the training data and measure the perplexity on training and development set.
- Task 8-3: Build character based language models (1-6-Gram) for Mandarin given the training data and measure the perplexity on training and development set.
- To estimate character based language models, the data sets need to be segmented into single-character words. That is, every word in the corpora should contain one and only one UTF-8 character. That is, every UTF-8 Mandarin character should be separated by at least one space (ASCII \x20) to each other.
- Task 8-4: Collect more language model data and add them to the training data. Build language models and measure the perplexity.
- You may collect data from the web, use existing data you have access to, or use a small data set we provide /project/Class-11-753/data/CH/lm-data/FBIS.2004.prep.utf8 . Note that the FBIS data set are segmented with different vocabulary from the Globalphone data we use. (This implies it's much easier to do this task on character based language models, because we don't have the segmentation problem anymore.) Also note that the FBIS data set contains punctuation marks, which should be removed with appropriate text normalization procedures.
-
Please send the NFS paths of your work to Stan. If you have problem on UTF-8 processing or the Mandarin language itself, please feel free to discuss with Stan. Currently we feel it's more stable to run UTF-8 processing on spoon.is . Since spoon.is is one major computation machine, please be considerate when you run your jobs. DO NOT overload spoon.is . Thanks!
Last modified: Fri Feb 10 15:07:04 EST 2006
Maintainer: scjou@cs.cmu.edu.