TurboParser for Kinyarwanda and Malagasy trained by Noah Smith (nasmith@cs.cmu.edu) April 2013 First, install TurboTagger and TurboParser version 2.0.2: http://www.cs.cmu.edu/~afm/TurboParser/TurboParser-2.0.2.tar.gz The scripts scripts/kin-parse.pl and scripts/mlg-parse.pl will tag and parse, respectively, Kinyarwanda and Malagasy sentences. The input is tokenized text, one sentence per line; the output is CoNLL dependency format. You should modify the three annotated lines in these scripts so that they find TurboTagger and TurboParser executables, and put the temporary files where you'd like them to go. Example usage: > scripts/kin-parse.pl from-Tahira/test/kin.words.test.punc.50 (output will be identical to generated/kin/basic.C1e-3.tags-from-bigram-tagger.C1e-3.pred ) Model files: Kinyarwanda tagger: generated/kin/bigram-tagger.C1e-3.model Kinyarwanda parser: basic.C1e-3.tags-from-bigram-tagger.C1e-3.model Malagasy tagger: generated/mlg/bigram-tagger.C1e-2.model Malagasy parser: generated/mlg/full+cl.C100.tags-from-bigram-tagger.C1e-2.model Documentation of the procedure for generating the models: 1. We begin with manually POS-tagged and parsed data from the Linguistic Core MURI project, prepared by Jason Baldridge and colleagues at the University of Texas. There are 270 sentences for Kinyarwanda and 168 for Malagasy. The annotations include head rules. The data were transformed into basic unlabeled dependency trees by Tahira Naseem at MIT. 2. Our main training set consists of all sentences in the data except those in Tahira's "test" set or the 10 sentences per language she held out for semisupervised learning ("semi"). For Kinyarwanda, there are 190 training instances; for Malagasy, there are 107. We produced 10 random splits of each language's dataset, called split-0 ... split-9, each of which holds out a 10% held-out set. These splits are used for picking hyperparameters. Training data (including gold-standard tags and dependencies): generated/kin/train.conll generated/mlg/train.conll 3. For each language, we train POS taggers with settings of (S, G, C), where: - S is the split (0, 1, ..., 9); train on the 90% subset for that split - G is the tagging model, bigram or trigram - C is the regularization strength (100, 10, 1, 1e-1, 1e-2, 1e-3, 1e-4), with a lower value indicating stronger regularization Training used TurboTagger v. 2.0.2. For each tuple we calculate dependency accuracy on S's held-out set. 4. For each language, we pick the tuple (G, C) that achieves the best average tagging accuracy on held-out. These turn out to be: - Kinyarwanda: bigram, 1e-3, 83.0% (+/- 1.6%) - Malagasy: bigram, 1e-2, 83.5% (+/- 4.3%) 5. In a 10-fold cross-validation setup, train the selected models on 9 folds and tag the 10th. Then concatenate these together to get an auto-tagged dataset for training the parser. The tagging performance of these models on Tahira's test set is: - Kinyarwanda: 80.3% - Malagasy: 82.6% These results are found in the files: generated/mlg/bigram-tagger.C1e-2.pred.log generated/kin/bigram-tagger.C1e-3.pred.log CoNLL files with these automatic tags (and gold-standard dependencies): generated/kin/train-tags-from-bigram-tagger.C1e-3.conll generated/kin/test-tags-from-bigram-tagger.C1e-3.conll generated/mlg/train-tags-from-bigram-tagger.C1e-2.conll+cl generated/mlg/test-tags-from-bigram-tagger.C1e-2.conll+cl 6. Now, parsing. We trained unlabeled dependency parsers with settings (S, P, L, C, A), where: - S is the split (0, 1, ..., 9); train on the 90% subset for that split - P is the parsing model (basic or full) - L indicates inclusion of clusters (for Malagasy only); these are coarse- and fine-grained features from Chris Dyer's run of Brown clustering. - C is the regularization strength (100, 10, 1, 1e-1, 1e-2, 1e-3, 1e-4), with a lower value indicating stronger regularization - A indicates whether gold-standard or automatic tags (from the tagger above) were given to the parser. Training used TurboParser v. 2.0.2. For each tuple, we calculate dependency accuracy on S's respective held-out set. 4. For each language, we pick the tuple (P, L, C) that achieves the best average dependency accuracy on held-out (A is fixed to use automatic tags). These turn out to be: - Kinyarwanda: basic, 1e-3, 70.8% (+/ 4.4%) - Malagasy: full, with clusters, 100, 72.9% (+/1 4.1%) Test data parses can be found in: generated/kin/basic.C1e-3.tags-from-bigram-tagger.C1e-3.pred generated/mlg/full+cl.C100.tags-from-bigram-tagger.C1e-2.pred 5. Train the dependency parsers with the selected settings on automatic tags (each instance is tagged using a tagger that was not trained on it, via the cross-validation setup). The results on Tahira's test set are: - Kinyarwanda: 70.1% - Malagasy: 72.4% These results are found in the files: generated/kin/basic.C1e-3.tags-from-bigram-tagger.C1e-3.pred.log generated/mlg/full+cl.C100.tags-from-bigram-tagger.C1e-2.pred.log