LRDEP (a.k.a. KSDEP)


A dependency parser developed by
Kenji Sagae
at Tsujii Lab, University of Tokyo
and USC's Institute for Creative Technologies


This is the dependency parser described in

Sagae, K., Tsujii, J. 2007. Dependency parsing and domain adaptation with LR models and parser ensembles. Proceedings of the CoNLL 2007 Shared Task. Joint Conferences on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL'07). Prague, Czech Republic.

and used in the experiments in

Miyao, Y., Saetre, R., Sagae, K., Matsuzaki, T. and Tsujii, J. 2008. Task-oriented Evaluation of Syntactic Parsers and Their Representations. Proceedings of the 45th Meeting of the Association for Computational Linguistics (ACL'08:HLT).

Models for GENIA (biomedical domain) and WSJ are provided. A linux binary is currently available.

For parsing biomedical text, use GDep (GENIA Dependency parser). GDep is a version of KSDep for biomedical text.

Source code and a Mac OS X (Intel) binary are available for a newer implementation of the parser (models are not compatible with previous versions, and only a WSJ model is provided for now).

Usage:
(Usage for the new version is slightly different. See the README file included in the version you download for usage information.)
./ksdep -m MODEL -b 10 INPUTFILE
where MODEL can be
- wsj.mod (a model trained on PTB WSJ 02-21)
- genia.mod (a model trained on the genia treebank)
- combo-genia-wsj.mod (a model trained on genia + wsj)
and INPUTFILE is in CoNLL-X format.

The -b option controls the beam width. For deterministic parsing, use -b 1. Larger values (for example, -b 1000) might improve accuracy minimally, at the expense of much greater computational cost.

The parser can be trained, using the -t option. When training, -c FLOAT sets the regularization parameter (-c 1.0 is usually a good guess, smaller values may overfit more). The -i INT option sets the number of iterations for maxent training (at every 100 iterations, a snapshot of the model will be written to disk). The -m STRING option sets the model name, like it does in parse mode (except that in training mode the file will be created, or overwritten if it already exists).

Download:

GDep (GENIA Dependency parser)
GDep is a version of KSDep that does part-of-speech tagging, named entity recognition and dependency parsing, tuned specifically for biomedical text using the GENIA Treebank. GDep takes plain text as input, with one sentence per line.

KSDep with a GENIA model
Use this for parsing biomedical text if you want to use your own part-of-speech tagger (for plain text input, use GDep instead). Input must be in CoNLL-X format, with part-of-speech tags as in the GENIA treebank. Included: linux binary and source code. To build on Mac OSX or Windows (with cygwin), just type "make" in the directory where you unpacked the files.

WSJ tagging + parsing This version includes tokenization and POS tagging (by Yoshimasa Tsuruoka), and takes plain text sentences as input. Source code with Linux binaru and WSJ models for tagging and parsing. Don't try to train with this parser; it won't work. If you need to train new parsing models, use the one below.

(New version) Source code with Mac OS X binary and a WSJ model. Note: Models created with the previous version of the parser cannot be used with this version.

Linux binary with WSJ and GENIA models. Older parser, but should work fine under Linux.

Anyone is free to download and use the parser and the models included. However, because this is an alpha release, I strongly recommend you contact me (sagae+lrdep at cs dot cmu dot edu) if you want to do anything beyond simple testing.