Language Modeling with Power Low Rank Ensembles

This is the code site for the paper:

A.P. Parikh, A. Saluja, C. Dyer, and E.P. Xing, Language Modeling with Power Low Rank Ensembles, To Appear in the Conference on Empirical Methods in Natural Language Processing (EMNLP 2014) [pdf][supplemental]

Here is the code zip for the PLRE algorithm described in the paper implemented in C++. The code has two external dependencies that need to be downloaded:

Eigen 3.2.2, a matrix library (headers only, does not need to be compiled)

Boost 1.55, C++ library that needs to be compiled with the serialization option

Compilation instructions can be found in the README file.

The input data must be formatted to include start/end markers and an out-of-vocabulary (oov) token. For example, consider the dataset

the house is green .
the bricks are red !

This must be reformatted to

<s> the house is green . </s>
<s> the bricks are red ! </s>

Furthermore, one symbol must be indicated to be the out-of-vocabulary(oov) token for words that do not appear in the training set but do appear in the test set. For the datasets in the paper, words that appeared only once in the training set were replaced with the oov token. The small-russian dataset is included in the code zip as an example. Feel free to contact me if you have any questions.