Inference and Estimation of a Long-Range Trigram Model

Abstract

We describe an implementation of a simple probabilistic link grammar. This probabilistic language model extends trigrams by allowing a word to be predicted not only from the two immediately preceeding words, but potentially from any preceeding pair of adjacent words that lie within the same sentence. In this way, the trigram model can skip over less informative words to make its predictions. The underlying ``grammar'' is nothing more than a list of pairs of words that can be linked together with one or more intervening words; this word-pair grammar is automatically inferred from a corpus of training text. We present a novel technique for indexing the model parameters that allows us to avoid all sorting in the M-step of the training algorithm. This results in significant savings in computation time, and is applicable to the training of a general probabilistic link grammar. Results of preliminary experiments carried out for this class of models are presented.