Statistical Language Modeling Using Grammatical Information

Abstract

We propose to investigate the use of grammatical information to build improved statistical language models. Until recently, language models were primarily influenced by local lexical constraints. Today, language models often utilize longer range lexical information to aid in their predictions. All of these language models ignore grammatical considerations other than those induced by the statistics of lexical constraints. We believe that properly incorporating additional grammatical structure will achieve improved language models.

We will use link grammar as our grammatical base. Being highly lexical in nature, the link grammar formalism will allow us to integrate more traditional modeling schemes with grammatical ones. An efficient robust link grammar parser will assist in this undertaking.

We will initially build finite state-based language models that will utilize relatively simple grammatical information, such as part-of-speech data, along with information sources used by other language models. Our models feature a new framework for probabilistic automata that makes use of hidden data to construct context-sensitive probabilities. The maximum entropy principle employed by these {\em Gibbs-Markov models} facilitates the easy integration of multiple information sources.

We will also build language models that take greater advantage of link grammar by including more sophisticated grammatical considerations. These models will include both probabilistic automata as well as models more closely related to the link grammar formalism.

Expected contributions of this work will be to demonstrate that grammatical information can be used to construct language models with low perplexity, and that such models can be used to reduce the error rates of speech recognition systems.