Linguistic Priors

Data-driven techniques are commonly used for many natural language processing tasks. However, these techniques require large amounts of data to perform well, and even with significant amounts of data there is always a long tail of infrequent linguistic events. The majority of words, for example, occur only a few times even in a very large corpus. Poor statistical estimation of these rare events will always be a problem when relying on data-driven techniques, especially when only small amounts of data are available.

One proposed solution is to augment corpus-derived statistics with linguistic knowledge, available in the form of existing lexical and semantic resources. Such resources include lexical databases like WordNet, knowledge bases like Cyc, thesauri like Roget's, and machine readable dictionaries like the Longman Dictionary of Contemporary English. These linguistic resources have been used for many natural language processing tasks, such as resolving syntactic ambiguity, identifying spelling errors, and disambiguating word senses. However, as they are not frequency-based, it is not clear in general how to use them within a statistical framework. We investiage the usefulness of the information encoded in the lexical database WordNet for two language-modelling tasks: improving perplexity of a bigram language model trained on very little data, and finding longer-distance correlations in text.