Human Language Technology (HLT) is the essence of the 'L' in SILK and is crucial for 'S' and 'K'. In my research in HLT, my tools are information theory and statistics. My raw materials are huge amounts of text of various types. My end products are new modeling techniques, improved performance of real systems, and new insights into the statistical nature of human language.
Statistical Language Modeling is useful, often crucial, to all human language technologies. These include speech recognition, machine translation, document classification and routing, information retrieval, textual datamining, optical character recognition, handwriting recognition, spelling correction, and many others. In all these cases, language models guide the system by acting as a knowledge source and imposing soft constraints on the system's expectations.
Modeling of human language is at the intersection of statistics and traditional machine learning. Technically speaking, it can be viewed as a statistical estimation problem, albeit in a very sparse domain. But its subject matter, human language, has been the subject of intense research in Artificial Intelligence in the past few decades, research which has relied on computational linguistics and machine learning techniques. Consequently, our research group consists of computer scientists and statisticians. We develop statistical frameworks for modeling various aspects of human language, implement them, and try out their effect on various language technologies applications. Some of the problems we have been tackling recently are:
Modeling the
Structure of Language: Left-to-right Markovian language models (n-grams),
invented in the '70s, are too crude as models of natural language, yet are
surprisingly hard to beat. More
sophisticated modeling techniques are needed in order to capture long-distance
correlations, grammaticality and other more subtle linguistic phenomena.
To this end, we have developed and are experimenting with a
whole-sentence exponential model, which can incorporate arbitrary sentence-level
features in a consistent manner.
Feature Induction via Interactive Discrimination: The whole-sentence model described above can accommodate arbitrary features, but finding good features is still a hard open problem. By formulating this as a discrimination problem, we have been using psycholinguistic experiments (a.k.a. Shannon games) to elicit useful feature condidates from human subjects.
Modeling Semantic Coherence: Natural language is semantically coherent. At a sentence surface level, one observes that content words tend to co-occur in specific clusters. Perhaps the most glaring deficiency of existing language models is their inability to capture this phenomenon. Straightforward statistical modeling of semantic coherence proved very difficult due to high dimensionality and extreme sparseness. We are attacking this problem using machine learning techniques.
Encoding Linguistic Knowledge as Prior: most language models have so far failed to realize significant benefits from linguistic knowledge. We believe that linguistic knowledge can be much more profitable if it is used more flexibly, perhaps as a prior in a Bayesian model. Eliciting such knowledge and encoding it as a prior is an exciting project we have just started.
Using the Web to Improve Language Modeling: Most corpora used nowadays for language modeling contain up to several hundred million words. In comparison, the spiderable text portion of the Web as of this writing is well over 100 billion words. This is a potentially very useful resource for language modeling, with distinct characteristics of homogeneity and timeliness. We are experimenting with practical methods for tapping this unique resource.