Statistical Machine Translation

The "Candide" project was an experimental machine translation system under development at IBM TJ Watson Research Center in the early 1990's. I was a member of this group from 1992-1995. A survey paper (available in full in postscript form and abstract-only in html) describes the prototype system we built. A considerably longer paper addresses some applications of maximum entropy in machine translation, and the abstract of a patent based on that work. Also available is a highly selective bibliography for statistical machine translation.

In 1949, Warren Weaver proposed that statistical techniques from the emerging field of information theory might make it possible to use modern digital computers to translate text from one natural language to another automatically. Although Weaver's scheme foundered on the rocky reality of the limited computer resources of the day, a group of IBM researchers in the late 1980's felt that the increase in computer power over the previous forty years made reasonable a new look at the applicability of statistical techniques to translation. Thus the "Candide" project, aimed at developing an experimental machine translation system, was born at IBM TJ Watson Research Center.

Machine Translation (MT) is widely considered among the most difficult tasks in natural language processing, and in artificial intelligence in general, because accurate translation seems to be impossible without a comprehension of the text to be translated. At least, this is what many prominent researchers concluded in the 1960's after some fruitless attempts to create working MT systems, at which time they threw up their hands in defeat. The path to developing a working MT system, so the thinking went, required decades of hand-crafting grammars, dictionaries, and translation rules in close consultation with human translation experts.

The guiding idea of the Candide project was to admit from the start that this approach is completely untenable. After all, a human translator can't possibly set down in sufficient detail the "algorithm" he applies when translating a document. Instead, the idea was to let a system learn by itself how to translate. Given a vast collection of documents in both English and French (the proceedings of the Canadian parliament, known as "Hansards," are an especially convenient source) the Candide system automatically learns how English and French are related. As a historical note, most of the members of the Candide group spoke little or no French.

The Candide group adopted an information-theoretic perspective on the MT problem, which goes as follows. In speaking a French sentence F, a French speaker originally thought up a sentence E in English, but somewhere in the noisy channel between his brain and mouth, the sentence E got "corrupted" to its French translation F. The task of an MT system is to discover E* = argmax(E') p(F|E') p(E'); that is, the MAP-optimal English sentence, given the observed French sentence. This approach involves constructing a model of likely English sentences, and a model of how English sentences translate to French sentences. Both these tasks are accomplished automatically with the help of a large amount of bilingual text.

As wacky as this perspective might sound, it's no stranger than the view that an English sentence gets corrupted into an acoustic signal in passing from the person's brain to his mouth, and this perspective is now essentially universal in automatic speech recognition.