CMU leads a multi-university effort to develop translation programs for less-common languages
By Jason Togyer
About 12 million people worldwide are fluent in Kinyarwanda, an African dialect used in Rwanda and parts of neighboring Burundi and Uganda. It may sound like a lot--but consider that about 1.4 billion people speak Mandarin and 1.8 billion speak English. That makes it relatively simple to find someone who can translate English into Mandarin, but not so simple to find someone who can turn English into Kinyarwanda and back again.
The lack of translators for these less-commonly spoken languages becomes a serious problem when trouble--such as a natural disaster, or, in the case of Rwanda, civil war and genocide--erupts in a place where such languages are the native tongues. "Imagine setting up an army base or Red Cross tent," says Jaime Carbonell, director of CMU's Language Technologies Institute, "and people are coming to you for medical help--but you can't speak their language." Aid workers or peacekeepers then must rely on native translators who can be difficult to find, and in some cases are fellow victims (or even perpetrators) of the ongoing crisis.
While computerized translation systems for languages such as English, French, Mandarin, Spanish and other languages have flourished, less-commonly-used languages have been largely left behind. A new five-year research project led by Carbonell and including colleagues from CMU and three other universities will try to close that gap. The research--valued at more than $1 million per year--is being funded by the U.S. Army's Multidisciplinary University Research Initiative, or MURI. Partner universities are MIT, the University of Southern California and the University of Texas at Austin.
Researchers will develop machine translation models for Kinyarwanda and Malagasy, the national language of the island Republic of Madagascar, which is spoken by about 20 million people worldwide. Their broader goal is studying how a combination of computing power and human intelligence might create faster, better translation systems than either method alone.
"MURI is interested in examining languages in potential hot-spots in the world that are also less-commonly spoken languages," Carbonell says. "Some of them are actually spoken by a lot of people, but there's very little written data, and it's difficult to map oral traditions onto computer models."
Building translation systems for "resource-poor" languages isn't simple. Early machine translation systems relied on grammar rules, but the practice went out of fashion because programming them took tens of thousands of person-hours, says Lori Levin, an associate research professor in the LTI. Since the early 1990s, most modern machine translation systems instead have been built on statistical models. Large bodies of parallel text in two or more languages--news articles or transcripts of United Nations or European Union proceedings--are analyzed and parallel words or phrases are matched based upon the frequency of their occurrence and the probability that they have the same meanings. In the case of languages such as Spanish and English, there are "terabytes of data" to work with, Carbonell says.
Another member of the research team, Stephan Vogel, says many of the models being used were adapted from other disciplines such as electrical engineering, physics or computer science. "From a linguistics point of view, these models are fairly dumb, but from a practical point of view, they're quite good," says Vogel, an assistant research professor in the LTI. "It's amazing how much machine translation has improved over the past 10 years."
But with a language such as Malagasy or Kinyarwanda, there aren't those collections of large, parallel texts with which to build statistical models. "If you have 10 million sentence pairs, you can build up a very good translation," Vogel says. "In this case, we don't have even 1 million sentence pairs. We may not have 100,000."
Using a brute force method--paying people to hand-translate documents from Malagasy or Kinyarwanda into English, and then training machine-translation algorithms on those bodies of text--would be both slow and expensive, Carbonell says.
Instead, the team will use existing texts such as the Bible, Koran, government documents and other written works. In the case of Kinyarwanda, researchers also have access to translated testimonies given by the survivors of the Rwandan genocide. But those documents suffer from a problem called "domain specificity," says team member Jason Baldridge, a computational linguist at UT Austin. The genocide documentation hopefully doesn't represent topics that native speakers of Kinyarwanda talk about in everyday life, he says, and formal works such as the Bible and Koran aren't typical of modern speech patterns, either. (People don't speak in sequences or "begats.")
"We don't have as much data to rely on," adds another project researcher, Noah Smith, an assistant professor of language technologies and machine learning at CMU, "so we have to rely on deeper linguistic principles and squeeze more information out of the data we do have."
To do that, the MURI initiative will experiment with hybrid translation systems that blend statistical models with language rules--a linguistic-core approach. "I think putting some linguistic knowledge back into the models is going to improve the quality of the translations," Levin says. "Some of the statistical methods are beginning to be maxed out, and I think everybody knows that."
Active-learning algorithms will then be used to explore the data being collected by the researchers and determine where statistical models alone are capable of devising accurate translation rules, and where linguistic and grammar rules written by humans will have a major impact. "Active learning is very good at determining what you don't know that would make the biggest difference if you did know it," Carbonell says. It's too soon to say how much of a finished hybrid translation system will rely on probability models, and how much will rely on linguistics, he says: "This is a five-year project, and we're only five months in. We've got a long way to go."
Although a big test of the team's work will be creating reliable translators for Malagasy and Kinyarwanda, the larger goal is learning how researchers can speed up the process of creating machine-translation systems for less-commonly used languages. "It's no good if you respond to an emergency and a year later you say, 'OK, now we have a system in place and we can translate,'" Vogel says. "Can you do it in three days?"
Jason Togyer | 412-268-8721 | jt3y [atsymbol] cs.cmu.edu