Sanjika Hewavitharana
Faculty Advisor: Stephan Vogel

Title: Utilizing Machine Translation to
Support Disaster Relief Efforts

   
     
Short
Bio
 

Sanjika Hewavitharana is a Ph.D. student in the Language Technologies Institute at Carnegie Mellon University. His research interests include machine translation and information retrieval with a focus on low-resource scenarios, natural language processing and machine learning. He is currently working on his thesis on learning from comparable corpora for statistical machine translation under the supervision of Stephan Vogel. He received his bachelor's degree in Computer Science from University of Colombo in Sri Lanka and Master’s degree in Language Technologies from Carnegie Mellon University.

     
Project Synopsis
 

One of the biggest problems faced by relief organizations when responding to international emergencies is the inability of the first responders to speak the local language. This was amply evident in the aftermath of the earthquake in Haiti in January 2010, and more recently after the Tsunami in Japan in March 2011. Often human interpreters have to be employed for the translation task, but finding them in the right place at the right time is a challenge.

As part of the emergency response in Haiti, an information service called “Mission 4636” collected text messages (SMS) sent by the local people in Haiti. They were originally written in Haitian Creole requesting food and other material for particular places, inquiring the whereabouts of misplaced family members, etc. These messages were translated into English by a group of volunteers spread across the globe, and were relayed back to the emergency responders on the ground. Using the location information available in the messages, the volunteers also generated maps that were used by the responders for quick responses.  Understandably, the messages are often noisy with incomplete information, and sometimes are written in several languages.

Our objective in this project is to evaluate to what extent the current machine translation technology can be applied to this particular usage scenario. Even when fully automatic translation is unreliable, it might still be possible that automated language processing and in particular translation technology can support the human translators to make their task easier and more manageable.

We analyze the available data to identify which supporting tools can be built in this regard. Specifically, the analysis covers: language identification, vocabulary coverage, identify and resolve peculiarities (such as abbreviations, different spellings, etc), and translation quality. We test the quality of a fully automatic translation system and explore to what extent the quality of the system output can be classified to indicate which translations can be taken unmodified, which can be provided for post-editing, and which should be translated by human translators from scratch. Additionally we test methods to classify the messages into pre-determined categories to indicate the priority.

The translation quality in machine translation systems depends primarily on two factors: the amount and the quality of the available parallel data. More parallel data typically results in better translation quality. Large amounts of parallel data is not available for many language pairs, especially so for less commonly spoken languages such as Haitian Creole. We exploit other resources, such as comparable corpora, to extract parallel data to enhance the limited amount of available data. We explore different ways this data can be utilized to improve the translation quality of the machine translation system.

Although the study focuses on Haitian Creole and English languages, the analysis and the proposed tools are not specific to a particular language. Therefore, the proposed approach can be used for other languages under different scenarios in the future.