INCA:  An Integrated Cluster Computing Architecture for Machine Translation

 

SUMMARY:  Progress in the field of machine translation (MT) has come to depend heavily on open-  source toolkits, which make it easier for new research groups to tackle the problem at lower cost, broadening participation.  Unfortunately, toolkits have not kept up with modern computing infrastructure (e.g., the MapReduce framework) required for modern "big data" approaches to MT, the "primitives" in most toolkits are hardly extensible to new models since they focus on pipeline components rather than algorithmic concepts, and experiment management has been all but ignored.

 

 The Integrated Cluster Computing Architecture (INCA) for translation is being developed to overcome these challenges, by implementing an extensible, open-source toolkit that can leverage MapReduce clusters and flexibly implement many types of MT systems.  MT is not a perfect fit for MapReduce (it has massive memory footprints and requires iterative algorithms); new algorithms are being developed to take advantage of the framework without being limited by it.  Experiment management, evaluation, and advice about "best practices" are also part of the toolkit, to make it as widely accessible as possible.

 

This project is expected to have broad impact in MT research through the open-source toolkit to be made available to the research community.  A course project will be developed and shared openly, suitable for undergraduates, using the toolkit.  Technical solutions to problems in large-scale, parallelized MT will be applicable in areas of data-intensive natural language processing and machine learning, and elements of the toolkit are expected to be useful in such research efforts, as well.

 

This project is funded by NSF under the CluE (Cluster Exploration) program, award IIS 084450, from Feb 2009 – Jan 2011.

 

 

Publications:

 

Qin Gao and Stephan Vogel.  Training phrase-based machine translation models on the cloud: Open source machine translation toolkit Chaski.  The Prague Bulletin of Mathematical Linguistics No. 93, 2010, pp.37–46.

 

Kevin Gimpel and Noah A. Smith.  Feature-Rich Translation by Quasi-Synchronous Lattice Parsing. In Proceedings of EMNLP, Singapore, August 2009.

 

Qin Gao and Stephan Vogel,  Parallel Implementations of Word Alignment Tool, Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp. 49-57, June, 2009.

 

 

Tools:

 

mGIZA++:  multi-threaded, multimode version of GIZA++, with a number of other interesting extensions

Chaski: a distributed toolkit for machine translation

 

Team INCA:

 

Stephan Vogel (Co-PI)

Noah Smith (Co-PI)

Qin Gao

Kevin Gimpel

Language Technologies Institute
Carnegie Mellon University

5000 Forbes Ave.
Pittsburgh, PA 15213

 

 

 

NEWS:
04/23/09:  Press Release NSF-Google-IBM announcing the NSF awards made under the CluE program