INCA: An
Integrated Cluster Computing Architecture
for Machine Translation
SUMMARY:
Progress in the field of machine translation (MT) has come
to depend heavily on open- source
toolkits, which make it easier
for new research groups to tackle the problem at lower cost, broadening
participation. Unfortunately, toolkits
have not kept up with modern computing infrastructure (e.g., the MapReduce framework) required for modern "big
data" approaches to MT, the "primitives" in most toolkits are hardly
extensible to new models since they focus on pipeline components rather
than
algorithmic concepts, and experiment management has been all but
ignored.
The Integrated Cluster Computing Architecture (INCA) for translation is being developed to overcome these challenges, by implementing an extensible, open-source toolkit that can leverage MapReduce clusters and flexibly implement many types of MT systems. MT is not a perfect fit for MapReduce (it has massive memory footprints and requires iterative algorithms); new algorithms are being developed to take advantage of the framework without being limited by it. Experiment management, evaluation, and advice about "best practices" are also part of the toolkit, to make it as widely accessible as possible.
This project is expected to have broad impact in
MT research
through the open-source toolkit to be made available to the research
community. A course project will be
developed and shared
openly, suitable for undergraduates, using the toolkit.
Technical solutions to problems in
large-scale, parallelized MT will be applicable in areas of
data-intensive
natural language processing and machine learning, and elements of the
toolkit
are expected to be useful in such research efforts, as well.
This project is funded by NSF from Feb 2009 – Jan
2011.
| Team INCA: | |
|
Stephan Vogel (Co-PI) Noah Smith (Co-PI) |
Language Technologies Institute Carnegie Mellon University 5000 Forbes Ave. Pittsburgh, PA 15213 |
NEWS:
04/23/09: Press
Release NSF-Google-IBM announcing the NSF awards made under the
CluE program