Realtime Adaptive Translation Systems with cdec

Tutorial by Michael Denkowski

Uses software written by Michael Denkowski, Chris Dyer, Victor Chahuneau,
Vladimir Eidelman, Adam Lopez, Kenneth Heafield

Updated February 19, 2014

If you use Realtime in your work, please cite the following:

@InProceedings{realtime,
  author    = {Michael Denkowski and Chris Dyer and Alon Lavie},
  title     = {Learning from Post-Editing: Online Model Adaptation for Statistical Machine Translation},
  booktitle = {Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics},
  year      = {2014},
}
@inproceedings{cdec,
  author    = {Dyer, Chris  and  Lopez, Adam  and  Ganitkevitch, Juri  and  Weese, Jonathan  and  Ture, Ferhan  and
               Blunsom, Phil  and  Setiawan, Hendra  and  Eidelman, Vladimir  and  Resnik, Philip},
  title     = {cdec: A Decoder, Alignment, and Learning Framework for Finite-State and Context-Free Translation Models},
  booktitle = {Proceedings of the ACL 2010 System Demonstrations},
  year      = {2010},
}

Table of Contents:

0. Installing Software
1. Preparing Data
2. Translation Models
3. Language Models
4. Parameter Tuning
5. Packaging
6. Translating
7. Command Line Interface
8. Python API

0. Installing Software

The steps in this tutorial have been verified in several computing environments, including a fresh installation of Ubuntu Linux 12.04 (LTS). Currently, only Linux operating systems are supported. Compiling the cdec and cpyp toolkits requires recent versions of several software packages. The easiest way to ensure these dependencies are met is to install Ubuntu Linux 12.04 and run the following command:
sudo apt-get install autoconf automake build-essential flex gcc-multilib git libicu-dev libtool python2.7-dev zlib1g-dev
Alternatively, the following can be installed manually on the Linux distribution of your choice: Sufficiently recent versions of GCC and Boost will be built from source. First, create a prefix where software will be installed:
mkdir -p ~/prefix/sw
echo '
export PREFIX="$HOME/prefix"
export PATH="$PREFIX/bin:$PATH"
export CPATH="$PREFIX/include"
export LIBRARY_PATH="$PREFIX/lib:$PREFIX/lib64"  
export LD_RUN_PATH="$PREFIX/lib:$PREFIX/lib64"
export LD_LIBRARY_PATH="$PREFIX/lib:$PREFIX/lib64"
export PYTHONPATH="$PREFIX/sw/cdec/python"
' >> ~/prefix/env
To use the prefix (either for installing software in this tutorial or for running the resulting translation systems), enter the following command to set the appropriate environment variables. If you integrate a Realtime system with your own software, make sure the same environment variables are set. Including the same command before the realtime.py invocation is usually sufficient.
. ~/prefix/env
Install GCC 4.8:
cd ~/prefix/sw
wget 'http://www.netgull.com/gcc/releases/gcc-4.8.0/gcc-4.8.0.tar.gz'
tar xf gcc-4.8.0.tar.gz
cd gcc-4.8.0
./contrib/download_prerequisites
cd ..
mkdir objdir
cd objdir
../gcc-4.8.0/configure --prefix=$PREFIX
LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:$LIBRARY_PATH make
make install
Install the Boost C++ libraries:
cd ~/prefix/sw
wget 'http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download' -O boost_1_49_0.tar.gz
tar xf boost_1_49_0.tar.gz
cd boost_1_49_0
./bootstrap.sh
./b2 install --prefix=$PREFIX
Download and build the cdec toolkit:
cd ~/prefix/sw
git clone https://github.com/redpony/cdec.git
cd cdec
autoreconf -ifv
./configure --prefix=$PREFIX
make
Build the Python interface to cdec:
cd ~/prefix/sw/cdec/python
python setup.py build_ext --inplace
Download and build the cpyp toolkit:
cd ~/prefix/sw
git clone https://github.com/redpony/cpyp.git
cd cpyp/hpyplm
Edit the hpyplm Makefile, changing the following line:
BOOST_ROOT=/cab0/tools/boost-1.49.0
to:
BOOST_ROOT=$(PREFIX)
Compile the hpyplm tools:
make hpyplm_train
make libcdec_ff_hpyplm.so
You are now ready to build and deploy adaptive translation systems.

1. Preparing Data

Translation systems are built using large bilingual and monolingual texts. This tutorial uses a small subset of the Spanish-English data set from the 2012 ACL WMT Shared Translation Task. When building systems for other scenarios, substitute in your own data for the following: Typically, files are named for language, so an Arabic-English system would use corpus.ar and corpus.en and so on. To get started, download the WMT12 Europarl, news commentary, and news development data to a system building directory:
mkdir ~/demo-es-en
cd ~/demo-es-en
wget http://statmt.org/wmt12/training-parallel.tgz
tar vxf training-parallel.tgz
wget http://statmt.org/wmt12/dev.tgz
tar vxf dev.tgz
First, normalize all training and development data with cdec's tokenizer (split off punctuation, contractions, and such to convert input text into lines of space-delimited translation units). The translation models will be built on news commentary and the language model will be built on news commentary plus Europarl.
~/prefix/sw/cdec/corpus/tokenize-anything.sh <training/news-commentary-v7.es-en.en >training/news-commentary-v7.es-en.en.tok
~/prefix/sw/cdec/corpus/tokenize-anything.sh <training/news-commentary-v7.es-en.es >training/news-commentary-v7.es-en.es.tok
~/prefix/sw/cdec/corpus/tokenize-anything.sh <training/europarl-v7.es-en.en >training/europarl-v7.es-en.en.tok
Dev and test set sizes are limited to 1000 sentences each to reduce training time (when building full scale systems, 2000-3000 sentences are preferred for better results).
head -n 1000 dev/newstest2010.es |~/prefix/sw/cdec/corpus/tokenize-anything.sh >dev.es
head -n 1000 dev/newstest2010.en |~/prefix/sw/cdec/corpus/tokenize-anything.sh >dev.en
head -n 1000 dev/newstest2011.es |~/prefix/sw/cdec/corpus/tokenize-anything.sh >test.es
head -n 1000 dev/newstest2011.en |~/prefix/sw/cdec/corpus/tokenize-anything.sh >test.en
Next, paste together the parallel text files into the triple pipe ( ||| ) format used by cdec:
~/prefix/sw/cdec/corpus/paste-files.pl training/news-commentary-v7.es-en.es.tok training/news-commentary-v7.es-en.en.tok \
  >training/news-commentary-v7.es-en.es-en.tok
Filter out long (over 80 words) and mismatched sentences, resulting in the training corpus that will be used for translation model estimation:
~/prefix/sw/cdec/corpus/filter-length.pl -80 training/news-commentary-v7.es-en.es-en.tok >corpus.es-en
Split off the target (English) side of the final corpus for language model estimation:
~/prefix/sw/cdec/corpus/cut-corpus.pl 2 <corpus.es-en >corpus.en
Combine the English side of the bilingual news commentary data and the monolingual Europarl data, resulting in the corpus used for language model estimation:
cat training/europarl-v7.es-en.en.tok corpus.en >mono.en

2. Translation Models

Translation models (encoded as sentence-level bilingual grammars) map source language sentences to target language sentences. Model estimation begins with learning word-level mappings between source and target (Spanish and English) text. Run forward and reverse word alignment with cdec's fast_align and save the model parameters for aligning future data:
~/prefix/sw/cdec/word-aligner/fast_align -i corpus.es-en -d -v -o -p corpus.es-en.fwd_params \
  >corpus.es-en.fwd_align 2>corpus.es-en.fwd_err
~/prefix/sw/cdec/word-aligner/fast_align -i corpus.es-en -r -d -v -o -p corpus.es-en.rev_params \
  >corpus.es-en.rev_align 2>corpus.es-en.rev_err
Symmetrize the alignments (convert two sets of directional word alignments into a single set of bidirectional word alignments) using cdec's atools:
~/prefix/sw/cdec/utils/atools -i corpus.es-en.fwd_align -j corpus.es-en.rev_align -c grow-diag-final-and >corpus.es-en.gdfa
Index the aligned bitext with a suffix array to facilitate extracting on-demand grammars for input sentences:
python -m cdec.sa.compile -a corpus.es-en.gdfa -b corpus.es-en -o sa >sa.ini
Use the parameters learned from word-aligning the corpus to align the dev set. The target side (references) are used to simulate post-editing during tuning:
~/prefix/sw/cdec/corpus/paste-files.pl dev.es dev.en >dev.es-en
~/prefix/sw/cdec/word-aligner/force_align.py corpus.es-en.fwd_params corpus.es-en.fwd_err \
  corpus.es-en.rev_params corpus.es-en.rev_err <dev.es-en >dev.es-en.gdfa
~/prefix/sw/cdec/corpus/paste-files.pl dev.es dev.en dev.es-en.gdfa >dev.es-en.in
Extract sentence-level grammars for the dev set. For tuning, these grammars must be generated in advance. In production, these grammars will be automatically generated on-demand as new sentences are translated.
cat dev.es-en.in |python -m cdec.sa.extract -c sa.ini -g dev.es.g-o -z -o

3. Language Models

Language models are used to determine how similar translation outputs are to observed text in the target language. These models are estimated from much larger data than translation models. Estimate a large 4-gram language model using KenLM:
~/prefix/sw/cdec/klm/lm/builder/lmplz -o 4 -T. <mono.en >mono.arpa
~/prefix/sw/cdec/klm/lm/build_binary trie mono.arpa mono.klm
Use cpyp to train an adaptive language model that exactly covers the translation model training data. Post-edit data will be added to this model as well as the translation model. Also copy the hpyplm feature function library for use with cdec.
~/prefix/sw/cpyp/hpyplm/hpyplm_train corpus.en corpus.hpyplm 100
cp ~/prefix/sw/cpyp/hpyplm/libcdec_ff_hpyplm.so .

4. Parameter Tuning

Create the following decoder configuration file named cdec.ini:
formalism=scfg
cubepruning_pop_limit=200
feature_function=WordPenalty
feature_function=ArityPenalty
add_pass_through_rules=true
Add the locations of your language models (absolute paths):
echo "feature_function=KLanguageModel -n LM $(pwd)/mono.klm" >>cdec.ini
echo "feature_function=External $(pwd)/libcdec_ff_hpyplm.so $(pwd)/corpus.hpyplm -r $(pwd)/dev.en" >>cdec.ini
Create the following feature weights file named weights.init:
PassThrough 0
Glue 0
WordPenalty 0
Arity_0 0
Arity_1 0
Arity_2 0
EgivenFCoherent 0
SampleCountF 0
CountEF 0
MaxLexFgivenE 0
MaxLexEgivenF 0
IsSingletonF 0
IsSingletonFE 0
IsSupportedOnline 0
LM 0
LM_OOV 0
HPYPLM 0
HPYPLM_OOV 0
These files specify the decoder configuration and the initial weights (all zeroes) to be used for parameter tuning. Run cdec's MIRA optimizer to learn feature weights for good translations on the dev set. Increase the number of jobs (-j) to speed up training. Depending on the size of data used to build models, the number of jobs may be constrained by memory before number of CPUs:
~/prefix/sw/cdec/training/mira/mira.py -d dev.es-en -c cdec.ini -w weights.init -j 4 -o mira-work --max-iterations 20 \
  --optimizer 2 --metric-scale 1 -k 500 --update-size 500 --step-size 0.01 --hope 1 --fear 1 -g dev.es.g-o/grammar

5. Packaging

Run the following to copy all configuration and model files for the translation system into a self-contained directory:
~/prefix/sw/cdec/realtime/mkconfig.py corpus.es-en.fwd_params corpus.es-en.fwd_err corpus.es-en.rev_params \
  corpus.es-en.rev_err sa sa.ini mono.klm libcdec_ff_hpyplm.so corpus.hpyplm cdec.ini mira-work/weights.final demo-es-en.d
The newly created demo-es-en.d constitutes all language- and dataset-dependent files required to run an adaptive translation system. It can be relocated or copied to other machines (with the caveat that libcdec_ff_hpyplm.so may need to be recompiled on production machines with different environments from development machines).

6. Translating

The system is now ready to translate new sentences. To translate unseen text with your system, run the following:
~/prefix/sw/cdec/realtime/realtime.py -c demo-es-en.d
The system reads lines from standard in and writes lines to standard out.

For verbose mode (various information written to standard error), use the -v flag. If you are translating unnormalized (raw) text, which is likely in a production environment, use the -n flag to tokenize text prior to translation and detokenize MT output before writing out.

As an example, use the following to prepare a test input set:
~/prefix/sw/cdec/realtime/mkinput.py test.es test.en >test.es.in
This alternates between translate and learn lines, using reference translations to simulate human post-editing. Next, translate the test set using the adaptive system:
~/prefix/sw/cdec/realtime/realtime.py -c demo-es-en.d -v <test.es.in >test.hyp
Evaluate system performance with cdec's scoring tools:
~/prefix/sw/cdec/mteval/fast_score -r test.en <test.hyp
The system is now ready for deployment or integration with other software systems.

7. Command Line Interface

realtime.py accepts lines consisting of commands followed by arguments, with a named context optionally specified:

COMMAND [context] ||| arg1 [||| arg2 ...]
A translation context is an instance of those system components that must be replicated for each adaptive system. It consists of a table of incremental data added to the grammar extractor, an adaptive language model that is updated by incremental data, and a set of decoding weights that is incrementally updated. By default, realtime.py uses a single default context for all commands. Adding named contexts to each command allows Realtime to concurrently adapt any number of translation systems that share the same base system (base grammar extraction data and language model). For example, Realtime can manage an individual context for each of several human translators without having to load several copies of the base system. Contexts can be added, dropped, saved, and loaded for efficient integration with computer-aided translation environments.

(Technically, each context instantiates a new lookup table in the grammar extractor, creates a new grammar cache directory, and starts a new decoder instance with a copy of the HPYP language model. Dropping a context frees these system resources.)

The following commands are accepted. If no context is given, the default context is used:

Commands with specified context "contextname" would be used as follows:

TR contextname ||| source sentence
LEARN contextname ||| source sentence ||| reference translation
SAVE contextname ||| contextname.state
DROP contextname
LOAD contextname ||| contextname.state
Active contexts can be listed with the LIST command:
LIST

8. Python API

Realtime can be easily integrated with other systems at the API level. realtime.py is a simple example of using the API to provide a translation service over standard IO.

Make sure Realtime is on your PYTHONPATH and import the rt module:
import rt
Instantiate a RealtimeTranslator:
translator = rt.RealtimeTranslator(configdir, tmpdir='/tmp', cache_size=5, norm=False)
where:

The commands handled by realtime.py correspond to method calls on RealtimeTranslator.

Translate a sentence:
translation = translator.translate('Translate this sentence', ctx_name=None)
Add data to models:
translator.learn('Source sentence', 'Target sentence', ctx_name=None)
Save state to a file or standard out (None). If passed a StringIO object, state lines will be written to the object, resulting in an in-memory representation of state that is useful for operations such as storing state in a database:
translator.save_state(file_or_stringio=None, ctx_name=None)
Drop state, freeing system resources:
translator.drop_ctx(ctx_name=None)
Load state from a file or standard in (None). If passed a StringIO object, state lines will be read from its getvalue() method. For example, state could be read from a database, written to a StringIO object, and passed to load_state():
translator.load_state(file_or_stringio=None, ctx_name=None)
List active contexts:
ctx_name_list = translator.list_ctx().split(' ||| ')[1].split()
Shutdown the decoder (necessary before exiting your program):
translator.close()
For additional information and examples, see realtime.py and rt/rt.py