This page contains resources for machine translation research developed
by members of Noah's ARK in the
Language Technologies Institute at
Carnegie Mellon University.

This page contains scripts to perform paired bootstrap resampling (Koehn, 2004) for three commonly-used MT automatic evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Banerjee and Lavie, 2005). All of these metrics compute document-level scores based on small amounts of information from each segment, so we can rapidly do any number of paired-sample comparisons once we have the segment-level information. We therefore do this process in two steps: (1) compute segment-level statistics for hypothesis and reference sgml files for the two systems we wish to compare, outputting to two files, and (2) read in all segment-level scores for these two systems and perform bootstrap resampling to test for significance.

We make minimal changes to the original evaluation metric scripts for step (1) to print out segment-level information. Then we perform step (2) with a separate script which performs the test by sampling segments from the documents with replacement and computing document-level scores on the set of samples. Small sections of the original scripts are used to compute these document-level scores. This sampling proceeds for a user-specified number of samples and the winning system is recorded on each iteration and for each metric. Given a p-value, significance is then tested using the fraction of samples for which system 1 performed better than system 2, for each metric.

All scripts are available in the following tar.gz file:

paired_bootstrap.tar.gz

The files contained are listed below:

- mteval-v11b-sig.pl -- modified version of NIST BLEU v11b evaluation script that can print segment-level statistics
- meteor-sig.pl -- modified version of METEOR v0.6 evaluation script to do the same
- paired_bootstrap_resampling_nistbleu.pl -- performs paired bootstrap resampling for the NIST and BLEU metrics given two files containing segment-level statistics computed using the modified scripts above. Number of samples and p-value for the test are command-line parameters.
- paired_bootstrap_resampling_meteor.pl -- does the same for METEOR (requires an additional parameter which indicates the target language)
- compute_nistbleu_from_stats.pl -- computes NIST and BLEU scores from a file containing segment-level statistics. Not used by the paired_bootstrap_resampling_nistbleu.pl script above, but useful as a check to ensure that the statistics were output correctly by the modified NIST BLEU script.
- compute_meteor_from_stats.pl -- does the same for METEOR

Below are links to the original scripts for computing these metrics.

- NIST/BLEU: www.nist.gov/speech/tests/mt/2008/scoring.html
- METEOR: www.cs.cmu.edu/~alavie/METEOR/

- Banerjee, S. and A. Lavie, METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at ACL 2005.
- Doddington, G. Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. HLT 2002.
- Koehn, P. Statistical Significance Tests for Machine Translation Evaluation. EMNLP 2004.
- Papineni, K., S. Roukos, T. Ward, and W. Zhu. Bleu: a Method for Automatic Evaluation of Machine Translation. ACL 2002.