ARK Machine Translation Research

This page contains resources for machine translation research developed by members of Noah's ARK in the Language Technologies Institute at Carnegie Mellon University.

Code for Statistical Significance Testing for MT Evaluation Metrics

IMPORTANT NOTE: A newer version of this code, compatible with mteval-v13a, is available here. The version below may exhibit errors due to a bug in mteval-v11b (that has been corrected as of mteval-v13a). Thanks to Alex Fraser for pointing this out.

This page contains scripts to perform paired bootstrap resampling (Koehn, 2004) for three commonly-used MT automatic evaluation metrics: BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Banerjee and Lavie, 2005). All of these metrics compute document-level scores based on small amounts of information from each segment, so we can rapidly do any number of paired-sample comparisons once we have the segment-level information. We therefore do this process in two steps: (1) compute segment-level statistics for hypothesis and reference sgml files for the two systems we wish to compare, outputting to two files, and (2) read in all segment-level scores for these two systems and perform bootstrap resampling to test for significance.

We make minimal changes to the original evaluation metric scripts for step (1) to print out segment-level information. Then we perform step (2) with a separate script which performs the test by sampling segments from the documents with replacement and computing document-level scores on the set of samples. Small sections of the original scripts are used to compute these document-level scores. This sampling proceeds for a user-specified number of samples and the winning system is recorded on each iteration and for each metric. Given a p-value, significance is then tested using the fraction of samples for which system 1 performed better than system 2, for each metric.

All scripts are available in the following tar.gz file:

The files contained are listed below:
For questions, bug reports, etc. please contact Kevin Gimpel (kgimpel at

Below are links to the original scripts for computing these metrics.