Meteor 1.5: Automatic Machine Translation Evaluation System

1. About

Meteor consists of two major components: a flexible monolingual word aligner and a scorer. For machine translation evaluation, hypothesis sentences are aligned to reference sentences. Alignments are then scored to produce sentence and corpus level scores. Score and alignment information can also be used to visualize word alignments and score distributions using Meteor X-ray. For detailed information on Meteor word alignment and scoring, see Denkowski and Lavie, 2011. This paper also details the flexible matching support that allows Meteor to align words and phrases with differing surface forms.

This release includes the following software:

The Meteor MT evaluation metric
The standalone monolingual word aligner
Indepedently usable paraphrase tables for supported languages
The X-ray system for visualizing alignments and score distributions

Meteor is released under the GNU Lesser General Public License (LGPL) and includes some files subject to the (compatible) WordNet license. See the included COPYING files for details.

2. Supported Languages

Language support is divided into two groups. Fully supported languages include flexible word and phrase matching (at least one type of match other than exact) and language-specific parameters tuned to maximize correlation between Meteor scores and human judgments of translation quality. Partially supported languages include flexible word matching and use language-independent parameters chosen to generalize well across known languages.

Fully supported languages:

Language	Exact Match	Stem Match	Synonym Match	Paraphrase Match	Tuned Parameters
English	Yes	Yes	Yes	Yes	Yes
Arabic	Yes	No	No	Yes	Yes
Czech	Yes	No	No	Yes	Yes
French	Yes	Yes	No	Yes	Yes
German	Yes	Yes	No	Yes	Yes
Spanish	Yes	Yes	No	Yes	Yes

Partially supported languages:

Language	Exact Match	Stem Match	Synonym Match	Paraphrase Match	Tuned Parameters
Danish	Yes	Yes	No	No	LI
Dutch	Yes	Yes	No	No	LI
Finnish	Yes	Yes	No	No	LI
Hungarian	Yes	Yes	No	No	LI
Italian	Yes	Yes	No	No	LI
Norwegian	Yes	Yes	No	No	LI
Portuguese	Yes	Yes	No	No	LI
Romanian	Yes	Yes	No	No	LI
Russian	Yes	Yes	No	No	LI
Swedish	Yes	Yes	No	No	LI
Turkish	Yes	Yes	No	No	LI

Paraphrase capability can also be added to unsupported languages. If your MT system has a bilingual phrase table, you can use Parex to build paraphrases tables and use them with Meteor. For example, if you want to evaluate a system that translates into Danish and build a paraphrase table named paraphrase-da.gz, you can use:

java -Xmx2G -jar meteor-*.jar test reference -l da \
-a paraphrase-da.gz -m 'exact stem paraphrase' -w '1.0 0.5 0.5'

This tells Meteor to use the paraphrase table (-a paraphrase-da.gz) add the paraphrase module (-m 'exact stem paraphrase') and add a weight for paraphrases (-w '1.0 0.5 0.5').

Other Languages:

Meteor is capable of scoring UTF-8 encoded data for any language. Specifying language "other" will automatically select exact matches only for alignment and language-independent scoring parameters. Remember to pre-segment, tokenize, and lowercase text as needed.

java -Xmx2G -jar meteor-*.jar test reference -l other

3. Running Meteor

To call Meteor, run the following:

java -Xmx2G -jar meteor-*.jar

Running Meteor with no arguments prints the following help message:

Meteor version 1.4

Usage: java -Xmx2G -jar meteor-*.jar <test> <reference> [options]

Options:
-l language                     Fully supported: en cz de es fr ar
                                Supported with language-independent parameters:
                                  da fi hu it nl no pt ro ru se tr
-t task                         One of: rank util adq hter li tune
                                  util implies -ch
-p 'alpha beta gamma delta'     Custom parameters (overrides default)
-m 'module1 module2 ...'        Specify modules (overrides default)
                                  Any of: exact stem synonym paraphrase
-w 'weight1 weight2 ...'        Specify module weights (overrides default)
-r refCount                     Number of references (plaintext only)
-x beamSize                     (default 40)
-s wordListDirectory            (if not default for language)
-d synonymDirectory             (if not default for language)
-a paraphraseFile               (if not default for language)
-f filePrefix                   Prefix for output files (default 'meteor')
-q                              Quiet: Segment scores to stderr, final to stdout,
                                  no additional output (plaintext only)
-ch                             Character-based precision and recall
-norm                           Tokenize / normalize punctuation and lowercase
                                  (Recommended unless scoring raw output with
                                   pretokenized references)
-lower                          Lowercase only (not required if -norm specified)
-noPunct                        Do not consider punctuation when scoring
                                  (Not recommended unless special case)
-sgml                           Input is in SGML format
-mira                           Input is in MIRA format
                                  (Use '-' for test and reference files)
-vOut                           Output verbose scores (P / R / frag / score)
-ssOut                          Output sufficient statistics instead of scores
-writeAlignments                Output alignments annotated with Meteor scores
                                  (written to <prefix>-align.out)

Sample options for plaintext: -l <lang> -norm
Sample options for SGML: -l <lang> -norm -sgml
Sample options for raw output / pretokenized references: -l <lang> -lower

See README file for additional information

The simplest way to run Meteor is as follows:

java -Xmx2G -jar meteor-*.jar test reference -l en -norm

This tells Meteor to score the file "test" against "reference", where test and reference are UTF-8 encoded files that contain one sentence per line. The "-l en" option tells Meteor to use settings for English. The -norm flag tells Meteor to apply language-specific text normalization before scoring. These are the ideal settings for which language-specific parameters are tuned.

Important note: If you are scoring text in a partially supported language, do not use the -norm flag, as Meteor has no normalization rules for these languages. Instead, use your own tools for segmenting, tokenizing, and lowercasing (if desired) the test and reference text prior to scoring. Meteor will warn if the -norm flag is used with unsupported languages. For example, to score Danish text, pre-tokenize the files and run:

java -Xmx2G -jar meteor-*.jar test.da.tok reference.da.tok -l da

To score the example files included with Meteor, use the following:

java -Xmx2G -jar meteor-*.jar example/xray/system1.hyp example/xray/reference -l en -norm

You should see the following output:

Meteor version: 1.4

Eval ID:        meteor-1.4-wo-en-norm-0.85_0.2_0.6_0.75-ex_st_sy_pa-1.0_0.6_0.8_0.6

Language:       English
Format:         plaintext
Task:           Ranking
Modules:        exact stem synonym paraphrase
Weights:        1.0 0.6 0.8 0.6
Parameters:     0.85 0.2 0.6 0.75

Segment 1 score:        0.447752250844953
Segment 2 score:        0.4284116369815996
Segment 3 score:        0.2772888474043816
Segment 4 score:        0.39587671218995263
Segment 5 score:        0.34983532103052495
.
.
.
Segment 2485 score:     0.29553941444479426
Segment 2486 score:     0.27829272093582047
Segment 2487 score:     0.2825995999223381
Segment 2488 score:     0.32037812996981163
Segment 2489 score:     0.33120147321343485

System level statistics:


           Test Matches                  Reference Matches
Stage      Content  Function    Total    Content  Function    Total
1            16268     20842    37110      16268     20842    37110
2              485        26      511        489        22      511
3              820       119      939        845        94      939
4             3813      3162     6975       3954      2717     6671
Total        21386     24149    45535      21556     23675    45231

Test words:             61600
Reference words:        62469
Chunks:                 20118
Precision:              0.6767347074578696
Recall:                 0.6500539115850005
f1:                     0.663126043401952
fMean:                  0.6539211143997783
Fragmentation penalty:  0.5099053526424513

Final score:            0.3204832379614146

The output contains the following in order:

Meteor version
Eval ID, a string that uniquely identifies all version, setting, and parameter information to ensure that other data sets scored with Meteor can be scored consistently and comparably
Header describing settings and parameters
Segment (sentence) level scores, one per line
Match statistics
Summary statistics
Final score

4. Meteor Options

For the majority of scoring scenarios, only the -l and -norm options should be used. For more advanced usage, the full list of options follows.

Language: -l lang

Use settings for specified language. Lang can be either the language name or two letter code. See the supported language list.

Task: -t task

Use a different pre-defined set of parameters for scoring (currently limited to English):

rank: parameters tuned to human rankings from WMT09 and WMT10
adq: parameters tuned to adequacy scores from NIST Open MT 2009
hter: parameters tuned to HTER scores from GALE P2 and P3
li: language-independent parameters

Parameters: -p 'alpha beta gamma delta'

Set parameters manually. Parameter string should be quoted.

Modules: -m 'module1 module2 ...'

Set modules manually. Options are: exact stem synonym paraphrase. See supported languages. Module string should be quoted.

Weights: -w 'weight1 weight2 ...'

Set weights for each match type manually. Parameter string should be quoted.

Reference Count: -r refCount

Specify N, the number of reference sentences for each hypothesis. For N references, it is assumed that the reference file will be N times the length of the test file, containing sets of N references in order. For example, if N=4, reference lines 1-4 will correspond to test line 1, 5-8 to line 2, etc.

Beam Size: -x

This number, set to 40 by default, is used to limit the beam size when searching for the highest scoring alignment. As parameters are tuned for a beam size of 40, simply increasing this number does not necessarily produce more reliable scores.

Word List Directory: -s wordListDirectory

This option should only be used to test external function word lists. By default, the included function word lists will be used.

Synonymy Directory: -d synonymDirectory

This option should only be used to test external synonymy databases. By default, the included synonymy database will be used.

Paraphrase File: -a paraphraseFile

This option should only be used to test external synonymy databases. By default, the included paraphrase tables will be used. To build your own paraphrase tables, use Parex.

File Prefix: -f filePrefix

If alignments are to be written, they are written to <prefix>-align.out. In SGML mode, files produced will be <filePrefix>-seg.scr, <filePrefix>-doc.scr, <filePrefix>-sys.scr. The default prefix is "meteor".

Quiet: -q

Sentence scores to stderr, one per line. Corpus score to stdout, one line total. No additional output.

Character-based -ch

Calculate character-based precision and recall. Alignment is still word and phrase-level. Fragmentation penalty is still word and phrase-level.

Normalize: -norm

Tokenize and lowercases input lines, normalize punctuation to improve scoring accuracy. This option is highly recommended unless scoring raw system output against pretokenized references.

Lowercase: -lower

Lowercase input lines (not required if -norm also specified). This is most commonly used scoring cased, tokenized outputs with pretokenized references.

Ignore Punctuation: -noPunct

If specified, punctuation symbols will be removed before scoring. This is generally not recommended as parameters are tuned with punctuation included.

SGML: -sgml

This specifies that input is in SGML format. In addition to summary output, the following files are produced:

meteor-seg.scr contains lines: testset system document segment score
meteor-doc.scr contains lines: testset system document score
meteor-sys.scr contains lines: testset system score

The prefix can be changed with the -f option.

Stdio Format: -stdio

Input is from stdin using the format described below. Stats and scores written to stdout. Use "-" for test and reference files. Input lines are of two types, SCORE and EVAL.

SCORE ||| reference 1 words ||| reference n words ||| hypothesis words

Scores hypothesis against one or more references and returns line of sufficient statistics.

EVAL ||| stats

Calculates final scores using output of SCORE lines. Meteor exits on end-of-file.

Verbose Output: -vOut

Output verbose scores (Precision, Recall, Fragmentation, Score) in place of regular scores.

Sufficient Statistics: -ssOut

This option outputs sufficient statistics in place of scores and omits all other output. Statistics for a single hypothesis/reference instance are:

tstLen refLen stage1tstTotalMatches stage1refTotalMatches
stage1tstWeightedMatches stage1refWeightedMatches s2tTM s2rTM s2tWM
s2rWM s3tTM s3rTM s3tWM s3rWM s4tTM s4rTM s4tWM s4rWM chunks lenCost

Write Alignments: -writeAlignments

Write alignments between hypotheses and references to meteor-align.out or <prefix>-align.out when file prefix is specified. Alignments are written in Meteor format, annotated with Meteor statistics:

Title precision recall fragmentation score
sentence1
sentence2
Line2Start:Length Line1Start:Length Module Score
...

5. Standalone Meteor Aligner

Meteor includes a monolingual word aligner that can be run independently of the scorer. To run the aligner, use:

java -Xmx2G -cp meteor-*.jar Matcher

Running the aligner with no arguments shows the help message:

Meteor Aligner version 1.4
Usage: java -Xmx2G -cp meteor-*.jar Matcher <test> <reference> [options]

Options:
-l language                     One of: en da de es fi fr hu it nl no pt ro ru se tr
-m 'module1 module2 ...'        Specify modules (overrides default)
                                  One of: exact stem synonym paraphrase
-t type                         Alignment type (coverage vs accuracy)
                                  One of: maxcov maxacc
-x beamSize                     Keep speed reasonable
-d synonymDirectory             (if not default)
-a paraphraseFile               (if not default)

See README file for examples

Most options are the same as in the Meteor scorer. The additional option is -t, which specifies whether alignments should maximize coverage (comparable to recall) or accuracy (comparable to precision).

Sentences are read from test and reference files, one per line, and alignments are written to stdout using the Meteor format:

Alignment <line N>
sentence1
sentence2
Line2Start:Length	Line1Start:Length	Module		Score
...

Important note: the Meteor Aligner does not apply any normalization to input text. Text should be segmented, tokenized, and lowercased as desired prior to Meteor alignment.

6. Standalone Word Stemmer

Meteor includes a standalone word stemmer for supported languages. To run the stemmer, use:

java -cp meteor-*.jar Stemmer

Running the stemmer with no arguments shows the help message:

Snowball stem some text in a supported language
Languages: en da de es fi fr hu it nl no pt ro ru se tr
Usage: Stemmer lang < in > out

The stemmer reads lines from stdin and writes to stdout. Each word in the input is stemmed using the Snowball stemmer for the specified language.

Important note: the Meteor Stemmer does not apply any normalization to input text. Text should be segmented, tokenized, and lowercased as desired prior to Meteor alignment.

7. Integrating Meteor with your Software

The simplest way to integrate Meteor with your software involves using the -stdio option:

java -Xmx2G -jar meteor-*.jar - - -l en -norm -stdio

This tells Meteor to use the English settings, normalize text, and use stdin/stdout. You can then write lines of the following form to Meteor's stdin:

SCORE ||| reference 1 words ||| reference n words ||| hypothesis words

This scores a hypothesis against one or more references and returns a line of sufficient statistics.

EVAL ||| stats

This reads a line of sufficient statistics and produces a final score. Meteor exits on end-of-file.

Languages such as C++, Python, and Perl can open an external process and communicate with its stdin and stdout. For more information, see the documentation for process control for your language.

If your software is written in Java, you can use the Meteor API directly:

import edu.cmu.meteor.scorer.MeteorConfiguration;
import edu.cmu.meteor.scorer.MeteorScorer;
import edu.cmu.meteor.util.Constants;

MeteorConfiguration config = new MeteorConfiguration();
config.setLanguage("en");
config.setNormalization(Constants.NORMALIZE_KEEP_PUNCT);
MeteorScorer scorer = new MeteorScorer(config);
double score = scorer.getMeteorStats("test string", "reference string").score;

Remember to add meteor-*.jar to your classpath. See the source files for MeteorConfiguration and MeteorScorer for additional information.

8. Meteor X-ray

X-ray visualizes alignments and scores of one or more MT systems against a set of reference translations. When scoring translation hypotheses with Meteor, use the -writeAlignments option to produce alignment files annotated with Meteor statistics. X-Ray uses these files to produce graphical representations of alignment matrices and score distributions via XeTeX and Gnuplot. Final output is in PDF form with intermediate LaTeX and plot files preserved for easy inclusion in reports and presentations.

Requirements:

Python 2.6 or later 2.x (http://www.python.org/)
XeTeX 2009 (http://www.tug.org/texlive/)
Gnuplot 4.4 or later (http://www.gnuplot.info/)
GNU Unifont (Optional, used for non-western languages) (http://unifoundry.com/unifont.html)

For example, on Ubuntu Linux, install the following packages:

sudo apt-get install python texlive-full gnuplot unifont

Setup:

If XeTeX and Gnuplot are installed somewhere other than /usr/bin, edit xray/Generation.py to include the correct locations:

xelatex_cmd = '/usr/bin/xelatex'
gnuplot_cmd = '/usr/bin/gnuplot'

Usage:

Run X-ray with the following:

python xray/xray.py

Running X-Ray with no arguments shows the help message:

MX: X-Ray your translation output
Usage: xray.py [options] <align.out> [align.out2 ...]

Options:
  -h, --help            show this help message and exit
  -c, --compare         compare alignments of two result sets (only first 2
                        input files used)
  -n, --no-align        do not visualize alignments
  -x MAX, --max=MAX     max alignments to sample (default use all)
  -p PRE, --prefix=PRE  prefix for output files (default mx)
  -l LBL, --label=LBL   optional system label list, comma separated:
                        label1,label2,...
  -u, --unifont         use unifont (use for non-western languages)

Example usage: score and visualize the hypotheses from system1 and system2 in the example/xray directory.

Score system1 with Meteor using the following options:

java -Xmx2G -jar meteor-*.jar example/xray/system1.hyp example/xray/reference \
-norm -writeAlignments -f system1

-norm: tokenize and normalize before scoring
-writeAlignments: write out sentence alignments used to calculate Meteor scores
-f system1: write alignments to system1-align.out

Visualize alignments and scores of system1 with Meteor X-Ray:

python xray/xray.py -p system1 system1-align.out

-p system1: prefix output files with 'system1'
system1-align.out: output from Meteor

Files produced:

system1-align-system-1.pdf: visualized Meteor alignments for each sentence
system1-score.pdf: visualized distributions of Meteor statistics
system1-files: LaTeX and gnuplot files used to produce PDFs

Score system2 with Meteor:

java -Xmx2G -jar meteor-*.jar example/xray/system2.hyp example/xray/reference \
-norm -writeAlignments -f system2

Compare performances of system1 and system2:

python xray/xray.py -c -p compare system1-align.out system2-align.out

-c: compare two Meteor outputs
-p compare: prefix output with 'compare'

Files produced:

compare-align.pdf: visualized alignments for both systems overlain
compare-score.pdf: score distributions for both systems
compare-files: LaTeX and gnuplot files

Additional systems:

To compare any number of systems, score each with Meteor (as above) and pass the align.out files to X-Ray. Without the -c flag, X-Ray will generate individual alignment matrices for each system and a single score PDF with score distributions for all systems. This is useful for comparing many configurations of the same system.

9. Training Meteor Parameters

Meteor parameters can be optimized to maximize agreement with human judgments of translation quality. The most frequently used evaluation task is ranking, where metrics should replicate human preferences between multiple translation hypotheses. Training Meteor to ranking data requires the following:

System outputs, plain text, one sentence per line: sys1, sys2, ... (typically named after site)
Reference translation: ref
A file of rankings: rank
- Lines are formatted as (tab delimited):
- This means that for segment id, system1 is ranked better than system2
- The same hypothesis pair can have multiple judgments.
- Ties should be discarded prior to training.

These files can exist for multiple language pairs with the same target language. For example, in WMT English, we have fr-en.sys1, fr-en.sys2, fr-en.ref, fr-en.rank, de-en.sys1, de-en.sys2, de-en.ref, de-en.rank, ... Meteor will consider all data files in the training directory when optimizing parameters. Meteor outputs the names of processed files to verify the data being used.

To prepare for training, convert the input files to SGML format where needed. (Plaintext would be fine in most cases, but datasets distributed as SGML aren't required to have segments in consistent order, which can create problems for older data.) The following example uses the included data in example/train:

mkdir my-train-dir
python scripts/sgmlize.py t < example/train/fr-en.sys1 > my-train-dir/fr-en.sys1.sgm
python scripts/sgmlize.py t < example/train/fr-en.sys2 > my-train-dir/fr-en.sys2.sgm
python scripts/sgmlize.py r < example/train/fr-en.ref > my-train-dir/fr-en.ref.sgm
cp example/train/fr-en.rank my-train-dir

Since parallel training requires loading several copies of Meteor into memory, filter the paraphrase table to minimize memory usage:

java -cp meteor-*.jar FilterParaphrase data/paraphrase-en.gz filtered.gz \
example/train/fr-en.ref

Meteor trains parameters by calculating sufficient statistics for all hypotheses and running an exhaustive grid search over rescorings. Every point explored is written out. To run the Trainer directly on one cpu:

java -cp meteor-*.jar Trainer rank my-train-dir -a filtered.gz > train.out

To find the best training point, sort the output:

sort -gr train.out > train.out.sort

The point with the highest correlation is the first line of the sorted file.

Running on multiple cpus greatly improves the speed of Meteor training. To run the grid search in parallel, use the meteor_shower script:

python scripts/meteor_shower.py meteor-*.jar en 4 rank my-train-dir work-dir 8 -a `pwd`/filtered.gz

This will keep 8 trainers running in parallel. Make sure to specify an absolute path for the paraphrase file. The results will be written to work-dir, along with a script for sorting the results.

10. Meteor Universal: Evaluation for Any Language

Meteor now supports language-specific evaluation for any target language for which there is enough data to build a standard phrase-based machine translation system. To build language-specific resources (paraphrase table and function word list), run the new_language.py script with your parallel data and a Moses-format phrase table:

python scripts/new_language.py out-dir corpus.f corpus.e phrase-table.gz [target-corpus.e]

Paraphrases will be extracted matching the target corpus (this can be a collection of relevant dev sets). If no target corpus is provided, the first 10,000 lines of the English corpus will be used (in practice this works adequately). Meteor can then be run with these files:

java -Xmx2G -jar meteor-*.jar test reference -new out-dir/meteor-files

Data should be pre-tokenized. Meteor will lowercase all data for evaluation (-new implies -lower). A universal parameter set will be used. These parameters are tuned on over 100,000 binary ranking judgments across 8 language directions and encode the following general properties:

Preference for recall over precision
Preference for word choice over word order
Preference for correct content words over correct function words

11. Special Thanks

Authors of previous Meteor versions:

Abhaya Agarwal
Satanjeev "Bano" Banerjee
Alon Lavie

Cotributors to previous Meteor versions:

Rachel Reynolds
Kenji Sagae
Jeremy Naman
Shyamsundar Jayaraman

Meteor 1.5: Automatic Machine Translation Evaluation System

Code by Michael Denkowski

Website Github

Table of Contents:

1. About

2. Supported Languages

3. Running Meteor

4. Meteor Options

Language: -l lang

Task: -t task

Parameters: -p 'alpha beta gamma delta'

Modules: -m 'module1 module2 ...'

Weights: -w 'weight1 weight2 ...'

Reference Count: -r refCount

Beam Size: -x

Word List Directory: -s wordListDirectory

Synonymy Directory: -d synonymDirectory

Paraphrase File: -a paraphraseFile

File Prefix: -f filePrefix

Quiet: -q

Character-based -ch

Normalize: -norm

Lowercase: -lower

Ignore Punctuation: -noPunct

SGML: -sgml

Stdio Format: -stdio

Verbose Output: -vOut

Sufficient Statistics: -ssOut

Write Alignments: -writeAlignments

5. Standalone Meteor Aligner

6. Standalone Word Stemmer

7. Integrating Meteor with your Software

8. Meteor X-ray

9. Training Meteor Parameters

10. Meteor Universal: Evaluation for Any Language

11. Special Thanks

Meteor 1.5: Automatic Machine Translation Evaluation System

Code by Michael Denkowski

WebsiteGithub

Table of Contents:

1. About

2. Supported Languages

3. Running Meteor

4. Meteor Options

Language: -l lang

Task: -t task

Parameters: -p 'alpha beta gamma delta'

Modules: -m 'module1 module2 ...'

Weights: -w 'weight1 weight2 ...'

Reference Count: -r refCount

Beam Size: -x

Word List Directory: -s wordListDirectory

Synonymy Directory: -d synonymDirectory

Paraphrase File: -a paraphraseFile

File Prefix: -f filePrefix

Quiet: -q

Character-based -ch

Normalize: -norm

Lowercase: -lower

Ignore Punctuation: -noPunct

SGML: -sgml

Stdio Format: -stdio

Verbose Output: -vOut

Sufficient Statistics: -ssOut

Write Alignments: -writeAlignments

5. Standalone Meteor Aligner

6. Standalone Word Stemmer

7. Integrating Meteor with your Software

8. Meteor X-ray

9. Training Meteor Parameters

10. Meteor Universal: Evaluation for Any Language

11. Special Thanks

Website Github