The METEOR MT Evaluation System, Version 1.2 Michael Denkowski (mdenkows at cs dot cmu dot edu) Abhaya Agarwal (abhayaa at cs dot cmu dot edu) Satanjeev "Bano" Banerjee (satanjeev at cmu dot edu) Alon Lavie (alavie at cs dot cmu dot edu) Carnegie Mellon University Pittsburgh, PA, USA Note: See xray/README for directions using METEOR X-Ray 1. Introduction: ================ The METEOR metric evaluates a machine translation hypothesis against a reference translation by calculating a similarity score based on an alignment between the two strings. When multiple references are provided, the hypothesis is scored against each and the reference producing the highest score is used. Alignments are formed according to the following types of matches between strings: Exact: Words are matched if and only if their surface forms are identical. Stem: Words are stemmed using a language-appropriate Snowball Stemmer and matched if the stems are identical. Synonym: Words are matched if they are both members of a synonym set according to the WordNet database. Paraphrase: Phrases are matched if they are listed as paraphrases in the METEOR paraphrase tables. Currently supported languages are English, Czech, German, French, and Spanish. The system is written in pure Java with a full API to allow easy incorporation of METEOR scoring into existing systems. This METEOR release also includes: - a standalone version of the Aligner - a standalone version of the Sufficient Statistics Scorer - a Trainer which can tune the METEOR parameters for new data 2. Running METEOR: ================== This section refers to the standalone METEOR scorer. For information about building METEOR, see the INSTALL file. For information about the METEOR API, see the JavaDoc in the doc directory. The following can be seen by running the METEOR scorer with no arguments: -------------------------------------------------------------------------------- METEOR version next-1.2 Usage: java -XX:+UseCompressedOops -Xmx2G -jar meteor-*.jar [options] Options: -l language One of: en cz de es fr -t task One of: rank adq hter tune -p "alpha beta gamma" Custom parameters (overrides default) -m "module1 module2 ..." Specify modules (overrides default) Any of: exact stem synonym paraphrase -w "weight1 weight2 ..." Specify module weights (overrides default) -r refCount Number of references (plaintext only) -x beamSize (default 40) -d synonymDirectory (if not default for language) -a paraphraseFile (if not default for language) -j jobs Number of jobs to run (nBest only) -f filePrefix Prefix for output files (default "meteor") -normalize Normalize punctuation and tokenize (plaintext only) -keepPunctuation Consider punctuation when aligning sentences -sgml Input is in SGML format -nBest Input is in nBest format -oracle Output oracle translations (nBest only) -vOut Output verbose scores (P / R / frag / score) -ssOut Output sufficient statistics instead of scores -writeAlignments Output alignments annotated with METEOR scores (written to -align.out) To filter paraphrase tables: java -cp meteor-*.jar FilterParaphrase To convert SGML files to plain text: java -cp meteor-*.jar SGMtoPlaintext See README file for additional information -------------------------------------------------------------------------------- The simplest way to run METEOR is as follows: $ java -XX:+UseCompressedOops -Xmx2G -jar meteor-*.jar NOTE: versions of Java before 1.6 update 14 do not have the -XX:+UseCompressedOops option. When running on earlier versions of Java, omit this option and be sure to filter the paraphrase table. It might also be necessary to increase the memory limit (-Xmx3G). If your input is in SGML format, use: $ java -XX:+UseCompressedOops -Xmx2G -jar meteor-*.jar -sgml For example, using the sample files included with this distribution, you can run the following test. First, filter the English paraphrase table to the example reference file: $ java -cp meteor-*.jar FilterParaphrase data/paraphrase-en.gz \ filtered.example.gz example/ref.sgm Score the example test and reference files using the filtered paraphrase table: $ java -XX:+UseCompressedOops -Xmx2G -jar meteor-*.jar \ example/test.sgm example/ref.sgm -sgml -a filtered.example.gz Score files for segment, document, and system level scores are produced, prefixed with the system name. The output from the above should match the example scores. $ diff meteor-seg.scr example/meteor-seg.scr $ diff meteor-doc.scr example/meteor-doc.scr $ diff meteor-sys.scr example/meteor-sys.scr The METEOR jar can be run from any directory as long as correct paths are specified. 3. Options: =========== Language: -l language --------------------- English is assumed by default. METEOR also supports evaluation of MT output in the following languages: Language Available Modules English (en) (exact, stem, synonym, paraphrase) French (fr) (exact, stem, paraphrase) German (de) (exact, stem, paraphrase) Spanish (es) (exact, stem, paraphrase) Czech (cz) (exact, paraphrase) Task: -t task ------------- Each task specifies the modules, module weights, and parameters (alpha, beta, gamma) tuned to a specific type of human judgment data. These tasks and their parameters are listed below: rank Tuned to human rankings of translations from WMT09. ------------------------------------------------------------------ exact stem synonym paraphrase alpha beta gamma English: 1.0 0.8 0.8 0.6 0.85 2.35 0.45 Czech: 1.0 n/a n/a 0.4 0.95 2.15 0.35 French: 1.0 0.0 n/a 0.6 0.90 0.85 0.45 Spanish: 1.0 0.8 n/a 0.4 0.15 0.25 0.75 German: 1.0 0.2 n/a 0.8 0.75 0.80 0.90 ------------------------------------------------------------------ adq Tuned to adequacy scores from NIST OpenMT09. ------------------------------------------------------------------ exact stem synonym paraphrase alpha beta gamma English: 1.0 1.0 0.6 0.8 0.80 1.10 0.45 ------------------------------------------------------------------ hter Tuned to HTER scores from GALE P2. ------------------------------------------------------------------ exact stem synonym paraphrase alpha beta gamma English: 1.0 0.2 0.6 0.8 0.65 1.70 0.55 ------------------------------------------------------------------ Parameters: -p "alpha beta gamma" --------------------------------- Alternatively, the three parameters (alpha, beta, and gamma) can be specified manually. This is most often used when tuning METEOR to new data. Modules: -m "module1 module2 ..." --------------------------------- METEOR supports 4 matcher modules: exact match using surface forms stem match using stems obtained from the included stemmers synonym match based on synonyms obtained from the included database paraphrase match based on paraphrases from a paraphrase database See the language section to determine which modules are available for languages. Module Weights: -w "weight1 weight2 ..." ---------------------------------------- The module weights can also be specified manually. This is also primarily used for tuning METEOR. Reference Count: -r ------------------- If the input is in plaintext, the number of references can be specified. For N references, it is assumed that the reference file will be N times the length of the test file, containing sets of N references in order. For example, if N=4, reference lines 1-4 will correspond to test line 1, 5-8 to line 2, etc. Beam Size: -x ------------- This number, set to 40 by default, is used to limit the beam size when searching for the highest scoring alignment. As parameters are tuned for a beam size of 40, just increasing this number does not necessarily produce more accurate scores. Synonymy Directory: -d synonymDirectory --------------------------------------- This option should only be used to test external synonymy databases. By default, the bundled synonymy database will be used. Paraphrase File: -a paraphraseFile ---------------------------------- This option is used to specify an alternate paraphrase file. This is commonly used after filtering one of the original paraphrase files to a reference set (highly recommended). Jobs: -j jobs ------------- This option (nBest scoring only) sets the number of jobs to use for scoring. It is generally a good idea to set this to the number of CPUs on the machine running METEOR. File Prefix: -f filePrefix -------------------------- Specify the prefix of score files in SGML mode. Files produced will be -seg.scr, -doc.scr, -sys.scr. The default prefix is "meteor". If alignments are to be written, they are written to -align.out. Normalize: -normalize --------------------- This is only used for plaintext (SGML is normalized automatically). This option tokenizes and lowercases the input lines, normalizes punctuation, and converts any remaining markup language tags to their plaintext forms. SGML: -sgml ----------- This specifies that input is in SGML format. (See Input/Output section) N-Best: -nBest -------------- This specifies that input is in nBest format with multiple translations for each segment. For each segment, a line containing a single number for the count of translations is followed by one translation per line. For example, an input file with translations for three segments might appear as follows: 1 This is a single translation. 3 This is hypothesis one. This is hypothesis two. This is hypothesis three. 2 This segment has two translations. This is the second translation. See Input/Output section for the output format. Keep Punctuation: -keepPunctuation ---------------------------------- If not specified, punctuation will be removed. If specified, punctuation symbols will be treated as tokens by the matcher. Verbose Output: -vOut --------------------- Output verbose scores (Precision, Recall, Fragmentation, Score) in place of regular scores. Sufficient Statistics: -ssOut ----------------------------- This option outputs sufficient statistics in place of scores and omits all other output. The behavior is slightly different depending on the data format. Plaintext: Space delimited lines are output, each having the following form: tstLen refLen stage1tstTotalMatches stage1refTotalMatches stage1tstWeightedMatches stage1refWeightedMatches s2tTM s2rTM s2tWM s2rWM s3tTM s3rTM s3tWM s3rWM s4tTM s4rTM s4tWM s4rWM chunks lenCost No system level score is output. The lines can be piped or otherwise passed to the StatsScorer program to produce METEOR scores from the sufficient statistics. SGML: The output score files will contain space delimited sufficient statistics in place of scores. Segment, Document, and System level scores are still produced. Write Alignments: -writeAlignments ---------------------------------- Write alignments between hypotheses and references to meteor-align.out or -align.out when file prefix is specified. Alignments are written in METEOR format, annotated with METEOR statistics: Title precision recall fragmentation score sentence1 sentence2 Line2Start:Length Line1Start:Length Module Score ... 4. Input/Output Format of METEOR: ================================= Input can be in either plaintext with one segment per line (also see -r and -nBest for multiple references or hypotheses), or in SGML. For plaintext, output is to standard out with scores for each segment and final system level statistics. If nBest is specified, a score is output for each translation hypothesis along with system level statistics for first-sentence (first translation in each list) and best-choice (best scoring translation in each list). For SGML, output includes 3 files containing segment, document, and system level scores for the systems and test sets: meteor-seg.scr contains lines: testset system document segment score meteor-doc.scr contains lines: testset system document score meteor-sys.scr contains lines: testset system score System level statistics will also be written to standard out for SGML scoring. 5. Aligner: =========== The METEOR aligner can be run independently with the following command: $ java -XX:+UseCompressedOops -Xmx2G -cp meteor-*.jar Matcher Without any arguments, the following help text is printed. -------------------------------------------------------------------------------- METEOR Aligner version next-1.2 Usage: java -XX:+UseCompressedOops -Xmx2G -cp meteor-*.jar Matcher [options] Options: -l language One of: en cz de es fr hu -m "module1 module2 ..." Specify modules (overrides default) One of: exact stem synonym paraphrase -t type Alignment type (coverage vs accuracy) One of: maxcov maxacc -x beamSize Keep speed reasonable -d synonymDirectory (if not default) -a paraphraseFile (if not default) See README file for examples -------------------------------------------------------------------------------- The aligner reads in two plaintext files and outputs a detailed line-by-line alignment between them. Only the options (outlined in previous sections) which apply to the creation of alignments are available. The type option determines whether the aligner prefers coverage (better for correlation with human judgments in evaluation) or accuracy (better for tasks requiring high accuracy for each alignment link). 6. StatsScorer: =============== The METEOR sufficient statistics scorer can also be run independently: $ java -cp meteor-*.jar StatsScorer The --help option provides the following help text. -------------------------------------------------------------------------------- METEOR Stats Scorer version next-1.2 Usage: java -cp meteor.jar StatsScorer [options] Options: -l language One of: en cz de es fr -t task One of: adq rank hter -p "alpha beta gamma" Custom parameters (overrides default) -w "weight1 weight2 ..." Specify module weights (overrides default) -------------------------------------------------------------------------------- The scorer reads lines of sufficient statistics from standard in and writes METEOR scores to standard out. 7. Trainer: =============== The METEOR trainer can be used to tune METEOR parameters for new data. The "scripts" directory contains scripts for creating training sets from many common data formats. Without any arguments, the following help text is printed. -------------------------------------------------------------------------------- METEOR Trainer version next-1.2 Usage: java -XX:+UseCompressedOops -Xmx2G -cp meteor-*.jar Trainer [options] Tasks: One of: segcor rank Options: -a paraphrase -e epsilon -l language -i "p1 p2 p3 w1 w2 w3 w4" Initial parameters and weights -f "p1 p2 p3 w1 w2 w3 w4" Final parameters and weights -s "p1 p2 p3 w1 w2 w3 w4" Steps -------------------------------------------------------------------------------- The Trainer will explore the parameter space bounded by the initial and final weights using the given steps. Output should be piped to a file and sorted to determine the best scoring point. The following tasks are available: segcor: Segment-level correlation: data dir can contain file triplets for any number of systems in the form: .tst - MT system output file (SGML) .ref - Reference translation file (SGML) .ter - Human score file for this system containing lines in the form (space delimited): example: newswire1 12 5 example: sys1.tst sys1.ref sys1.ter Human scores can be of any numerical measure (7 point adequacy scale, 0/1 correctness, HTER or other post-edit measure). For each point in the parameter space, the segment-level length-weighted Pearson's correlation coefficient is calculated across the scores for all segments in all files. rank: Rank consistency: data dir can contain file groups in the following form: .rank - rank file containing lines in the form (tab delimited): example: 3 cz-en sysA cz-en sysB indicating that for a given segment, language pair A, system A is preferred (higher score) to language pair B system B. There can be multiple judgments for the same systems on the same segments. .ref.sgm - Reference translation file for this target language. (SGML) ..sgm - MT system output for this language pair (SGML) ..sgm - another system ..sgm - another system ...additional systems... example: cz-en.rank en.ref.sgm cz-en.sysA.sgm cz-en.sysB.sgm cz-en.sysC.sgm ... For each point in the parameter space, the rank consistency (proportion of times preferred segments receive a higher metric score) is calculated. 8. SGML-to-Plaintext Converter: =============================== This release also includes a program for reliably converting SGML test and reference files to plain text. Resulting files are consistently ordered even if the SGML files are not and blank lines are appropriately added for empty or missing segments. To run this program, use: $ java -cp meteor-1.2.jar SGMtoPlaintext 9. Licensing: ============= METEOR is released under the LGPL and includes some files subject to the WordNet license. See the included COPYING files for details. 10. Acknowledgements: ===================== The following researchers have contributed to the implementation of the METEOR system (all at Carnegie Mellon University): Rachel Reynolds Kenji Sagae Jeremy Naman Shyamsundar Jayaraman