The METEOR MT Evaluation System, Version 0.8 Michael Denkowski (mdenkows at cs dot cmu dot edu) Abhaya Agarwal (abhayaa at cs dot cmu dot edu) Satanjeev "Bano" Banerjee (satanjeev at cmu dot edu) Alon Lavie (alavie at cs dot cmu dot edu) Carnegie Mellon University Pittsburgh, PA, USA 1. Introduction: ================ METEOR is a system that automatically evaluates the output of machine translation engines by comparing to them to one or more reference translations. For a given pair of a hypothesis and reference strings, the evaluation proceeds in a sequence of stages, with different criteria being used at each stage to find and score unigram matches. By default, at the first stage all exact matches are detected between the two strings. In the second stage, all stem matches are detected using the Snowball stemmers, and in the third stage, all synonym matches are detected using data extracted from the WordNet 3 database. The base system is written in c++ and includes APIs in C++, Java, and Python to allow users to easily incorporate METEOR scoring into existing systems. The sentence aligner can also function independently of scorer and thus be used in other systems that require monolingual sentence alignment. METEOR supports evaluation of MT outputs in languages other than English. Currently supported languages are French, German, Spanish and Czech. Details are provided in next section. 2. How to Run METEOR: ===================== This section refers to the standalone meteor binary. For information about building METEOR, see the INSTALL file. For information about using the METEOR library, see the API file. The following can be seen by running the meteor binary with no arguments: -------------------------------------------------------------------------------- Usage: ./meteor [options] [module1 module2 ...] Options: -l language One of: en cz de es fr -t task One of: af rank -p alpha beta gamma Custom parameters (overrides task) -s systemID Not usually required -r refCount References per sentence (plaintext input only) -x maxComputations Keep speed reasonable -w synonymDirectory (if not ./share/meteor) -normalize Convert symbols and tokenize (plaintext input only) -sgml Input is in SGML format -keepPunctuation Consider punctuation when aligning sentences -ssOut Output sufficient statistics only (plaintext input only) Available modules: exact stem syn Default settings: -l en -t af -r 1 -x 10000 exact stem syn Note: processing SGML input or normalizing the lines can take significantly longer than the scoring itself. Scoring tokenized plain text is multiple times faster. -------------------------------------------------------------------------------- The simplest way to run the system is as follows: $ meteor This assumes plaintext input. If your input is in SGML format, use: $ meteor -sgml For example, using the sample files included with this distribution, you can run as follows: $ meteor example/test.sgm example/ref.sgm -sgml Score files for segment, document, and system level scores are produced, prefixed with the system name. The output from the above should match the example output. As seen above, the default settings specify the "adequacy and fluency" task (explained in the task section), though the "ranking" task can also be selected. 3. Options: =========== The following is a more detailed description of each of the command line arguments. Languages: -l ---------- English is assumed by default. Meteor also supports evaluation of MT output in the following languages: Language Available Modules English (en) (exact, stem, syn) Czech (cz) (exact) French (fr) (exact, stem) German (de) (exact, stem) Spanish (es) (exact, stem) Input is assumed to be in UTF-8 encoding. Task: -t ----- Each task specifies parameters (alpha, beta, gamma) tuned to a specific type of human judgment data. These tasks and their parameters are as follow: af Tuned to human judgments of adequacy and fluency. This task uses the default parameter set from the Meteor-0.7 system. --------------------------------- alpha beta gamma English: 0.8 2.5 0.4 Czech: 0.8 0.83 0.28 French: 0.76 0.5 1.0 Spanish: 0.95 0.5 0.75 German: 0.95 1.0 0.98 --------------------------------- rank Tuned to human rankings of translations. This task uses the parameter set from the Meteor-Ranking system. --------------------------------- alpha beta gamma English: 0.95 0.5 0.5 Czech: 0.95 0.5 0.45 French: 0.95 0.5 0.55 Spanish: 0.95 0.5 0.55 German: 0.9 3.0 0.15 --------------------------------- Parameters: -p ----------- Alternatively, the three parameters (alpha, beta, gamma, discussed in the cited Meteor papers) can be specified to run a custom evaluation task. While this can be helpful for tuning Meteor to new data, the above task option is preferred for regular scoring. System ID: -s ---------- This is the ID of the system you would like to evaluate. This is not required if the input file has only one system. Reference Count: -r ---------------- If the input is in plaintext, you can specify the number of references. For N references, it is assumed that the reference file will be N times the length of the test file, containing sets of N references in order. For example, if N=4, reference lines 1-4 will correspond to test line 1, 5-8 to line 2, etc. Maximum Computations: -x --------------------- This number, set to 10,000 by default, is used to limit the search depth for very long sentence pairs. Synonymy Directory: -w ------------------- This is only used if the meteor binary is moved to a different directory and the wn directory is not moved with it. Use this option to specify the location of the synonymy files used by the syn matcher. Normalize: -normalize ---------- This is only used for plaintext (SGML is normalized automatically). This option tokenizes and lowercases the input lines, normalizes punctuation, and converts any remaining markup tags to their plaintext forms. SGML: -sgml ----- This specifies that input is in SGML format. Keep Punctuation: -keepPunctuation ----------------- If not specified, punctuation will be removed. If specified, punctuation symbols will be treated as tokens by the matcher. Sufficient Statistics: -ssOut ---------------------- Output lines containing the tab delimited sufficient statistics for each test sentence. Each output line will have the following format: numMatches testLength refLength numChunks lenghCost No system level score or other output will be generated. This output can be piped or otherwise passed to the "met-reduce" program to produce Meteor scores from the sufficient statistics. Modules: -------- Meteor currently supports 3 modules: exact matching using surface forms stem matching using stems obtained from the included stemmers syn matching based on synonyms obtained from the included database See the previous section to know which modules to use for which languages. 3. Input/Output Format of METEOR: ================================= All the input is assumed to be in UTF-8 encoding and all the output is also generated in UTF-8 encoding. Input can be in either plaintext or in SGML. For plaintext, output is to standard out with scores for each segment and a final system level score. For SGML, output is in a 3 file format which includes segment, document, and system level scores prefixed by the system name. For system "example": example-seg.score contains lines: testset system document segment score example-doc.score contains lines: testset system document score example-sys.score contains line: testset system score 4. Licensing: ============= METEOR is distributed under the following license: License Start: Carnegie Mellon University Copyright (c) 2004-2009 All Rights Reserved. Permission is hereby granted, free of charge, to use and distribute this software and its documentation without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of this work, and to permit persons to whom this work is furnished to do so, subject to the following conditions: 1. The code must retain the above copyright notice, this list of conditions and the following disclaimer. 2. Any modifications must be clearly marked as such. 3. Original authors' names are not deleted. 4. The authors' names are not used to endorse or promote products derived from this software without specific prior written permission. CARNEGIE MELLON UNIVERSITY AND THE CONTRIBUTORS TO THIS WORK DISCLAIM ALL WARRANTIES WITH REGARD TO THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS, IN NO EVENT SHALL CARNEGIE MELLON UNIVERSITY NOR THE CONTRIBUTORS BE LIABLE FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE. Author: Satanjeev "Bano" Banerjee satanjeev AT cmu.edu Author: Alon Lavie alavie AT cs.cmu.edu Author: Abhaya Agarwal abhayaa AT cs.cmu.edu Author: Michael Denkowski mdenkows AT cs.cmu.edu License End. 6. Acknowledgements: ==================== The following researchers have contributed to the implementation of the METEOR system (all at Carnegie Mellon University): Rachel Reynolds Kenji Sagae Jeremy Naman Shyamsundar Jayaraman 7. References: ============== [Lavie & Agarwal,2007] 2007, Lavie, A., A. Agarwal. "METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments", To appear in Proceedings of Workshop on Statistical Machine Translation at the 45th Annual Meeting of the Association of Computational Linguistics (ACL-2007), Prague, June 2007. [Banerjee & Lavie,2005] 2005,Banerjee, S. and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments", Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005. [Lavie et al, 2004] 2004, Lavie, A., K. Sagae and S. Jayaraman. "The Significance of Recall in Automatic Metrics for MT Evaluation". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004.