This example scores and visualizes the machine translation output examples included with the Meteor distribution. First, run the Meteor jar file with no arguments:

$ java -jar meteor-*.jar

Meteor version 1.3

Usage: java -Xmx2G -jar meteor-*.jar   [options]

-l language                     One of: en cz de es fr ar
-t task                         One of: rank adq hter li tune
-p 'alpha beta gamma delta'     Custom parameters (overrides default)
-m 'module1 module2 ...'        Specify modules (overrides default)
                                  Any of: exact stem synonym paraphrase
-w 'weight1 weight2 ...'        Specify module weights (overrides default)
-r refCount                     Number of references (plaintext only)
-x beamSize                     (default 40)
-d synonymDirectory             (if not default for language)
-a paraphraseFile               (if not default for language)
-j jobs                         Number of jobs to run (nBest only)
-f filePrefix                   Prefix for output files (default 'meteor')
-norm                           Tokenize / normalize punctuation and lowercase
                                  (Recommended unless scoring raw output with
                                   pretokenized references)
-lower                          Lowercase only (not required if -norm specified)
-noPunct                        Do not consider punctuation when scoring
                                  (Not recommended unless special case)
-sgml                           Input is in SGML format
-mira                           Input is in MIRA format
                                  (Use '-' for test and reference files)
-nBest                          Input is in nBest format
-oracle                         Output oracle translations (nBest only)
-vOut                           Output verbose scores (P / R / frag / score)
-ssOut                          Output sufficient statistics instead of scores
-writeAlignments                Output alignments annotated with Meteor scores
                                  (written to <prefix>-align.out)

Sample options for plaintext: -l <lang> -norm
Sample options for SGML: -l <lang> -norm -sgml
Sample options for raw output / pretokenized references: -l <lang> -lower

See README file for additional information

Running Meteor without any arguments produces the above help message outlining the various options available for scoring. For detailed descriptions of each option, see the included README file.

Now use Meteor to score the output of System1:

$ java -Xmx2G -jar meteor-*.jar example/xray/system1.hyp example/xray/reference -norm -writeAlignments -f system1

Meteor version: 1.3

Eval ID:        meteor.1.3-en-norm-0.85_0.2_0.6_0.75-ex_st_sy_pa-1.0_0.6_0.8_0.6

Language:       English
Format:         plaintext
Task:           Ranking
Modules:        exact stem synonym paraphrase
Weights:        1.0 0.6 0.8 0.6
Parameters:     0.85 0.2 0.6 0.75

Segment 1 score:        0.447752250844953
Segment 2 score:        0.4284116369815996
Segment 3 score:        0.2772888474043816
Segment 2487 score:	0.2825995999223381
Segment 2488 score:	0.32037812996981163
Segment 2489 score:	0.33120147321343485

System level statistics:

           Test Matches                  Reference Matches
Stage      Content  Function    Total    Content  Function    Total
1            16268     20842    37110      16268     20842    37110
2              485        26      511        489        22      511
3              820       119      939        845        94      939
4             3813      3162     6975       3954      2717     6671
Total        21386     24149    45535      21556     23675    45231

Test words:             61600
Reference words:        62469
Chunks:                 20118
Precision:              0.6767347074578696
Recall:                 0.6500539115850005
f1:                     0.663126043401952
fMean:                  0.6539211143997783
Fragmentation penalty:  0.5099053526424513

Final score:            0.3204832379614146

This displays the segment-level scores for the translation hypotheses and summary statistics for System1. The file system1-align.out is written, which contains human-readable alignments annotated with Meteor statistics. While the examample uses plaintext input, Meteor can also score input in WMT SGML and NIST XML formats using the -sgml option.

To visualize these alignments and statistics, run Meteor X-ray:

$ python xray/xray.py -p system1 system1-align.out

This produces system1-align-system-1.pdf containing scored alignment matrices as seen here:

The TeX source for these alignments can be found in system1-files/align-1.tex. The file system1-score.pdf is also produced, though the score distributions will be more interesting when comparing two systems.

Next score the output of System2:

$ java -Xmx2G -jar meteor-*.jar example/xray/system2.hyp example/xray/reference -norm -writeAlignments -f system2

Run Meteor X-ray in comparison mode:

$ python xray/xray.py -c -p compare system1-align.out system2-align.out

This produces several useful files. The file compare-align.pdf contains matrices where the alignments of System2 (right, solid and hollow dots) are overlain on alignments of System1 (left, grey and yellow boxes), as seen here:

The alignments are also annotated with the differences in Meteor statistics between the two translations, allowing a thorough inspection of the differences between the two hypotheses as compared to the reference.

The file compare-score.pdf contains graphs of score distributions for individual segments for both systems:

Scores are also broken down by Meteor statistic and sentence length, as seen here:

All TeX and Gnuplot files used to create these visualizations are available in the created directory compare-files.