MTNT: Machine Translation of Noisy Text


Break your MT models with MTNT, the Testbed for Machine Translation of Noisy Text! MTNT is a collection of comments from the Reddit discussion website in English, French and Japanese, translated to and from English. The particularity of this dataset is that the data consists of "noisy" text, that exhibits typos, grammar errors, code switching and more. For more details, check out the paper.



You can download the data here: MTNT.1.1.tar.gz (md5sum: 8ce1831ac584979ba8cdcd9d4be43e1d)

After extraction with tar xvzf MTNT.1.1.tar.gz, the MTNT folder should have the following structure:

├── monolingual
│   ├── dev.en
│   ├──
│   ├── dev.ja
│   ├── dev.tok.en
│   ├──
│   ├── dev.tok.ja
│   ├── train.en
│   ├──
│   ├── train.ja
│   ├── train.tok.en
│   ├──
│   └── train.tok.ja
├── test
│   ├── test.en-fr.tsv
│   ├── test.en-ja.tsv
│   ├──
│   └── test.ja-en.tsv
├── train
│   ├── train.en-fr.tsv
│   ├── train.en-ja.tsv
│   ├──
│   └── train.ja-en.tsv
└── valid
    ├── valid.en-fr.tsv
    ├── valid.en-ja.tsv
    └── valid.ja-en.tsv

The monolingual data is distributed with and without tokenization, in raw text format. The parallel data is split into training, validation and test set. Each tsv file contains 3 columns:

  • Comment ID
  • Source sentence
  • Target sentence

Some source sentences are from a same original comment, and you can use the comment ID to group them together and leverage the contextual information.

If you're only interested in the source and target sentence, you can run the script to split the files into source and target files.

I have made the data used for pretraining available here: clean-data-en-fr.tar.gz and clean-data-en-ja.tar.gz. This should save you some time if you want to reproduce the setting from the paper.


Language pair Source Target
en-fr Just got called into work tho so I won’t be in til tomorrow night Mais on vient de m'appeler pour le travail donc je n'y serai pas avant demain soir
fr-en je demande lazil politique pr janluk # Il ressuscitera ! I demand political asylum for jean luc # He will resurrect!
en-ja Sooooooo, he hasn’t had a day off in 36 years? ということは、36年間一度も休まなかったの?
ja-en もう「ネットの噂に反応する企業(団体)wwwwww」て時代じゃないんだよなあ It's not like it's the era of "companies (organizations) reacting to online rumors hahahaha".



This table lists all published results on the MTNT test set. If you want to appear on this table, shoot an email to pmichel1[at] (please include a link/copy of your paper and code).

System en-fr fr-en en-ja ja-en
[Michel & Neubig, 2018] Base 21.77 23.27 9.02 6.65
[Michel & Neubig, 2018] Finetuned 29.73 30.29 12.45 9.82


The BLEU scores should be computed according to the guidelines given in the paper: using sacreBLEU on the detokenized output and reference with intl tokenization. Precisely, run:

cat out.detok | sacrebleu --tokenize=intl ref.detok

Where {out,ref}.detokare the detokenized output and reference.

In the case of en-ja only, you should pre-segment the Japanese output with Kytea before running sacreBLEU:

kytea -m /path/to/kytea/share/kytea/model.bin -notags out.detok > out.seg
kytea -m /path/to/kytea/share/kytea/model.bin -notags ref.detok > ref.seg
cat out.seg | sacrebleu --tokenize=intl ref.seg


The code to reproduce the collection process and the Machine Translation experiments is available on github.


If you use this dataset or the associated code, please cite:

  author    = {Michel, Paul  and  Neubig, Graham},
  title     = {MTNT: A Testbed for Machine Translation of Noisy Text},
  booktitle = {Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing}


If you have any issue with the data, please contact pmichel1[at] For any question regarding the code, please open an issue on Github.


This data is released under the terms of the Reddit API.