μtopia - Microblog Translated Posts Parallel Corpora

Release V1.0 - 15/05/2013

Weibo Corpora Weibo Gold Standard Twitter Corpora Mining Method
Terms Of Use Acknowledgments

Overview

μtopia is a Parallel Corpora that contains parallel sentence pairs mined from Microblogs. For more information refer to our work:

Microblogs as Parallel Corpora, Wang Ling, Guang Xiang, Chris Dyer, Alan Black and Isabel Trancoso, ACL 2013, [pdf] [bib]

In a nutshell, this website contains the following resources:

  • Microblog Parallel Data - Automatically crawled parallel data from Microblogs. Available here (Sina Weibo) and here (Twitter).
  • Microblog Gold Standard Parallel Data - Manually aligned parallel data from our crawl. Download it here.

Our dataset is radically different from existing parallel corpora(which are normally in the news or parliament domain), since it contains characteristics present in Microblogs(Ex:Twitter), Social Networks(Ex: Facebook) and Chatrooms(Ex:MSN). Here are some parallel sentence examples:
Source (English)Target (Mandarin)
AbbreviationsShe love my tattoos ain't got no room for her name, but imma make room -她喜欢我的纹身,那上面没有纹她名字的地方了,不过我会弄出空余空间
Orthographic Errorshappy singles day in China - sorry I won't be celebratin witchu, I have my love... -中国的粉丝们,光棍节快乐 - 抱歉我不能和你们一起庆祝节日了,我有我可爱的老婆S...
Syntactic ErrorsLess guilty of some wrong, will be able to talk less "I'm sorry " to themselves or others are worth celebrating. 少犯一些错,就能少说“对不起”,这对自己或对别人都是值得庆幸的。
EmoticonsSo excited to reveal the title of my new album on the new KellyRowland.com :) -非常激动要在 KellyRowland.com 上揭晓我新专辑的名字了:)

This dataset was created to improve the Machine Translation quality, with emphasis in informal and noisy data, and it has been shown (see our work here) that significant improvements can be obtained for in-domain testsets, and moderate improvement in out-domain testsets. However, you are free to use this datasets for any research purpose.

Also, if you use this dataset, we would appreciate it if you support our work by citing our paper. Bibtex available here.


Sina Weibo Parallel Corpora

Sina Weibo does not allow publishing textual information to be published(clause 5.1). So, once again we will only publish the IDs of the messages and provide tools to crawl the data from the server.

Note that our crawler was built to prioritize the crawling English-Mandarin sentence pairs, which is why the ratio between the size English-Mandarin corpora is so much larger than other language pairs.

Data Format - Each corpora folder contains the following structure:

  • README - Instructions for this dataset, please read very carefully.
  • COPYING - Copyright for this dataset, please read even more carefully.
  • meta/meta.csv - post ids and metadata(csv format).
  • meta/meta.json - post ids and metadata(json format).
  • data/data.(lang-pair).s - source sentences (should contain only some samples if you just downloaded this dataset).
  • data/data.(lang-pair).t - target sentences (should contain only some samples if you just downloaded this dataset).
  • data/data.(lang-pair).json - json describing the parallel sentence (should contain only some samples if you just downloaded this dataset).

To build the dataset you must use the data in meta/meta.csv or meta/meta.json to extract the actual posts from the provider(Sina Weibo or Twitter). Each line of the metadata files contain a tweet id, used to find the tweet, and the indexes of the parallel segments within that tweet. Using this information it is possible to extract the tweet from Weibo and retrieve the parallel segments. Check the tools section.

How to get it - You can download the corpora below:

Language Pair Num Sentences
Mandarin-English 800K [Sample] [Download]
Mandarin-Arabic 6K [Sample] [Download]
Mandarin-Russian 12K [Sample] [Download]
Mandarin-Korean 41K [Sample] [Download]
Mandarin-German 49K [Sample] [Download]
Mandarin-French 43K [Sample] [Download]
Mandarin-Spanish 36K [Sample] [Download]
Mandarin-Portuguese 25K [Sample] [Download]
Mandarin-Czech 21K [Sample] [Download]
Everything(Includes Gold Standard) 1M [Download]

Tools and Resources - The following list contains the resources you will need to crawl Sina Weibo:

  • Open API Guide - You will first need to create and setup an Open API Weibo Account for your crawler. This non-native friendly guide will show you how to do this.
  • Java API Client - Java Helper API to access the Weibo Server. Use it if you wish to write your own crawler. Note: We did not write this toolkit and credit goes to the original author of this API. We have this here, since it might be hard for non-native Mandarin Speakers to find it.
  • Weibo Parallel Data Builder - Coming soon...

Sina Weibo Gold Standard

There are two reasons we created this gold standard.

  • There is no gold standard bitext for testing MT systems in the microblog domain. By releasing this corpora, we hope to promote MT research in this direction.
  • During the research period of this work, we needed to test the quality of our crawler and aligner.

How to get it - You can download the gold standard by itself here. It contains 4347 manually aligned tweets, out of which 2581 are parallel sentences in the Mandarin-English language pair.

Data Format - The structure of this corpora is described below:

  • README - Instructions for this dataset, please read very carefully.
  • COPYING - Copyright for this dataset, please read even more carefully.
  • annotations.json - manually annotated tweets, with the following information: (1) is there parallel data, (2) if so, what is the language pair of the data, (3) what are the spans of the parallel data.
  • bitext/(lang-pair)/gold.json - detailed description of each parallel pair.
  • bitext/(lang-pair)/tweet.txt - original tweets where the parallel data was extracted from.
  • bitext/(lang-pair)/gold.(s) and gold.(t) - the line-by-line aligned parallel sentences.


Twitter Parallel Corpora

Twitter only allows the IDs(clause 4.a) of the tweets to be published. Thus, we shall only publish the IDs of the messages, and our meta-data that describes how to obtain the parallel data. We shall also provide the tools to extract this data.

Data Format - Each corpora folder contains the following structure:

  • README - Instructions for this dataset, please read very carefully.
  • COPYING - Copyright for this dataset, please read even more carefully.
  • meta/meta.csv - post ids and metadata(csv format).
  • meta/meta.json - post ids and metadata(json format).
  • data/data.(lang-pair).s - source sentences (should contain only some samples if you just downloaded this dataset).
  • data/data.(lang-pair).t - target sentences (should contain only some samples if you just downloaded this dataset).
  • data/data.(lang-pair).json - json describing the parallel sentence (should contain only some samples if you just downloaded this dataset).

How to get it - You can download the corpora below:

Language Pair Num Sentences
English-Mandarin 113K [Sample] [Download]
English-Arabic 114K [Sample] [Download]
English-Russian 119K [Sample] [Download]
English-Korean 78K [Sample] [Download]
English-Japanese 75K [Sample] [Download]
Everything 500K [Download]

Tools - Coming soon...


Mining Method

In a nutshell, we are interested in users that parallel post (post tweets with translations in multiple languages). Here is an example of Snoop Dogg and Psy posting in parallel in Sina Weibo:

Our extraction process is divided into two tasks:

  • Identification of parallel tweets - Find potential tweets that contain parallel sentences.
  • Extraction of the parallel segments - Try to find and align the parallel segments within a tweet.

Furthermore, we also noticed that it is also common for users to translate tweets as a retweet, so this is also performed for tweet/retweet pairs. For more details refer to our paper "Microblogs as Parallel Corpora".


Terms of Use

We are not aware of any copyright restrictions of the material we are publishing (IDs and metadata). If you use these datasets and tools in your research, please cite the our paper (bibtex available here).

Also, please keep in mind that after crawling the data from Twitter and Sina Weibo, you are subject to their terms and conditions. Twitter's terms can be found here. Sina Weibo's terms can be found here and here. We do not endorse and shall not be held responsible or liable for damages resulting from the impropriate usage of any content downloaded in this site.


Acknowledgments

I, Wang Ling, would like to give thanks (Brace yourselfs for the lamest acknowledgment section of your lives! I certainly do not recommend read this to any living being):
To FCT
I would like to thank FCT (Fundação para a Ciência e a Tecnologia) from the bottom of my heart for funding my Phd. The time I spent as a Phd Student in Carnegie Mellon University and Instituto Superior Tecnico was without doubt one of best in my life, and I am eternally grateful for being granted such a chance.
To My Advisors
I would like to thank my three advisors Alan Black, Isabel Trancoso and Chris Dyer for making me understand how D'Artagnan felt in "The Three Musketeers". Working with three mentors with distinct backgrounds meant tons of additional work, as each of them would pull me in a different direction. Yet, this forced me to look at the problem from different perspectives and pushed me to grow as a researcher. To put it simply, I feel that this work would not have been possible without the inspiration and insight that was instilled in me.
To My Collaborators
To my collaborator and friend Guang Xiang, I would like to thank you for the help and support in the duration of this my work. I am aware that you were busy with your thesis proposal/defense during the implementation of this work, so I appreciate it that you found the time to help me under such conditions.
To Others
  • Many thanks to Brendon O'Connor for providing his huuuuge Twitter data, which was essential to extract the Parallel data from the Twitter domain.
  • Thanks to Justin Chiu for occasionally posting on his Facebook wall in parallel. Otherwise, I would never have even dreamed that people would post in parallel.


The content in this website is provided by Wang Ling. Feel free to mail me ideas, feedback or suggestions and I will do my best to accommodate them.