μtopia - Microblog Translated Posts Parallel Corpora


Overview

μtopia is a Parallel Corpora that contains parallel sentence pairs mined from Microblogs. Currently, we are providing the data crawled from Twitter and Sina Weibo.

Our dataset is radically different from existing parallel corpora(which are normally in the news or parliament domain), since it contains characteristics present in Microblogs(Ex:Twitter), Social Networks(Ex: Facebook) and Chatrooms(Ex:MSN). Here are some parallel sentence examples:
Source (English)Target (Mandarin)
AbbreviationsShe love my tattoos ain't got no room for her name, but imma make room -她喜欢我的纹身,那上面没有纹她名字的地方了,不过我会弄出空余空间
Orthographic Errorshappy singles day in China - sorry I won't be celebratin witchu, I have my love... -中国的粉丝们,光棍节快乐 - 抱歉我不能和你们一起庆祝节日了,我有我可爱的老婆S...
Syntactic ErrorsLess guilty of some wrong, will be able to talk less "I'm sorry " to themselves or others are worth celebrating. 少犯一些错,就能少说“对不起”,这对自己或对别人都是值得庆幸的。
EmoticonsSo excited to reveal the title of my new album on the new KellyRowland.com :) -非常激动要在 KellyRowland.com 上揭晓我新专辑的名字了:)

This dataset was created to improve the Machine Translation quality, with emphasis in informal and noisy data, and it has been shown (see our work here) that vast improvements can be obtained for in-domain testsets, and moderate improvement in out-domain testsets. However, you are free to use this datasets for research purposes besides MT.

Also, if you use this dataset, we would appreciate it if you support our work by citing our paper. Bibtex available here.


Twitter Parallel Corpora

Twitter only allows the IDs(clause 4.a) of the tweets to be published. Thus, we shall only publish the IDs of the messages, and our meta-data that describes how to obtain the parallel data. We shall also provide the tools to extract this data.

Data Format -

More to come...


Sina Weibo Parallel Corpora

Sina Weibo does not allow publishing textual information to be published(clause 5.1). So, once again we will only publish the IDs of the messages and provide tools to crawl the data from the server.

More to come...


Mining Method

The data available in this website was mined using the method described in this paper. In short, our method attempts to find posts that contain translations within the post or post/repost pairs that are parallel.

More to come...


Terms of Use

Also, please keep in mind that after crawling the data from Twitter and Sina Weibo, you are subject to their terms and conditions. Twitter's terms can be found here. Sina Weibo' terms can be found here and here. We do not endorse and shall not be held responsible or liable for damages resulting from the impropriate usage of any content downloaded in this site.


Acknowledgments

I, Wang Ling, would like to give thanks (Also, I tend to be rather lame in these things so read at your own risk):
To FCT
I would like to thank FCT (Fundação para a Ciência e a Tecnologia) from the bottom of my heart for funding my Phd. The time I spent as a Phd Student in Carnegie Mellon University and Instituto Superior Tecnico was without doubt one of best in my life, and I am eternally grateful for being granted such a chance.
To My Advisors
I would like to thank my three advisors Alan Black, Isabel Trancoso and Chris Dyer for making me understand how D'Artagnan felt in "The Three Musketeers". Working with three mentors with distinct backgrounds meant tons of additional work, as each of them would pull me in a different direction. Yet, this forced me to look at the problem from different perspectives and pushed me to grow as a researcher. To put it simply, I feel that this work would not have been possible without the inspiration and insight that was instilled in me.
To My Collaborators
To my collaborator and friend Guang Xiang, I would like to thank you for the help and support in the duration of this my work. I am aware that you were busy with your thesis proposal/defense during the implementation of this work, so I appreciate it that you found the time to help me under such conditions.


The content in this website is provided by Wang Ling. Feel free to send me ideas, feedback or suggestions and I will do my best to accommodate them.