Multilingual Embeddings

Introduction

This page provides a link to the multilingual word embeddings described in the paper [1] below.

Current supported languages are: Arabic, Brazilian Portuguese, Dutch, English, French, German, Italian, Polish, Romanian, Russian, Spanish, and Turkish.

The embeddings were obtained by combining parallel data from the TED Corpus with pre-trained English GloVe embeddings.

Code

The code is publicly available here.

Download pre-trained word vectors

The multilingual word vectors can be downloaded here (1.3 GB tar.gz file). The format is one word per line. The suffix of each word indicates the language.

As an alternative, you can download each language's vectors separately:

Arabic
Brazilian Portuguese
Dutch
English
French
German
Italian
Polish
Romanian
Russian
Spanish
Turkish

Reference:

[1] Daniel C. Ferreira, André F. T. Martins, and Mariana S. C. Almeida.
"Jointly Learning to Embed and Predict with Multiple Languages."
Annual Meeting of the Association for Computational Linguistics (ACL'16), Berlin, Germany, August 2016.
To appear.