Multilingual Embeddings

Introduction

This page provides a link to the multilingual word embeddings described in the paper [1] below.

Current supported languages are: Arabic, Brazilian Portuguese, Dutch, English, French, German, Italian, Polish, Romanian, Russian, Spanish, and Turkish.

The embeddings were obtained by combining parallel data from the TED Corpus with pre-trained English GloVe embeddings.

Code

The code is publicly available here.

Download pre-trained word vectors

The multilingual word vectors can be downloaded here (1.3 GB tar.gz file). The format is one word per line. The suffix of each word indicates the language.

As an alternative, you can download each language's vectors separately:

Reference: