Language Identification

Language identification is the process of deciding which (human) language a particular bit of text is written in, e.g. the word "Sprachidentifikation" is in German while the rest of this sentence is in English. Typical approaches are based on statistics of the most frequent n-grams in each language, with a smaller but still substantial number of attempts using the most common words in a language or some combination of the two.

Given a paragraph or two of purely monolingual text and a dozen or so distinct languages to choose from, this problem is essentially solved, with virtually error-free performance. My work has focused on short texts (equivalent to a single line of printed text in a paperback book) and very large numbers of languages, many of which are extremely similar or even dialects of a common language. Even with these additional complications, it is possible to classify the language with about 98% accuracy, i.e. only about 2% of the test strings in a set of 1000 or more languages are assigned an incorrect language.

Recent work by others has started to focus on distinguishing those extremely similar language pairs as well as regional variants of languages (such as British versus American English or Mexican versus Argentian Spanish). Performance on these close pairs has historically been much worse than on distinct pairs like English versus Spanish.

Other open work is dealing with texts containing passages in multiple languages, such as the first paragraph above. If the admixture of a second language is only a minor part of the entire text, it is usually correctly identified as the majority language. But what if the text contains nearly equal amounts of text in multiple languages, or we wish to identify the secondary language(s) as well as the primary language of a text? In the former case, any statistics over the entire text will be skewed, and there is a good chance that it will be identified as none of its constituent languages!

For a comprehensive survey of language identification research as of early 2018, see Jauhiainan et al, "Automatic Language Identification in Texts: A Survey" (PDF link).

LTI LangID Corpus

LTI LangID Corpus Release 1 download
Contains training data for 781 languages, and some (possibly very tiny) amount of text for a total of 1091 languages. (373 MB)

LTI LangID Corpus Release 2 download
Contains training data for 847 languages, and some (possibly very tiny) amount of text for a total of 1146 languages. (395 MB)

LTI LangID Corpus Release 3 download
Contains training data for 970 languages, and some (possibly very tiny) amount of text for a total of 1279 languages. (461 MB)

LTI LangID Corpus Release 4 download
Contains training data for 1152 languages, and some (possibly very tiny) amount of text for a total of 1547 languages. (669 MB)
Erratum: the 00README in the archive accidentally omitted counting one of the Wikipedia languages.

LTI LangID Corpus Release 5 download
Contains training data for 1266 languages, and some (possibly very tiny) amount of text for a total of 1706 languages. (753 MB)
Also includes scripts to download non-redistributable data for 1000+ additional languages, approximately doubling the total data.

EMNLP-2014

Supplemental data to EMNLP-2014 paper.
Contains the complete evaluation results for both the LTI LangID Corpus and the full 1366-language corpus. (0.8 MB)
Source code to replicate EMNLP experiments.
Contains the five language identification programs as modified for the experiments, plus evaluation programs and a master script to run all of the experiments. (5.4 MB)

To replicate the experiments reported in

Ralf D. Brown, "Non-linear Mapping for Improved Identification of
1300+ Languages." In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP-2014).

download both the LTI LangID Corpus, Release 1 and the above source code, then follow these steps:

Ensure that Java, Perl, 'tcsh', 'xargs', GNU 'parallel', and GCC/G++ are installed.
Unpack the source code in a directory of your choice; call the top-level dir of the archive $EMNLP
Unpack the LangID corpus in a directory of your choice; call the top-level dir $LANGID
Install the corpus data where the replication script can find it using the command
$LANGID/code/install.sh -devtrain $EMNLP/corpus
You may optionally delete the unpacked corpus after the install script completes to save space.
Compile the identifiers with
$EMNLP/bin/build.sh
Now run the replication script with
$EMNLP/replicate.sh
and wait.
Better yet, come back tomorrow....
(Seriously -- this takes 22 hours on a hex-core 4.1 GHz Intel i7 "Sandy Bridge" with 16GB RAM.)

Summary results of the evaluation runs will be placed in $EMNLP/results. If 'gnuplot' is installed, graphs of the results (some of which appear in the paper) will be generated in that directory as well. Full evaluation outputs can be reviewed in $EMNLP/corpus/eval-dev-results and $EMNLP/corpus/eval-results (devtest/tuning and test sets, respectively).

My Language Identification Papers

Ralf D. Brown. "Finding and Identifying Text in 900+ Languages". In Digital Investigation, Volume 9 (2012), pp. S34-S43. (Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012)
DOI: 10.1016/j.diin.2012.05.004
Available in PDF.
Also available: slides from my presentation at DFRWS.
Abstract: This paper presents a trainable open-source utility to extract text from arbitrary data files and disk images which uses language models to automatically detect character encodings prior to extracting strings and for automatic language identification and filtering of non-textual strings after extraction. With a test set containing 923 languages, consisting of strings of at most 65 characters, an overall language identification error rate of less than 0.4% is achieved. False alarm rates on random data are 0.34% when filtering thresholds are set for high recall and 0.012% when set for high precision, with corresponding miss rates of 0.002% and 0.009% in running text.

Ralf D. Brown, "Selecting and Weighting N-Grams to Identify 1100 Languages", In Proceedings of Text, Speech, and Dialogue 2013. Plzen, Czech Republic, September 2013.
Available in PDF.
Also available: slides from my presentation at TSD
Abstract: This paper presents a language identification algorithm using cosine similarity against a filtered and weighted subset of the most frequent n-grams in training data with optional inter-string score smoothing, and its implementation in an open-source program. When applied to a collection of strings in 1100 languages containing at most 65 characters each, an average classification accuracy of over 99.2% is achieved with smoothing and 98.2% without. Compared to three other open-source language identification programs, the new program is both much more accurate and much faster at classifying short strings given such a large collection of languages.

Ralf D. Brown, "Non-linear Mapping for Improved Identification of 1300+ Languages." In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP-2014).
Preprint PDF available.
Abstract: Non-linear mappings of the form P(ngram)^γ and log(1+τ P(ngram))/log(1+τ) are applied to the n-gram probabilities in five trainable open-source language identifiers. The first mapping reduces classification errors by 4.0% to 83.9% over a test set of more than one million 65-character strings in 1366 languages, and by 2.6% to 76.7% over a subset of 781 languages. The second mapping improves four of the five identifiers by 10.6% to 83.8% on the larger corpus and 14.4% to 76.7% on the smaller corpus. The subset corpus and the modified programs are made freely available for download at http://www.cs.cmu.edu/~ralf/langid.html.

[LTI Home Page] [Ralf Brown's Home Page] [Ralf Brown's Papers]
(Last updated 25-Jul-2023)