Language Identification
Language identification is the process of deciding which (human)
language a particular bit of text is written in, e.g. the word
"Sprachidentifikation" is in German while the rest of this sentence is
in English. Typical approaches are based on statistics of the most
frequent n-grams in each language, with a smaller but still
substantial number of attempts using the most common words in a
language or some combination of the two.
Given a paragraph or two of purely monolingual text and a dozen or so
distinct languages to choose from, this problem is essentially solved,
with virtually error-free performance. My work has focused on short
texts (equivalent to a single line of printed text in a paperback
book) and very large numbers of languages, many of which are extremely
similar or even dialects of a common language. Even with these
additional complications, it is possible to classify the language with
about 98% accuracy, i.e. only about 2% of the test strings in a set of
1000 or more languages are assigned an incorrect language.
Recent work by others has started to focus on distinguishing those
extremely similar language pairs as well as regional variants of
languages (such as British versus American English or Mexican versus
Argentian Spanish). Performance on these close pairs has historically
been much worse than on distinct pairs like English versus Spanish.
Other open work is dealing with texts containing passages in multiple
languages, such as the first paragraph above. If the admixture of a
second language is only a minor part of the entire text, it is usually
correctly identified as the majority language. But what if the text
contains nearly equal amounts of text in multiple languages, or we
wish to identify the secondary language(s) as well as the primary
language of a text? In the former case, any statistics over the
entire text will be skewed, and there is a good chance that it will be
identified as none of its constituent languages!
For a comprehensive survey of language identification research as of
early 2018, see Jauhiainan et al,
"Automatic Language
Identification in Texts: A Survey" (PDF link).
LTI LangID Corpus
- LTI LangID Corpus Release 4 download
Contains training data for 1152 languages, and some (possibly very
tiny) amount of text for a total of 1547 languages. (669 MB)
Erratum: the 00README in the archive accidentally omitted
counting one of the Wikipedia languages.
- LTI LangID Corpus Release 5 download
Contains training data for 1266 languages, and some (possibly very
tiny) amount of text for a total of 1706 languages. (753 MB)
Also includes scripts to download non-redistributable data
for 1000+ additional languages, approximately doubling the total
data.
EMNLP-2014
To replicate the experiments reported in
Ralf D. Brown, "Non-linear Mapping for Improved Identification of
1300+ Languages." In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP-2014).
download both the LTI LangID Corpus, Release 1 and the above source
code, then follow these steps:
- Ensure that Java, Perl, 'tcsh', 'xargs', GNU 'parallel', and GCC/G++ are installed.
- Unpack the source code in a directory of your choice; call the top-level dir of the archive $EMNLP
- Unpack the LangID corpus in a directory of your choice; call the top-level dir $LANGID
- Install the corpus data where the replication script can find it using the command
$LANGID/code/install.sh -devtrain $EMNLP/corpus
You may optionally delete the unpacked corpus after the install
script completes to save space.
- Compile the identifiers with
$EMNLP/bin/build.sh
- Now run the replication script with
$EMNLP/replicate.sh
and wait.
Better yet, come back tomorrow....
(Seriously -- this takes 22 hours on a hex-core 4.1 GHz Intel i7 "Sandy Bridge" with 16GB RAM.)
Summary results of the evaluation runs will be placed in
$EMNLP/results
. If 'gnuplot' is installed, graphs of the
results (some of which appear in the paper) will be generated in that
directory as well. Full evaluation outputs can be reviewed
in $EMNLP/corpus/eval-dev-results
and $EMNLP/corpus/eval-results
(devtest/tuning and test
sets, respectively).
My Language Identification Papers
Ralf D. Brown. "Finding and Identifying Text in 900+ Languages".
In Digital Investigation, Volume 9 (2012), pp. S34-S43.
(Proceedings of the Twelfth Annual DFRWS Conference, Washington DC, August 6-8, 2012)
DOI: 10.1016/j.diin.2012.05.004
Available in
PDF.
Also available:
slides from my presentation at DFRWS.
Abstract:
This paper presents a trainable open-source utility to extract text
from arbitrary data files and disk images which uses language models
to automatically detect character encodings prior to extracting
strings and for automatic language identification and filtering of
non-textual strings after extraction. With a test set containing 923
languages, consisting of strings of at most 65 characters, an overall
language identification error rate of less than 0.4% is achieved.
False alarm rates on random data are 0.34% when filtering thresholds
are set for high recall and 0.012% when set for high precision, with
corresponding miss rates of 0.002% and 0.009% in running text.
Ralf D. Brown, "Selecting and Weighting N-Grams to Identify 1100 Languages",
In Proceedings of Text, Speech, and Dialogue 2013.
Plzen, Czech Republic, September 2013.
Available in PDF.
Also available: slides from my presentation at TSD
Abstract:
This paper presents a language identification algorithm using cosine
similarity against a filtered and weighted subset of the most frequent
n-grams in training data with optional inter-string score smoothing,
and its implementation in an open-source program. When applied to a
collection of strings in 1100 languages containing at most 65
characters each, an average classification accuracy of over 99.2% is
achieved with smoothing and 98.2% without. Compared to three other
open-source language identification programs, the new program is both
much more accurate and much faster at classifying short strings given
such a large collection of languages.
Ralf D. Brown, "Non-linear Mapping for Improved Identification of
1300+ Languages." In Proceedings of the Conference on Empirical
Methods in Natural Language Processing (EMNLP-2014).
Preprint PDF available.
Abstract:
Non-linear mappings of the form P(ngram)γ and
log(1+τ P(ngram))/log(1+τ) are applied to the n-gram
probabilities in five trainable open-source language identifiers. The
first mapping reduces classification errors by 4.0% to 83.9% over a
test set of more than one million 65-character strings in 1366
languages, and by 2.6% to 76.7% over a subset of 781 languages. The
second mapping improves four of the five identifiers by 10.6% to
83.8% on the larger corpus and 14.4% to 76.7% on the smaller
corpus. The subset corpus and the modified programs are made freely
available for download at http://www.cs.cmu.edu/~ralf/langid.html.
[LTI Home Page]
[Ralf Brown's Home Page]
[Ralf Brown's Papers]
(Last updated 25-Jul-2023)