This file contains documentation on the Chinese-English Translation Lexicon Version 3.0, Linguistic Data Consortium (LDC) catalog number LDC2002L27 and ISBN 1-58563-238-4.
In 1999, responding to urgent demand for a Chinese-English bilingual wordlist to support various projects, the LDC quickly solicited entries from both in-house and Internet resources and compiled two versions of Chinese-English wordlists, "ldc_ce_dict.1.0.gb" (henceforth Version 1) and "ldc_ce_dict.2.0.txt" (henceforth Version 2), available for free to the general public at http://www.ldc.upenn.edu/Projects/Chinese/.
Version 1 took as its point of departure the CEDICT resource initiated by Paul Denisowski (more information about this project and its development can be found at http://www.mandarintools.com/cedict.html). The hastily compiled LDC Version 1, with its 24,298 entries, was relatively small and had unbalanced coverage. Research sites reported that some definitions were not suitable for machine translations, etc. and showed great interest in an updated version. Version 2, created as an experiment, has proven impractical for translingual information processing. Many of its entries were created by applying simple tricks such as reversing source and target language fields in various English-to-Chinese wordlists; as a result many entries are not really words. The increasing demand for richer lexical resources lead to the birth of the present release, "ldc_cedict.gb.Version 3" (henceforth Version 3).
The total number of Chinese headwords in this release is 54,170.
In terms of coverage, Version 3 is a superset of Version 1 and the LDC's Mandarin pronunciation lexicon (Version 3/Version 4). The pronunciation lexicon has a total of 44,404 entries, or 43,968 unique Chinese character strings (i.e. with pronunciation removed). There are still 553 entries from the pronunciation lexicon not found in Version 3. We were unable to provide accurate translations for these head words for various reasons: they may be very technical; they don't make sense unless their source is re-examined; they may have segmentation errors; or they may be rare words for which appropriate translations could not be found due to limited time and resources.
Version 3 also left out less than 40 entries from Version 1. Most of these are rare single-character words whose translations cannot be verified for accuracy.
Efforts have been made to assure:
There are exceptions to (4)-(6). For example, some grammatical words may not have simple English "equivalents". Often, they're given as many glosses as one can think of. But sometimes, the only way is to explain how the word in question is used. In such cases, brackets are used. However, brackets usually appear later in the sequence unless no regular gloss can be defined.
The following table shows the distribution of alternate glosses per Chinese headword; for example, 38,373 headwords have 1 gloss, 9,573 have 2 glosses, and so on (four Chinese words have 20 or more glosses):
#words #glosses
38373 1
9573 2
3445 3
1404 4
664 5
313 6
173 7
96 8
54 9
21 10
17 11
9 12
7 13
3 14
6 15
3 16
2 17
3 18
1 20
1 22
1 23
1 37
Creation Procedures
We extracted a list of headwords from the pronunciation lexicon that did not have glosses in Version 1. We first ran machine translation software on the list to get initial translations. The list was divided into two sub-lists. One list has full English translations for each headword; these translations were checked for accuracy and problems which were corrected by hand. The other list has partial or no translations at all. We used several hardcopy dictionaries as well as on-line resources to produce appropriate translations by hand.
The two lists with proper translations were then merged with Version 1 and quality control was performed on this master list. See above, "What's New in Version 3", for issues that were addressed during quality control.
Remaining Issues
Not every entry from Version 1 has been checked. Problems with entries from Version 1 were corrected only if they were spotted while quality control was being performed globally such as in the process of spell-checking. Therefore, it's likely that some entries may still have lengthy explanations, or a less common translation appears before a more common translation for certain entries. We hope to outline a set of more consistent quality control guidelines for the next version.
Format
There is one data file, the lexicon itself. Within the lexicon, each entry is in this format:
head_word_in_Chinese_characters <tab> /gloss 1/gloss 2/.../gloss n/ <newline>
For example:
/Chinese language/Chinese/
/English language/English/
Please see file.tbl for the directory structure of this publication, as well as a complete list of files.
Please see ldc_cedict.gb.v3 for the lexicon.
A text copy of the author's documentation is available at readme.txt.
Additional information, updates and bug fixes may be available in the LDC catalog entry for this corpus: LDC2002L27.
Portions © 2002 Trustees of the University of Pennsylvania.