Prev: Generalization

Generalized Example-Based Machine Translation

[This page is under construction -- please check back.]

4. Effectiveness of Generalization

[[Work Area]]

Graphs 1 and 2 show how coverage increases as more text is added to the example base. Coverage means the percentage of the words in the input for which the EBMT system is able to generate at least one candidate translation. The three conditions compared in each graph are

  1. example base indexed without tokenization (other than numbers)
  2. example base indexed using only the tokenization file for replacements
  3. example base indexed using full recursive matching with both tagged entries and the tokenization file
For the French-English system using full recursive matching, there are some 76,000 morphological entries and 550 grammar rules, for a total of 224,000 words of linguistic data that must be counted toward the example base; this is why the top curve starts well to the right of the other curves. The Spanish-English system has much less linguistic data: some 13,000 morphological entries and 450 grammar rules, for a total of 43,000 words of overhead. For both systems, the left-most point on the top curve represents the performance with only the linguistic data and no actual example sentences.

[Graph of corpus size vs. coverage for French]
Graph 1: French coverage

[Graph of corpus size vs. coverage for Spanish]
Graph 2: Spanish coverage

Mere coverage alone is not the only measure of the system's performance. Another important measure is the size of the pieces for which it generates translations, the average match length. Provided that other parameters are not changed, a larger match will (in general) be of higher quality because it takes more context into account -- and is thus less likely to pick an incorrect word sense, make an egregious alignment error, etc. This approximate quality measure is important because generating manual judgements of translation quality is a tedious, time-consuming, and expensive task. Graphs 3 and 4 show how the average match length increases as more examples are added to the system, with the three curves representing the same conditions as in Graphs 1 and 2.

[Graph of corpus size vs. match length for French]
Graph 3: French match length

[Graph of corpus size vs. match length for Spanish]
Graph 4: Spanish match length

As can be seen from these graphs, simple tokenization adds a few percent to the coverage and average match length, while full recursive matching substantially increases both once the initial overhead of morphological entries and grammar rules has been accounted for. A more important measure in practice, however, is how much text is required to reach a certain coverage of unrestricted texts; here, recursive matching with a grammar makes a dramatic difference. To achieve 80% coverage of French inputs requires about 1.4 million words of text (French + English) without generalizations, but less than 280,000 words with full recursive matching -- a factor of five, and most of the text in the latter case is linguistic information rather than example sentences. Because the performance curve flattens out, the benefit increases as the coverage level goes up, reaching a reduction by a factor of eleven at 90% coverage. A similar, though somewhat less pronounced (due to the smaller amount of lingustic knowledge), reduction in the required amount of text is seen in the Spanish system.

Next: Automated Generalization

[LTI Home Page] [EBMT Main Page] [Basic System] [Generalization] [Applications]
(Last updated 04-Aug-99)