Prev: Results of Automated Generalization

Generalized Example-Based Machine Translation

[This page is under construction -- please check back.]

7. Word-Level Alignment

In the general discussion of the EBMT system, the word-level alignment algorithm used to find the corresponding translation for a partial match was outlined. The basic steps in the alignment algorithm are creating a correspondence table and applying a set of heuristic scoring functions to each candidate alignment using the correspondence table and the full the translation example.

To create a correspondence table, first look up each word in the source-language half of a translation example in a bilingual dictionary, and mark each occurrence of any of the translations in the target-language half as a possible correspondence. This yields a matrix such as the one in Figure 1. Note that there can be considerable ambiguity, especially for more common words such as determiners and prepositions.

Before actually using the correspondence table, we first prune it, removing those correspondences that are unlikely to actually be correct. To perform the pruning, we take advantage of the fact that there is usually local coherence between the two languages -- even if the word order is not exactly the same, a particular unit such as a noun phrase in the one language will usually correspond to a single unit in the other language, rather than being scattered all throughout the translation. Thus, for each word, we compute an expected range for its translation, and remove those correspondences which lie outside that range (ensuring that we never remove the last correspondence for any word in either language). The expected range for a word's translation is simply the range between the earliest and latest possible translation of its neighbors, expanded by a given number of words to allow for minor variations in word order. Should one of the neighboring words have no known correspondences, the expected range is set to be the entire sentence.

In Figure 1, we would prune the table by first proceeding downward, trimming each row as we go, and then proceed left-to-right, trimming each column as we go; this process repeats until no more correspondences can be removed. Consider the two green rectangles. These indicate the expected ranges for the words pointed at by the arrows on the left. For the upper rectangle, we can remove the two possibilities to its right, since neither is the only correspondence in its column, and we will still keep the third possibility in that row. For the lower green rectangle, nothing can be removed -- yet. When we switch to moving across the table, we eventually reach the red rectangle. Notice how we can now remove the two possibilities above the rectangle, which were unaffected during the scan down the table.

This example also demonstrates how the method is able to deal with large-scale rearrangement of the units in the sentence. Within each such unit, we have the same situation we would have had without the rearrangement. Only at the boundaries is there a change, which has the effect of making it harder for possible correspondences to be removed, since the expected range becomes larger. Harder, but not impossible -- as shown by the removal of the two correspondences above the red rectangle.

[Picture illustrating local coherence]

Figure 1: Local Coherence in Correspondence Table

Figures 2 and 3 show an actual example encountered in the Spanish-English system. In Figure 2, both the shaded and black boxes were initially listed as possible correspondences from the dictionary lookup. The shaded boxes were removed by the pruning method (using a two-word expansion of the range for word order variations). Note, however, the four remaining black squares in the top right-hand corner. These are, to a person, clearly out of place, but the pruning method could not touch them because the column to the left has no known correspondences.

[Picture of Pruned Correspondence Table]

Figure 2: Pruned Correspondence Table

To deal with such cases, we make an assumption: if there are a small number of words on each side with no known correspondences, then they probably correspond to some permutation of one another, but were simply not present in the dictionary. So we mark them as possible correspondences for each other (the four circles in Figure 2), and repeat the pruning. This yields Figure 3, which shows how the four squares mentioned before have now been removed, as well as one of the four correspondences that we added. Figure 3 also nicely illustrates that Spanish and English have very nearly the same word order, at least for this particular sentence.

[Picture of Final Correspondence Table]

Figure 3: Final Correspondence Table

Since generating the pruned correspondence table consumes the bulk of the computation involved in the word-level alignment algorithm, we normally build the table when the example base is indexed, and store the table along with each translation example.

[[Work Area]]

Next: Statistical Dictionaries

[LTI Home Page] [EBMT Main Page] [Generalization] [Results] [Applications]
(Last updated 04-Aug-99)