To make the examples more effective, we can generalize them, so that more than one string can match any given part of the example. Suppose that we have the following translation example for English to German:
John Hancock was in Philadelphia on July 4th. John Hancock war am 4. Juli in Philadelphia.Now, if we knew that "John Hancock" is a person, Philadelphia is a city, and July 4th is a date, we could save this example in our database as
<PERSON> was in <CITY> on <DATE>. <PERSON> war am <DATE> in <CITY>.where <PERSON>, <CITY>, and <DATE> are special tokens naming the equivalence class whose members can be "plugged in" at that location. Notice how we can now immediately match many other strings that have this pattern, and are not restricted to matching just "John Hancock", "Philadelphia", or "July 4th". Any other member of the equivalence class <PERSON> can match what was originally "John Hancock".
How does the system know that "John Hancock" is a person? We tell it through specialized entries in its knowledge base. When indexing the example base, and before matching a new input against the database, the system tokenizes the input by searching the sentence for words and phrases which are listed in these specialized entries, and replacing each occurrence by the appropriate token. The original word/phrase, along with its translation, are remembered for later use. Partly due to the way the system evolved and partly for efficiency reasons, there are in fact two types of special entries with which words and phrases can be assigned to equivalence classes.
The first, and older, type is a separate file of tokenizations which simply lists all the members of a class in a group, along with the corresponding translation(s) for each. Any particular word or phrase may have multiple translations in its equivalence class, all of which will be recognized while indexing the translation examples, but only one of which (the preferred one) will actually be produced while translating. A word or phrase is allowed to be in multiple equivalence classes provided that all of the translations are unique -- if two equivalence classes were to contain the same translation, the indexer would not know which equivalence class to use.
The members of an equivalence class may themselves contain tokens, allowing entire patterns to be defined, such as the following definition for a <DATE>:
<MONTH> <NUMBER>1 , <NUMBER>2 [English] <NUMBER>1 . <MONTH> <NUMBER>2 [German]Which allows "July 4, 1776" to be translated into German as "4. Juli 1776".
The second type of special entries are tagged entries within the example base itself. These entries differ from the tokenization file in that they may be ambiguous, and are not unilaterally applied to every input, but only if the appropriate context is present. Because tagged entries are stored in the same database as regular entries, they can partake in normal partial matches after tokenization is complete, as well as in full matches for tokenization (for computational efficiency, tagged entries are also stored in a secondary index containing only tagged entries).
As with entries in the tokenization file, tagged entries in the example base may themselves contain tokens. To support this feature, the indexer (as well as the runtime lookup preprocessing) performs recursive matching: tokenize using the tokenization file, tokenize by looking up tagged entries in the database, and repeat until no more replacements are possible.