@make{slides} @libraryfile{accents} @libraryfile{mathematics10} @style{fontfamily=timesroman} @pagefooting{right="@-{@-{RDB 31Jul97 SIGIR-97 CLIR (@value{page})}}"} @form{drawing=<@begin{center,leftmargin 0,rightmargin 0, below 2 lines} @comment[@graphic MUST start in first column] @graphic{postscript=@parm(file), magnify=@parm(magnify,default 1.0), boundingbox=@parm(boundingbox,default file)} @string{drawingtagname=[@parm{tag,default Notag}]} @case{drawingtagname, Notag [@comment{No tag given}], Else [@tag{@parm(tag,default [])}]} @end{center}>} @comment{----------------------------------------------------------------} @begin{center} @heading{Corpus-Based Query Translation for} @heading{Cross-Lingual Information Retrieval} @blankspace(1in) Ralf D. Brown Language Technologies Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15232-3890 USA @blankspace(1line) ralf+@@cs.cmu.edu @end{center} @newpage{} @comment{----------------------------------------------------------------} @heading{Overview} @begin{enumerate} Translation Methods: @begin{itemize} Dictionary-based Term Translation Example-based Term Translation EBMT @end{itemize} Automated Dictionary Extraction Evaluation Results @end{enumerate} @newpage{} @comment{----------------------------------------------------------------} @heading{Dictionary-Based Term Translation} For each word in the query: @begin{enumerate} look up the word in a general-purpose bilingual dictionary add all translations of the input word to the translated query @end{enumerate} @heading{Example-Based Term Translation} For each word in the query: @begin{enumerate} look up the word in a corpus-derived bilingual dictionary for each translation of the input word, @begin{itemize} compute its relative frequency from the frequency information in the dictionary add the translation to the translated query 20*freq times (but at least once) @end{itemize} @end{enumerate} @newpage{} @comment{----------------------------------------------------------------} @heading{Example-Based Machine Translation} For each @i{line} in the query: @begin{enumerate} find all phrases which match against the source-language half of the example base for each matching phrase: @begin{itemize} attempt to find the translation by performing subsentential alignment if the alignment is successful, add the translation of the phrase to the translated query @end{itemize} @end{enumerate} @newpage{} @comment{----------------------------------------------------------------} @heading{Automated Dictionary Extraction} The extraction process is a simple count-and-filter algorithm, consisting of the following steps: @begin{itemize} creating a table of bilingual co-occurrence counts filtering the table using a set of threshold tests @end{itemize} Two additional refinements are available: @begin{itemize} a positional bias for languages with similar word orders a minor modification for handling the highest-frequency words @end{itemize} @heading{The Co-Occurrence Table} The co-occurrence table is simply a two-dimensional matrix indexed by source-language words in one dimension and target-language words in the other. For each sentence pair in the corpus, increment the counts in the cells for the cross-product of the words in the two sentences. Thus, the sentence pair @begin{verbatim, size -1} the report on the el informe sobre el @end{verbatim} would cause the following elements in the co-occurrence table to be incremented: @begin{verbatim,size -2.5} c[the,el] c[report,el] c[on,el] c[the,informe] c[report,informe] c[on,informe] c[the,sobre] c[report,sobre] c[on,sobre] @end{verbatim} @newpage{} @comment{----------------------------------------------------------------} @heading{Filtering} After constructing the co-occurrence table, all entries which do not pass at least one of the threshold functions are zeroed, and any remaining non-zero entries are output as probable translations. The symmetric threshold is passed whenever @begin{mathdisplay,size -1} C[S,T] @gte threshold[C] * count[S] @* @ @i{and} @* C[S,T] @gte threshold[C] * count[T], @end{mathdisplay} where @math{C[S,T]} is the number of times source-language word @math{S} co-occurs with target-language word @math{T} and @math{threshold[C]} is the threshold value selected by that co-occurrence count. The asymmetric threshold is passed whenever @begin{mathdisplay} C[S,T] @gte thresh1[C] * count[S] @i{and} C[S,T] @gte thresh2[C] * count[T] @center{@i{or}} C[S,T] @gte thresh1[C] * count[T] @i{and} C[S,T] @gte thresh2[C] * count[S], @end{mathdisplay} where @math{thresh1[C]} and @math{thresh2[C]} are the two separate limits of the asymmetric threshold. @newpage{} @comment{----------------------------------------------------------------} @heading{Variable Thresholds} As shown on the previous slide, the threshold values are a function of the total co-occurence count. This permits adjustments for very low-frequency and very high-frequency words: @begin{itemize} very low-frequency words may have a fairly high co-occurrence ratio purely by chance (e.g a word which appears only twice in the entire corpus will have a co-occurrence of at least 0.50 with @b{any} word with which it co-occurs. very high-frequency words may have a high co-occurrence ratio with other very high-frequency words. @end{itemize} Sample Threshold Function: @tabclear{} @begin{format,leftmargin +1.5in} @tabset{1in} C @\ Threshold 1 @\ 1.000 2 @\ 0.667 3 @\ 0.667 4 @\ 0.500 5 @\ 0.400 6 @\ 0.333 7 @\ 0.286 8 @\ 0.270 ... 1000 @\ 0.300 ... 2000 @\ 0.400 ... 3000 @\ 0.500 @end{format} @newpage{} @comment{----------------------------------------------------------------} @heading{High-Frequency Words} The highest-frequency words occur so often that they correlate with each other sufficiently to pass the filtering tests. Thus, a slightly modified algorithm is needed. To extract definitions for the highest-frequency words, a second pass is used, with the following differences: @begin{itemize} All but the highest-frequency words are omitted from the correspondence table. All sentences in the corpus which contain more than three high-frequency words are ignored. @end{itemize} When run for the 16 words occurring in 20% or more of all sentences, this modified algorithm extracts definitions for seven of them with no incorrect definitions. @newpage{} @comment{----------------------------------------------------------------} @heading{Induced Definitions} @begin{format,size -5.75} (ASTUDILLO (ASTUDILLO 1)) (COMPLICACI@uoac{}N (COMPLICATION 9)) (REFUTA (REFUTES 3)) (OCTUBRE (OCTOBER 4350)(NOVEMBER 419)(SEPTEMBER 314)(DATED 314)) (ACOPIAR (OLS 4)) (CENTRALIZANDO (CENTRALIZING 4)) (K1N (OTTAWA 6)(ONTARIO 4)(K1N 2)) (CART@uoac{}GRAFOS (CARTOGRAPHERS 2)) (CIEGOS (BLIND 18)) (CMINU (JUNIC 45)) (P@uuac{}AS (BARBED 8)(WIRE 8)) (DESMOTADO (COTTON 17)(GINNED 8)) (NUDO (IDICT 3)(NODE 3)(GORDIAN 3)(KNOT 3)) (TIEMPO (TIME 4723)(SAME 1357)(WHILE 1019)) (TRANSFORMAR@uaac{}N (APPEASEMENT 2)) (PRISMA (PRISM 2)) (RESURGIMIENTO (RESURGENCE 36)(FASCISM 6)) (COEXISTIR (COEXIST 3)) (MA@uiac{}Z (MAIZE 31)(RICE 15)(METRIC 14)(CORN 10)(SORGHUM 9)(WHEAT 8) (MILLET 6)(INSECTS 5)) (FECALES (FC 6)(COLIFORMS 3)(ML 3)(COLIFORM 3)(FECAL 3)) (ERR@uoac{}NEAMENTE (MISINTERPRETED 7)(WRONGLY 5)) (HORTALIZAS (VEGETABLES 31)(VEGETABLE 18)(FRUITS 15)(FRUIT 9)) (NOVIEMBRE (NOVEMBER 4810)(INTRODUCED 511)(OCTOBER 375)) (GANTCHEV (GANTCHEV 5)) (MASDIT (MASDIT 9)) (AGHRYMET (AGHRYMET 4)) (SABE (KNOWN 95)(KNEW 25)) (ABSOLUTAMENTE (ABSOLUTELY 65)) (O.O. (SP@uoac{}LKA 4)(AKCYJNA 2)(OGRANICZA 2)(ODPOWIEDZIALNOSCIA 2) (SP.Z 2)(O.O. 2)) (ITINERARIOS (ITINERARIES 5)(PREDETERMINED 5)) (LAMENTAMOS (REGRET 34)) (DESPLAZADOS (DISPLACED 143)(EXTERNALLY 13)(INTERNALLY 12) (AMERICANS 9)) (BARROCO (BAROQUE 3)(VIDEODISC 2)) (PRESUME (PRESUMED 7)) (DECLINAN (CAPTIVES 3)) (SHAHI (SHAHI 3)(M@uuum{}LLERSON 2)) (TIENDEN (TEND 96)(TENDED 24)) (SECTORIALISMO (SECTORALISM 2)) (MILOSEVIC (MILOSEVIC 11)(TUDJMAN 4)(KADIJEVIC 4)) (CINEMATECAS (PRINTS 2)) (DECIMALES (DECIMALS 2)) (|21B| (|21B| 2)) (BATOR (BATOR 7)(ULAN 4)) @end{format} @newpage{} @comment{----------------------------------------------------------------} @heading{Size-Accuracy Tradeoffs} By setting the thresholds used in filtering to different values, a tradeoff between yield and accuracy may be tuned: @begin{itemize} Raising the thresholds reduces the number of incorrect/spurious translations generated, but reduces the size of the dictionary. Lowering the thresholds yields more definitions, but also increases the error rate. @end{itemize} The size-accuracy tradeoff needs to be tuned for each application. For EBMT subsentential alignments, a very low threshold (large dictionary, but also high error rate) proved best; for CLIR, a higher threshold (smaller dictionary, but considerably lower error rate) proved best. @heading{Advantages of a Corpus-Derived Dictionary} @begin{itemize} tuned to the way the corpus translates terms provides frequency information which is not available in general-purpose dictionaries provides some query expansion because monolingually-collocated terms are also included as translations can be tuned for best trade-off between accuracy and coverage @end{itemize} @newpage{} @comment{----------------------------------------------------------------} @heading{Evaluation Setup} Corpus: @begin{itemize} UN documents referring repeatedly to UNICEF (at least 10 times) ``UNICEF'' documents were subdivided along internal section boundaries to increase size of document collection human relevance judgements collected for each query over the collection @end{itemize} Metrics: @begin{itemize} Standard 11-point average precision Degradation from MLIR to CLIR @end{itemize} Baseline: @begin{itemize} SMART without relevance feedback (SMART.basic), optimized on the UNICEF collection @end{itemize} @newpage{} @comment{----------------------------------------------------------------} @heading{Sample Query Translations} Query: @begin{format,size -4} water purification sanitation water supply project clean water personal hygiene health sanitation @end{format} Translated: @begin{format,size -4} abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento agua agua agua agua agua agua agua agua agua agua agua agua agua agua abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento agua agua agua agua agua agua agua agua agua agua agua agua agua agua abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento agua agua agua agua agua agua agua agua agua agua agua agua agua agua abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud salud higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene higiene personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales personales @end{format} @newpage{} @comment{----------------------------------------------------------------} @heading{Results} @tabclear{} @tabset{1.3in,2.7in,3.85in,5.0in} @begin{format} @bar{} Method @\ Weighting @\ MIR @\ TIR @\ TIR/MIR @bar{} DICT @\ ntc.ntc @\ .4721 @\ .2898 @\ 61% DICT @\ ltc.ltc @\ .4306 @\ .2340 @\ 54% DICT @\ atc.atc @\ .3492 @\ .1749 @\ 50% @bar{} EB-Term @\ ntc.ntc @\ .4721 @\ .4318 @\ 91% EB-Term @\ ltc.ltc @\ .4306 @\ .3723 @\ 86% EB-Term @\ atc.atc @\ .3492 @\ .2803 @\ 80% @bar{} @end{format} @blankspace(0.5in) @heading{Comparison of EBT Dictionaries} @tabclear{} @begin{format,leftmargin 1.9in,rightmargin 1.9in} @tabset{+1.5in} Dictionary @\ TIR avgp @ux{Threshold @\ (ntc.ntc)} 0.10 @\ 0.3886 0.15 @\ 0.3936 0.20 @\ 0.4097 0.22 @\ 0.4235 0.25 @\ 0.4111 0.27 @\ 0.4318 @ux{0.30 @\ 0.3988} @end{format} @newpage{} @comment{----------------------------------------------------------------} @comment{EXTRA SLIDES -- FOR Q&A} @heading{PanEBMT Corpus} Spanish-to-English corpus: @begin{itemize} ~685,000 sentence pairs -- ~265 megabytes UN Multilingual Corpus supplies most of the pairs 10,250 sentence pairs from Pan American Health Organization 552 sentence pairs from prior ARPA evaluations (newswire text) @end{itemize} Example: @begin{verbatim} Las fuentes de esos comentarios y recomendaciones son las suguientes: The sources of these comments and recommendations are: @end{verbatim} @newpage{} @comment{----------------------------------------------------------------} @heading{PanEBMT Architecture} @blankspace(0.75in) @drawing{file="ebmt-arch.pic",magnify 0.8,tag ebmt-arch} @newpage{} @comment{----------------------------------------------------------------} @comment{ --- End of File --- }