@make{slides}
@libraryfile{accents}
@libraryfile{mathematics10}
@style{fontfamily=timesroman}
@pagefooting{right="@-{@-{RDB 31Jul97 SIGIR-97 CLIR (@value{page})}}"}

@form{drawing=<@begin{center,leftmargin 0,rightmargin 0, below 2 lines}
                @comment[@graphic MUST start in first column]
@graphic{postscript=@parm(file), magnify=@parm(magnify,default 1.0),
                        boundingbox=@parm(boundingbox,default file)}
                @string{drawingtagname=[@parm{tag,default Notag}]}
                @case{drawingtagname,
                        Notag [@comment{No tag given}],
                        Else [@tag{@parm(tag,default [])}]}
                @end{center}>}
@comment{----------------------------------------------------------------}

@begin{center}
@heading{Corpus-Based Query Translation for}
@heading{Cross-Lingual Information Retrieval}
@blankspace(1in)

Ralf D. Brown
Language Technologies Institute
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15232-3890
USA

@blankspace(1line)
ralf+@@cs.cmu.edu
@end{center}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Overview}
@begin{enumerate}
Translation Methods:
@begin{itemize}
Dictionary-based Term Translation

Example-based Term Translation

EBMT
@end{itemize}

Automated Dictionary Extraction

Evaluation

Results
@end{enumerate}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Dictionary-Based Term Translation}

For each word in the query:
@begin{enumerate}
look up the word in a general-purpose bilingual dictionary

add all translations of the input word to the translated query
@end{enumerate}


@heading{Example-Based Term Translation}

For each word in the query:
@begin{enumerate}
look up the word in a corpus-derived bilingual dictionary

for each translation of the input word,
@begin{itemize}
compute its relative frequency from the frequency information in the
dictionary

add the translation to the translated query 20*freq times (but at
least once)
@end{itemize}

@end{enumerate}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Example-Based Machine Translation}

For each @i{line} in the query:
@begin{enumerate}
find all phrases which match against the source-language half of the
example base

for each matching phrase:
@begin{itemize}
attempt to find the translation by performing subsentential alignment

if the alignment is successful, add the translation of the phrase
to the translated query
@end{itemize}

@end{enumerate}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Automated Dictionary Extraction}

The extraction process is a simple count-and-filter algorithm,
consisting of the following steps:
@begin{itemize}
creating a table of bilingual co-occurrence counts

filtering the table using a set of threshold tests
@end{itemize}

Two additional refinements are available:
@begin{itemize}
a positional bias for languages with similar word orders

a minor modification for handling the highest-frequency words
@end{itemize}

@heading{The Co-Occurrence Table}

The co-occurrence table is simply a two-dimensional matrix indexed by
source-language words in one dimension and target-language words in the
other.  For each sentence pair in the corpus, increment the counts in
the cells for the cross-product of the words in the two sentences.
Thus, the sentence pair
@begin{verbatim, size -1}
 the report on the
 el informe sobre el
@end{verbatim}
would cause the following elements in the co-occurrence table to be
incremented:
@begin{verbatim,size -2.5}
 c[the,el]       c[report,el]       c[on,el]
 c[the,informe]  c[report,informe]  c[on,informe]
 c[the,sobre]    c[report,sobre]    c[on,sobre]
@end{verbatim}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Filtering}

After constructing the co-occurrence table, all entries which do not
pass at least one of the threshold functions are zeroed, and any
remaining non-zero entries are output as probable translations.

The symmetric threshold is passed whenever
@begin{mathdisplay,size -1}
C[S,T] @gte threshold[C] * count[S] @*
@ @i{and} @*
C[S,T] @gte threshold[C] * count[T],
@end{mathdisplay}
where @math{C[S,T]} is the number of times source-language word @math{S}
co-occurs with target-language word @math{T} and @math{threshold[C]} is the threshold
value selected by that co-occurrence count.

The asymmetric threshold is passed whenever
@begin{mathdisplay}
C[S,T] @gte thresh1[C] * count[S]
@i{and}
C[S,T] @gte thresh2[C] * count[T]
@center{@i{or}}
C[S,T] @gte thresh1[C] * count[T]
@i{and}
C[S,T] @gte thresh2[C] * count[S],
@end{mathdisplay}
where @math{thresh1[C]} and @math{thresh2[C]} are the two separate
limits of the asymmetric threshold.

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Variable Thresholds}

As shown on the previous slide, the threshold values are a function 
of the total co-occurence count.  This permits adjustments for very
low-frequency and very high-frequency words:
@begin{itemize}
very low-frequency words may have a fairly high co-occurrence ratio
purely by chance (e.g a word which appears only twice in the entire
corpus will have a co-occurrence of at least 0.50 with @b{any} word
with which it co-occurs.

very high-frequency words may have a high co-occurrence ratio with
other very high-frequency words.
@end{itemize}

Sample Threshold Function:
@tabclear{}
@begin{format,leftmargin +1.5in}
@tabset{1in}
C @\ Threshold
1 @\ 1.000
2 @\ 0.667
3 @\ 0.667
4 @\ 0.500
5 @\ 0.400
6 @\ 0.333
7 @\ 0.286
8 @\ 0.270
...
1000 @\	0.300
...
2000 @\ 0.400
...
3000 @\ 0.500
@end{format}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{High-Frequency Words}

The highest-frequency words occur so often that they correlate with
each other sufficiently to pass the filtering tests.  Thus, a slightly
modified algorithm is needed.

To extract definitions for the highest-frequency words, a second pass
is used, with the following differences:
@begin{itemize}
All but the highest-frequency words are omitted from the correspondence
table.

All sentences in the corpus which contain more than three
high-frequency words are ignored.
@end{itemize}

When run for the 16 words occurring in 20% or more of all sentences,
this modified algorithm extracts definitions for seven of them with no
incorrect definitions.

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Induced Definitions}

@begin{format,size -5.75}
(ASTUDILLO (ASTUDILLO 1))
(COMPLICACI@uoac{}N (COMPLICATION 9))
(REFUTA (REFUTES 3))
(OCTUBRE (OCTOBER 4350)(NOVEMBER 419)(SEPTEMBER 314)(DATED 314))
(ACOPIAR (OLS 4))
(CENTRALIZANDO (CENTRALIZING 4))
(K1N (OTTAWA 6)(ONTARIO 4)(K1N 2))
(CART@uoac{}GRAFOS (CARTOGRAPHERS 2))
(CIEGOS (BLIND 18))
(CMINU (JUNIC 45))
(P@uuac{}AS (BARBED 8)(WIRE 8))
(DESMOTADO (COTTON 17)(GINNED 8))
(NUDO (IDICT 3)(NODE 3)(GORDIAN 3)(KNOT 3))
(TIEMPO (TIME 4723)(SAME 1357)(WHILE 1019))
(TRANSFORMAR@uaac{}N (APPEASEMENT 2))
(PRISMA (PRISM 2))
(RESURGIMIENTO (RESURGENCE 36)(FASCISM 6))
(COEXISTIR (COEXIST 3))
(MA@uiac{}Z (MAIZE 31)(RICE 15)(METRIC 14)(CORN 10)(SORGHUM 9)(WHEAT 8)
    (MILLET 6)(INSECTS 5))
(FECALES (FC 6)(COLIFORMS 3)(ML 3)(COLIFORM 3)(FECAL 3))
(ERR@uoac{}NEAMENTE (MISINTERPRETED 7)(WRONGLY 5))
(HORTALIZAS (VEGETABLES 31)(VEGETABLE 18)(FRUITS 15)(FRUIT 9))
(NOVIEMBRE (NOVEMBER 4810)(INTRODUCED 511)(OCTOBER 375))
(GANTCHEV (GANTCHEV 5))
(MASDIT (MASDIT 9))
(AGHRYMET (AGHRYMET 4))
(SABE (KNOWN 95)(KNEW 25))
(ABSOLUTAMENTE (ABSOLUTELY 65))
(O.O. (SP@uoac{}LKA 4)(AKCYJNA 2)(OGRANICZA 2)(ODPOWIEDZIALNOSCIA 2)
    (SP.Z 2)(O.O. 2))
(ITINERARIOS (ITINERARIES 5)(PREDETERMINED 5))
(LAMENTAMOS (REGRET 34))
(DESPLAZADOS (DISPLACED 143)(EXTERNALLY 13)(INTERNALLY 12)
    (AMERICANS 9))
(BARROCO (BAROQUE 3)(VIDEODISC 2))
(PRESUME (PRESUMED 7))
(DECLINAN (CAPTIVES 3))
(SHAHI (SHAHI 3)(M@uuum{}LLERSON 2))
(TIENDEN (TEND 96)(TENDED 24))
(SECTORIALISMO (SECTORALISM 2))
(MILOSEVIC (MILOSEVIC 11)(TUDJMAN 4)(KADIJEVIC 4))
(CINEMATECAS (PRINTS 2))
(DECIMALES (DECIMALS 2))
(|21B| (|21B| 2))
(BATOR (BATOR 7)(ULAN 4))
@end{format}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Size-Accuracy Tradeoffs}

By setting the thresholds used in filtering to different values, a
tradeoff between yield and accuracy may be tuned:
@begin{itemize}
Raising the thresholds reduces the number of incorrect/spurious
translations generated, but reduces the size of the dictionary.

Lowering the thresholds yields more definitions, but also increases the
error rate.
@end{itemize}

The size-accuracy tradeoff needs to be tuned for each application.
For EBMT subsentential alignments, a very low threshold (large
dictionary, but also high error rate) proved best; for CLIR, a higher
threshold (smaller dictionary, but considerably lower error rate)
proved best.

@heading{Advantages of a Corpus-Derived Dictionary}

@begin{itemize}
tuned to the way the corpus translates terms

provides frequency information which is not available in general-purpose
dictionaries

provides some query expansion because monolingually-collocated terms are
also included as translations

can be tuned for best trade-off between accuracy and coverage
@end{itemize}


@newpage{}
@comment{----------------------------------------------------------------}

@heading{Evaluation Setup}

Corpus:
@begin{itemize}
UN documents referring repeatedly to UNICEF (at least 10 times)

``UNICEF'' documents were subdivided along internal section boundaries
to increase size of document collection

human relevance judgements collected for each query over the collection
@end{itemize}

Metrics:
@begin{itemize}
Standard 11-point average precision

Degradation from MLIR to CLIR
@end{itemize}

Baseline:
@begin{itemize}
SMART without relevance feedback (SMART.basic), optimized on
the UNICEF collection
@end{itemize}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Sample Query Translations}

Query:
@begin{format,size -4}
water purification sanitation water supply project clean water
personal hygiene health sanitation
@end{format}

Translated:
@begin{format,size -4}
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento agua agua agua agua agua agua agua agua agua agua agua agua
agua agua abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento agua agua agua agua agua agua agua agua agua
agua agua agua agua agua abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento saneamiento
saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento
saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento
purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n
purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n
purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n purificaci@oac{}n
purificaci@oac{}n purificaci@oac{}n abastecimiento abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento agua agua agua agua agua agua
agua agua agua agua agua agua agua agua abastecimiento abastecimiento
abastecimiento abastecimiento abastecimiento abastecimiento abastecimiento
saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento
saneamiento saneamiento saneamiento saneamiento saneamiento saneamiento
saneamiento salud salud salud salud salud salud salud salud salud salud salud
salud salud salud salud salud salud salud salud salud higiene higiene higiene
higiene higiene higiene higiene higiene higiene higiene higiene higiene
higiene higiene higiene higiene higiene higiene higiene higiene personales
personales personales personales personales personales personales personales
personales personales personales personales personales personales personales
personales personales personales personales personales
@end{format}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{Results}

@tabclear{}
@tabset{1.3in,2.7in,3.85in,5.0in}
@begin{format}
@bar{}
Method  @\ Weighting @\ MIR     @\ TIR          @\ TIR/MIR
@bar{}
DICT    @\ ntc.ntc @\ .4721     @\ .2898        @\ 61%
DICT    @\ ltc.ltc @\ .4306     @\ .2340        @\ 54%
DICT    @\ atc.atc @\ .3492     @\ .1749        @\ 50%
@bar{}
EB-Term @\ ntc.ntc @\ .4721     @\ .4318        @\ 91%
EB-Term @\ ltc.ltc @\ .4306     @\ .3723        @\ 86%
EB-Term @\ atc.atc @\ .3492     @\ .2803        @\ 80%
@bar{}
@end{format}

@blankspace(0.5in)
@heading{Comparison of EBT Dictionaries}
@tabclear{}
@begin{format,leftmargin 1.9in,rightmargin 1.9in}
@tabset{+1.5in}
Dictionary	@\ TIR avgp
@ux{Threshold	@\ (ntc.ntc)}
0.10		@\ 0.3886
0.15		@\ 0.3936
0.20		@\ 0.4097
0.22		@\ 0.4235
0.25		@\ 0.4111
0.27		@\ 0.4318
@ux{0.30	@\ 0.3988}
@end{format}

@newpage{}
@comment{----------------------------------------------------------------}
@comment{EXTRA SLIDES -- FOR Q&A}

@heading{PanEBMT Corpus}

Spanish-to-English corpus:
@begin{itemize}
~685,000 sentence pairs -- ~265 megabytes

UN Multilingual Corpus supplies most of the pairs

10,250 sentence pairs from Pan American Health Organization

552 sentence pairs from prior ARPA evaluations (newswire text)
@end{itemize}

Example:
@begin{verbatim}
Las fuentes de esos comentarios y
  recomendaciones son las suguientes:
The sources of these comments and
  recommendations are:
@end{verbatim}

@newpage{}
@comment{----------------------------------------------------------------}

@heading{PanEBMT Architecture}
 
@blankspace(0.75in)
@drawing{file="ebmt-arch.pic",magnify 0.8,tag ebmt-arch}

@newpage{}
@comment{----------------------------------------------------------------}

@comment{ --- End of File --- }