\mysection{Spoken Language Translation Research}
\label{translation}

The core of the {\sc Nespole!} project will be the development of new
state-of-the-art speech translation methods, which will be integrated into a
complete STST system.  While we plan to build upon the technology and
scientific advances achieved by previous major STST efforts such as C-STAR
\cite{} and Verbmobil \cite{}, our main research goal will be to significantly
advance the technological capabilities of STST systems in terms of robustness,
scalability and portability to new domains.

The partners in this consortium have significant experience in building STST
systems using a variety of rule-based, corpus-based, and statistical
approaches to translation.  For {\sc Nespole!}, our intention is to allow
partners to investigate a variety of approaches to STST, some rule-based, some
corpus-based, and some a combination of both.  The IF will be the main point
of interface between partners.  Each partner will develop its own analysis
module to map sentence in its language onto IF, and its own synthesis module
to map IFs onto its language.  All-ways translation between all pairs of
languages is achieved by combining any analysis module with any synthesis
module, mediated by the IF.  This mode of collaboration between partners has
been effective in C-STAR.  

The work described here corresponds to Workpackage~5 in the European proposal.
This workpackage will be lead by UKA with help from all the other research
partners (CMU, IRST and CLIPS).  We focus here primarily on the planned
research on development of new spoken language translation components at CMU.

The research work at CMU will focus on the development of a number of
different translation approaches that will be combined into a multi-engine
system, to make the best of the strengths, and minimize the weaknesses of each
single approach.  Within this multi-engine context, we will experiment with
approaches that use the IF as well as approaches that do not.  The approaches
that do not use IF will be developed for English and German only, since they
have to be specifically trained and developed for each pair of languages
separately.  Our goal in trying different approaches is to provide comparative
evidence concerning the respective merits and weakness of IF- versus
non-IRF-based approaches with respect to robustness.  The individual
approaches and our proposed research into multi-engine combination are
described in greater detail in the following subsections

\mysubsection{IRF-based Analysis Engines}

We intend to experiment with the following IRF-based approaches, comparing and
assessing their respective merits and weaknesses.

\begin{enumerate}
\item {\bf Semantic Grammar-based approach:} manually developed semantic 
grammars for the various sub-domains are combined together and parsed with a
robust parser, producing IRF representations.

\item {\bf Concept Classification approach:} this approach uses an underlying
grammar for identifying/parsing basic argument-level phrases (i.e. time
expressions, locations, etc.) and a statistical classifier to map the sequence
of identified arguments into complete domain actions. The final outputs are
IRF representations.

\item {\bf Shallow syntax-based analysis:} identification of predicate-argument
structures using simple and shallow methods, with the goal of then mapping
these into predicate-argument IF representations.
\end{enumerate}

\subsubsection{Semantic Grammar-based Approach}

\begin{figure}[t]
%\centerline{\psfig{file=grammars.eps,width=10cm}}
\epsfxsize=5.0in
\centerline{\epsffile{grammars.eps}}
\caption{Combining multiple sub-domain grammars with shared and cross domain 
grammars.}
\label{grammars}
\end{figure}

The semantic grammar-based approach has been the focus of our previous STST
research within the context of the C-STAR project, and has proven to be
effective for large yet limited domains such as travel planning, which can be
broken down into several natural sub-domains.  For both analysis and
generation we have been using semantic grammars.  Rather than focusing on the
syntactic structure of the input, semantic grammars directly describe how
surface expressions reflect the underlying semantic concepts that are being
conveyed by the speaker.  For example, the rules indicate that the concept of
something being availabile can be expressed with the phrases {\it we have
\ldots\/} or {\it there are \ldots\/}.  Because they focus on identifying a
set of predefined semantic concepts, they are relatively well suited to handle
the types of meaningful but ungrammatical disfluencies that are typical of
spoken language, and are also less sensitive to speech recognition errors.
Semantic grammars are also relatively fast to develop for limited domains,
where the set of concepts being described is relatively small.  However, they
are usually hard to expand to cover new domains.  New rules are required for
each new semantic concept, since syntactic generalities cannot usually be
fully utilized.

In our current version of the {\sc Janus} system we have developed a way to
combine modular grammars in order to overcome the problems associated with
expanding semantic grammars to new domains.  The parser \cite{gavalda} applies
multiple sub-grammars in parallel and stores the outputs in a parse tree
lattice.  (In the process of building the lattice, the parser also segments
long utterances into sentences, so that this does not need to be done as a
separate process outside of the parser.)  A number of heuristics are used to
rank the paths through the lattice, including the likelihood of a string of
words belonging to a particular sub-domain module \cite{amta-98}.
Figure~\ref{grammars} illustrates this modular approach.

For the {\sc Nespole!} project, we will incorporate our grammar-based approach
as one of the multiple analysis engines.  We expect to leverage coverage off
of the domain-independent portions of our current grammars (labelled as
``cross-domain'' and ``shared'' in Figure~\ref{grammars}) We will also limit
the amount of grammar writing that needs to be done by using our new Concept
Classification approach to parsing (described below).  With a Concept
Classification parser we only need to write mini-grammars for IF arguments;
the amount of grammar writing for speech acts and concepts in the IF is
significantly reduced.

%% We do not plan to construct full
%% grammars to cover the new and expanded domains proposed for NESPOLE!

\subsubsection{The Concept Classification Approach} 

\begin{figure}[t]
  \centering \fbox{ 
    \centering \parbox[c][1.6\totalheight][c]{130mm}{
      \centering \begin{math} \underbrace{
          \underbrace{\rm{Hello.}}_{a_1}}_{\tt{\scriptsize
            greeting}} \underbrace{ \underbrace{\rm{This\ is} }_{a_2}
          \underbrace{\rm{Bob.}}_{\tt{\scriptsize
              \parbox{11mm}{\centering
                person\\-name}}}}_{\tt{\scriptsize
            introduce-self}} \underbrace{
          \underbrace{\rm{I}}_{\tt{\scriptsize super\_who}}
          \underbrace{\rm{would\ like\ to\ book}}_{a_5}
          \underbrace{\rm{a\ flight}}_{\tt{\scriptsize
              \parbox{18mm}{\centering super\_flight\\-type}}}
          \underbrace{\rm{to\ Frankfurt.}}_{\tt{\scriptsize
              super\_destination}}}_{\tt{\scriptsize
            request-action+reservation+features+flight}}
      \end{math}}}
\caption{Example: Multi-level Analysis for an Input Utterance}
\label{cc-example}
\end{figure}

We propose to construct a new Concept Classification parser for analyzing
task-oriented speech utterances.  The goal of the parser is to analyze
utterances directly into our dialogue-act-based IF representation (see
section~\ref{irf}).  Complete DA representations consist of a speech-act, a
domain concept and a list of analyzed arguments.  The new Concept
Classification parser that we propose will operate in two stages.  In the
first stage, the parser uses an underlying phrase-level semantic grammar for
analyzing the input into a sequence of arguments.  In the second stage, the
parser identifies the speech-act and domain concept based on the sequence of
detected arguments and the words in the utterance.  An example utterance and
its levels of analysis is shown in Figure~\ref{cc-example}.  The utterance
{\tt I would like to book a flight to Frankfurt} is analyzed first as a
sequence of phrase arguments (such as {\tt to Frankfurt} analyzed as a
destination).  The sequence of arguments then gets mapped to a speech-act,
in this case {\tt request-action}, and a domain-concept - 
{\tt reservation+features+flight}.

The parsing of argument level phrases will be done using the robust SOUP
parser \cite{}.  The necessary phrase-level semantic grammars will be 
developed manually, but we expect leverage coverage off of portions of our 
current semantic grammars.  We expect the phrase-level semantic grammars  
to be far less domain dependent (thus far more portable) than complete 
semantic grammars.  

The second stage speech-act and concept classification will be based on data 
trainable classification technology.  We will experiment with hidden markov 
models (HMMs), neural nets, and decision trees.  These techniques are by
design more robust and portable than complete semantic grammars, but their
accuracy depends on the availablilty of adequate accurate training data.

We have already conducted a pilot study of the above approach.  The pilot
study was performed on our current C-STAR travel planning domain, with the
goal of proving the feasibility of the general approach.  In the pilot
study, we used a multi-level HMM to model argument-level concepts and
speech-acts.  The models were trained on a labelled corpus of utterances
parsed by our full semantic grammar system.  The trained HMM was then used
to determine the segmentation and classification of utterances into 
speech-acts and argument-level segments.  The argument-level segments were
then parsed (when possible) using the portion of the semantic grammar that
corresponds to each of the argument labels.  A single neural net was trained
to map the sequence of argument labels to a domain concept.  In the above
example, the sequence of arguments {[\tt super-who a5 super-flight-type
super-destination]} should be mapped to the domain concept
{\tt reservation+features+flight}.  The speech-act, domain concept and
arguments were then combined together to form a complete IF dialogue act.

A preliminary evaluation of this pilot system showed promising results.
We compared the performance of the pilot system with that of our full semantic
grammar analysis system, via a translation English ``paraphrase'' test.
Using a test set of 200 English utterances, we analyzed the input into IF
representations using both analysis systems.  The resulting IFs were then
generated back to English (using our grammar-based generation component).
The paraphrases were then graded manually by 2 human graders, and catagorized
as ``perfect'' (if the paraphrase fluently conveyed the complete content of 
the input), ``OK'' (if the content was conveyed, but minor details were
missing, or with some disfluency), or ``bad'' (if the paraphrase was
considered unacceptable).  ``Perfect'' and ``OK'' scores are also summed
together as ``acceptable''. Whereas the system that used the semantic grammar
analysis component achieved a score of 73.7\% acceptable paraphrases, the
system that used our pilor concept classification analyzer achieved 57.3\%
acceptable paraphrases.

All aspects of the pilot concept classification analyzer require significant
further research and development.  The underlying argument grammars will need
to be further developed, to support better detection and disambiguation of 
correct argument-level segments.  The second stage detection and classification
of speech-acts and domain concepts must also be significantly improved.  We
will investigate a variety of classification approaches beyond those already
tested in the pilot study.


\subsubsection{Shallow Syntax-based Analysis}

We also plan on developing a shallow syntax-based analysis component for
the analysis of descriptive sentences.  Descriptive sentences will require
an IF representation that is based on predicate-argument structure rather than
on domain actions (see section~\ref{irf}).  To analyze descriptive sentences
into there IF representations we plan on using a shallow lexically driven
grammar for identifying syntactic chunks in the input, and then attaching them
to the main predicate using information encoded in the lexicon.

Parsing will be performed using the {\bf LCFlex} parser \cite{}, a robust
left corner chart parser for unification-augmented context-free grammars, that
can support a variety of flexibilities (such as skipping portions of the
input, and limited relaxation of unification constraints).


\mysubsection{Generation}

In the NESPOLE! project, the task of generation from IRF will be more complex
than in previous STST systems. Because our planned system will explicitly be
designed to handle multimodal information, our generation components must be
able to produce both sensible output utterances, and synchronisation markers
between pointings and referring expressions.  To this end, the output of
generation engines will be a representation in some XML-like language - e.g.,
SMIL - for synchronised multimedia.  Such a language must be capable of
encoding both the language part - e.g., a string - pointers descriptors, and
temporal links between the referring expressions of the output sentence and
the pointer.

Concerning the linguistic part, we shall build upon the experience of such
STST projects as C-STAR and VERBMOBIL, but shall also consider relevant
results of projects targeting text generation: RAGS (general architectures for
generation systems) \cite{}, SAX (hybrid generation systems) \cite{}, ILEX 
\cite{} and GIST (multilingual generation) \cite{}.  The purpose is to improve
the quality of translation by addressing such questions as discourse level
phenomena - e.g., the cognitive state of referents - and information-packaging
(topic/focus).

%% MEMT and Direct Approaches - from Bob Frederking

\input{ref-translation}

These direct approaches depend on the availability of bilingual data, and we
propose to initially limit experimentation therewith to the translation
between English and German. The direct translation approaches will most likely
not be fully integrated into the showcase systems, given that the absence of
the IRF would make it more difficult to account for multimodality in a
principled way. We believe, nevertheless, that extending the multi-engine
approach to such techniques will provide information concerning robustness,
scalability and domain portability which will prove valuable for future
projects. Our intention is to investigate the relative contribution of the
direct translation approaches to each of the above questions, as compared with
the IRF-based approaches.

\mysubsection{Information Extraction}

We plan to exploit IE techniques to enhance scalability and
cross-domain portability, taking advantage of the considerable amount
of textual data describing places, cultural and sport events, etc., or
parts of an equipment, etc., which will be made available by our
users. {\it Named entity Recognition\/} techniques will provide STST
systems with the relevant information about proper names, city and
place names, company names, etc. which are needed for both the
analysis and the synthesis chain modules to work properly. {\it
Template Filling\/} techniques will provide structured description
(e.g., including the agent, objects, date, location, and so on) of
relevant events.

\mysubsection{Corpora Acquisition and Annotation}

Each {\sc Nespole!} partner will collect, transcribe, and annotate
(with IF representations) data for its home language.  This data will
be used to develop and test the language knowledge sources of the
translation components (grammars, lexica, IF, etc.), to conduct
evaluations, and to train the statistical and corpus-based components
for speech recognition and translation.

We have many years of experience in data collection, transcription,
and annotation for C-STAR and Verbmobil.  Data collection scenarios
usually involve face-to-face role-playing --- e.g., one person
pretending to be a traveller and one pretending to be a travel agent.
In the later stages of C-STAR we have begun to collect data through
user studies.  The traveller and agent are not face-to-face, but
communicate only through our MT system and interface.  {\sc Nespole!}
will begin development with this user study data.  However, user
studies can only be carried out when the system has reasonable
coverage of the desired domain.  Therefore, each time we expand or
change domains we will have to bootstrap with a small amount of
face-to-face data until user studies become feasible.