\mysection{Spoken Language Translation Research} \label{translation} The core of the {\sc Nespole!} project will be the development of new state-of-the-art speech translation methods, which will be integrated into a complete STST system. While we plan to build upon the technology and scientific advances achieved by previous major STST efforts such as C-STAR \cite{} and Verbmobil \cite{}, our main research goal will be to significantly advance the technological capabilities of STST systems in terms of robustness, scalability and portability to new domains. The partners in this consortium have significant experience in building STST systems using a variety of rule-based, corpus-based, and statistical approaches to translation. For {\sc Nespole!}, our intention is to allow partners to investigate a variety of approaches to STST, some rule-based, some corpus-based, and some a combination of both. The IF will be the main point of interface between partners. Each partner will develop its own analysis module to map sentence in its language onto IF, and its own synthesis module to map IFs onto its language. All-ways translation between all pairs of languages is achieved by combining any analysis module with any synthesis module, mediated by the IF. This mode of collaboration between partners has been effective in C-STAR. The work described here corresponds to Workpackage~5 in the European proposal. This workpackage will be lead by UKA with help from all the other research partners (CMU, IRST and CLIPS). We focus here primarily on the planned research on development of new spoken language translation components at CMU. The research work at CMU will focus on the development of a number of different translation approaches that will be combined into a multi-engine system, to make the best of the strengths, and minimize the weaknesses of each single approach. Within this multi-engine context, we will experiment with approaches that use the IF as well as approaches that do not. The approaches that do not use IF will be developed for English and German only, since they have to be specifically trained and developed for each pair of languages separately. Our goal in trying different approaches is to provide comparative evidence concerning the respective merits and weakness of IF- versus non-IRF-based approaches with respect to robustness. The individual approaches and our proposed research into multi-engine combination are described in greater detail in the following subsections \mysubsection{IRF-based Analysis Engines} We intend to experiment with the following IRF-based approaches, comparing and assessing their respective merits and weaknesses. \begin{enumerate} \item {\bf Semantic Grammar-based approach:} manually developed semantic grammars for the various sub-domains are combined together and parsed with a robust parser, producing IRF representations. \item {\bf Concept Classification approach:} this approach uses an underlying grammar for identifying/parsing basic argument-level phrases (i.e. time expressions, locations, etc.) and a statistical classifier to map the sequence of identified arguments into complete domain actions. The final outputs are IRF representations. \item {\bf Shallow syntax-based analysis:} identification of predicate-argument structures using simple and shallow methods, with the goal of then mapping these into predicate-argument IF representations. \end{enumerate} \subsubsection{Semantic Grammar-based Approach} \begin{figure}[t] %\centerline{\psfig{file=grammars.eps,width=10cm}} \epsfxsize=5.0in \centerline{\epsffile{grammars.eps}} \caption{Combining multiple sub-domain grammars with shared and cross domain grammars.} \label{grammars} \end{figure} The semantic grammar-based approach has been the focus of our previous STST research within the context of the C-STAR project, and has proven to be effective for large yet limited domains such as travel planning, which can be broken down into several natural sub-domains. For both analysis and generation we have been using semantic grammars. Rather than focusing on the syntactic structure of the input, semantic grammars directly describe how surface expressions reflect the underlying semantic concepts that are being conveyed by the speaker. For example, the rules indicate that the concept of something being availabile can be expressed with the phrases {\it we have \ldots\/} or {\it there are \ldots\/}. Because they focus on identifying a set of predefined semantic concepts, they are relatively well suited to handle the types of meaningful but ungrammatical disfluencies that are typical of spoken language, and are also less sensitive to speech recognition errors. Semantic grammars are also relatively fast to develop for limited domains, where the set of concepts being described is relatively small. However, they are usually hard to expand to cover new domains. New rules are required for each new semantic concept, since syntactic generalities cannot usually be fully utilized. In our current version of the {\sc Janus} system we have developed a way to combine modular grammars in order to overcome the problems associated with expanding semantic grammars to new domains. The parser \cite{gavalda} applies multiple sub-grammars in parallel and stores the outputs in a parse tree lattice. (In the process of building the lattice, the parser also segments long utterances into sentences, so that this does not need to be done as a separate process outside of the parser.) A number of heuristics are used to rank the paths through the lattice, including the likelihood of a string of words belonging to a particular sub-domain module \cite{amta-98}. Figure~\ref{grammars} illustrates this modular approach. For the {\sc Nespole!} project, we will incorporate our grammar-based approach as one of the multiple analysis engines. We expect to leverage coverage off of the domain-independent portions of our current grammars (labelled as ``cross-domain'' and ``shared'' in Figure~\ref{grammars}) We will also limit the amount of grammar writing that needs to be done by using our new Concept Classification approach to parsing (described below). With a Concept Classification parser we only need to write mini-grammars for IF arguments; the amount of grammar writing for speech acts and concepts in the IF is significantly reduced. %% We do not plan to construct full %% grammars to cover the new and expanded domains proposed for NESPOLE! \subsubsection{The Concept Classification Approach} \begin{figure}[t] \centering \fbox{ \centering \parbox[c][1.6\totalheight][c]{130mm}{ \centering \begin{math} \underbrace{ \underbrace{\rm{Hello.}}_{a_1}}_{\tt{\scriptsize greeting}} \underbrace{ \underbrace{\rm{This\ is} }_{a_2} \underbrace{\rm{Bob.}}_{\tt{\scriptsize \parbox{11mm}{\centering person\\-name}}}}_{\tt{\scriptsize introduce-self}} \underbrace{ \underbrace{\rm{I}}_{\tt{\scriptsize super\_who}} \underbrace{\rm{would\ like\ to\ book}}_{a_5} \underbrace{\rm{a\ flight}}_{\tt{\scriptsize \parbox{18mm}{\centering super\_flight\\-type}}} \underbrace{\rm{to\ Frankfurt.}}_{\tt{\scriptsize super\_destination}}}_{\tt{\scriptsize request-action+reservation+features+flight}} \end{math}}} \caption{Example: Multi-level Analysis for an Input Utterance} \label{cc-example} \end{figure} We propose to construct a new Concept Classification parser for analyzing task-oriented speech utterances. The goal of the parser is to analyze utterances directly into our dialogue-act-based IF representation (see section~\ref{irf}). Complete DA representations consist of a speech-act, a domain concept and a list of analyzed arguments. The new Concept Classification parser that we propose will operate in two stages. In the first stage, the parser uses an underlying phrase-level semantic grammar for analyzing the input into a sequence of arguments. In the second stage, the parser identifies the speech-act and domain concept based on the sequence of detected arguments and the words in the utterance. An example utterance and its levels of analysis is shown in Figure~\ref{cc-example}. The utterance {\tt I would like to book a flight to Frankfurt} is analyzed first as a sequence of phrase arguments (such as {\tt to Frankfurt} analyzed as a destination). The sequence of arguments then gets mapped to a speech-act, in this case {\tt request-action}, and a domain-concept - {\tt reservation+features+flight}. The parsing of argument level phrases will be done using the robust SOUP parser \cite{}. The necessary phrase-level semantic grammars will be developed manually, but we expect leverage coverage off of portions of our current semantic grammars. We expect the phrase-level semantic grammars to be far less domain dependent (thus far more portable) than complete semantic grammars. The second stage speech-act and concept classification will be based on data trainable classification technology. We will experiment with hidden markov models (HMMs), neural nets, and decision trees. These techniques are by design more robust and portable than complete semantic grammars, but their accuracy depends on the availablilty of adequate accurate training data. We have already conducted a pilot study of the above approach. The pilot study was performed on our current C-STAR travel planning domain, with the goal of proving the feasibility of the general approach. In the pilot study, we used a multi-level HMM to model argument-level concepts and speech-acts. The models were trained on a labelled corpus of utterances parsed by our full semantic grammar system. The trained HMM was then used to determine the segmentation and classification of utterances into speech-acts and argument-level segments. The argument-level segments were then parsed (when possible) using the portion of the semantic grammar that corresponds to each of the argument labels. A single neural net was trained to map the sequence of argument labels to a domain concept. In the above example, the sequence of arguments {[\tt super-who a5 super-flight-type super-destination]} should be mapped to the domain concept {\tt reservation+features+flight}. The speech-act, domain concept and arguments were then combined together to form a complete IF dialogue act. A preliminary evaluation of this pilot system showed promising results. We compared the performance of the pilot system with that of our full semantic grammar analysis system, via a translation English ``paraphrase'' test. Using a test set of 200 English utterances, we analyzed the input into IF representations using both analysis systems. The resulting IFs were then generated back to English (using our grammar-based generation component). The paraphrases were then graded manually by 2 human graders, and catagorized as ``perfect'' (if the paraphrase fluently conveyed the complete content of the input), ``OK'' (if the content was conveyed, but minor details were missing, or with some disfluency), or ``bad'' (if the paraphrase was considered unacceptable). ``Perfect'' and ``OK'' scores are also summed together as ``acceptable''. Whereas the system that used the semantic grammar analysis component achieved a score of 73.7\% acceptable paraphrases, the system that used our pilor concept classification analyzer achieved 57.3\% acceptable paraphrases. All aspects of the pilot concept classification analyzer require significant further research and development. The underlying argument grammars will need to be further developed, to support better detection and disambiguation of correct argument-level segments. The second stage detection and classification of speech-acts and domain concepts must also be significantly improved. We will investigate a variety of classification approaches beyond those already tested in the pilot study. \subsubsection{Shallow Syntax-based Analysis} We also plan on developing a shallow syntax-based analysis component for the analysis of descriptive sentences. Descriptive sentences will require an IF representation that is based on predicate-argument structure rather than on domain actions (see section~\ref{irf}). To analyze descriptive sentences into there IF representations we plan on using a shallow lexically driven grammar for identifying syntactic chunks in the input, and then attaching them to the main predicate using information encoded in the lexicon. Parsing will be performed using the {\bf LCFlex} parser \cite{}, a robust left corner chart parser for unification-augmented context-free grammars, that can support a variety of flexibilities (such as skipping portions of the input, and limited relaxation of unification constraints). \mysubsection{Generation} In the NESPOLE! project, the task of generation from IRF will be more complex than in previous STST systems. Because our planned system will explicitly be designed to handle multimodal information, our generation components must be able to produce both sensible output utterances, and synchronisation markers between pointings and referring expressions. To this end, the output of generation engines will be a representation in some XML-like language - e.g., SMIL - for synchronised multimedia. Such a language must be capable of encoding both the language part - e.g., a string - pointers descriptors, and temporal links between the referring expressions of the output sentence and the pointer. Concerning the linguistic part, we shall build upon the experience of such STST projects as C-STAR and VERBMOBIL, but shall also consider relevant results of projects targeting text generation: RAGS (general architectures for generation systems) \cite{}, SAX (hybrid generation systems) \cite{}, ILEX \cite{} and GIST (multilingual generation) \cite{}. The purpose is to improve the quality of translation by addressing such questions as discourse level phenomena - e.g., the cognitive state of referents - and information-packaging (topic/focus). %% MEMT and Direct Approaches - from Bob Frederking \input{ref-translation} These direct approaches depend on the availability of bilingual data, and we propose to initially limit experimentation therewith to the translation between English and German. The direct translation approaches will most likely not be fully integrated into the showcase systems, given that the absence of the IRF would make it more difficult to account for multimodality in a principled way. We believe, nevertheless, that extending the multi-engine approach to such techniques will provide information concerning robustness, scalability and domain portability which will prove valuable for future projects. Our intention is to investigate the relative contribution of the direct translation approaches to each of the above questions, as compared with the IRF-based approaches. \mysubsection{Information Extraction} We plan to exploit IE techniques to enhance scalability and cross-domain portability, taking advantage of the considerable amount of textual data describing places, cultural and sport events, etc., or parts of an equipment, etc., which will be made available by our users. {\it Named entity Recognition\/} techniques will provide STST systems with the relevant information about proper names, city and place names, company names, etc. which are needed for both the analysis and the synthesis chain modules to work properly. {\it Template Filling\/} techniques will provide structured description (e.g., including the agent, objects, date, location, and so on) of relevant events. \mysubsection{Corpora Acquisition and Annotation} Each {\sc Nespole!} partner will collect, transcribe, and annotate (with IF representations) data for its home language. This data will be used to develop and test the language knowledge sources of the translation components (grammars, lexica, IF, etc.), to conduct evaluations, and to train the statistical and corpus-based components for speech recognition and translation. We have many years of experience in data collection, transcription, and annotation for C-STAR and Verbmobil. Data collection scenarios usually involve face-to-face role-playing --- e.g., one person pretending to be a traveller and one pretending to be a travel agent. In the later stages of C-STAR we have begun to collect data through user studies. The traveller and agent are not face-to-face, but communicate only through our MT system and interface. {\sc Nespole!} will begin development with this user study data. However, user studies can only be carried out when the system has reasonable coverage of the desired domain. Therefore, each time we expand or change domains we will have to bootstrap with a small amount of face-to-face data until user studies become feasible.