%% Section: INTRO: Research Goals and Objectives With the internet clearly established as a modern tool for business and communication in the developed nations of this world, much attention has recently been given to the gap between developed connected nations and those that are disenfranchised by the lack of information and computing technology. This gap has often been referred to in recent years as the ``digital divide''. However, beyond the concerns about computing and communication infrastructures, a problem of even greater difficulty, magnitude and persistence emerges: a separation from the online world due to language barriers---a ``linguistic divide''. Speakers of minority and endangered languages must either abandon their cultural and linguistic heritage or forever remain economically and intellectually marginalized. Despite the obstacles, the need for better information and communication infrastructures is recognized and pursued vigorously in many developing nations, and improvement over the next decade appears likely. Transcending the language divide, by contrast, appears to be a daunting task: There are 6000 languages in the world, some of which are in danger of extinction, while many more are spoken only by small communities, who are disenfranchised not just for their lack of communication infrastructure, but due to their inability to communicate in any major language, including their own country's major language. Examples of such groups include Quechua speakers in Peru, Nahuatl speakers in Mexico, and many other native minority language speakers in North and South America and in Africa. A common language or lingua franca such as English cannot be expected to bridge this gap, as language and culture go hand in hand. Of further hindrance is the fact that in many of the economically and linguistically disenfranchised communities, oral language communication is more prevalent than written form. % THE FOLLOWING IS CORRECT BUT DILUTES THE MAIN POINT % Moreover, content is increasingly presented in % major languages other than English. The year 2000, in fact, marks the % first year in which the % number of non-English Web sites has exceeded those that are in English. % These communities remain unreached, % unconnected, uneducated and unsupported. How can such communities benefit % from the Information Society?; how should benefits such as health, education, % trade and commerce expand into these communities?; and how can free speech, % democracy, knowledge and commerce spread and empower the disenfranchised % people of the world, without first overcoming the language barriers among % them? In the {\sc Avenue} (Adaptable Voice-Enabled Natural-translator for Universal Empowerment) Project, we propose to develop solutions to bridge the language divide and address the underlying science challenges. Our goal is to develop a prototype voice-enabled translating communicator that delivers information services across the language divide. The contemplated voice translation portal will allow a remote user to (1) communicate directly with internet content and databases in other languages by voice, and more importantly (2) communicate with others speaking a different language from their own -- such as health workers, educators, agricultural advisors, and peers in other linguistically isolated communities. In order to achieve this ambitious goal, robust multilingual speech recognition and minority-language translation methods will be created at much lower per-language development costs than present-day technology, and for any relatively unexplored language. At Carnegie Mellon University, we have already developed a number of speech and language prototype systems that deliver the precursors of such capabilities. These prototypes, however, have been realized for fairly well-studied languages under considerable development effort, using language-specific knowledge and data resources, and we are now starting to reach out to minority languages. But true success for the minority and endangered languages of the world is impossible to achieve unless the development efforts and data requirements of our systems are reduced by about two orders of magnitude. In the following section we describe a scientific program that will allow us to build speech and MT technology considerably faster than current state-of-the-art, based on novel machine learning strategies that economize on data requirements and development effort. We envision a technology that is sufficiently flexible and adaptive that it could allow deployment of human services in a new community and new language or dialect within months. %% Section: Scientific Approach Given the objective of universal linguistic access as a means of empowering disadvantaged communities, we need to focus our research primarily on new language-adaptable speech recognition methods and on radical advances in machine-translation approaches based on machine learning from miserly amounts of training data. We outline such high-risk/high-payoff methods in this section, that are nonetheless based on promising precursor results at CMU and elsewhere. %% Section: Universal Learning and Data Miser Whereas Machine Translation systems \cite{hutc86} are gradually improving in accuracy and web-based reach, there has not been a corresponding broadening of their linguistic coverage base. Traditional transfer-rule-based MT requires on the order of a person-century to build and perfect a new language pair. More modern and flexible interlingual MT systems also require vast development efforts \cite{nire91mtbook}. Incremental advances, such as modular software techniques and better rule encodings \cite{carb87tomita}, have marginally reduced development efforts. However, without a radical advance, the only commercially justifiable MT applications involve the major European languages, Japanese, Chinese and perhaps Korean and Arabic. The vast majority of human languages, including endangered languages, minority indigenous languages, and the languages of the economically disadvantaged, are unfortunately relegated to the proverbial MT dust heap. We propose radically new MT approaches based on extended and new machine learning methods in order to reduce new-language MT development to several months, rather than several years or decades. Limited forms of machine learning in MT have already proven successful in data-intensive approaches such as statistical MT \cite{Brown:90a} and Example-Based MT \cite{atr-ebmt}. However, these approaches substitute decades of human coding effort by vast amounts of linguistic training data, usually in the form of hundreds of megabytes of bilingually-aligned parallel corpora. For all but a few language pairs (such as English-French or English-Spanish), such prodigious amounts of translated text are prohibitively expensive to obtain, if they are obtainable {\em at all}. Hence, prohibitive quantities of human effort are replaced by equally prohibitive quantities of parallel text, and MT for minority languages remains on the far side of the economic divide. The methods we propose -- automated acquisition of transfer rules via seeded version-space learning, and maximum-entropy knowledge-seeded direct statistical translation -- reduce data requirements by up to two orders of magnitude, opening the door to the linguistic democratization of MT. A first step in this more miserly use of data has been taken with the Generalized EBMT project under NSF {\sc Stimulate} funding, which has achieved an order-of-magnitude reduction through the use of manually-created generalization knowledge \cite{Brown:1999}, or a factor of five reduction using automatically-learned generalizations \cite{Brown:2000}. These results establish the feasibility of data-miserly MT learning methods, but they only scratch the surface of what is possible. The new learning MT methods promise another order of magnitude improvement. The situation is similar for speech recognition systems, where it is often difficult or infeasible to find a sufficient number of native speakers to provide training data for acoustic recognition models and language models. The simplistic approach of using English acoustic models for other languages must be rejected as it leads to prohibitively high error rates. Yet, most languages share the bulk of their phonetics (because speech production is constrained by the human vocal tract), with multiple variations. It proves far more data-efficient to build language-adaptive recognizers than to train a new recognizer from scratch for each new language. Machine learning methods have already been proposed for language adaptation of speech systems \cite{SWicassp00} that reduce data requirements for acoustic models considerably. Further advances will permit reduction in required adaptation data to well under an hour of recorded transcribed speech in each language, while reducing error rates in the new language to competitive values. Automatic and/or semi-automatic methods to derive dictionaries and language models will also be developed. %\documentclass{article} %\begin{document} %% Section: Language Adaptive Speech Recognition The state-of-the-art in large vocabulary continuous speech recognition (LVCSR) has advanced substantially in recent years. Recognition systems developed originally for one language have been successfully ported to several languages, including systems developed by IBM \cite{CDGasru97}, Dragon \cite{BCGicslp96}, Cambridge \cite{YAAcsl97}, TI \cite{WKAMicassp94}, LIMSI \cite{LAGeuro95}, and by our group at the Language Technologies Institute and Karlsruhe University \cite{OAMicassp92,SWeuro97}. The transformation of English systems to diverse languages such as Arabic, Chinese, German, French, or Japanese implies that speech technology generalizes across languages and that similar modeling assumptions hold for various languages. However, extensions have only been performed for about 10-20 of the most wide-spread languages out of the most extensively studied language groups. In those languages huge amounts of spoken and written data have been accumulated over time. Even if the costs to collect and transcribe these data were extremely high (usually the process of transcribing spoken speech takes up to 20 times realtime), the collection of large databases was at least possible since those languages provide the community with lots of potential speakers and written text resources. Commonly used techniques for building a performant LVCSR engine in a new language premise to have 1) dozens of hours of recorded and transcribed speech to train the acoustic models, 2) large vocabulary in the order of 100k together with a pronunciation dictionary covering the vocabulary to guide the decoding procedure, and 3) huge amount of text data for language modeling containing more words than a human being is listening during his/her entire life. Unfortunately the majority of languages are only spoken by 100 to 10000 speakers, only 150 (about 3\% of the worlds languages) have more than 1 Million speakers. Furthermore the number of languages providing large amount of text data lies in the order of 30, a large number of languages even don't have any written language. Therefore, to deal with of minority languages from the speech recognition point of view is mainly an issue of very limited audio data, very few even no text data (and/or no writing system), and the completely lack of pronunciation dictionaries. For languages known and/or spoken by very few speakers it is even reasonable that no native experts are available to serve as transcribers or to provide the community with knowledge about pronunciation rules, syntax, morphology, and other relevant information. Therefore, our objective in language adaptive speech recognition is to develop solutions for minority languages, i.e. in situations where only very limited or even no data are available. We adress three main problems: \begin{enumerate} \item Rapid acoustic models adaptation \item Generating pronunciations for large vocabulary \item Language Modeling with limited text data \end{enumerate} Our goal is to develop solutions in all three issues. More specifically, we plan to build language independent acoustic models by combining acoustic models across various languages for which such models has already been trained properly. These language independent acoustic models will then be adapted using only very limited amounts of adaptation data from the target language. To automatically generate pronunciations for words of the new target language we are planning to build a consensus among the votes from phoneme recognition engines applying language dependent and language independent acoustic models. For language modeling with very limited data we propose to apply the transfer grammers resulting from approaches refered to in part XY of this proposal, to automatically generate in-domain texts. These texts can be further extended by logging human interaction with the systems and on-line interactive help provided by native users. Partly the lack of training data might be counterbalanced by using class-based language models. The following sections describe our proposed solutions in more detail. %%=========== \noindent {\bf Rapid acoustic model adaptation} %% the extensions to former work is along the lines: %% - more languages %% - new language groups %% - less wide-spread meaning less knowledge ... %% - fewer data In the area of acoustic models we plan to estimate acoustic models for new target languages by borrowing data from various source languages for which such data is more plentiful, while using only very limited amounts of adaptation data from the target language. We propose to perform this language adaptation in three main steps. First, we define a language independent (universal) phoneme set \cite{SWeuro97} by applying data-driven methods and building upon the definitions of similar sounds across languages as documented in international phonetic inventories like Sampa and IPA \cite{IPAjipa93}. Second, we combine the acoustic models of those sounds to create language independent acoustic models \cite{SWdarpa98}\cite{SWicslp98}. Third, we adapt those language independent acoustic models to new target languages \cite{SWicassp00} using very limited data. These three steps require extensive multilingual speech recognition experience, access to multilingual text and speech, and a set of monolingual speech recognition engines for a variety of languages. Over the course of the last ten years, our group at the Interactive Systems Lab has developed speech recognizers in more than 15 languages \cite{WGMSWieee00}. The {\sf GlobalPhone} database provides us with multilingual speech and text data in 17 widespread languages \cite{SWeuro97}\cite{SWicslp98}\cite{SWicassp00}. We thus have the extensive resources and experience required to undertake the challenging task of language-adaptive speech recognition as outlined above. %% NOW IT'S STARTS GETTING DRAFT %%=========== \noindent {\bf Generating dictionary pronunciation} dictionary for large vocabularies: A pronunciation dictionary describes the pronunciation of each word to be able to recognize to guide the recognizer. The task of building a dictionary is time consuming, it requires native experts. For languages with letter-to-sound relation rule-based approaches leading to good quality, however, all languages with no writing system, ideographic writing systems, or weak letter-to-sound relations so far rely on hand-editing. For spontaneous or dialectal speech common pronunciation variants are likely which are hard to cover by rules. So far pronunciation dictionaries are only available in very few languages in an order of 10. Former research in our group \cite{Sloicassp95,Sloicslp96} showed that accurate acoustic models of one language can sucessfully be used to generate pronunciation variants of words in this language. Once acoustic models from many different languages and even language independent acoustic models are available, these models might even be useful to generate initial pronunciations for words of a new languages. In this case several phoneme recognition engines trained on different languages can be applied to decode utterances from the new language in question. Using IPA mappings from the source to the target language can be applied to find a consensus among the recognizers. This consensus is likely to be close to the actual pronunciation of the target word. This approach only requires the knowledge of word boundaries of spoken speech. This knowledge can be either provided by humans or derived from prior knowledge about word length distributions in combination with (stuetzstellen). %%=========== \noindent {\bf Language modeling with limited text data} text data: large amount of text data is necessary for language modeling. Usually a language model has seen more words in the training corpus than an average human being is likely to listen to during his/her entire life. For wide-spread languages large amounts of text data can be derived for example from web-archives and several text resources which are already available. This assumption does not hold for most of the worlds languages. Especially for languages without writing systems it is very difficult to access any resources. Second, large text resources are mainly available for tasks like newspaper or broadcast news. Highly specialized data or spontaenous spoken-domain data are hard to get. If only few text data are around or even worse the language is a spoken not a written language we could include an incremental langauge model training using the actual users input (enrollment, logging a users session) and in cooperation with translation part try to generate from IF (talk to Alon). Known options are class based language models. %\subsection{Rapid Development of Synthetic Voices in New Languages} In addition to speech recognition we also wish to be able to generate synthetic speech in the languages we will be working on. The CMU Festvox project \cite{festvox00} brings together the basic research and appropriate tools and engines for building synthetic voices in new languages and has already been used within CMU for Nepali, Croatian, Chinese and Japanese and by a host of other groups throughout the world for various languages including Turkish, Swedish, Japanese, Italian, Basque, Scot's Gaelic and others. This work builds on the Edinburgh University's Festival Speech Synthesis System \cite{festival98} and the earlier CMU unit selection synthesizer Phonebox \cite{lenzo98}. A new voice in a new language requires some form of text analysis \cite{sproat01}, a lexicon or letter to sound rule set \cite{black98b}, a prosody model (providing phrasing, intonation and duration models) and a waveform synthesis method (e.g. diphones \cite{lenzo00a} or more general unit selection \cite{ldom00a} and \cite{black97c}). Although it is relatively easy to get something to talk badly, and very difficult to get something perfect, we are still a long way from getting perfect English synthesis, this research will address the issue of getting reasonable, recognizable speech synthesis in a relatively short amount of time, limited amount of training data and not requiring the person building the system to be an expert in speech synthesis. Specifically we will address issues of learning prosodic models, phrasing, intonation and duration with very small amounts of data which can be automatically labelled, thus giving probably conservative, but understandable speech synthesis. Building automatic prosody models for new languages is the currently the weakest part of the build process and the one that would improve the quality most. Two techniques will be investigate. First, cross language models, where a model is take from an existing model in a similar language. This has been shown to be not unreasonable between English and German, \cite{sproat98}, also we will investigate adaptation of the model with a small amount of data in the target language. And second, use modeling technique that have been shown to work on small amounts of data, e.g. \cite{maghbouleh96}. Defining a phoneme set and lexicon for a synthetic voice in a new language is a similar problem developing one for recognition, and we will share resources in such a development. In minority languages there is sometimes not even a accepted phoneme set defined. Current unit selection techniques used for speech synthesis \cite{black97c} can go some way to aid this. Approximate phoneme sets can be used and acoustic data can be used to further refine distinctions. Thus accents and dialects of larger language can be more readily captured. Lexical specification may actually be easier for minority languages as the more recent the development of an orthography the closer relation between letters and sounds \cite{sproat00} and the easier it is to define that functionally. % \documentstyle[11pt]{article} % \setlength{\textwidth}{163mm} % \setlength{\textheight}{225mm} % \setlength{\oddsidemargin}{1mm} % \setlength{\topmargin}{-10mm} % \setlength{\headsep}{5mm} % \begin{document} \def\given{\,|\,} % \section{Learning Direct Statistical Translation Models} The idea of statistical machine translation can be traced back to Warren Weaver in the late 1940's \cite{Weaver}. The approach was first seriously developed beginning in the late 1980's in a project at the IBM Watson Research Center \cite{Brown:90a}. The statistical approach has proven to be very successful, achieving levels of performance comparable with some knowledge-engineered systems, but with orders of magnitude less effort and expense---when a sufficiently large bilingual corpus is available for training. We propose to develop statistical methods that can learn from orders of magnitude less training data, and that can more effectively incorporate prior linguistic information, including dictionaries, word classes, and rule-based systems for pre-processing the input. As the amount of the available bilingual text increases, the system is expected to become more robust and accurate, but it will not be solely dependent upon an extremely large parallel corpus for training. The new ingredients in our approach are (1) the use of exponential (maximum-entropy) models as the basic statistical tools; (2) the (3) the use of joint source-channel modeling (vs. artificially factored models); (4) incorporation of bilingual dictionaries into the statistical models; (5) the use of word classes and other annotations to parameterize models; and (6) the use of syntactic/semantic parse information from the source language, when available. In this approach we will completely reformulate the statistical translation framework in terms of a model that {\it directly} parameterizes the target sentence in terms of the source, making use of recent advances in statistical learning techniques based upon exponential models. This research is motivated by our experience with statistical translation systems, most recently through the results of the 1999 workshop on statistical machine translation held at Johns Hopkins University, under NSF support, from which we conclude that statistical translation will significantly improve by coupling the estimation of translation models together with the language model and, more importantly, by conditioning on greater linguistic structure and contextual information in the source. The fundamental object in the {\it direct statistical transfer} approach is a conditional model $p(T\given S)$ of the target sentence given the source. Any linguistic information designed into the models is {\it conditioned\/} on, rather than generated, removing the need to specify the explicit relationship between the various sources of information. In a sense, this is a return to the most natural linguistic approaches to machine translation. But it will only be possible by using the recent advances in statistical modeling and machine learning. The direct translation approach uses the {\it Analysis--Statistical Transfer--Synthesis} paradigm \cite{Brown:92a}, but without reversing the direction in the statistical transfer stage. This reformulation of the problem represents a significant departure from the framework for statistical translation pioneered more than 12 years ago, and presents several new scientific challenges. There are two fundamental difficulties with the standard approach that factors the language model from the translation model. First, the source-channel paradigm is typically implemented by building the language model $p(T)$ independent of the channel model $p(S\given T)$, often using different training data, and combining the models in an {\it ad hoc\/} manner when decoding. This is justified by an idealization originating in information theory (the joint source-channel coding theorem) which is quite unrealistic for natural language. The second difficulty---the more relevant one for our research---is that it requires statistical models that {\it predict\/} all of the linguistic annotations in the input sentence. As the linguistic knowledge used grows in complexity, the probabilistic models become extremely unwieldy. Instead, we would like to {\it condition} on available linguistic annotations for the input, and use them to predict words and structures in the target language. This presents an apparent difficulty. A significant source of the strength of the statistical approach to speech and machine translation comes from the power of the trigram language model to guide the search for an intelligible hypothesis. Moreover, the flexibility of the EM algorithm for ``bootstrapping'' training alignment models and related hidden Markov models affords an extremely powerful set of tools. Fortunately, it is not necessary to throw out these strengths, but to build on them. Our approach will be to construct exponential models of the translation process directly, and to incorporate trigrams into the default distributions. In order to build these models, we will adopt a bootstrapping approach using simpler models to seed the estimation of more powerful context-sensitive exponential models. Since the translation model is estimated together with the language model, this is called {\it joint source-channel\/} modeling. % \end{document} %%\documentstyle[fullpage,psfig]{article} %%\begin{document} %% New version for ITR-00 Proposal 4/01/01 %% Chris Edited 5/7/99 %% Alon Done Editing 5/5/99 %% Alon Editing 5/2/99 %% Lori edited on 5/1/99. %% Alon: General outline of the section --- 4/29/99 %% %% Learning Transfer Rules from Elicited Data: %% - Translation using transfer rules %% - The Elicitation Process %% - The Learning Process %% - Implementation within the context of a KBMT engine/module %% \section{Learning Transfer Rules from Elicited Data} %% \label{kbmt} Knowledge-Based Machine Translation (KBMT) uses knowledge of both syntax and semantics to analyze the source language text in order to produce high-quality translations, typically better than those produced by other MT engines. However, to achieve such high performance, KBMT systems developed to date have required person-decades of detailed development by experts trained in linguistic analysis. The challenge we face in applying knowledge-based MT methods to minority languages is how to drastically reduce the dependence on the availability of human experts (computational linguists) and the requisite development time. Our goal is to develop a novel approach that will address this challenge by combining effective knowledge elicitation with powerful machine learning capabilities. The method proposed in this section is designed to acquire high-quality MT transfer rules, based on translation learning examples that are elicited from native informants that are bilingual speakers but are not linguistic experts. These translation examples are then generalized into transfer rules via a new locally-constrained version space learning method. The goal is to learn {\it transfer rules\/} \cite{Alshawi+al:91,Dorna:96} that express both compositional syntax and non-compositional ways of performing functions such as certainty, obligation or modality. \mysubsubsection{Elicitation of Transfer Rule Learning Examples} \paragraph{The elication scenario:} Eliciation is carried out in a way that does not require the user to know linguistic terminology. One sentence is displayed at a time in English or Spanish (L1) and the user is asked to translate this sentence into his or her native language (L2). After the translation is given, the user is asked to give information about how the words line up. For each L1 word, he or she highlights the corresponding words in the L2 sentence. For example, for the English sentence {\it This is Juan\/} and its translation into Mapudungun (spoken in southern Chile) {\it T\"{u}fa Kuan\/}, the user might align {\it this is\/} with {\it t\"{u}fa\/} and {\it Juan\/} with {\it Kuan\/}. (Alternatively, the word alignment could be automatic, depending on the availability of corpora, glossaries, and other tools.) The system then stores the sentence with its translation and the word alignment. Word alignment has to allow for one-to-one correspondences of words as well as zero to one or many and many to many. For example, {\it t\"{u}fa} could be aligned with {\it this\/} and {\it is\/} could be aligned with nothing. Ideally, we would not give the users guidelines about what to do when alternative alignments are possible, but simply expect that different rules would be learned for different alignments. Figure~\ref{el-tool} shows how the alignment may look for a typical LI. \begin{figure}[t] \centerline{\psfig{figure=elicitationtool.ps,width=5in}} \vspace{-2mm} \caption{\label{el-tool} {\bf Translation and Alignment of a Simple Sentence using the Elicitation Interface}} \end{figure} \paragraph {The design of the corpus:} There are three design criteria for the corpus. First, it should be dynamically adaptable in that different sentences will be presented depending on the properties of previously translated sentences. Second, it should be compositional with smaller phrases forming the components of larger ones. This is a requirement of the version space learning algorithm we intend to use for learning transfer rules. Third, it should be as typologically complete as we can make it. As a guideline for typological completeness, we are following checklists used by linguistic field workers such as~\cite{lingua-checklist}. We need to learn two types of information from the corpus --- how the constituent structure and word order of L1 corresponds to the constituent structure and word order of L2, and what grammatical distinctions (case, number, gender, etc.) are made by L2. The corpus has four main parts (basic vocabulary, basic sentences, basic noun phrases, and complex constructions). Each part of the corpus (except for the basic vocabulary) targets constituent structure and word order as well as grammatical features. The main mechanism for detection of grammatical features is to structure the corpus as a collection of minimal pairs --- two sentences that differ in only one feature, for example whether the subject of the sentence is singular or plural. The following minimal pair, for example, determines whether the relative pronoun in L2 (corresponding to English {\it who\/}) agrees in number with the head noun (the L2 equivalent of {\it the main\/} or {\it the two men\/}. \begin{verbatim} The man who lived in the small house died last year. The two men who lived in the small house died last year. \end{verbatim} \paragraph{Basic Vocabulary:} Following the tradition of the Swadesh List (named after the linguist Morris Swadesh) used by linguists, we start by asking for things that even very remote languages have words for such as tree, cloud, child, and so on. We also at this point elicit a few kinship terms (mother, father, etc.), a few body part terms (hand, arm, etc.), and pronouns. In anticipation of languages that have gender (masculine, feminine, neuter) and other types of noun classes (such as the 18 Bantu noun classes), we include nouns that may have inherent gender (man, woman, etc.) and a variety of animate and inanimate objects, natural and man-made objects, and abstract concepts. \paragraph{Basic Sentences:} The next section of the corpus elicits basic sentences with transitive and intransitive verbs. In addition to identifying basic word order of subject, object, and verb, this part of the corpus diagnoses a few other properties of the language, such as whether there is special treatment of indefinite or inanimate noun phrases acting as subjects of sentences. The diagnoses will influence the selection of sentences in the remainder of the corpus. Each diagnosis can be made by comparing a few minimal pairs. For example, to determine whether the language has an aversion of indefinite subjects (preferring something like {\it There is a man who left\/} over {\it A man left\/}), compare the translations of the minimal pair {\it The man left\/} and {\it A man left\/}. The reason for eliciting simple sentences before simple noun phrases is that the noun phrases may have to be elicited in the context of a sentence, so even before we start eliciting noun phrases, we need to know of any precautions (e.g., relating to definiteness and animacy) we have to take. \paragraph{Basic Noun Phrases:} Basic noun phrases may contain determiners, adjectives, cardinal and ordinal numerals, quantifiers, and possessors, but do not contain prepositional phrases and relative clauses. There are many phenomena to detect at this level --- the order of the basic elements, the expression of definiteness, whether the language has numerical classifiers (as in Japanese where it is necessary to say {\it one volume of book\/} and {\it one stick of pencil\/} instead of {\it one book\/} and {\it one pencil\/}), whether possessive pronouns agree with the head noun (as in the Romance languages), whether the language has possessor ascension (preferring structures like {\it I hit him on the hand\/} over {\it I hit his hand\/}), etc. \paragraph{Complex Constructions:} Having established basic vocabulary, noun phrases, and sentence structure, we can move on to more complex constructions such as relative clauses, comparatives, embedded clauses, adjunct clauses, questions, deeper exploration of the tense and aspect system, etc. %Using a Graphical User Interface (GUI), LETO will present the non-expert %native informant of the low-density source language with a sequence of {\it %learning instances\/} (LIs). Each LI will consist of an English phrase or %sentence example and its translation into the speaker's native language. The %translation will either be done by the user or by some other native %speaker (and stored in a bilingual corpus) or will be done on the fly as part %of the elicitation process. The LI is then analyzed at the lexical and %morphological level (Section~\ref{morphology}), which involves tokenizing %words into their root forms and features and identifying part of speech tags. %The user will then be asked to indicate word alignments between the two %languages wherever possible. In cases where the GEBMT component %for the language is already available, the GEBMT system will produce a word %alignment hypothesis that will be presented to the informant for verification %or modification. % %No correspondences of structure on any higher levels will be requested from %the informant. Figure~\ref{gait-np} shows how the alignment may look for a %typical LI. There will generally not be one-to-one correspondences of %morphemes or words in the source and target languages. Zero-to-one, %zero-to-many, many-to-one and many-to-many correspondences will be common. % %The knowledge of which types of structures and meanings should be elicited, %and the order in which to elicit them, will be encoded in LETO, based on a %principled structural language typology, refined from established methodology %in field linguistics. The types of structures and meanings will be designed %to be linguistically comprehensive (see detailed description in %Section~\ref{data}). Additionally, in order to support the learning of %transfer rules, the elicitation will be conducted in a hierarchical fashion. %LIs for lower-level phrase structures will be presented first, to allow the %early learning of transfer rules for such phrases, that can later be used as %building blocks for transfer rules at higher levels of phrase structure. %While this elicitation ordering will be based on an underlying general phrase %structure grammar of one language (English in our case), we do not expect the %resulting learned transfer rules to necessarily match the hierarchical %structure of the underlying English grammar. Rather, the English grammar will %serve as a knowledge source for suggesting possible units of structure for %transfer rules. % %The word-aligned bilingual LIs will then serve as the input to our %learning module, which will identify transfer rules of appropriate %generalization that can account for the various translation examples %that have been elicited. This learning process is described in detail %next, in Section~\ref{learning}. %% Chris added: %%All learned rules will be stored in the LETO database for use by other %%components. \mysubsubsection{Automated Abstraction of Linguistic Transfer Rules} \begin{figure}[t] {\bf Learning Instance:} \newline {\bf English:} {\tt the big boy}~~~~~~~~~{\bf Hebrew:} {\tt ha-yeled ha-gadol} \newline {\bf Acquired Transfer Rule:} \begin{verbatim} Hebrew: NP: N ADJ <==> English: NP: "the" ADJ N where: (Hebrew:N <=> English:N) (Hebrew:ADJ <=> English:ADJ) (Hebrew:N has ((def +))) (Hebrew:ADJ has ((def +))) \end{verbatim} \vspace{-3mm} \caption{Example: A Noun Phrase LI and Transfer Rule} \label{NP-example} \end{figure} Another major scientific challenge within the iKBMT approach is how to accomplish effective generalization from the translation examples that are aligned by the source language informant. This is a crucial task, since the translation applicability of the acquired linguistic knowledge greatly depends on correct and effective generalization. %% actually, Doug Jones' work is very similar to ours % Contrary to the approach pursued by the Boas Project \cite{nirenburg:98}, % which relies on a native informant with advanced linguistic expertise % to identify appropriate transfer rule abstractions, our approach is % designed to elicit data from informants with little to no linguistic % training. Our premise is that it would be impractical to expect a non-linguist informant of the low-density language to have the necessary skills required to identify the appropriate levels of generalization, in contrast to the assumption of the Boas Project \cite{nirenburg:98}. Our intention is therefore to use powerful machine learning techniques to infer appropriate abstractions and generalizations, using the evidence available from the translation examples and prior encoded knowledge. The goal of the learning process is to match every translation example LI in the elicited bilingual corpus with a transfer rule that accounts for the translation and is of an appropriate level of abstraction. Figure~\ref{NP-example} shows a simple example of a noun phrase LI and the transfer rule with the abstraction which we would like to acquire from such LIs. The transfer rule indicates that a Hebrew noun phrase of the form ``{\tt N ADJ\/}'' transfers to an English noun phrase of the form ``{\tt the ADJ N\/}'', with appropriate correspondences between the ``{\tt ADJ}'' and ``{\tt N}'' in both languages, and restrictions on the Hebrew noun and adjective that require them both to be marked as definite. The abstraction of ``{\tt ADJ}'' and ``{\tt N}'' within the transfer rule implies that the rule would also apply if the ``{\tt ADJ}'' mapping {\it ``gadol/big''} in the instance were replaced with any other adjective (such as {\it ``qatan/small''}). Generalization to the level of basic categories (parts-of-speech) will serve as the fundamental step in the rule abstraction learning process. After the system observes the translations of several example phrases of similar structure but different nouns and adjectives, it should be capable of automatically inferring that the transfer rules of these examples can be collapsed into a single transfer rule with the above indicated level of abstraction. Abstractions that are learned automatically will be verified against the aligned bilingual corpus and, when necessary, interactively with the informant. We are developing a new method, based on the Version Space (VS) hypothesis formation technique \cite{Mitchell:82,Hirsh:92} for the task of automatically learning appropriate transfer rule abstractions. The VS method assumes a hypothesis space with an abstraction partial order relation between the hypotheses. Using both positive and negative learning instances, the new locally-constrained VS algorithm efficiently narrows down the Version Space - the space of hypotheses that are consistent with all of the instances seen so far - until the hypothesis with the correct level of abstraction is identified. The partial order between the hypotheses allows the VS at each stage to be efficiently represented by a set $G$ of most-general hypotheses and a set $S$ of most-specific hypotheses. Each positive instance allows $S$ to be modified to consist of more general hypotheses, while each negative instance modifies the set $G$ to consist of more specific hypotheses. The difficulties of worst-case exponential $S$ and $G$ sets are unlikely to pose a problem in practice because: (1) we envision relatively shallow version spaces -- presenting much less opportunity for exponential blowup of either $S$ or $G$, (2) the refinements (below) maximally reduce the VS, and (3) our previous work in the Prodigy planning and learning system \cite{prodigy} has indicated that typical performance is much better than worst case even for deeper version spaces. In order to effectively learn appropriate abstractions for translation transfer rules, we will investigate extending the VS technique by: \vspace{-2mm} \begin{itemize1} \item Developing a strategy for selecting candidate generalizations that maximally reduce the candidate hypothesis space. In other words, the new VS method is an active learner, selecting training instances that maximize VS reduction, and testing these against the bilingual corpus (if it contains the correct constructions) and otherwise the native informant, who serves an ORACLE (the final arbiter to classify instances as correct or not). For version spaces, the maximal-information instance is a point in the generalization lattice that corresponds to one-half the the average distance between $S$ and $G$ boundaries, weighted by the size of of $S$ and $G$, \item Developing a {\em Dependency-linked VS} method that can keep a trail of roll-back operations in case the native informant errs or changes his or her mind. The same mechanism must be able to identify the set of candidate (dis)confirmed hypotheses potentially responsible in case the VS collapses to the null set, signaling a logical inconsistency. Such improvements are necessary to render practical the otherwise logically elegant method of VS. \item Developing {\em Rule order precedence} relations that will be needed if more than one hypothesis is generated by the VS and the simple maxim of ``specific-over-general'' proves insufficient. We currently envision a $k$-fold cross-validation process on which to test transfer-rule presence relations if we fail to discover an analytic solution for precedence ordering. \end{itemize1} \begin{figure}[t] {\bf Learning Instance:} \newline {\bf English:} {\tt I saw the big boy}~~~{\bf Hebrew:} {\tt ra'iti et ha-yeled ha-gadol} \newline {\bf Acquired Transfer Rule:} \begin{verbatim} Hebrew: S: V "et" NP <==> English: S: "I" V NP where: (Hebrew:V <=> English:V) (Hebrew:NP <=> English:NP) (Hebrew:V has ((tense past) (agr 1s) (gender M))) (Hebrew:NP has ((det +))) \end{verbatim} \vspace{-3mm} \caption{Example: A Simple Sentence-level LI and Transfer Rule} \label{S-example} \end{figure} \begin{figure}[t] \centerline{\psfig{file=S-ex-VS.ps,width=4in,angle=270}} %\smallskip \caption{Abstraction Lattice for a Sentence-level Transfer Rule} \label{S-ex-VS} \end{figure} The space of possible abstractions for a given translation LI is a function of its linguistic complexity. For example, Figure~\ref{S-example} shows a simple sentence level LI and desired transfer rule. Figure~\ref{S-ex-VS} depicts a portion of the space of all possible transfer rule abstractions for this LI. In this example, the desired transfer rule captures the correspondence between the object NPs in both languages, using an NP-level transfer rule. Our system will be able to infer such abstractions from LIs in cases where: (1) The underlying English phrase structure grammar licenses a dominated constituent (such as the object NP in the example); (2) the word-level correspondences within the LI suggest that a similar constituent exists in the other language; and (3) a transfer rule which can account for the correspondence has already been acquired for the dominated lower-level constituent. These three constraints act to limit the space of possible abstractions that the VS learning algorithm must consider. The third constraint requires that the transfer-rule learning process be conducted in an order that supports learning transfer rules for lower-level phrase constituents first. This simplifies the learning process on one hand, but also further limits the search space of possible abstractions, thus limiting the abstraction capability of the proposed method. Alternatively, conditions (1) and (2) may be used to hypothesize a lower-level phrase transfer rule (such as the object NP in our example), even when such a transfer rule has not previously been acquired. We will start with the more restrictive conditions, and investigate the quality and generality of the learned transfer rules. After gaining some insight and experience, we will look into the possibility of relaxing the third constraint, expanding the search space of possible abstractions. In the third year of the project, we propose to investigate some advanced questions related to the representation and abstraction power of our proposed formalism: \vspace{-2mm} \begin{itemize1} \item The formalism as described above will result in multiple separate transfer rules being created for LIs that are structurally similar but differ in grammatical markings such as case, person or gender. We would like to investigate how to identify such collections of transfer rules, so that they can be collapsed into a single more abstract representative rule, with the grammatical markings passed up to the phrase level. \item We are aware of the fact that the formalism we propose will not be be optimal for generating abstract rules for languages that have a very free word order. To address this, we will investigate methods for relaxing the strict word order that our current rules impose. This can be accomplished by extending the context-free phrase structure rules used in the current account to IDLP rules.\footnote{Immediate Dominance/Linear Precedence (IDLP)\cite{gazdar+pullum:82,shieber:84} is a formalism which simplifies expression of grammars for languages with freer constituent orderings. While a context free rule, such as $S \rightarrow NP\ VP$ expresses both constituency (that an $S$ is composed of both an $NP$ and a $VP$) as well as the relative orders of the daughter constituents, the equivalent statement in IDLP: $S \rightarrow \{NP,VP\}$, $NP \prec VP$ separates these two notions.} \end{itemize1} \begin{figure}[t] \centerline{\psfig{file=iKBMT-diagram.ps,width=6.0in,angle=270}} \vspace{-2mm} \caption{Data flow within the iKBMT module} \label{iKBMT-diagram} \end{figure} \mysubsubsection{Translation Using the iKBMT Engine} \label{kbmt-translation} The transfer rule learning process will result in a large collection of transfer rules which together constitute a transfer grammar for the language pair (source language and English). The transfer grammar will then be utilized at runtime to support a ``transfer-based'' translation, similar in principle to that described in \cite{Alshawi+al:91,Dorna:96} in the following way: \vspace{-2mm} \begin{enumerate1} \item {\bf Parsing:} The source language input is analyzed using the source-language portion of the transfer grammar into one or more parse trees. This will be done using the existing robust GLR* parser \cite{ENTH:alon-phd,ENTH:esslli-96}, or the LCFlex parser, an improved version of GLR*, currently under development under separate funding. The parser will be adapted to handle the specific types of robustness required for the task, particularly with respect to ambiguity and fragmentation. \item {\bf Transfer:} Each analysis, consisting of one or more parse trees for the source language will be converted into a corresponding tree (or set of trees) for English, the target language. This can be done in a straightforward top-down and recursive fashion using the corresponding English version of each source-language rule that appears in the tree. \item {\bf Generation:} The GenKit \cite{tomita88} generation system will then be used to render the English structural representations into an English sentence. \end{enumerate1} Figure~\ref{iKBMT-diagram} contains a schematic diagram of the data flow within the envisioned iKBMT module. Research on the iKBMT engine will be primarily led by Dr. Lori Levin and Dr. Alon Lavie, with substantial participation by Dr. Jaime Carbonell. %\end{document} %\section{Computer Assisted Language Learning} In order to learn to pronounce the sounds of the language, it is not sufficient to tell the user that she was wrong or right in the pronunciation of a sentence. People need to be told exactly which sound was incorrect and they need help knowing how to make the sound correct. We propose to use the speech recognition system that has been adapted to each of the new languages in order to teach people to pronounce those languages. In order to detect erroneous sounds in the language, we will leverage on the fact that we know what the user is saying (the user will be prompted as to what to say by the system) (Eskenazi et al, 2000). We can then send both the text of what the user said and the speech signal to the speech recognition system for the new language (L2) which can thus operate in forced alignment mode rather performing recognition of an unknown speech signal. This gives us more precision in the time domain and the phone score gives us an idea of the fit of each incoming sound to the one that is expected at that time. We will then use this information from the recognizer to determine which sounds have been pronounced correctly and which are erroneous. This technique has been used to teach the pronunciation of English. The algorithms here are language-independent and will therefore be easy to adapt to a new L2. After pinpointing the erroneous sounds, we will create filters that limit the number of errors shown per utterance since it is poor pedagogical practice to correct all errors. Rather, a system should show only errors on one sound at a time, the sound that is being studied. The system will also offer a choice of native voices to imitate. The imitation of voices having characteristics that are close to ones own in terms of spectral qualities, timing and pitch has been shown to help people learn better (Probst et al, 2001). A system based on the above scheme that teaches the pronunciation of the English TH sound to non-native speakers of a variety of native languages (L1s) has proven to be as effective as a human teacher (Mayfield Tomokiyo et al, 2000). At the same time that we develop the system to pinpoint errors in the pronunciation of the new languages, we will develop the pedagogical materials for the system. We will develop an automated system that uses the newly developed pronunciation lexicons for each new language to search for appropriate words and phrases to learn each sound that is requested. Given the phoneset of the new language and the phoneset of the users L1, we will automatically determine which sounds of L2 need to be learned and which are close enough to the native language that they will be comprehensibly pronounced without training. Using this system, we will also automatically determine what contrasting sounds from the users L1 should be shown when giving correction help. For example, to learn the TH in the word birthday, a native speaker of Japanese would hear (and be shown) a contrast with an S (birsday - sin/thin), a native speaker of Russian would see and hear the contrast with F (birfday - fin/thin), and a native Serbo-Croatian speaker would have the contrast with T (birtday tin/thin). Work in the past few years (Best et al, 2001) implies that we can use the phonological structure of L2 to predict which L1 phones a speaker will use instead of an unknown phone in an L2 word. We will compare this contrastive information to correction information that only describes how to produce the L2 sound, without creating contrasts with the L1 sound first. In other words, as well as offering instruction in the pronunciation of a sound as a movement from a sound in L1 to a sound in L2, we will offer instruction in pronouncing the sound as a new unit to be learned with no link to an L1 sound. Neither the L1 -> L2 method or the L1 alone method has been proven to be superior and there has been evidence in the literature to support both strategies. If neither strategy proves to be superior according to tests using our system, we will offer both possibilities to users. %% Section: Envisioned Scenarios Since our work is motivated by the desire to build practical, portable systems that could be deployed in remote communities of the world, our efforts need to be grounded in the practical realities of the user's environment. The scientific advances proposed above will therefore be paralleled by our interaction with a showcase in a field situation. Evaluation can then be carried out using well-known performance metrics, as well as measures of usability in the field. During the course of the project, our language processing systems will be connected with actual health care services aiming to reach the linguistically and geographically isolated. We expect to work with one or more human services that already operate or are working with such remote communities. Dr. Peter Wilkniss (Director, Arctic and Antarctic Institute) has agreed to collaborate with our project on Arctic Telemedicine. The eight-nation Arctic Council, including the respective native organizations, have established circumpolar remote healthcare as a high priority. Telemedicine, including remote health-sensing and translingual physician-patient teleconference are expected to play a major role in overcoming the difficulties of healthcare delivery in the remote Arctic for different indigenous groups speaking different minority languages. Bodymedia Inc., is open to providing internet-enabled bio/health-sensing technology, which we would combine with the proposed voice/text translation for field studies as appropriate. We expect that the field study will include one or more remote kiosks that can accept direct requests by spoken language dialog and translate such requests and any ensuing dialogs with a remotely-located physician or health-service provider. We intend to evaluate system performance as well as usability in the field, to determine the effectiveness of the human service for actual users, when mediated by our speech and translation technologies. %% Section: Evaluation Evaluation of machine translation focuses on three aspects of system quality. First, and most traditional, is the evaluation of pure translation accuracy \cite{gates97}. Second, since we are concerned with reduction of training data size, we evaluate accuracy and coverage as a function of the amount of training data \cite{Brown:2000}, and finally, we evaluate usability of the system. \paragraph{Evaluation of accuracy:} Accuracy of MT is rated on a sentence-by-sentence basis and is typically scored on a three point scale---perfect (correct meaning and grammar), acceptable (correct meaning, comprehensible but imperfect grammar), and unacceptable (incorrect meaning or incomprehensible grammar). Accuracy of Speech recognition systems is evaluated primarily as word accuracy over decoding speed. \paragraph{Accuracy as a function of training data size:} The fundamental scientific question concerning language adaptive speech recognition is: which adaptation approaches perform best, with minimal development time and costs, with minimal amount of adaptation data, and at what level of accuracy. We propose to extend the evaluation criterion of accuracy to to include accuracy as a function of adaptation data, because we crucially need to measure the amount of training data needed for a specific adaptation approach. We will also evaluate accuracy of MT as a function of training data size in order to see changes in accuracy as data is added or removed. \paragraph{User studies:} One important measure of usability is the rate of task completion. For instance, what percentage of the time do patient and healthcare provider successfully complete their objectives (e.g., diagnosing a condition, or seeking advice) in the translingual case vs. the monolingual one. However, going beyond our previous work on user studies and task-based evaluation \cite{lrec-00}\cite{amta-98}, we will also evaluate the usability of the system for people in remote locations who may be unaccustomed to computers. Our contacts among such communities are described in Section~\ref{}. %% Section: Technological Advances and Impact We envision scientific, technological, societal, and educational benefits from {\sc Avenue}. The primary scientific advances foreseen in the proposed research are (1) automated induction of semantically-situated MT transfer rules via a new P-time symbolic ML technique (seeded version spaces); (2) significantly improved statistical MT via exponential models and seeded priors the new joint-channel model; and (3) ML methods for effective rapid adaptation of speech recognition to new languages with scarce data resources. The confluence of these advances should produce a two-orders-of-magnitude improvement in initial development time to reach a given level of accuracy for speech-based and text-based MT over the current state of the art. The major proposed technological advances are (1) an improved multi-engine architecture, supporting the integration of diverse MT methods, including centrally the new statistical MT and induced-transfer MT engines; and (2) a complete prototype system for translation and information access that is demonstrably adaptable in months to new minority languages with scarce linguistic data resources. %% Has multi-engine been mentioned anywhere earlier in the proposal? The primary societal impact is a significant contribution to the global democratization of information, a process which began with the creation of the Web and must continue by bridging current linguistic barriers, especially for low-density or economically-disadvantaged languages. If successful, {\sc Avenue} will be the prototype of the MT system that will empower world-wide access to multilingual information. A second benefit is to open linguistic minority communities to scholars and government agencies by providing a means for understanding their culture and their needs and concerns. The educational impacts of our work include training of multiple graduate and undergraduate students as researchers in speech recognition, machine translation, machine learning, computational linguistics, and statistical methods. We also propose to host faculty and students from other institutions as summer interns and short or mid-term visitors, and involve them directly in appropriate components of {\sc Avenue}. We plan to distribute components of our project of benefit to external students and researchers. These include the language-adaptive speech recognition engine, our data elicitation tool, the data we collect and align, and the new MT engines.