Each sentence is annotated ahead of time with a feature structure (see Section~\ref{EC}) which indicates only communicative functions: the action takes place before the time of speech and is represented as completed; the undergoer is animate or inanimate, male or female, specific/identifiable or non-specific/identifiable, and is the speaker, hearer or a third party. The surface realizations of these functional features are what the {\sc Avenue} system must discover automatically. If we want to find out how to translate English definite NPs into another language, could take a morphosyntactic approach or a functional approach. A morphosyntactic approach might be to find out how English determiners are translated into the target language, or to see how determiners in the target language are translated into English. The first morpho-syntactic approach, observing the translation of English articles in the target language, is problematic in cases where English determiners are used to express something other than definiteness. For example, in \pref{defArta} a definite article is used to refer to a species and in \pref{defArtb} an indefinite determiner marks a predicate nominal referring to a profession. Since other languages may not use articles in these ways (e.g., \pref{defArtFr}~\citep{croft}), the translation of determiners will not give us a clear information about the translation of definiteness. we could ask how English determiners (morph-syntactic entities) are translated or we could vary identifiability, specificity, and uniqueness (functional notions) and see what effects they have on the sentence. We have taken the latter approach because morpho-syntactic entities in one language typically don't map one-to-one onto morpho-syntactic entities of another language. In general, in looking at any one functional notion across languages we may find a range of different morpho-syntactic realizations. The examples below show the effects of changing definiteness of subjects and objects in Hebrew and Chinese. In Hebrew, definiteness is marked on both subjects and objects with the prefix/clitic {\it ha-\/}. However, there is a difference between subjects and objects in that a preposition/particle {\it et\/} is used with definite direct objects but not subjects. In Chinese, definite NPs are often unmarked, but indefinite NPs (at least singular ones) might be marked with a classifier. Also, in Chinese, definite subjects are unmarked, but indefinite subjects may be more naturally expressed with an existential construction. Also, indefinite objects are typically postverbal in Chinese, whereas definite objects may be expressed with topicalization/fronting or other kinds of word order changes. Hebrew examples are given in \pref{HebrewDef} and Chinese examples in \pref{ChineseDef}. \eenumsentence{\label{HebrewDef} \item{\shortex{10}{} % Hebrew, columns indicated with &'s {} % Interlinear, columns indicated with &'s {}} % Free translation \item{\shortex{10}{} {} {}} } \eenumsentence{\label{ChineseDef} \item{\shortex{10}{} {} {}} \item{\shortex{10}{} {} {}} \item{\shortex{10}{} {} {}} } {\sc Avenue} is an ongoing research project. As we describe the components of the system we will indicate the degree of completion of each component, which languages it has been tested on, and the degree to which the research challenges have been addressed. \section{Components of AVENUE \label{Components}} Automatic learning of grammatical encoding is just one component of {\sc Avenue}. In this section, we present a larger picture of the Machine Translation framework developed under the {\sc Avenue} project. The {\sc Avenue} framework was designed as a comprehensive approach for rapid prototyping of MT for scenarios in which very limited language processing resources are available for the source language. The framework supports transfer-based MT using the unification-based formalism described in Section~\ref{formalism}. The core translation engine consists of a transfer-rule engine and a decoder. These components are designed to produce translations using partial or incomplete grammars that can be either manually written by linguistic experts or automatically learned from elicited data, or may even consist of a mixture of rules of both kinds. The Elicitation and Rule Learning components that preceed the core translation engine support automatic learning. A Rule Refinement component supports automatic grammar correction and refinement using feedback from bilingual informants. The main components of the {\sc Avenue} framework are shown in Figure~\ref{SystemArchitecture} along with their sub-parts. \begin{figure}[t] \begin{center} \scalebox{.4}{\includegraphics{SystemDiagram.pdf}} \end{center} \caption{Architecture of the {\sc Avenue} MT System and its Major Components} \label{SystemArchitecture}\ end{figure} In the process of elicitation, an Elicitation Corpus is presented to an informant using an Elicitaiton Tool. This process is described in Section~\ref{elicitation}. The translated elicitation corpus is input to the rule learning component. It can also serve as a data source for manual grammar development by linguists. The output of rule learning is a set of transfer rules for machine translation. Rule learning has been applied so far to translation into English from Hebrew and Hindi, with results described in~\citep{Probst:2005} and~\citep{LavieEtAl:2003}. Two additional types of learning are feature detection (the learning of grammatical encodings), which is the topic of this paper, and the learning of morphology. Currently, morphology is learned from an untagged monolingual corpus. Morpheme boundaries as well as sets of morphemes that may be paradigms are learned. Morphology learning does not at this time include learning the functions of the morphemes. For example, it could learn that {\it -es\/} is a suffix in Spanish, but not that it marks plurality on nouns. Morphological learning has been so far applied to Spanish~\citep{MonsonEtAl:2004}. In order to identify functions for the hypothesized morphemes, morphological learning will have to be integrated with the learning of grammatical encoding, which learns which functional notions are manifested morpho-syntactically. This integration of feature detection and morphology learning is planned, but has not yet been tackled. The resources used by the translation engine include a grammar consisting of translation rules, a bilingual translation lexicon, and possibly a morphological analyzer if one is available. These resources can be automatically learned or hand-built. We have so far conducted experiments on five language pairs, Mapudungun-to-Spanish, Quechua-to-Spanish, Hindi-to-English, Hebrew-to-English, and Dutch-to-English.\footnote{We would like to acknowledge the following people for their work on these translation systems: Eliseo Ca\~{n}ulef (Mapudungun lexicon), Rosendo Huisca (Mapudungun lexicon), Flor Ca\~{n}upil (Mapudungun lexicon), Roberto Aranovich (Mapudungun rules), Carlos Fasola (Mapudungun morphology), Rodolfo Vega (Mapudungun data collection), the Chilean Ministry of Education, Program in Bilingual Education (Mapudungun funding), Richard Cohen and Pranjali Kanade (Hindi-to-English system), Shuly Wintner and Yaniv Eytani (Hebrew morphology and lexicons), Simon Zwarts (Dutch-to-English system), Irene Gomez (Translation of the Elicitation Corpus into Quechua and segmentation and translation of Quechua words), Yenny Ccolque (Translation of the Elicitation Corpus and part of the OCR correction of Quechua texts), and the {\sc Avenue} project members (not including the authors of this paper) Jaime Carbonell, Ralf Brown, Katharina Probst, Ariadna Font Llitj\'{o}s, and Christian Monson.} \begin{itemize} \item We are working on hand-built resources for Mapudungun-to-Spanish translation, including a translation lexicon, a morphological analyzer for Mapudungun, and translation rules between Mapudungun and Spanish. This system produces good translations for simple sentences that are covered by the lexicon and grammar. Inflected Spanish words are generated by a pre-existing morphological generator~\citep{SpanishMorphGenerator}. \item Our Hindi-to-English translation system~\citep{LavieEtAl:2003} uses a pre-existing morphological analyzer~\citep{iiit}; a translation lexicon that was integrated from several sources (including a lexicon provided by the Linguistic Data Consortium, a statistically trained translation lexicon and a small manually developed lexicon); and automatically learned translation rules. This system produced output that it typical for systems that are built without extensive human involvement---highly robust over any type of input, but with a quality that is at best suitable for getting the gist of what a document is about. \item The Hebrew-to-English system \citep{lavie-hebrew} uses an adapted pre-existing morphological analyzer, a translation lexicon that was integrated from several different sources, and automatically learned translation rules. The automatically learned translation rules were compared to handwritten translation rules that reflect about one week of human effort, with comparable results~\citep{ProbstEtAl:2001, lavie-hebrew}. \item The Dutch-to-English system uses a translation lexicon primarily extracted from a large sentence-parallel corpus that was automatically word-aligned, and that was augmented with manual translations for frequent words. It also uses a handwritten transfer grammar that targets several syntactic differences between the two languages. \item We are also working on a Quechua-to-Spanish system, which currently has stem and suffix translation lexicons and a set of handwritten transfer rules intended to cover the first two hundred sentences in our basic elicitation corpus. The stem translation lexicon was generated semi automatically from a list of words segmented and translated by a bilingual speaker. Preliminary work has been done to integrate with the existing MT system a morphological analyzer for Quechua. Like in the Mapudungun-to-Spanish system, inflected Spanish words are generated by a pre-existing morphological generator~\citep{SpanMorphGenerator}. \end{itemize} The output of the tranfer engine component is not a single target language sentence, but a large lattice containing all possible rule outputs for substrings of the sentence~\citep{LavieEtAl:2003}. This allows the engine to operate with partial, incomplete or even noisy and inconsistent transfer grammars, typical of our current automatically-learned grammars. The translation engine itself does not do disambiguation, and complete translations of newspaper-style sentences are rarely found. The transfer process must therefore be followed by a decoding process that walks the lattice to find the most probable and highest quality concatenation of target language substrings. For decoding, we use a language model (probability of 1-grams, 2-grams, and 3-grams) for the target language~\citep{Probst:2005, LavieEtAl:2003}, and are also investigating ways to score confidence in the rules that have produced each arc in the lattice. This is similar in concept to standard practice in statistical machine translation (SMT) and example-based machine tramslation (EBMT), which use a similar type of decoder for piecing together complete sentence translations from large collections of translation fragments. The final component of the {\sc Avenue} framework is translation correction and refinement\citep{Font-LlitjosEtAl:2005}. This component, currently in preliminary stages of development, is designed to automatically detect and correct erroneous transfer rules in the underlying transfer grammar, based on feedback from bilngual informants on incorrect translations produced by the system. \section{The Rule Formalism \label{formalism}} The transfer rules are coded in a unification-based formalism which is like LFG in that it is a co-description of c-structure and f-structure. Figure~\ref{TransForm} shows the rules and lexical entries for translating a Mapudungun sentence with second person acting on first person, as shown in~\pref{Mapud1}, into a Spanish sentence with a first person direct object clitic and the verb agreeing with a second person subject. The Mapudungun suffix {\it -en\/}, glossed here as 2 acting on 1, could be broken down into {\it -e\/} (inverse) and {\it -n\/} (first person singular). \enumsentence{\shortex{2}{pe & -en} {see & 2$>$1} {{\it Me viste. (You saw me.)}} \label{Mapud1}} Each rule or lexical entry consists of several parts. The first is the category such as VBar::VBar, which means that a Mapudungun V-bar is translated into a Spanish V-bar. In this example, the Mapudungun and Spanish categories are always the same, although this does not always have to be the case. The second part is a translation pattern such as {\tt [V VsuffG] $\rightarrow$ [PRON V]}, a verb followed by a verb suffix group in Mapudungun can be translated into a pronoun followed by a V in Spanish. The category and translation pattern are similar to a synchronous context free grammar in which each rule specifies two parallel expansions of a non-terminal symbol~\citep{SynchContFreeGram}, for example, V-bar expanding into a verb and verb suffix group and V-bar expanding into a clitic pronoun and a verb. One expansion is for the source language, in this case Mapudungun, and the other for the target language, in this case Spanish. The third part of the rule is an alignment such as {\tt [X1::Y2]}, the first element of the Mapudungun translation pattern corresponds to the second element of the Spanish translation pattern. The remaining parts of the rule are equations that specify unifications and constraints. In the equations, indices such as {\tt xn} and {\tt yn} are used instead of up and down arrows~\citep{patr, glr}. {\tt x0} refers to the f-structure of the mother node on the source language side and {\tt y0} refers to the f-structure of the mother node on the target language side. {\tt x1} refers to the f-structure of the first child on the source language side, and so on. Aside from this, the equations are interpreted much as they are in Lexical Functional Grammar~\citep{Bresnan:2001}, with a few exceptions: {\tt (f a) = *UNDEFINED*} is roughly equivalent to LFG {\tt $\tilde{}$(f a)} and {\tt (f a) = (*NOT* v)} is roughly equivalent to LFG {\tt (f a) ${\tilde{}}$= v}. Our formalism is also implemented differently from the pure LFG formalism in that we use a short-cut version of version of unification called pseudo-unification~\citep{PseudoUnification} in which f-structures are copied instead of being truly unified. The output of translation is a target language sentence that will be read off of the target language c-structure tree. In order for translation to succeed, the rules do not have to produce complete f-structures for the source and target languages. The f-structures can be used simply for checking and constraining features. \begin{figure}[ht!] \begin{small} \begin{center} \begin{minipage}[c]{3in} \begin{verbatim} V::V |: [pe] -> ["ver"] ( (X1::Y1) ) VSuff::VSuff |: [en] -> [""] ( (X1::Y1) ((x0 voice) = inv) ((x0 person) = 1) ((x0 number) = sg) ((x0 object person) = 2) ((x0 object number) = sg) ((x0 mood) = ind) ) VSuffG::VSuffG : [VSuff] -> [] ( (X0 = X1) ) VBar::VBar : [V VSuffG] -> [PRON V] ( (X1::Y2) ((X2 number) =c (*OR* sg pl)) ((X2 voice) =c inv) ((X2 object number) = *UNDEFINED*) ((X2 negation) = *UNDEFINED*) ((X2 tense) = *UNDEFINED*) ((X1 lexicalaspect) = *UNDEFINED*) ((X2 aspect) = (*NOT* habitual)) ((X0 person) = (X2 person)) ((X0 number) = (X2 number)) ((X0 mood) = (X2 mood)) ((X0 voice) = (X2 voice)) ((X0 object person) = (X2 object person)) ((X0 reportative) = (X2 reportative)) ((Y0 person) = (X0 object person)) ((Y0 number) = (Y2 number)) ((Y0 mood) = (X0 mood)) ((Y0 object person) = (X0 person)) ((Y0 object number) = (X0 number)) ((Y2 person) = (Y0 person)) ((Y2 number) = (Y0 number)) ((Y2 mood) = (Y0 mood)) ((Y2 tense) = past) ((Y1 person) = (Y0 object person)) ((Y1 number) = (Y0 object number)) ((Y1 type) =c personal) ((Y1 case) =c acc) ((Y1 morph) =c clitic) ) \end{verbatim} \end{minipage} \end{center} \end{small} \caption{An {\sc Avenue} Transfer Rule \label{TransForm}} \end{figure} When confronted with a new resource poor language, one of the things {\sc Avenue} must learn is the encoding properties of subjects, objects, and other grammatical relations in terms of word order, case marking, and verb inflection. This paper describes the {\it Feature Detection\/} components of {\sc Avenue}, which are involved in learning these encodings. Collection of resources includes assembling whatever corpora and lexicons are available, or creating new ones. New lexicons can be built by hand, or acquired from parellel corpora by statistical word alignment algorithms~\citep{talip, Hebrew}. New corpora may be recorded and transcribed~citep{lrec}, or may be elicited from bilingual native speakers. The learning of grammatical encodings, the topic of this paper, is part of the process of eliciting data. Because we are assuming that no linguist is available, {\sc Avenue} must elicit information from non-linguists. The bilingual informant cannot answer questions about surface encodings such as case and agreement markers, auxiliary verbs, determiners, etc.;\footnote{But see the Boas system which does train informants to use linguistic terminology.} s/he can only translate from the resource rich language to the resource poor language and line up corrsponding words in the two languages. The informant is presented with an {\it elicitation corpus\/}, which is a list of sentences similar to a fieldwork questionnaire in the resource rich language. The sentences have been annotated ahead of time with f-structures, but the f-structures contain only information relating to communicative function, such as whether the actor is the speaker, hearer, or a third party; the cardinality of the set of actors; whether the actor is male or female; identifiable, unique, or specific, etc. The job of {\sc Avenue} is to examine the informant's translations of the elicitation corpus and identify which morphosyntactic mechanisms have been used for each communicative function in the resource poor language. The romanization is vowelless, with {\it i\/} and {\it a\/} representing high front glide ({\it yod\/}) and glottal stop ({\it aleph\/}). The simplified example in~\ref{multiplicationIntro} specifies that feature structures should be created to cover all values of lexical aspect combined with all values of grammatical aspect in three tenses. Multiplications will be discussed further in Section~\ref{FSCont}. \enumsentence{\label{multiplicationIntro} (lexical-aspect \#all) $\times$ \\ (grammatical-aspect \#all) $\times$ \\ (absolute-tense past, non-past, present, future) } \eenumsentence{\label{featuresvaluesIntro} \item{{\bf Feature:} Causer intentionality \\ {\bf Values:} intentional, unintentional} \item{{\bf Feature:} Causee control \\ {\bf Values:} in control, not in control} \item{{\bf Feature:} Causee volitionality \\ {\bf Values:} willing, unwilling} \item{{\bf Feature:} Causation type \\ {\bf Values:} direct, indirect} } Informants may not initially be consistent about word alignments, especially for closed class items, but they usually become consistent after some feedback and correction (assuming that a human linguist is available to detect inconsistencies).