Ariadna Font Llitjós RESEARCH page

Research interests:

Machine Translation

Human Computer Interaction, especially through the Internet

Developing technology for developing communitites
Improving speech synthesis accuracy
Natural Language Processing
Machine Learning

My Research Blog

Loading Clusty Cloud ...

Dissertation Research

Ph.D. Thesis: Automatic Improvement of Machine Translation Systems

Interactive and Automatic Refinement of Translation Rules for a Transfer-based MT systems
OR
Can the Internet help improve MT?
Achieving high translation quality remains the biggest challenge in Machine Translation (MT) systems. To address this challenge researchers have explored a variety of methods to include user feedback in the MT loop. However, most MT systems have failed to incorporate post-editing efforts beyond the addition of corrected translations to the parallel training data for Statistical and Example-Based system or to a translation memory database. My research centers on developing a largely automated approach that uses online post-editing feedback from non-experts to refine translation rules. Precise error correction information that is relevant to the system allows the Automatic Rule Refiner to trace the errors back to incorrect lexical and grammar rules responsible for the errors and to propose concrete fixes to such rules. Since this approach attacks the problem at its core, it generalizes beyond the input sentences corrected by bilingual speakers, and allows for correct translation of unseen data. The reaching power of the Internet further enhances the relevance of this work. We envision modifying the product of my research to be an online game with a purpose. This game will allow bilingual speakers to correct MT input, get rewards for making good corrections, and compare their scores and speed to those of other players. For the MT community, this game will provide a free and easy way to get feedback for MT system improvement.

Short paper describing my dissertation research in more detail: Can the Internet help improve Machine Translation". Doctoral Consortium at HLT-NAACL, June 4, 2006, New York, USA. pdf

Other papers can be found here.

I proposed in November 2004. If you would like me to send you the Proposal document or presentation slides, send me email (aria AT cs.cmu.edu).

AVENUE project

Since May 2002, I have been working on the AVENUE project (CMU blackboard for the AVENUE project), namely on developing Automatic Machine Translation Systems for resource-poor languages.

I am currently in charge of developing the resources for Quechua. During the summer of 2005, I spent three months in Cusco building the resources and infrastructure to implement a Quechua-Spanish MT prototype system, as part of the V-Unit (Vision Unit). The V-Unit is part of the TechBridgeWorld Initiative at CMU. Currently the Quechua-Spanish MT prototype system has 25 translation rules and 683 lexical entries (40 manually and 643 semi-automatically created).

From May 2002 until January 2005, I was in charge of Mapudungun. In April and November of 2002, I travelled to Temuco, Chile, and worked with the local team at the Instituto de Estudios Indígenas (Universidad de la Frontera) to develop resources and NLP tools for Mapudung-Spanish (transcribed spoken corpus, dictionary, morphological analyzer). I also provided technical assistance setting up and using software developed at CMU.

For my thesis, I am working on automatically refining translation rules, by using minimal corrections from non-expert bilingual users. More

Quechua

Quechua or runasimi, which means language of the people, is the indigenous language of a large portion of the South American highlands, and there are about 10 million speakers today. However, we know of no electronic resources in Quechua, let alone any information and communication technologies in Quechua.

The term Quechua covers a variety of distinct languages and dialects. The Ethnologue Data Base showes 46 dialects of Quechuan, 32 spoken in Peru. Quechua is also spoken in Bolivia, Ecuador, South of Colombia and North of Argentina. The most important dialect is that spoken in Cuzco, the seat of the former Inca Empire. Quechua spread by means of conquests realized before and during that empire. It displaced several earlier languages, only to find itself increasingly displaced today by Spanish. In spite of this intense competition, Quechua in its various forms remains a vital language in Peru and elsewhere.

A piece of good news for us, computational linguists, is that the endless battle to decide which one of the two competing orthographies should be the official one, the pentavocal and the trivocal, has finally ended in favor of the pentavocalic orthographic system, which has a closest correspondence with the Quechuan letter-to-sound rules.

In 2005 Spring Semester, I audited Quechua II at the University of Pittsburgh, taught by Salome Gutierrez. And during my time in Cusco (June-August 2005), I studied both the Quechua language and culture at Centro Bartolome de las Casas, where I enjoyed daily classes taught by native speaker and educator Gina Maldonado.

Mapudungun

Mapudungun is an American Indigenous language spoken in Chile and Argentina by about half million Mapuche people.

Our Chilean local team is located in the Instituto de Estudios Indígenas, Universidad de la Frontera, in Temuco, Chile.

The first AVENUE Mapudungun-Spanish Machine Translation system is now available online.

For some information on the Mapuche people and their language, Mapudungun, you can visit these links:

About their history

About their language

To see their flags

PhD research related publications

Some recent papers and presentations

"A Walk on the Other Side: Adding Statistical Components to a Transfer-Based Translation System" with Stephan Vogel. To appear in Syntax and Structure in Statistical Translation (SSST) Workshop at HLT-NAACL, 26 April 2007, Rochester, New York, USA.

pdf

"The Inner Works of an Automatic Rule Refiner for Machine Translation" with William Ridmann. METIS-II Workshop, January 11, 2007, Leuven, Belgium.

pdf

"Automating Post-Editing to Improve MT Systems" with Jaime Carbonell. Automated Post-Editing Workshop at AMTA, August 12, 2006, Boston, USA.

pdf

"Giving the Power to Bilingual Speakers". Position Paper for the Automated Post-Editing Workshop at AMTA, August 12, 2006, Boston, USA.

pdf

"Can the Internet help improve Machine Translation". Doctoral Consortium at HLT-NAACL, June 4, 2006, New York, USA.

pdf [slides] [poster]

"A Framework for Interactive and Automatic Refinement of Transfer-based Machine Translation" with Jaime Carbonell and Alon Lavie. EAMT 10th Annual Conference 30-31 May 2005, Budapest, Hungary.

pdf

"Building Machine translation systems for indigenous languages" with Roberto Aranovich and Lori Levin. Second Conference on the Indigenous Languages of Latin America (CILLA II), 27-29 October 2005, Texas, USA.

pdf

"Error Analysis of Two Types of Grammar for the Purpose of Automatic Rule Refinement" with Katharina Probst and Jaime Carbonell. forthcoming at AMTA, 2004.

postscript pdf

"The Translation Correction Tool: English-Spanish user studies" with Jaime Carbonell. LREC, 2004. Lisbon, Portugal.

postscript pdf

Lavie, A., S. Vogel, L. Levin, E. Peterson, K. Probst, A. Font Llitjos, R. Reynolds, J. Carbonell, and R. Cohen, "Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario". ACM Transactions on Asian Language Information Processing (TALIP), to appear in 2(2), June 2003.

postscript pdf

Two HCI project proposals presented December 11, 2002 powerpoint presentation

My 2 cents to the panel "From Bits to Bots: Women Everywhere, Leading the Way": AVENUE, Automatic Machine Translation for low-density languages. Grace Hopper Celebration, 2002 Vancouver, Canada. With: Lenore Blum; Anastassia Ailamaki, Manuela Veloso, Sonya Allin and M. Bernardine Dias. powerpoint presentation

Here is a brief overview of the Catalan language (català) for MT, which I presented February 17, 2003, during one of our surprise language exercise meetings: powerpoint
[To my grandfather, Joan Llitjós Armengou, a generous and positive man who worked all his life to give us a better life.] And here is a very nice and thorough description by the Gran Enciclopedia Catalana (GREC) Word

Past research: Speech

CMU Speech group (only from CMU)

A couple of years ago, I used to working on:

Investigating what is the least amount of data needed to build a voice with Alan Black
Evaluating our pronunciation models online: www.pronounce-names.org
Unsupervised Clustering of words guided by their pronunciation

In September 2001, I defended my Masters Thesis on:
Improving Pronunciation Accuracy of Proper Names with Language Origin Classes
postcript   pdf    slides
For less detail versions, you can take a look at the Eurospeech '01 paper or at some of the related talks I've given:
       - Eurospeech presentation (September 7 2001)   powerpoint (~15 minutes)
       - presentation in Catalan (July 18, 2001)   powerpoint (~20 minutes)
       - talk at the Sphinx lunch (June 21, 2001)    postcript    powerpoint (~45 minutes)

Please, take a minute to participate in the evaluation of our pronunciation models by going to the
                   PRONUNCIATION OF PROPER NAMES SITE

In the past, I have worked on:

VXML Dialog Systems (BusLine)
Writing Spanish generation grammars for multilingual Machine Translation systems (JANUS and NLPWIN)
Writing a Catalan Constraint Grammar
Writing LFG-based Catalan analysis and generation grammars in LEKTA

Ariadna Font Llitjós

Last modified: Mon May 13 21:16:17 EDT 2002