Alon's Home Page

Dr. Alon Lavie
Research Professor
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA USA

Phone: +1-412-268-5655
Email: alavie AT cs DOT cmu DOT edu (anti-spam notation).



Teaching

I'm the main instructor of the Algorithms for NLP (11-711) course at the LTI. Algorithms for NLP is an introductory graduate-level course on the computational properties of natural languages and the fundamental algorithms for processing natural languages. The course provides an in-depth presentation of the major algorithms used in NLP, including Lexical, Morphological, Syntactic and Semantic analysis, with the primary focus on parsing algorithms and their analysis.

I am also a co-instructor of the Machine Translation (11-731) course, and co-supervise the NLP Lab (11-712) and the MT Lab (11-732) courses.


Research

My main areas of research are Machine Translation (MT) of both text and speech, and Spoken Language Understanding (SLU). My current most active research is on developing a general framework for syntax-driven Machine Translation, applicable to a variety of data scenarios. This framework is being applied to developing MT prototype systems for languages with limited amounts of electronic resources. It is also being applied to data-rich scenarios. One main focus of this work is the development of novel syntax-based methods for acquisition of the resources that are necessary for MT. I am also actively working on frameworks for Multi-Engine Machine Translation (MEMT) and on developing automatic metrics for MT evaluation. Another current research project is developing parsing approaches for accurate annotation of Grammatical Relations (GRs)in spoken language data. I have worked extensively on the design and development of Speech-to-Speech Machine Translation systems and on robust parsing algorithms for analysis of spoken language.

Current Research Projects

The AVENUE and LETRAS Projects:

I am a co-PI of the AVENUE and LETRAS projects (funded by NSF). AVENUE is concerned with the design and rapid development of new Machine Translation methods for languages for which only scarce resources are available. Our goal in AVENUE is to apply these new MT methods to minority languages, with a specific focus on native languages of North and Latin America. We worked on developing MT systems between Spanish and Mapudungun, a native language spoken in southern Chile, and have started working on Quechua, a native language spoken mainly in Peru, Ecuador and Bolivia. The LETRAS project is a follow-on project to AVENUE, where we are focusing on further development of the underlying general MT framework and expanding its application to new languages, including Inupiaq (a native Alaskan language), and native languages in Bolivia and Brazil. Together with Jaime Carbonell, Lori Levin, and a team of several graduate students, the primary research topics I am working on include: The design and implementation of a transfer-based MT framework specifically suitable for learning from data and for rapid prototyping of MT systems (work with Erik Peterson); Automatic learning of MT transfer-rules for languages with limited amounts of data resources (work with Kathrin Probst); Automatic rule refinement based on feedback from users (work with Ariadna Font-Llitjos; and unsupervised learning of morphological inflection classes from monolingual data (work with Christian Monson).
Select Publications:
  • 2003, Lavie, A., S. Vogel, L. Levin, E. Peterson, K. Probst, A. Font Llitjos, R. Reynolds, J. Carbonell, and R. Cohen, "Experiments with a Hindi-to-English Transfer-based MT System under a Miserly Data Scenario". ACM Transactions on Asian Language Information Processing (TALIP), 2(2).
  • 2002, Probst, K., L. Levin, E. Peterson, A. Lavie, and J. Carbonell, "MT for Minority Languages Using Elicitation-Based Learning of Syntactic Transfer Rules". Machine Translation, 17(4).
  • The Hebrew-English MT Project:

    As a direct follow-up to our AVENUE project work and in collaboration with Shuly Wintner and his Computational Linguistics Group at the University of Haifa in Israel, we are developing a prototype Hebrew-to-English Machine Translation system that is based on the framework developed under AVENUE. This work is being supported by a small grant from the Caesaria Rothschild Institute at the University of Haifa.
    Select Publications:
  • 2004, Lavie, A., S. Wintner, Y. Eytani, E. Peterson and K. Probst. "Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System". In Proceedings of the 10th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI-2004), Baltimore, MD, October 2004.
  • The MEMT Project:

    I am the lead-PI of a project on a new approach to Multi-Engine Machine Translation (MEMT). The goal of MEMT is to synthesize the output of multiple MT systems into a new output that is of higher accuracy than all of the contributing systems. The new approach invloves two main stages. An explicit word matcher is first used in order to identify the words that are common between the MT engine outputs. A decoding algorithm then uses this information, in conjunction with confidence estimates for the various engines and a language model in order to score and rank a collection of sentence hypotheses that are synthetic combinations of words from the various original engines. The highest scoring sentence hypothesis is selected as the final output of our system. The project is currently being funded by the DARPA GALE program, where our MEMT system serves as an essential component for combining the output from multiple MT engines within the Interoperability Demonstration system (IOD). The MEMT system has been made available for experimentation to other research groups. Contact me by email to obtain a copy.
    Select Publications:
  • 2005, Jayaraman, S. and A. Lavie. "Multi-Engine Machine Translation Guided by Explicit Word Matching" . In Proceedings of the 10th Annual Conference of the European Association for Machine Translation (EAMT-2005), Budapest, Hungary, May 2005.
  • The METEOR Project:

    METEOR is an automatic metric for MT evaluation that we have been developing at CMU for the past couple of years. METEOR is designed to address a number of weaknesses in the currently commonly used BLEU and NIST metrics. The metric heavily relies on an algorithm for finding an optimal word-to-word matching between a candidate MT translation and a human-produced reference translation for the same input sentence. METEOR produces normalized scores (in the range of [0,1]), and has been demonstrated to have significantly higher-levels of correlation with human judgments of MT quality, as compared with the more commonly used BLEU and NIST metrics. METEOR is freely available, and can be downloaded from here .
    Select Publications:
  • 2007, Lavie, A. and A. Agarwal, "METEOR: An Automatic Metric for MT Evaluation with High Levels of Correlation with Human Judgments" . In Proceedings of the Second Workshop on Statistical Machine Translation at the 45th Meeting of the Association for Computational Linguistics (ACL-2007), Prague, Czech Republic, June 2007. Pages 228-231.
  • 2005, Banerjee, S. and A. Lavie, "METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments" . In Proceedings of Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization at the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005.
  • 2004, Lavie, A., K. Sagae and S. Jayaraman. "The Significance of Recall in Automatic Metrics for MT Evaluation". In Proceedings of the 6th Conference of the Association for Machine Translation in the Americas (AMTA-2004), Washington, DC, September 2004.
  • The GRASP Project:

    I am PI of the GRASP Project (funded by NSF), where I am working together with Brian MacWhinney (co-PI) and Kenji Sagae on developing a framework for robust high-accuracy parsing of grammatical relations in spoken language data. Our goal is to automatically annotate the CHILDES database (a large database of child-parent conversations) with grammatical relations, in order to support advanced corpus-based research of child language acquisition.
    Select Publications:
  • 2007, Sagae, K., E. Davis, A. Lavie, B. MacWhinney and S. Wintner, "High-accuracy Annotation and Parsing of CHILDES Transcripts" . In Proceedings of the Workshop on Cognitive Aspects of Computational Language Acquisition at the 45th Meeting of the Association for Computational Linguistics (ACL-2007), Prague, Czech Republic, June 2007. Pages 25-32.
  • 2005, Sagae, K., A. Lavie and B. MacWhinney, "Automatic Measurement of Syntactic Development in Child Language" . In Proceedings of the 43th Annual Meeting of the Association of Computational Linguistics (ACL-2005), Ann Arbor, Michigan, June 2005.
  • 2004, Sagae, K., B. MacWhinney and A. Lavie "Adding Syntactic Annotations to Transcripts of Parent-Child Dialogs". In Proceedings of the Fourth International Conference on Language Resources and Evaluation (LREC-2004), Lisbon, Portugal, May 2004.
  • Previous Research Projects

    I was a co-PI of the Nespole! and C-STAR speech translation projects and of the LingWear and Babylon mobile speech translation projects.

  • The Nespole! project (2000-2003) was funded jointly by the European Commission and the US NSF. The main goal of the project was to advance the state-of-the-art of speech-to-speech translation in a real-world setting of common users involved in e-commerce applications. The project was a collaboration between three European research labs (ITC-irst in Trento Italy, ISL at University of Karlsruhe in Germany, CLIPS at UJF in Grenoble France), our research group at CMU, and two industrial partners (APT - the Trentino provincial tourism bureau, and AETHRA - an Italian tele-communications commercial company).
  • The C-STAR project is part of an ongoing joint collaboration between research labs in seven different countries (Japan, Korea, China, Italy, France, Germany and USA) on construction of robust spoken language translation systems for dedicated applications. I was primarily involved in phases C-STAR-II (1996-1999) and C-STAR-III (1999-2002) of the project.
  • The LingWear (2000-2001) and Babylon (2002-2004) projects were concerned with the development of mobile, hand-held Speech Translation applications in support of military and civilian users.
  • I was the lead PI of AMTEXT project (2003-2005, funded by DoD), a small pilot project that investigated the feasibility of a rapid development approach to Machine Translation based on Information Extraction. The approach builds upon the MT transfer framework developed in the AVENUE project and on Fei Huang's work on translation of Named Entities. The main idea is to use a small elicitation corpus of translated and word-aligned sentences to semi-automatically learn pattern transfer-rules that can then be used to both extract the information of interest in the source-language and translate this information into the target-language.

    I was a co-PI of the Clarity project (1997-1999, funded by DoD) on the automatic detection and classification of the discourse structure of spoken language.

    Other Research Interests

    I have a general interest in parsing algorithms for natural and programming languages and in theoretical problems related to parsing. My own research has primarily focused on the area of robust analysis and understanding of spoken language. In my PhD work, I developed GLR*, one of the first robust parsers for spoken language analysis, and a key component in the earlier versions of the JANUS speech translation system.


    My Students

  • Greg Hanneman (PhD)
  • Jonathan Clark (PhD)
  • Kenneth Heafield (PhD)
  • Michael Denkowski (PhD)
  • Austin Matthews (MS)
  • My Students that have Graduated

  • Christian Monson (PhD, 2008) (co-advised with Jaime Carbonell)
  • Kenji Sagae (PhD, 2006) (co-advised with Brian MacWhinney)
  • Kathrin Probst (PhD,2005) (co-advised with Jaime Carbonell and Lori Levin)
  • Chad Langley (PhD, 2003)
  • Hassan Al-Haj (MS, 2011)
  • Alok Parlikar (MS, 2009, co-advised with Stephan Vogel)
  • Danny Rashid (MS, 2009)
  • Abhaya Agarwal (MS, 2008)
  • Erik Peterson (MS, 2008)
  • Eric Davis (MS, 2008)
  • Shyamsundar Jayaraman (MS, 2005)
  • Matthew Broadhead (MS, 1997)
  • Cortis Clark (MS, 1997)

  • Recent Talks and Presentations

  • Evaluating the Output of Machine Translation Systems. Tutorial Presented at the 13th MT Summit, Xiamen, China. September 19, 2011.
  • Statistical MT with Syntax and Morphology: Challenges and Some Solutions. Presentation at LTI Colloquium. September 2, 2011.
  • Machine Translation Overview. Presentation at LTI Immigration Course. August 22, 2011.
  • Evaluation of Machine Translation Systems: Metrics and Methodology. Invited Presentation at the 56th IFIP WG 10.4 Meeting, Obidos, Portugal. July 3, 2009.
  • Stat-XFER: A General Search-based Syntax-driven Framework for MT. Research talk at MT Marathon, Prague. January 26, 2009.
  • Syntax-driven Learning of Sub-sentential Translation Equivalents and Translation Rules from Parsed Parallel Corpora. Presented at the Second Workshop on Syntax and Structure in Statistical Translation at the 46th Meeting of the Association for Computational Linguistics (ACL-2008), Columbus, OH. June 20, 2008.
  • Rapid Prototyping of a Transfer-based Hebrew-to-English Machine Translation System. Presented at ISCOL workshop at the 9th Bar-Ilan Symposium on the Foundations of AI (BISFAI-2007), Bar-Ilan, Israel. June 20, 2007.

  • My Full Publication List


    Miscellaneous Information

  • Full CV
  • Plan File

  • Contact Information

    Office:
    5715 Gates-Hillman Complex
    +1-412-268-5655
    Fax: +1-412-268-6298

    Administrative Assistant:
    Mary Jo Bensasi
    65xx Gates-Hillman Complex
    maryjob AT cs DOT cmu DOT edu
    +1-412-268-7517

    Mailing Address:
    Dr. Alon Lavie
    Language Technologies Institute
    School of Computer Science
    Carnegie Mellon University
    5000 Forbes Avenue
    Pittsburgh, PA 15213-3891

    Email:
    alavie AT cs DOT cmu DOT edu (anti-spam notation)

    Home:
    5124 Beeler St.
    Pittsburgh, PA 15217
    +1-412-621-0933