Nguyen Bach (Bạch Hưng Nguyên)
Graduate Research Assistant
Home | Experience | Miscellaneous | Personal Info

CONTACT INFORMATION

 

Language Technology Institute (Affiliated with interACT Research)
School of Computer Science
Carnegie Mellon University
407 South Craig Street, Pittsburgh, PA 15213, USA
Email: x@y where x=nbach; y=cs.cmu.edu
Phone: gif-uks-pihu

 

RESEARCH INTERESTS

  • Machine Translation
  • Natural language processing
  • Speech recognition and synthesis
  • Machine learning
  • Information Retrieval


I am a PhD student working under the supervision of Prof. Alex Waibel and Prof. Stephan Vogel at CMU Statistical Machine Translation group. My current research is exploiting dependency structures from text and speech for statistical machine translation. I am working on GALE and TransTac projects.

PUBLICATIONS

2009

  • Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
    Nguyen Bach, Qin Gao and Stephan Vogel
    In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada. [Abstract], [Slides], [Bib].
    We propose a novel source-side dependency tree reordering model for statistical machine translation, in which subtree movements and constraints are represented as reordering events associated with the widely used lexicalized reordering models. This model allows us to not only efficiently capture the statistical distribution of the subtree-to-subtree transitions in training data, but also utilize it directly at the decoding time to guide the search process. Using subtree movements and constraints as features in a log-linear model, we are able to help the reordering models make better selections. It also allows the subtle importance of monolingual syntactic movements to be learned alongside other reordering features. We show improvements in translation quality in English-Spanish and English-Iraqi translation tasks.

  • Virtual Babel: Towards Context-Aware Machine Translation in Virtual Worlds
    Ying Zhang and Nguyen Bach
    In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada. [Abstract], [Slides], [Bib].
    In this paper, we describe our ongoing research project of Virtual Babel, a context-aware machine translation system for Second Life, one of the most popular virtual worlds. We augment the Second Life viewer to intercept the incoming/outgoing chat messages and reroute the message to a statistical machine translation server. The returned translations are appended to the original text message to help users to understand the foreign language. Virtual Babel provides a platform to study cross-lingual conversations facilitated by machine translation in virtual worlds and we observe interesting phenomena that are not present in document translations. Virtual Babel is aware of the non-verbal context of the conversation. Language model and translation models are trained from collected conversations and are used to generate translations according to observed non-verbal context of the conversation.

  • Cohesive Constraints in A Beam Search Phrase-based Decoder
    Nguyen Bach, Stephan Vogel and Colin Cherry
    In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA. [Abstract], [Slides], [Bib].
    Cohesive constraints allow the phrase-based decoder to employ arbitrary, non-syntactic phrases, and encourage it to translate those phrases in an order that respects the source dependency tree structure. We present extensions of the cohesive constraints, such as exhaustive interruption count and rich interruption check. Furthermore, we present analyses related to the impact of cohesive constraints across language pairs with different reordering models and dependency parsers. Our experiments show that the cohesion-enhanced decoder performs statistically significant better than the standard phrasebased decoder on English-Spanish. Improvements between 0.4 and 1.8 BLEU point are also obtained on English-Iraqi, Arabic-English and Chinese-English systems.

  • Incremental Adaptation of Speech-to-Speech Translation
    Nguyen Bach, Roger Hsiao, Matthias Eck, Paisarn Charoenpornsawat, Stephan Vogel, Tanja Schultz, Ian Lane, Alex Waibel and Alan W. Black
    In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA. [Abstract], [Slides], [Bib].
    In building practical two-way speech-to-speech translation systems the end user will always wish to use the system in an environment different from the original training data. As with all speech systems, it is important to allow the system to adapt to the actual usage situations. This paper investigates how a speech-to-speech translation system can adapt day-to-day from collected data on day one to improve performance on day two. The platform is the CMU Iraqi-English portable two-way speechto-speech system as developed under the DARPA TransTac program. We show how machine translation, speech recognition and overall system performance can be improved on day 2 after adapting from day 1 in both a supervised and unsupervised way.

2008

2007

  • A Log-linear Block Transliteration Model based on Bi-Stream HMMs
    Bing Zhao, Nguyen Bach, Ian Lane and Stephan Vogel
    In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2007), pp 364-371, April 2007, Rochester, NY, USA. [Abstract], [Slides],[Bib],[Test set].
    We propose a novel HMM-based framework to accurately transliterate unseen named entities. The framework leverages features in letter alignment and letter n-gram pairs learned from available bilingual dictionaries. Letter-classes, such as vowels/non-vowels, are integrated to further improve transliteration accuracy. The proposed transliteration system is applied to out-of-vocabulary named-entities in statistical machine translation (SMT), and a significant improvement over traditional transliteration approach is obtained. Furthermore, by incorporating an automatic spell-checker based on statistics collected from web search engines, transliteration accuracy is further improved. The proposed system is implemented within our SMT system and applied to a real translation scenario from Arabic to English.

  • The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System
    Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black
    In Proceedings of the International Workshop on Spoken Language Translation(IWSLT-2007), October 2007, Trento, Italy. [Abstract], [Slides], [Bib].
    The paper describes our portable two-way speech-to-speech translation system using a completely eyes-free/hands-free user interface. This system translates between the language pair English and Iraqi Arabic as well as between English and Farsi, and was built within the framework of the DARPA TransTac program. The Farsi language support was developed within a 90-day period, testing our ability to rapidly support new languages. The paper gives an overview of the system’s components along with the individual component objective measures and a discussion of issues relevant for the overall usage of the system. We found that usability, flexibility, and robustness serve as severe constraints on system architecture and design.

  • Handling OOV Words in Arabic ASR Via Flexible Morphological Constraints
    Nguyen Bach, Mohamed Noamany, Ian Lane and Tanja Schultz
    In Proceedings of the INTERSPEECH (Interspeech-2007), August 2007, Antwerp, Belgium. [Abstract], [Slides],[Bib].
    We propose a novel framework to detect and recognize out-of-vocabulary (OOV) words in automated speech recognition (ASR). In the proposed framework a hybrid language model combining words and sub-word units is incorporated during ASR decoding then three different OOV words recognition methods are applied to generate OOV word hypotheses. Specifically, dictionary lookup, morphological composition, and direct phoneme-to-grapheme. The proposed approach successfully reduced WER by 1.9% and 1.6% for ASR systems with recognition vocabularies of 30K and 219K. Moreover, the proposed approach correctly recognized 5% of OOV words.

  • The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007
    Ian Lane, Andreas Zollmann, ThuyLinh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan Vogel, Kay Rottmann, Ying Zhang and Alex Waibel
    In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2007), October 2007, Trento, Italy. [Abstract], [Slides], [Bib].
    This paper describes the CMU-UKA statistical machine translation systems submitted to the IWSLT 2007 evaluation campaign. Systems were submitted for three language-pairs: Japanese-English, Chinese-English and Arabic-English. All systems were based on a common phrase-based SMT (statistical machine translation) framework but for each language-pair a specific research problem was tackled. For Japanese-English we focused on two problems: first, punctuation recovery, and second, how to incorporate topic-knowledge into the translation framework. Our Chinese-English submission focused on syntax augmented SMT and for the Arabic-English task we focused on incorporating morphological-decomposition into the SMT framework. This research strategy enabled us to evaluate a wide variety of approaches which proved effective for the language pairs they were evaluated on.

2006

Before 2005


TALKS

 

IMPLEMENTATIONS – UNPUBLISHED REPORTS


Nguyen Bach

Last modified: August 18, 2009

Locations of visitors to this page