Nguyen Bach (Bạch Hưng
Nguyên)
Graduate
Research Assistant
Home
| Experience | Miscellaneous | Personal Info
- Machine Translation
- Natural language processing
- Speech recognition and synthesis
- Machine learning
- Information Retrieval
I am a PhD student working under the supervision of Prof. Alex Waibel
and Prof. Stephan Vogel at CMU
Statistical Machine Translation group. My current research is exploiting dependency structures from text and speech for statistical machine translation. I am working on GALE and TransTac projects.
2009
- Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
Nguyen Bach, Qin Gao and Stephan Vogel
In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada.
[Abstract], [Slides], [Bib].
We propose a novel source-side dependency tree reordering model for statistical machine translation, in which subtree movements and constraints are represented as reordering events associated with the widely used lexicalized reordering models. This model allows us to not only efficiently capture the statistical distribution of the subtree-to-subtree transitions in training data, but also utilize it directly at the decoding time to guide the search process. Using subtree movements and constraints as features in a log-linear model, we are able to help the reordering models make better selections. It also allows the subtle importance of monolingual syntactic movements to be learned alongside other reordering features. We show improvements in translation quality in English-Spanish and English-Iraqi translation tasks.
- Virtual Babel: Towards Context-Aware Machine Translation in Virtual
Worlds
Ying Zhang and Nguyen Bach
In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada.
[Abstract], [Slides], [Bib].
In this paper, we describe our ongoing research project of Virtual Babel, a context-aware machine translation system for Second Life, one of the most popular virtual worlds. We augment the Second Life viewer to intercept the incoming/outgoing chat messages and reroute the message to a statistical machine translation server. The returned translations are appended to the original text message to help users to understand the foreign language. Virtual Babel provides a platform to study cross-lingual conversations facilitated by machine translation in virtual worlds and we observe interesting phenomena that are not present in document translations. Virtual Babel is aware of the non-verbal context of the conversation. Language model and translation models are trained from collected conversations and are used to generate translations according to observed non-verbal context of the conversation.
- Cohesive Constraints in A Beam Search Phrase-based Decoder
Nguyen Bach, Stephan Vogel and Colin Cherry
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA.
[Abstract], [Slides], [Bib].
Cohesive constraints allow the phrase-based decoder to employ arbitrary, non-syntactic phrases, and encourage it to translate those phrases in an order that respects the source dependency tree structure. We present extensions of the cohesive constraints, such as exhaustive interruption count and rich interruption check. Furthermore, we present analyses related to the impact of cohesive constraints across language pairs with different reordering models and dependency parsers. Our experiments show that the cohesion-enhanced decoder performs statistically significant better than the standard phrasebased decoder on English-Spanish. Improvements between 0.4 and 1.8 BLEU point are also obtained on English-Iraqi, Arabic-English and Chinese-English systems.
- Incremental Adaptation of Speech-to-Speech Translation
Nguyen Bach, Roger Hsiao, Matthias Eck, Paisarn Charoenpornsawat, Stephan Vogel, Tanja Schultz, Ian Lane, Alex Waibel and Alan W. Black
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA.
[Abstract], [Slides], [Bib].
In building practical two-way speech-to-speech translation systems the end user will always wish to use the system in an environment different from the original training data. As with all speech systems, it is important to allow the system to adapt to the actual usage situations. This paper investigates how a speech-to-speech translation system can adapt day-to-day from collected data on day one to improve performance on day two. The platform is the CMU Iraqi-English portable two-way speechto-speech system as developed under the DARPA TransTac program. We show how machine translation, speech recognition and overall system performance can be improved on day 2 after adapting from day 1 in both a supervised and unsupervised way.
2008
- Improving Word Alignment with Language Model Based Confidence Scores
Nguyen Bach, Qin Gao and Stephan Vogel
In Proceedings of the ACL 2008 Third Workshop on Statistical Machine Translation (ACL-08:HLT, WSMT), June 2008, Columbus, Ohio, USA.
[Abstract], [Slides], [Bib], [MGIZA++].
This paper describes the statistical machine translation systems submitted to the ACL-WMT 2008 shared translation task. Systems were submitted for two translation directions: English-Spanish and Spanish-English. Using sentence pair confidence scores estimated with source and target language models, improvements are observed on the News-Commentary test sets. Genre-dependent sentence pair confidence score and integration of sentence pair confidence score into phrase table are also investigated.
- Recent Improvements in the CMU Large Scale Chinese-English SMT System
Almut Silja Hildebrand, Kay Rottmann, Mohamed Noamany, Qin Gao, Sanjika Hewavitharana, Nguyen Bach and Stephan Vogel
In Proceedings of the Annual Meeting of the Association for Computational Linguistics with the Human Language Technology Conference (ACL-08:HLT), June 2008, Columbus, Ohio, USA.
[Abstract], [Slides], [Bib].
In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach.
2007
- A Log-linear Block Transliteration Model based on Bi-Stream HMMs
Bing Zhao, Nguyen Bach, Ian Lane and Stephan Vogel
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2007), pp 364-371, April 2007, Rochester, NY, USA.
[Abstract], [Slides],[Bib],[Test set].
We propose a novel HMM-based framework to accurately transliterate unseen named entities. The framework leverages features in letter alignment and letter n-gram pairs learned from available bilingual dictionaries. Letter-classes, such as vowels/non-vowels, are integrated to further improve transliteration accuracy. The proposed transliteration system is applied to out-of-vocabulary named-entities in statistical machine translation (SMT), and a significant improvement over traditional transliteration approach is obtained. Furthermore, by incorporating an automatic spell-checker based on statistics collected from web search engines, transliteration accuracy is further improved. The proposed system is implemented within our SMT system and applied to a real translation scenario from Arabic to English.
- The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System
Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black
In Proceedings of the International Workshop on Spoken Language Translation(IWSLT-2007), October 2007, Trento, Italy.
[Abstract], [Slides], [Bib].
The paper describes our portable two-way speech-to-speech translation system using a completely eyes-free/hands-free user interface. This system translates between the language pair English and Iraqi Arabic as well as between English and Farsi, and was built within the framework of the DARPA TransTac program. The Farsi language support was developed within a 90-day period, testing our ability to rapidly support new languages. The paper gives an overview of the system’s components along with the individual component objective measures and a discussion of issues relevant for the overall usage of the system. We found that usability, flexibility, and robustness serve as severe constraints on system architecture and design.
- Handling OOV Words in Arabic ASR Via Flexible Morphological Constraints
Nguyen Bach, Mohamed Noamany, Ian Lane and Tanja Schultz
In Proceedings of the INTERSPEECH (Interspeech-2007), August 2007, Antwerp, Belgium.
[Abstract], [Slides],[Bib].
We propose a novel framework to detect and recognize out-of-vocabulary (OOV) words in automated speech recognition (ASR). In the proposed framework a hybrid language model combining words and sub-word units is incorporated during ASR decoding then three different OOV words recognition methods are applied to generate OOV word hypotheses. Specifically, dictionary lookup, morphological composition, and direct phoneme-to-grapheme. The proposed approach successfully reduced WER by 1.9% and 1.6% for ASR systems with recognition vocabularies of 30K and 219K. Moreover, the proposed approach correctly recognized 5% of OOV words.
- The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007
Ian Lane, Andreas Zollmann, ThuyLinh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan Vogel, Kay Rottmann, Ying Zhang and Alex Waibel
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2007), October 2007, Trento, Italy.
[Abstract], [Slides], [Bib].
This paper describes the CMU-UKA statistical machine translation systems submitted to the IWSLT 2007 evaluation campaign. Systems were submitted for three language-pairs: Japanese-English, Chinese-English and Arabic-English. All systems were based on a common phrase-based SMT (statistical machine translation) framework but for each language-pair a specific research problem was tackled. For Japanese-English we focused on two problems: first, punctuation recovery, and second, how to incorporate topic-knowledge into the translation framework. Our Chinese-English submission focused on syntax augmented SMT and for the Arabic-English task we focused on incorporating morphological-decomposition into the SMT framework. This research strategy enabled us to evaluate a wide variety of approaches which proved effective for the language pairs they were evaluated on.
2006
Before 2005
- Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese
Hansjoerg Mixdorff, Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong
In Proceedings of the EUROSPEECH (Eurospeech-2003), pp 177 - 180, Sep 2003, Geneva, Switzerland.
- Analysis F0 Contours Using the Fujisaki model for Vietnamese Tones
Bach Hung Nguyen and Nguyen Tien Dung
In Proceedings of the National Informatics Conference, Thai Nguyen, Vietnam, 2003.
- Application of Dynamic Time Warping Algorithm for the Recognition of Vietnamese Isolated Words
Bach Hung Nguyen and Luong Chi Mai
In Journal of Science and Technology, N.5, 2002, Vietnam.
- Qin Gao, Alok Parlikar, Nguyen Bach, and Stephan Vogel, 'Statistical Machine Translation: Parallel Processing for Large Data Situations,' Intel Research Pittsburgh Open House 2008, October 2008, Pittsburgh, PA, USA. [PDF]
- Simulating Sentence Pairs Sampling Process via Source and Target Language Models, MT Lunch, April 2008, Carnegie Mellon University
- Translating Words
You've Never Seen , Student Research Symposium 2006, Language
Technologies Institute, Carnegie
Mellon University
|
IMPLEMENTATIONS
– UNPUBLISHED REPORTS
|
- Translate
Arabic OOV words by Transformation Transliteration Rules , March 2006,
Carnegie Mellon University
[available inside CMU]
- N. Bach, ' MetaShopper - a
preliminary study and implementation ', May 2004, Johns Hopkins
University
You can try the implementation here VeryNaiveBookCrawler
- N. Bach, S. Reddy, 'A
preliminary quantitative study on the characteristics of Vietnamese vowels
and English vowels', May 2004, Johns Hopkins University
- A random sentence generator. Each time you run
the generator; it reads the context-free grammar from a file and prints
one or more random sentences. This small program was done in September
2003 and updated June 2004. You can try it here: 10
English sentences or 10
Vietnamese sentences with Nguyen_Binh's style
- A text classifier. The program uses 2 training
corpora. They can be spam and not-spam or English and Spanish. Given an
email the program classifies it to a training group. So for spam detector,
the email is determined whether it is spam or not-spam. For language
identification, the email is determined whether it is written in English
or Spanish. By using smoothing techniques the error rate sharply
decreases. I tried uniform, add-lambda, add-lambda backoff, and
Witten-Bell backoff.
Nguyen Bach
Last
modified: August 18, 2009