N G U Y E N B A C H ' s Homepage

Nguyen Bach (Bạch Hưng Nguyên)
Graduate Research Assistant
Home | Experience | Miscellaneous | Personal Info

CONTACT INFORMATION

Language Technology Institute (Affiliated with interACT Research)
School of Computer Science
Carnegie Mellon University
407 South Craig Street, Pittsburgh, PA 15213, USA
Email: x@y where x=nbach; y=cs.cmu.edu | LinkedIn
Phone: gif-uks-pihu

RESEARCH INTERESTS

Machine Translation
Information Extraction
Speech Recognition and Synthesis
Machine Learning
Information Retrieval

I successfully defended my dissertation works on Jan 3rd 2012.

I am a PhD candidate working under the supervision of Prof. Alex Waibel and Prof. Stephan Vogel at CMU Statistical Machine Translation group. My current research is exploiting dependency structures from text and speech for statistical machine translation. I am working on GALE and TransTac projects. I was at IBM T.J. Watson Research Center for summer and fall 2010. My project there was on estimating machine translation quality automatically.

Top words appear on my paper abstracts:

PUBLICATIONS

2012

The SDL Language Weaver Systems in the WMT12 Quality Estimation Shared Task
Radu Soricut, Nguyen Bach, and Ziyuan Wang
In Proceedings of the Seventh Workshop on Statistical Machine Translation (WMT 2012), June 2012, Montreal, Quebec, Canada. [Abstract], [Slides], [Bib].
We present in this paper the system sub-missions of the SDL Language Weaver team in the WMT 2012 Quality Estimation shared-task. Our MT quality-prediction systems use machine learning techniques (M5P regression-tree and SVM-regression models) and a feature-selection algorithm that has been designed to directly optimize towards the official metrics used in this shared-task. The resulting submissions placed 1st (the M5P model) and 2nd (the SVM model), respectively, on both the Ranking task and the Scoring task, out of 11 participating teams.
```
@InProceedings{soricut-bach-wang:2012:WMT,
  author    = {Soricut, Radu  and  Bach, Nguyen  and  Wang, Ziyuan},
  title     = {The SDL Language Weaver Systems in the WMT12 Quality Estimation Shared Task},
  booktitle = {Proceedings of the Seventh Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2012},
  address   = {Montr{\'e}al, Canada},
  publisher = {Association for Computational Linguistics},
  pages     = {145--151},
  url       = {http://www.aclweb.org/anthology/W12-3118}
}
```

2011

TriS: A Statistical Sentence Simplifier with Log-linear Models and Margin-based Discriminative Training
Nguyen Bach, Qin Gao, Stephan Vogel and Alex Waibel
In Proceedings of the 5th International Joint Conference on Natural Language Processing (IJCNLP 2011), November 2011, Chiang Mai, Thailand. [Abstract], [Slides], [Data], [Binary], [Bib].
We propose a statistical sentence simplification system with log-linear models. In contrast to state-of-the-art methods that drive sentence simplification process by hand-written linguistic rules, our method used a margin-based discriminative learning algorithm operates on a feature set. The feature set is defined on statistics of surface form as well as syntactic and dependency structures of the sentences. A stack decoding algorithm is used which allows us to efficiently generate and search simplification hypotheses. Experimental results show that the simplified text produced by the proposed system reduces 1.7 Flesch-Kincaid grade level when compared with the original text. We will show that a comparison of a state-ofthe-art rule-based system (Heilman and Smith, 2010) to the proposed system demonstrates an improvement of 0.2, 0.6, and 4.5 points in ROUGE-2, ROUGE-4, and AveF10, respectively.
```
@InProceedings{bach-EtAl:2011:IJCNLP-2011,
  author    = {Bach, Nguyen  and  Gao, Qin  and  Vogel, Stephan  and  Waibel, Alex},
  title     = {TriS: A Statistical Sentence Simplifier with Log-linear Models and Margin-based Discriminative Training},
  booktitle = {Proceedings of 5th International Joint Conference on Natural Language Processing},
  month     = {November},
  year      = {2011},
  address   = {Chiang Mai, Thailand},
  publisher = {Asian Federation of Natural Language Processing},
  pages     = {474--482},
  url       = {http://www.aclweb.org/anthology/I11-1053}
}
```

Goodness: A Method for Measuring Machine Translation Confidence
Nguyen Bach, Fei Huang and Yaser Al-Onaizan
In Proceedings of The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL-HLT 2011), June 2011, Portland, Oregon, USA. [Abstract], [Slides], [Visualization], [Bib].
State-of-the-art statistical machine translation (MT) systems have made significant progress towards producing user-acceptable translation output. However, there is still no efficient way for MT systems to inform users which words are likely translated correctly and how confident it is about the whole sentence. We propose a novel framework to predict wordlevel and sentence-level MT errors with a large number of novel features. Experimental results show that the MT error prediction accuracy is increased from 69.1 to 72.2 in F-score. The Pearson correlation between the proposed confidence measure and the human-targeted translation edit rate (HTER) is 0.6. Improvements between 0.4 and 0.9 TER reduction are obtained with the n-best list reranking task using the proposed confidence measure. Also, we present a visualization prototype of MT errors at the word and sentence levels with the objective to improve post-editor productivity.
```
@InProceedings{bach-huang-alonaizan:2011:ACL-HLT2011,
  author    = {Bach, Nguyen  and  Huang, Fei  and  Al-Onaizan, Yaser},
  title     = {Goodness: A Method for Measuring Machine Translation Confidence},
  booktitle = {Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies},
  month     = {June},
  year      = {2011},
  address   = {Portland, Oregon, USA},
  publisher = {Association for Computational Linguistics},
  pages     = {211--219},
  url       = {http://www.aclweb.org/anthology/P11-1022}
}
```

CMU Haitian Creole-English Translation System for WMT 2011
Sanjika Hewavitharana, Nguyen Bach, Qin Gao,Vamshi Ambati and Stephan Vogel
In Proceedings of the Sixth Workshop on Statistical Machine Translation (WMT 2011), July 2011, Edinburgh, Scotland. [Abstract], [Slides], [Bib].
This paper describes the statistical machine translation system submitted to the WMT11 Featured Translation Task, which involves translating Haitian Creole SMS messages into English. In our experiments we try to address the issue of noisy training data, as well as lack of parallel training data. Spelling normalization is applied to reduce out-of-vocabulary words in the corpus. Using Semantic Role Labeling rules we expand the available training corpus. We also investigate extracting parallel sentences from comparable corpora to enhance the available parallel data
```
@InProceedings{hewavitharana-EtAl:2011:WMT,
  author    = {Hewavitharana, Sanjika  and  Bach, Nguyen  and  Gao, Qin  and  Ambati, Vamshi  and  Vogel, Stephan},
  title     = {CMU Haitian Creole-English Translation System for WMT 2011},
  booktitle = {Proceedings of the Sixth Workshop on Statistical Machine Translation},
  month     = {July},
  year      = {2011},
  address   = {Edinburgh, Scotland},
  publisher = {Association for Computational Linguistics},
  pages     = {386--392},
  url       = {http://www.aclweb.org/anthology/W11-2146}
}
```

2010

A Semi-Supervised Word Alignment Algorithm with Partial Manual Alignments
Qin Gao, Nguyen Bach and Stephan Vogel
In Proceedings of the ACL 2010 the Joint Fifth Workshop on Statistical Machine Translation and Metrics MATR (ACL-2010 WMT), July 2010, Uppsala, Sweden. [Abstract], [Slides], [Bib].
We present a word alignment framework that can incorporate partial manual alignments. The core of the approach is a novel semi-supervised algorithm extending the widely used IBM Models with a constrained EM algorithm. The partial manual alignments can be obtained by human labelling or automatically by high-precision-low-recall heuristics. We demonstrate the usages of both methods by selecting alignment links from manually aligned corpus and apply links generated from bilingual dictionary on unlabelled data. For the first method, we conduct controlled experiments on Chinese-English and Arabic-English translation tasks to compare the quality of word alignment, and to measure effects of two different methods in selecting alignment links from manually aligned corpus. For the second method, we experimented with moderate-scale Chinese-English translation task. The experiment results show an average improvement of 0.33 BLEU point across 8 test sets.
```
@InProceedings{gao-bach-vogel:2010:WMT,
  author    = {Gao, Qin  and  Bach, Nguyen  and  Vogel, Stephan},
  title     = {A Semi-Supervised Word Alignment Algorithm with Partial Manual Alignments},
  booktitle = {Proceedings of the Joint Fifth Workshop on Statistical Machine Translation and MetricsMATR},
  month     = {July},
  year      = {2010},
  address   = {Uppsala, Sweden},
  publisher = {Association for Computational Linguistics},
  pages     = {1--10},
  url       = {http://www.aclweb.org/anthology/W10-1701}
}
```

2009

Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints
Nguyen Bach, Qin Gao and Stephan Vogel
In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada. [Abstract], [Slides], [Bib].
We propose a novel source-side dependency tree reordering model for statistical machine translation, in which subtree movements and constraints are represented as reordering events associated with the widely used lexicalized reordering models. This model allows us to not only efficiently capture the statistical distribution of the subtree-to-subtree transitions in training data, but also utilize it directly at the decoding time to guide the search process. Using subtree movements and constraints as features in a log-linear model, we are able to help the reordering models make better selections. It also allows the subtle importance of monolingual syntactic movements to be learned alongside other reordering features. We show improvements in translation quality in English-Spanish and English-Iraqi translation tasks.
```
@InProceedings{bach-gao-vogel:2009:MTSummit,
  author    = {Bach, Nguyen and Gao, Qin and  Vogel, Stephan},
  title     = {Source-side Dependency Tree Reordering Models with Subtree Movements and Constraints},
  booktitle = {Proceedings of the Twelfth Machine Translation Summit (MTSummit-XII)},
  month     = {August},
  year      = {2009},
  address   = {Ottawa, Canada},
  publisher = {International Association for Machine Translation}
}
```

Virtual Babel: Towards Context-Aware Machine Translation in Virtual Worlds
Ying Zhang and Nguyen Bach
In Proceedings of the 12th Machine Translation Summit (MT Summit XII), August 2009, Ottawa, Ontario, Canada. [Abstract], [Slides], [Bib].
In this paper, we describe our ongoing research project of Virtual Babel, a context-aware machine translation system for Second Life, one of the most popular virtual worlds. We augment the Second Life viewer to intercept the incoming/outgoing chat messages and reroute the message to a statistical machine translation server. The returned translations are appended to the original text message to help users to understand the foreign language. Virtual Babel provides a platform to study cross-lingual conversations facilitated by machine translation in virtual worlds and we observe interesting phenomena that are not present in document translations. Virtual Babel is aware of the non-verbal context of the conversation. Language model and translation models are trained from collected conversations and are used to generate translations according to observed non-verbal context of the conversation.
```
@InProceedings{zhang-bach:2009:MTSummit,
  author    = {Zhang, Ying and Bach, Nguyen},
  title     = {Virtual Babel: Towards Context-Aware Machine Translation in Virtual Worlds},
  booktitle = {Proceedings of the Twelfth Machine Translation Summit (MTSummit-XII)},
  month     = {August},
  year      = {2009},
  address   = {Ottawa, Canada},
  publisher = {International Association for Machine Translation}
}
```

Cohesive Constraints in A Beam Search Phrase-based Decoder
Nguyen Bach, Stephan Vogel and Colin Cherry
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA. [Abstract], [Slides], [Bib].
Cohesive constraints allow the phrase-based decoder to employ arbitrary, non-syntactic phrases, and encourage it to translate those phrases in an order that respects the source dependency tree structure. We present extensions of the cohesive constraints, such as exhaustive interruption count and rich interruption check. Furthermore, we present analyses related to the impact of cohesive constraints across language pairs with different reordering models and dependency parsers. Our experiments show that the cohesion-enhanced decoder performs statistically significant better than the standard phrasebased decoder on English-Spanish. Improvements between 0.4 and 1.8 BLEU point are also obtained on English-Iraqi, Arabic-English and Chinese-English systems.
```
@InProceedings{bach-vogel-cherry:2009:NAACLHLT09-Short,
  author    = {Bach, Nguyen  and  Vogel, Stephan  and  Cherry, Colin},
  title     = {Cohesive Constraints in A Beam Search Phrase-based Decoder},
  booktitle = {Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers},
  month     = {June},
  year      = {2009},
  address   = {Boulder, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {1--4},
  url       = {http://www.aclweb.org/anthology/N/N09/N09-2001}
}
```

Incremental Adaptation of Speech-to-Speech Translation
Nguyen Bach, Roger Hsiao, Matthias Eck, Paisarn Charoenpornsawat, Stephan Vogel, Tanja Schultz, Ian Lane, Alex Waibel and Alan W. Black
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2009), Boulder, CO, May/June 2009, USA. [Abstract], [Slides], [Bib].
In building practical two-way speech-to-speech translation systems the end user will always wish to use the system in an environment different from the original training data. As with all speech systems, it is important to allow the system to adapt to the actual usage situations. This paper investigates how a speech-to-speech translation system can adapt day-to-day from collected data on day one to improve performance on day two. The platform is the CMU Iraqi-English portable two-way speechto-speech system as developed under the DARPA TransTac program. We show how machine translation, speech recognition and overall system performance can be improved on day 2 after adapting from day 1 in both a supervised and unsupervised way.
```
@InProceedings{bach-EtAl:2009:NAACLHLT09-Short,
  author    = {Bach, Nguyen  and  Hsiao, Roger  and  Eck, Matthias  and  Charoenpornsawat, Paisarn  and  Vogel, Stephan  and  Schultz, Tanja  and  Lane, Ian  and  Waibel, Alex  and  Black, Alan},
  title     = {Incremental Adaptation of Speech-to-Speech Translation},
  booktitle = {Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers},
  month     = {June},
  year      = {2009},
  address   = {Boulder, Colorado},
  publisher = {Association for Computational Linguistics},
  pages     = {149--152},
  url       = {http://www.aclweb.org/anthology/N/N09/N09-2038}
}
```

2008

Improving Word Alignment with Language Model Based Confidence Scores
Nguyen Bach, Qin Gao and Stephan Vogel
In Proceedings of the ACL 2008 Third Workshop on Statistical Machine Translation (ACL-08:HLT, WSMT), June 2008, Columbus, Ohio, USA. [Abstract], [Slides], [Bib], [MGIZA++].
This paper describes the statistical machine translation systems submitted to the ACL-WMT 2008 shared translation task. Systems were submitted for two translation directions: English-Spanish and Spanish-English. Using sentence pair confidence scores estimated with source and target language models, improvements are observed on the News-Commentary test sets. Genre-dependent sentence pair confidence score and integration of sentence pair confidence score into phrase table are also investigated.
```
@InProceedings{bach-gao-vogel:2008:WMT,
  author    = {Bach, Nguyen  and  Gao, Qin  and  Vogel, Stephan},
  title     = {Improving Word Alignment with Language Model Based Confidence Scores},
  booktitle = {Proceedings of the Third Workshop on Statistical Machine Translation},
  month     = {June},
  year      = {2008},
  address   = {Columbus, Ohio},
  publisher = {Association for Computational Linguistics},
  pages     = {151--154},
  url       = {http://www.aclweb.org/anthology/W/W08/W08-0321}
}
```

Recent Improvements in the CMU Large Scale Chinese-English SMT System
Almut Silja Hildebrand, Kay Rottmann, Mohamed Noamany, Qin Gao, Sanjika Hewavitharana, Nguyen Bach and Stephan Vogel
In Proceedings of the Annual Meeting of the Association for Computational Linguistics with the Human Language Technology Conference (ACL-08:HLT), June 2008, Columbus, Ohio, USA. [Abstract], [Slides], [Bib].
In this paper we describe recent improvements to components and methods used in our statistical machine translation system for Chinese-English used in the January 2008 GALE evaluation. Main improvements are results of consistent data processing, larger statistical models and a POS-based word reordering approach.
```
@InProceedings{hildebrand-EtAl:2008:ACLShort,
  author    = {Hildebrand, Almut Silja  and  Rottmann, Kay  and  Noamany, Mohamed  and  Gao, Quin  and  Hewavitharana, Sanjika  and  Bach, Nguyen  and  Vogel, Stephan},
  title     = {Recent Improvements in the CMU Large Scale Chinese-English SMT System},
  booktitle = {Proceedings of ACL-08: HLT, Short Papers},
  month     = {June},
  year      = {2008},
  address   = {Columbus, Ohio},
  publisher = {Association for Computational Linguistics},
  pages     = {77--80},
  url       = {http://www.aclweb.org/anthology/P/P08/P08-2020}
}
```

2007

A Log-linear Block Transliteration Model based on Bi-Stream HMMs
Bing Zhao, Nguyen Bach, Ian Lane and Stephan Vogel
In Proceedings of the North American Association for Computational Linguistics Human Language Technologies Conference (NAACL-HLT 2007), pp 364-371, April 2007, Rochester, NY, USA. [Abstract], [Slides], [Bib], [Test set].
We propose a novel HMM-based framework to accurately transliterate unseen named entities. The framework leverages features in letter alignment and letter n-gram pairs learned from available bilingual dictionaries. Letter-classes, such as vowels/non-vowels, are integrated to further improve transliteration accuracy. The proposed transliteration system is applied to out-of-vocabulary named-entities in statistical machine translation (SMT), and a significant improvement over traditional transliteration approach is obtained. Furthermore, by incorporating an automatic spell-checker based on statistics collected from web search engines, transliteration accuracy is further improved. The proposed system is implemented within our SMT system and applied to a real translation scenario from Arabic to English.
```
@InProceedings{zhao-EtAl:2007:main,
  author    = {Zhao, Bing  and  Bach, Nguyen  and  Lane, Ian  and  Vogel, Stephan},
  title     = {A Log-Linear Block Transliteration Model based on Bi-Stream {HMM}s},
  booktitle = {Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference},
  month     = {April},
  year      = {2007},
  address   = {Rochester, New York},
  publisher = {Association for Computational Linguistics},
  pages     = {364--371},
  url       = {http://www.aclweb.org/anthology/N/N07/N07-1046}
}
```

The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System
Nguyen Bach, Matthias Eck, Paisarn Charoenpornsawat, Thilo Kohler, Sebastian Stuker, ThuyLinh Nguyen, Roger Hsiao, Alex Waibel, Stephan Vogel, Tanja Schultz and Alan W. Black
In Proceedings of the International Workshop on Spoken Language Translation(IWSLT-2007), October 2007, Trento, Italy. [Abstract], [Slides], [Bib].
The paper describes our portable two-way speech-to-speech translation system using a completely eyes-free/hands-free user interface. This system translates between the language pair English and Iraqi Arabic as well as between English and Farsi, and was built within the framework of the DARPA TransTac program. The Farsi language support was developed within a 90-day period, testing our ability to rapidly support new languages. The paper gives an overview of the system’s components along with the individual component objective measures and a discussion of issues relevant for the overall usage of the system. We found that usability, flexibility, and robustness serve as severe constraints on system architecture and design.
```
@InProceedings{iwslt07:Transtac,
  author= {Nguyen Bach and Matthias Eck and Paisarn Charoenpornsawat and Thilo KÃƒÂ¶hler and Sebastian StÃƒÂ¼ker and ThuyLinh Nguyen and Roger Hsiao and Alex Waibel and Stephan Vogel and Tanja Schultz and Alan Black},
  title= {{The CMU TransTac 2007 Eyes-free and Hands-free Two-way Speech-to-Speech Translation System}},
  year= {2007},
  booktitle= {Proc. of the International Workshop on Spoken Language Translation},
  address= {Trento, Italy}
 }
```

Handling OOV Words in Arabic ASR Via Flexible Morphological Constraints
Nguyen Bach, Mohamed Noamany, Ian Lane and Tanja Schultz
In Proceedings of the INTERSPEECH (Interspeech-2007), August 2007, Antwerp, Belgium. [Abstract], [Slides],[Bib].
We propose a novel framework to detect and recognize out-of-vocabulary (OOV) words in automated speech recognition (ASR). In the proposed framework a hybrid language model combining words and sub-word units is incorporated during ASR decoding then three different OOV words recognition methods are applied to generate OOV word hypotheses. Specifically, dictionary lookup, morphological composition, and direct phoneme-to-grapheme. The proposed approach successfully reduced WER by 1.9% and 1.6% for ASR systems with recognition vocabularies of 30K and 219K. Moreover, the proposed approach correctly recognized 5% of OOV words.
```
@InProceedings{Bach2007, 
  author = "Bach, Nguyen and Noamany, Mohammed and Lane, Ian and Schultz, Tanja",
  title = "Handling OOV Words in Arabic ASR Via Flexible Morphological Constraints",
  booktitle = "Proceedings of the Interspeech2007",
  year = "2007"
}
```

The CMU-UKA Statistical Machine Translation Systems for IWSLT 2007
Ian Lane, Andreas Zollmann, ThuyLinh Nguyen, Nguyen Bach, Ashish Venugopal, Stephan Vogel, Kay Rottmann, Ying Zhang and Alex Waibel
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2007), October 2007, Trento, Italy. [Abstract], [Slides], [Bib].
This paper describes the CMU-UKA statistical machine translation systems submitted to the IWSLT 2007 evaluation campaign. Systems were submitted for three language-pairs: Japanese-English, Chinese-English and Arabic-English. All systems were based on a common phrase-based SMT (statistical machine translation) framework but for each language-pair a specific research problem was tackled. For Japanese-English we focused on two problems: first, punctuation recovery, and second, how to incorporate topic-knowledge into the translation framework. Our Chinese-English submission focused on syntax augmented SMT and for the Arabic-English task we focused on incorporating morphological-decomposition into the SMT framework. This research strategy enabled us to evaluate a wide variety of approaches which proved effective for the language pairs they were evaluated on.
```
@InProceedings{iwslt07:UKACMU_SMT,
  author= {Ian Lane and Andreas Zollmann and Thuy Linh Nguyen and Nguyen Bach and Ashish Venugopal and Stephan Vogel and Kay Rottmann and Ying Zhang and Alex Waibel},
  title= {{The UKA-CMU Statistical Machine Translation Systems for IWSLT 2007}},
  year= {2007},
  booktitle= {Proc. of the International Workshop on Spoken Language Translation},
  address= {Trento, Italy}
}
```

2006

A Log-linear Block Transliteration Model based on Bi-Stream HMMs
Bing Zhao, Nguyen Bach, Ian Lane and Stephan Vogel
T.R. CMU-LTI-06-007, Carnegie Mellon University, Pittsburgh, PA, Fall 2006.
The UKA/CMU Statistical Machine Translation System for IWSLT 2006
Matthias Eck, Ian Lane, Nguyen Bach, Sanjika Hewavitharana, Muntsin Kolss, Bing Zhao, Almut Silja Hildebrand, Stephan Vogel and Alex Waibel
In Proceedings of the International Workshop on Spoken Language Translation (IWSLT-2006), pp 130-137, November 2006, Kyoto, Japan. [Abstract], [Slides], [Bib].
This paper describes the UKA/CMU statistical machine translation system used in the IWSLT 2006 evaluation campaign. The system is based on phrase-to-phrase translations extracted from a bilingual corpus. We compare two different phrase alignment techniques both based on word alignment probabilities. The system was used for all language pairs and data conditions in the evaluation campaign translating both the ASR output (as 1best) and the correct recognition results.
```
@InProceedings{iwslt06:EC:UKACMU_SMT,
  author= {Matthias Eck and Ian Lane and Nguyen Bach and Sanjika Hewavitharana and Muntsin Kolss and Bing Zhao and Almut Silja Hildebrand and Stephan Vogel and Alex Waibel},
  title= {{The UKA/CMU Statistical Machine Translation System for IWSLT 2006}},
  year= {2006},
  booktitle= {Proc. of the International Workshop on Spoken Language Translation},
  address= {Kyoto, Japan},
  pages= {130-137},
}
```

Before 2005

Quantitative Analysis and Synthesis of Syllabic Tones in Vietnamese
Hansjoerg Mixdorff, Nguyen Hung Bach, Hiroya Fujisaki and Mai Chi Luong
In Proceedings of the EUROSPEECH (Eurospeech-2003), pp 177 - 180, Sep 2003, Geneva, Switzerland.
Analysis F0 Contours Using the Fujisaki model for Vietnamese Tones
Bach Hung Nguyen and Nguyen Tien Dung
In Proceedings of the National Informatics Conference, Thai Nguyen, Vietnam, 2003.
Application of Dynamic Time Warping Algorithm for the Recognition of Vietnamese Isolated Words
Bach Hung Nguyen and Luong Chi Mai
In Journal of Science and Technology, N.5, 2002, Vietnam.

CODE

TriS: A Statistical Sentence Simplifier. It helps you to simplify a long and complicated sentence into shorter sentences with stack decoding and margin-based discriminative training. The simplification engine is written in Java; server side is in CGI; web interface is in AJAX and CSS; Windows interface is in C#; total about 6,000 lines of code.

SentSimpTurk: A web interface to collect sentence simplification data from Amazon Mechanical Turk. It is equiped with edit-distance algorithm to avoid bad Turkers. In GWT and CSS.

WPP: a reimplementation of Ueffing-Ney word posterior probability. It is usefull for n-best list reranking task in MT. In Perl.

Coh: cohesive constraints for phrase-based SMT decoders. These constraints are an extension of Colin Cherry's cohesive constraints. They tie the source-side dependency trees with phrase movements during decoding time. It is itergrated in the CMU's SMT decoder. In C++ STL.

DepRe: a dependency reordering model for phrase-based SMT decoders. It extracts well-formed dependency structures (see definition in Libin Shen's paper) from source-side, and learns reordering patterns, i.e moving inside or outside of a subtree. In C++ STL.

HMM POS tagger: an implementation of a smoothed bigram HMM tagger with EM training (Baum-Welch) and Viterbi decoding in Perl.

ParserServer: a server for parsing setences. This program opens a socket and listens parsing requests of client in order to parse sentences quickly without reloading models. In Java.

AraMorph: an Arabic rule-based morphological analyzer. Given an Arabic words this program returns different variants of splitting the word into linguitics units. It is often used to analyze the out-of-vocabulary of an Arabic-English translation system. In Perl.

AutoGoogle: automatically abuse :-) Google search engine with lots of queries. It used to pass 30K queries per day without being blocked IP address. This program was used to get statistics, spell check, and related snippets of a given query. In Perl.

NPP: a noun phrase paraphrasing script. Paraphrasing rules are learned from syntactic trees. It was used in English-Spanish translation system. In Perl.

StatSpell: a reimplementation of Peter Norvig's statistical spelling checker. It has been used for cleaning corpora of Haitian-Creole, Pashtu, and Iraqi. In Perl.

MetaShopper - a preliminary study and implementation. A meta-search engine for book hungry reader, you may probably save some bucks. You can try the implementation here VeryNaiveBookCrawler. In Perl CGI.

A random sentence generator: a Perl script generate text or trees from a probabilistic context-free grammar (PCFG). You can try it here: 10 English sentences or 10 Vietnamese sentences with Nguyen_Binh's style(Nguyen Binh is a famous Vietnamese poet).

A text classifier: The program uses 2 training corpora. They can be spam and not-spam or English and Spanish. Given an email the program classifies it to a training group. So for spam detector,the email is determined whether it is spam or not-spam. For language identification, the email is determined whether it is written in English or Spanish. By using smoothing techniques the error rate sharply decreases. I tried uniform, add-lambda, add-lambda backoff, and Witten-Bell backoff. In C.

Shell scripts and aliases, things you might want to type at a Unix command line:
checkEmptyLine: you have a corpus and you want to know which line is empty.
BleuSent: compute smoothed sentence-level BLEU score.
fstr: find a string from files in a directory. It is useful for command line debugging.
rmsvndir: remove all hidden SVN directories.
sorttab: sort a tab delimeted text file according to a colum.
Kappa statistics: compute inter-raters agreements with Fleiss' Kappa, including random sampling.
html2text: removes html tags and returns plain text documents. It is helpful when preprocessing Web collected data.
VocabCoverage: check the vocabulary coverage for a given vocabulary and a text.
TypeToken: compute type token ratio given a corpus.
shuffle: given a corpus, shuffle generates a new corpus with random sentence order.
getReadability: compute readability scores, such as Flesch, Fog, Flesch-Kincaid, of a given text.
ngram-overlap: compare 2 ngram lists and return statistics of overlapping entries among them. It is useful when you want to know how much different of 2 corpora.
...

TALKS

Qin Gao, Alok Parlikar, Nguyen Bach, and Stephan Vogel, 'Statistical Machine Translation: Parallel Processing for Large Data Situations,'Intel Research Pittsburgh Open House 2008, October 2008, Pittsburgh, PA, USA.
Simulating Sentence Pairs Sampling Process via Source and Target Language Models, MT Lunch, April 2008, Carnegie Mellon University
Translating Words You've Never Seen, Student Research Symposium 2006, Language Technologies Institute, Carnegie Mellon University

UNPUBLISHED REPORTS

Nguyen Bach, A comparison between IBM Watson DeepQA and Statistical Machine Translation, May 2011, Carnegie Mellon University.
A Survey on Relation Extraction, Nguyen Bach and Sameer Badaskar Literature review for Language and Statistics II, 2007. [Abstract], [Slides].
Many applications in information extraction, natural language understanding, information retrieval require an understanding of the semantic relations between entities. We present a comprehensive review of various aspects of the entity relation extraction task. Some of the most important supervised and semi-supervised classification approaches to the relation extraction task are covered in sufficient detail along with critical analyses. We also discuss extensions to higher-order relations. Evaluation methodologies for both supervised and semi-supervised methods are described along with pointers to the commonly used performance evaluation datasets. Finally, we also give short descriptions of two important applications of relation extraction, namely question answering and biotext mining.
N. Bach, S. Reddy, 'A preliminary quantitative study on the characteristics of Vietnamese vowels and English vowels', May 2004, Johns Hopkins University

Nguyen Bach

Last modified: January 3, 2012