The Language Technologies Institute

at Carnegie Mellon University


Jonathan Clark
PhD Student
School of Computer Science
(NLP Scientist @ Microsoft Research)




Broadly, I am interested in how we can use linguistics and statistics to improve computational models of human language. Currently, I work with Alon Lavie on statistical machine translation. I also frequently collaborate with Chris Dyer. My dissertation, which I'm currently working on, is entitled "Locally Non-Linear Learning via Feature Induction in Statistical Machine Translation".

I also spent some time building very large language models. Before that, I developed discriminant syntactic features that help the system choose better translations in both resource rich and resource poor languages; these features included phrase structure and dependency structure and how to best statistically model these structures to capture the behavior of the language pair being translated.

Previously, I worked with Lori Levin and Robert Frederking on a year-long pilot project (also a part of AVENUE) investigating active learning techniques for presenting the a bilingual person with the examples from a linguistically-structured corpus so that such people can be tapped as an efficient and cost-effective resource for improving the quality of machine translation for languages that have few alternatives for acquiring the data needed to traing modern machine translation systems.


Jonathan Clark
Microsoft Research, Building 99
One Microsoft Way
Redmond, WA 98052-6399

Phone: (412) 254-4566


ducttape: HyperWorkflow Manager

MultEval: Easy Bootstrap Resampling and Approximate Randomization for BLEU, METEOR, and TER using Multiple Optimizer Runs


J. Clark, A. Lavie, C. Dyer "One System, Many Domains: Open-Domain Statistical Machine Translation via Feature Augmentation", Association for Machine Translation in the Americas (AMTA) October 2012. San Diego, California, USA [PDF]

Thesis Proposal: "Locally Non-Linear Learning via Feature Induction in Statistical Machine Translation", April 2012.

J. Clark, C. Dyer, A. Lavie, N. Smith "Better Hypothesis Testing for Statistical Machine Translation: Controlling for Optimizer Instability", Association for Computational Lingustics (ACL) July 2011. Portland, Oregon, USA [PDF] [ACL Slides] [Software] [YouTube Presentation]

C. Dyer, J. Clark, A. Lavie, N. Smith "Unsupervised Word Alignment with Arbitrary Features", Association for Computational Lingustics (ACL) July 2011. Portland, Oregon, USA [PDF]

C. Dyer, K. Gimpel, J. Clark, N. Smith "The CMU-ARK German-English Translation System", Workshop on Statistical Machine Translation (WMT11) July 2011. Edinburgh, UK [PDF]

G. Hanneman, J. Clark, A. Lavie, "Improved Features and Grammar Selection for Syntax-Based MT", Workshop on Statistical Machine Translation (WMT10) at the Association for Computational Lingustics (ACL) July 2010. Uppsala, Sweden [PDF]

J. Clark, J. Weese, B. Ahn, A. Zollmann, Q. Gao, K. Heafield, A. Lavie, "The Machine Translation Toolpack for LoonyBin: Automated Management of Experimental Machine Translation HyperWorkflows", Prague Bulletin of Mathematical Linguistics (Presented at the Fourth Machine Translation Marathon) January 2010. Dublin, Ireland [PDF] [MT Lunch Slides] [MT Marathon Slides] [Software]

J. Clark, A. Lavie, "LoonyBin: Keeping Language Technologists Sane through Automated Management of Experimental (Hyper)Workflows", LREC 2010. Malta. [PDF] [Software]

G. Hanneman, V. Ambati, J. Clark, A. Parlikar, A. Lavie, "An Improved Statistical Transfer System for French–English Machine Translation", The Fourth Workshop on Statistical Machine Translation (WMT09) at the European Association for Computational Linguistics (EACL), March 2009. Athens, Greece. [PDF]

J. Clark , R. Frederking, L. Levin "Inductive Detection of Language Features via Clustering Minimal Pairs: Toward Feature-Rich Grammars in Machine Translation", The Second Workshop on Syntax and Structure in Translation (SSST) at the Associatation for Computational Linguistics (ACL), June 2008. Columbus, Ohio. [PDF] [Slides]

J. Clark , R. Frederking, L. Levin "Toward Active Learning in Corpus Creation: Automatic Discovery of Language Features During Elicitation", The Sixth Language Resources and Evaluation Conference (LREC), May 2008. Marrakech, Morocco. [PDF] [Slides]

J. Clark , C. Hannon, "A Classifier System for Author Recognition Using Synonym-Based Features", Sixth Mexican International Conference on Artificial Intelligence , November 2007. Aguascalientes, Mexico. [PDF]

J. Clark , C. Hannon, "An Algorithm for Identifying Authors Using Synonyms", ENC 2007 , September 2007. Morelia, Mexico.

M. Bowden, M. Olteanu, P. Suriyentrakorn, J. Clark, D. Moldovan, "LCC's PowerAnswer at QA@CLEF 2006," CLEF 2006 Working Notes, September, 2006. Alicante, Spain. [PDF]

C. Hannon, J.Clark, "A Cognitive-Based Approach to Learning Integrated Language Components", The Third International Workshop on Natural Language Understanding and Cognitive Science, May 2006. Paphos, Cyprus


J.Clark, "Treegraft: A Stochastic Transduction Chart Parser", NLP Lab Self-Defined Project Final Report, Spring 2008. [PDF] [Google Code Project page]

J. Clark, J. Gonzalez, "Coreference: Current Trends and Future Directions", Language and Statistics II Literature Review, Fall 2008.[PDF]

The Initial

With apologies to Noah A. Smith, I also feel the need to explain the pretentious middle initial on all my publications: The name Jon Clark is only slighly less common than John Smith. Other Jonathan Clarks include the 2007 CMU MBA class co-president, the songwriter, the photographer, the woodworker, the journalist, the comedian, the cameraman, the actor, the teacher, the pilot, the athlete, the golfer, the biker, the boxing champion, the lighting designer, the British artist, the sculptor, the architect, the health technologist, the computational biology professor, the personal trainer, the wellness professional, the history professor, the chief counsel for Morgan Stanley, the finance professor, the attorney, the founder of Thinstall (virtualization software), the senior VP at Sallie Mae, the real estate agent, the university president, the music professor, the post-hardcore band singer, the founder of Business Writing Solutions, the 18th century general, the basketball player, the NLP trainer (NeuroLinguistic Programming), the telecommunications consultant, the IT professional, the search marketing specialist, the computer engineering student, the polymer research engineer, the physician, another physician, another still, the surgeon, the zoology professor, the biomedial robotics professor (who, incidentally, published a paper with Jorge Cham), and the former CTO of LionBridge (large language engineering company that made this translation software... talk about hard to be unique). Even the initial doesn't always work; the other Jonathan H Clark is a Texas lawyer.


Spring 2009

Fall 2008

Spring 2008

Fall 2007


When I'm not knee-deep in code, I enjoy going to Pittsburgh Pirates baseball games with my wife Libby (while eating nachos topped with obscene amounts of jalapeños), playing drums (jazz, hand percussion, metal, it's all good stuff), and learning bits of random languages. And of course, reading Jorge Cham's wonderful PhD comics (follow the link for more laughs):


Simple, but Brilliant Java Programming Advice

Choosing a Ph.D. Program in Computer Science (Berkley)
Advice on Applying (and whether to apply) for a Ph.D. in Computer Science (CMU)
Advice on Applying for Ph.D., Fellowships, and Other Such (Stanford)
Advice for Writing Personal Statements

A Few Favorite Applications

Remember the Milk - Advanced Todo List
Pros: Implements Getting Things Done and most of Randy Pausch's Time Management lecture
Cons: Doesn't integrate time tracking

Google Calendar - Tells me when to be places
Pros: Easy to use interface and support for sharing calendars

FindBugs - Finds bugs in Java programs