Sep 2

Alon Lavie



Statistical MT with Syntax and Morphology: Challenges and Some Solutions

Phrase-based Statistical Machine Translation has become the most dominant approach to MT in recent years.  Its linguistic shallowness, however, limits its capabilities when applied to morphologically-rich languages and to language-pairs with highly divergent syntax.  Integration of morphological analysis and syntactic modeling within statistical MT are currently at the forefront of MT research.  This talk will overview recent work within my research group and our collaborators on hybrid MT frameworks that incorporate syntax and morphology into statistical translation.


The talk will focus on three main lines of work: (1) Learning of syntax-based synchronous context-free grammars from large volumes of parsed parallel corpora; (2) Morphological segmentation of Arabic and its impact on English-to-Arabic phrase-base SMT; and (3) MT between Hebrew and Arabic - two morphologically-rich related languages with very limited parallel data resources.

Dr. Alon Lavie is an Associate Research Professor at the Language Technologies Institute at Carnegie Mellon University, where he directs a research group in the area of Machine Translation (MT).  His current main research projects focus on the design of syntax-based data-driven approaches to Machine Translation, multi-engine Machine Translation system combination, and MT evaluation. Dr. Lavie is also the co-founder and President of Safaba Translation Solutions - a CMU spin-off company that develops Machine Translation solutions for commercial enterprises and Language Service Providers.  Dr. Lavie is currently serving as President of the Association for Machine Translation in the Americas (AMTA).

Sep 16

Robert Frederking


Going beyond Identifinder


Extraction of common named entities (Named Entity Recognition, or NER) from normal English documents is largely a solved problem.  As part of an ongoing research project here, we are attempting to go beyond standard NER in several dimensions.  We have investigated ensemble approaches to improving NER quality, rapid development of NER in other languages, and NER co-reference both within and across documents.  We have also made an initial investigation into NER in short informal messages (Twitter "tweets").  I will describe the context of this work, our approaches and results so far, and then mention future directions that we are considering.  This talk presents joint work with Anatole Gershman, Rushin Shah, and Bo Lin.


Robert Frederking is a Senior Systems Scientist in the Language Technologies Institute (LTI) at Carnegie Mellon, where he is also SCS Associate Dean for Graduate Education, and the Chair of the LTI's graduate programs.   His research in information extraction began with the Radar project.  His research interests also include machine translation and other areas of language technologies.  He received his PhD in Computer Science/Artificial Intelligence from Carnegie Mellon in 1986.  He received his BS in Computer Engineering from CWRU in 1977, and worked on CWRUNET, an early computer network.  He has also worked at Carnegie Group Inc., the Robotics Institute, and Siemens Research in Munich, Germany.


Sep 23

Lori Levin

Modal Constructions in Machine Translation:
beyond propositional semantics


This talk describes an experiment in which semantic information about modality (might, could, can, etc.) was added to a syntactically augmented machine translation system.  The work is conducted in the frameworks of Semantically Informed MT (SIMT) and Linguistic Core MT (LCMT), which enable us to add interlingua-style semantic information to statistical MT while avoiding some of the shortcomings of old-style interlingua systems.  An example will be given from the SIMT framework on modality in Urdu-English MT (NIST 2009 task).  Functional information about modality was added to a syntax based MT system by relabeling VP nodes in English parse trees with more specific categories such as VP-Require, resulting in a small increase in BLEU score.  I will discuss the implications of this experiment more generally with respect to the role of semantics in machine translation and the goal of language technologies for low-resource languages.



Lori Levin has a B.A. in Linguistics from University of Pennsylvania (1979) and a Ph.D. in Linguistics from MIT (1986).  She has been involved in machine translation since 1986, starting in the Center for Machine Translation (precursor of the LTI).  She has worked on MT and other language technologies for major European and Asian languages as well as native American languages and African languages.


Sep 30

Joseph Keshet

Loss Minimization for Voice Onset Time (VOT) Measurement,
Phoneme Alignment, and Phoneme Recognition


In discriminative learning one is interested in training a system to optimize a certain desired measure of performance, or task loss. In binary classification one typically tries to minimize the error rate. But in prediction for more complex tasks, such as phoneme recognition or voice onset time (VOT) measurement, each task has its own loss. Phoneme recognition performance is measured in terms of phoneme error rate (edit distance) and VOT measurement is quantitatively assessed by the mean deviation from the manually labeled VOT. In the talk I will present two algorithms applied to VOT measurement, phoneme alignment, and phoneme recognition, where the goal is to minimize the specific loss for each task.


In the first part of the talk I will present the problem of automatic VOT measurement and define its loss. I will describe an algorithm which is based on structural support vector machines (SVMs) to minimize this loss. Applied to initial voiceless stops from four corpora (read and conversational speech), the agreement between automatic and manual measurements were found to be near human inter-judge agreement. The experimental results also show that this algorithm provides an accurate and efficient technique for large-scale phonetic analysis.


While algorithms based on structural SVMs are aimed at minimizing the task loss, they actually minimize a surrogate to the task loss, and there is no guarantee about the actual task loss. In the second part of the talk, I will describe a new theorem stating that a general learning update rule directly corresponds to the gradient of the task loss. Based on this update rule I will present a new algorithm for minimizing the unique task loss of phoneme alignment. I will present empirical results on phoneme alignment of a standard test set from the TIMIT corpus, which surpass all previously reported results on this problem. I will show how this update rule can be applied to continuous-density HMMs and will present empirical results on phoneme recognition of TIMIT, showing our approach outperforms previous results on large-margin training of HMMs.


This is joint work with Chih-Chieh Cheng, Tamir Hazan, David McAllester, Morgan Sonderegger, and Mark Stoehr.



Dr. Keshet received his B.Sc. and M.Sc. degrees in Electrical Engineering in 1994 and 2002, respectively, from Tel Aviv University. He received his Ph.D. in Computer Science from The School of Computer Science and Engineering at The Hebrew University of Jerusalem in 2007. From 1995 to 2002 he was a researcher at IDF, and won the prestigious Israeli award, "Israel Defense Prize", for outstanding research and development achievements. From 2007 to 2009 he was a post-doctoral researcher at IDIAP Research Institute in Switzerland. From 2009 He is a research assistant professor at TTI-Chicago, a philanthropically endowed academic computer science institute within the campus of university of Chicago. Dr. Keshet's research interests are in speech and language processing and machine learning. His current research focuses on the design, analysis and implementation of machine learning algorithms for the domain of speech and language processing.


Oct 7

Huan Liu

When Connected, Are We Still in Control of Our Destinies?


The prevalence of social media offers a new kind of laboratory for behavioral study. In the era of the social Web, we are presented with unparalleled opportunities and novel challenges. In this talk, we will introduce some of our recent studies of collective behavior with social media. In particular, we will discuss some projects that illustrate our endeavors to improve the understanding of collective behaviors in social media. We look into user migration patterns in the presence of seemingly unlimited choices of social media services; and investigate ways of exploiting vulnerability to protect user privacy on a social networking site. We benefit from sociological theories and methodologies in carrying out interdisciplinary research that sheds light into our study of collective behavior in social media. The improved understanding of collective behavior can help develop social media services that encourage more user participation with better experience in social media activities.


Joint work with former and current DMML members



Dr. Huan Liu is a professor of Computer Science and Engineering at Arizona State University. He obtained his Ph.D. in Computer Science at University of Southern California and B.Eng. in Computer Science and Electrical Engineering at Shanghai JiaoTong University. He is recognized for excellence in teaching and research in Computer Science and Engineering at Arizona State University. His research interests are data mining, machine learning, social computing, and artificial intelligence.  His research focus is centered on investigating problems that arise in many real-world applications with high-dimensional data of disparate forms such as analyzing social media, group interaction and modeling, data preprocessing  (feature selection), and text/web mining. His well-cited publications include books, book chapters, encyclopedia entries as well as conference and journal papers. He serves on journal editorial boards and numerous conference program committees, and is a founding organizer of the International Conference Series on Social Computing, Behavioral-Cultural Modeling, and Prediction ( He is an ACM Distinguished Scientist. For contact information and links to recent publications, please visit


Oct 14

Eduard Hovy
Information Sciences Institute
University of Southern California



Toward a New Semantics:
Merging Propositional and Distributional Information


Despite hundreds of years of study on semantics, theories and representations of semantic content—the actual meaning of the symbols used in semantic propositions—remain impoverished.  The traditional extensional and intensional models of semantics are difficult to actually flesh out in practice, and no large-scale models of this kind exist.  Recently, researchers in Natural Language Processing (NLP) have increasingly treated word distributions (also called ‘context vectors’, ‘topic models’, ‘language models’, etc.) as a de facto placeholder for semantics at various levels of granularity.  This talk explores a new kind of semantics that combines traditional symbolic logic-based proposition-style semantics (of the kind used in older NLP) with (computation-based) statistical word distribution information (what is being called Distributional Semantics in modern NLP).  The core resource is a single lexico-semantic ‘lexicon’ that can be used for a variety of tasks.  I outline how to define and build such a lexicon and how to use it for various tasks. Combining the two views of semantics opens many fascinating questions that beg study, including the operation of logical operators such as negation and modalities over word(sense) distributions, the nature of ontological facets required to define concepts, and the action of compositionality over statistical concepts.



Eduard Hovy directs the Human Language Technology Group at the Information Sciences Institute of the University of Southern California, and holds several adjunct professorships at universities in China, Korea, and Canada.  He is also a research associate professor of USC’s Computer Science Department, and is co-Director of Research for the Center for Command, Control, and Interoperability Data Analytics, funded by DHS.  Dr. Hovy completed a Ph.D. in Computer Science (Artificial Intelligence) at Yale University in 1987.  His research addresses many areas in Natural Language Processing, including machine reading of text, question answering, information extraction, automated text summarization, the semi-automated construction of large lexicons and ontologies, and machine translation.  His work combines statistical machine learning methods with insights from Linguistics, Sociology, and other disciplines to develop models of language-based phenomena that go beyond simple statistical word correspondences. Dr. Hovy is the author or co-editor of six books and over 300 technical articles and is a popular invited speaker.  In 2001 Dr. Hovy served as President of the Association for Computational Linguistics (ACL) and in 2001–03 as President of the International Association of Machine Translation (IAMT); for 2007–2009 he served as President of the Digital Government Society of North America (DGSNA). Dr. Hovy has also worked on various aspects of Digital Government.  Dr. Hovy regularly co-teaches a specialized course in the Computer Science Department of the University of Southern California, as well as occasional short courses on machine translation, ontologies, text annotation, information extraction, and other topics at universities and conferences.  He serves on Advisory Boards for institutes and funding organizations in Germany, Italy, the Netherlands, and the USA. 




Oct 28

Martha Palmer
University of Colorado

Beyond Shallow Semantics


Shallow semantic analyzers, such as semantic role labelers and sense taggers, are increasing in accuracy and becoming commonplace. However, they only provide limited and local representations of words and individual predicate-argument structures. This talk will address some of the current opportunities and challenges in producing deeper, richer representations of coherent eventualities. Available resources, such as VerbNet, that can assist in this process will also be discussed, as well as some of their limitations.


Martha Palmer is a Full Professor at the University of Colorado with joint appointments in Linguistics and Computer Science and is an Institute of Cognitive Science Faculty Fellow. She recently won a Boulder Faculty Assembly 2010 Research Award. Her research has been focused on trying to capture elements of the meanings of words that can comprise automatic representations of complex sentences and documents. Supervised machine learning techniques rely on vast amounts of annotated training data so she and her students are engaged in providing data with word sense tags and semantic role labels for English, Chinese, Arabic, Hindi, and Urdu, funded by DARPA and NSF. They also train automatic sense taggers and semantic role labelers, and extract bilingual lexicons from parallel corpora. A more recent focus is the application of these methods to biomedical journal articles and clinical notes, funded by NIH. She is a co-editor for the Journal of Natural Language Engineering and for LiLT, Linguistic Issues in Language Technology, and on the CLJ Editorial Board. She is a past President of the Association for Computational Linguistics, past Chair of SIGLEX and SIGHAN, and was the Director of the 2011 Linguistics Institute held in Boulder, Colorado.


Nov 4

Sarah Cohen
Duke Univ.

Computation and Watchdog Journalism:
Investigative Reporting Methods in the Digital Age


While Twitter and social media have already fundamentally altered news that happens in the public square, uncovering information that powerful institutions wish to keep secret is as hard as ever. But journalism is on the cusp of a new revolution in reporting methods spurred by advances in unstructured data analysis and cheap access to computing resources. The changes could be as significant as advances spurred by the copy machine in the 1970s and the widespread use of relational databases in the 1990s to find and document stories. This talk will review some of the methods used and challenges faced among 21st century investigative reporters.



Sarah Cohen is the Knight Professor of the Practice of Journalism at Duke University's DeWitt Wallace Center for Media and Democracy. She joined Duke in 2009 after nearly 20 years as a beat and investigative reporter, including more than a decade as a member of investigative units at The Washington Post. Her awards include the Pulitzer Prize in Investigative Reporting, the Goldsmith Prize in Investigative Reporting and the Robert F. Kennedy Public Service Award for Journalism. Cohen serves as an officer of the 4,200-member Investigative Reporters and Editors Inc., to which she was elected to the board of directors in 2010.  At Duke, she is the founder of the new  Duke Project for the Advancement of Public Affairs Reporting (or the Reporter's Lab for short), which curates, develops, deploys and adapts free and open source software and tools for public affairs reporting.


Nov 11

Yiming Yang

Modeling Novelty in Multi-session Retrieval

An open challenge in information retrieval is to detect the novel information from sequenced ranked lists, and to optimize system’s utility with respect to both relevance and novelty.

Modeling novelty is difficult because novelty depends on user browsing history, and user behavior over ranked lists is non-deterministic.  We propose a new probabilistic framework for stochastic modeling of user interactions with multi-session ranked lists, an algorithmic solution (based on sub-modularity) for efficient approximation of expected utility (an NP-hard problem), and new search strategies for retrieval optimization based on nugget detection and nugget-level relevance/novelty estimation.  Our framework provides a strong foundation for new methodologies both in retrieval evaluation and in retrieval optimization.  It allows significant utility improvements by leveraging realistic stochastic assumptions about user behavior, without requiring cost-intensive and time-consuming studies with human subjects. Our evaluations on benchmark datasets (TDT and TREC) show significant performance improvements with the proposed approach, over the results of other state-of-the-art methods.



Yiming Yang is a professor in the Language Technologies Institute and the Machine Learning Department in the School of Computer Science at Carnegie Mellon University.  Her research has centered on statistical learning methods and their applications to a broad range of challenging problems, including large-scale text categorization, utility (relevance and novelty) based retrieval and adaptive filtering, personalization and active learning for recommendation systems, social network analysis for personalized email prioritization, etc.


Nov 18

Michael "Fuzzy" Mauldin
Former LTI faculty

Turning your ideas into money by taking your company public


Between 1994 and 1996 I created the Lycos search engine while a faculty member at the LTI at Carnegie Mellon. Under the auspices of the CMU Technology Transfer Office, the University and I successfully sold Lycos to a venture capital firm. Less than nine months later, Lycos was the fastest company ever to go public in the United States. Using my experiences with Lycos as an example, I will discuss company ownership from the entrepreneur's point of view. I will also describe the process of taking a private company public through the IPO process.



Michael "Fuzzy" Mauldin earned his BA in CompSci from Rice University in 1981 and his PhD in CompSci from Carnegie Mellon in 1989. While a junior faculty member of the Language Technology Institute, Dr. Mauldin created the Lycos search engine. Today he is retired and raises beef cattle at his Austin, Texas ranch. He also builds combat robots for the BattleBots and RoboGames competitions.


Dec 2

Larry Birnbaum

Knight News Innovation Laboratory
Northwestern University

From Contextual Search to Automatic Content Generation:
Scaling Human Editorial Judgment


Systems that present people with information inescapably make editorial judgments in determining what information to show and how to show it.  However the editorial values used to make these determinations are generally invisible to users and in many cases even to the engineers who design them.  This talk describes some of the problems that this creates, and presents some approaches to providing explicit and visible editorial control in news and media information systems.


I’ll also talk about our recent work on automatically generating stories from data using human-authored editorial models.  A system based on this technology is already generating more than 10 thousand stories weekly in areas ranging from sports to business.  This system is the nation’s most prolific and published author of, among other things, women’s collegiate softball stories.  The stories compare favorably to those written by human beings.



Larry Birnbaum is Associate Professor of Electrical Engineering and Computer Science and of Journalism at Northwestern University, Director of the Medill-McCormick Center for Innovation in Technology, Media and Journalism at Northwestern, and serves on the management committee of its Knight News Innovation Laboratory.  He is also the Chief Scientific Advisor of Narrative Science, a Chicago-based startup.  Larry received his B.S., M.S., and Ph.D. in Computer Science from Yale.


Dec 9

Kevin Knight


Language Translation and Code-Breaking


In 1949, Warren Weaver suggested applying cryptanalysis methods to the problem of automatic language translation.  He said: "When I look at an article in Russian, I say: this is really written in English, but it has been coded in some strange symbols. I will now proceed to decode".


Weaver's inspiration has borne fruit in this century, as statistical techniques have enabled us to build translation systems for many languages, with increasing accuracy.  But other fruitful connections between code-breaking and translation are only starting to emerge.  This talk will examine some: estimating the amount of data required to break a cipher, building translation systems without parallel data, and solving a previously-undeciphered manuscript from the 1730s.



Kevin Knight is a Senior Research Scientist and Fellow at the University of Southern California's Information Sciences Institute, and a Research Professor in the Computer Science Department at USC.  He received a Ph.D. in computer science from Carnegie Mellon University and a bachelor's degree from Harvard University.  His research interests include natural language processing, statistical modeling, machine translation, language generation, and decipherment.  He currently serves as president of the Association for Computational Linguistics.