Lemur Search   
Language Technologies Institute
Carnegie Mellon University
School of Computer Science

Jan 20

Dipanjan Das


Multilingual Guidance for Unsupervised Linguistic Structure Prediction

Learning linguistic analyzers from unannotated data remains a major challenge; can multilingual text help? In this talk, I will describe learning methods that use unannotated data in a target language along with annotated data in more resource-rich "helper" languages. I will focus on two lines of work. First, I will describe a graph-based semi-supervised learning approach that uses parallel data to learn part-of-speech tag sequences through type-level lexical transfer from a helper language. Second, I will examine a more ambitious goal of learning part-of-speech sequences and dependency trees from raw text, leveraging parameter-level transfer from helper languages, but without any parallel data. Both approaches result in significant improvements over strong state-of-the-art monolingual unsupervised baselines.

Bio: Dipanjan Das is a Ph.D. student at the Language Technologies Institute, School of Computer Science at Carnegie Mellon University. He works on statistical natural language processing under the mentorship of Noah Smith. He finished his M.S. at the same institute in 2008, conducting research on language generation with Alexander Rudnicky. Das completed his undergraduate degree in 2005 from the Indian Institute of Technology, Kharagpur, where he received the best undergraduate thesis award in Computer Science and Engineering and the Dr. B.C. Roy Memorial Gold Medal for best all-round performance in academics and co-curricular activities. He worked at Google Research, New York as an intern in 2010 and received the best paper award at the ACL 2011 conference.

Feb 3

Pedro Moreno


Google's Speech Internationalization Project:
From 1 to 300 Languages and Beyond

The speech team at google has built speech recognition systems in more that 30 languages in little more than 2 years. In this talk we will describe the history of this project and more interestingly what technologies have been developed to achieve this goal. I'll explore a bit some of the acoustic modeling, lexicon, language modeling, infrastructure techniques and even social engineering techniques used to achieve our ultimate goal, to build speech recognition systems in the top 300 languages of the planet.

Bio: Dr. Pedro J. Moreno leads the speech global engineering group at the Android division of Google. His team is in charge of deploying speech recognition services in as many languages as possible. He joined Google 7 years ago after working as a research scientist at HP Labs. During his work at HP he worked mostly in audio indexing systems. Dr. Moreno completed his Ph.D. studies at Carnegie Mellon University under the direction of Prof. Richard Stern. His work there was focused on noise robustness in speech recognition systems. His Ph.D. studies were sponsored by a Fulbright scholarship. Before that he completed an Electrical Engineering degree at Universidad Politecnica de Madrid, Spain.

Feb 10

William Cohen


Fast Effective Clustering for Graphs and Documents

We describe two new methods for clustering nodes in a graphs. The first method is simple to implement, easily parallelized, and very fast: on a single machine, it runs in time linear with the number of edges in the graph. Experimentally the method leads to clusterings that are comparable in quality to those produced by widely used spectral methods (e.g., the Normalized Cut algorithm), even though it is much faster. The second method is based on building a probabilistic model of the graph, and has a complementary set of advantages: while not as amenable to parallelization, it also typically runs in time linear in the number of graph edges, and is well-suited to extensions that incorporate differing clustering criteria or outside information about node similarities. We also discuss extensions to the methods for graphs associated with text corpora. This is joint work with Frank Lin and Ramnath Balasubramanyan.

Bio: William Cohen received his bachelor's degree in Computer Science from Duke University in 1984, and a PhD in Computer Science from Rutgers University in 1990. From 1990 to 2000 Dr. Cohen worked at AT&T Bell Labs and later AT&T Labs-Research, and from April 2000 to May 2002 Dr. Cohen worked at Whizbang Labs, a company specializing in extracting information from the web. Dr. Cohen is President of the International Machine Learning Society, an Action Editor for the Journal of Machine Learning Research, and an Action Editor for the journal ACM Transactions on Knowledge Discovery from Data. He is also an editor, with Ron Brachman, of the AI and Machine Learning series of books published by Morgan Claypool. In the past he has also served as an action editor for the journal Machine Learning, the journal Artificial Intelligence, and the Journal of Artificial Intelligence Research. He was General Chair for the 2008 International Machine Learning Conference, held July 6-9 at the University of Helsinki, in Finland; Program Co-Chair of the 2006 International Machine Learning Conference; and Co-Chair of the 1994 International Machine Learning Conference. Dr. Cohen was also the co-Chair for the 3rd Int'l AAAI Conference on Weblogs and Social Media, which was held May 17-20, 2009 in San Jose, and was the co-Program Chair for the 4rd Int'l AAAI Conference on Weblogs and Social Media, which will be held May 23-26 at George Washington University in Washington, D. C. He is a AAAI Fellow, and in 2008, he won the SIGMOD "Test of Time" Award for the most influential SIGMOD paper of 1998.

Dr. Cohen's research interests include information integration and machine learning, particularly information extraction, text categorization and learning from large datasets. He holds seven patents related to learning, discovery, information retrieval, and data integration, and is the author of more than 180 publications.

Feb 17

Benjamin Snyder

University of Wisconsin-Madison

Harnessing Dozens of Languages for Robust Language Technology

The written word plays a greater role in human communication than at any point in world history. As modern technology infrastructure spreads throughout the world, the quantity of electronic text, written in hundreds of different languages, continues to grow in size and diversity. While language processing technologies have been steadily maturing for English, progress on most languages has been slow, due to the paucity of data and research.
In this talk I will present my work on multilingual NLP. The key idea is that by jointly modeling a broad array of languages, apparent ambiguities can be resolved by building generic and universally plausible models of human language. I will talk about the application of this idea to several longstanding problems in NLP, including part-of-speech induction, computational decipherment of lost languages, and morphological induction.
I will also present the new task of unsupervised grapheme-to-phoneme prediction (as a first step towards robust and general decipherment methods). In this task, we are given an unknown language written using a Latin alphabet, and must predict the set of phonemes associated with each letter. By harnessing data from over a hundred languages, we build a model which relates patterns of symbols in text to plausible phonetic interpretations with high accuracy.
If time permits, I will describe some current work on childhood grammar and language development using tools from machine translation.

Bio: Benjamin Snyder is an Assistant Professor at the University of Wisconsin-Madison in the Department of Computer Sciences. His research interests include natural language processing, machine learning, and cognitive science. Ben received a B.A in philosophy from the University of Pennsylvania in 2003, and a Ph.D. in computer science from MIT in 2010. His dissertation, which focuses on multilingual statistical models and the computational decipherment of lost languages, received the ACM 2010 Dissertation Award honorable mention.

Mar 23

Tie-Yan Liu

Microsoft Research Asia

Computational Advertising: Challenges and Opportunities

Computational advertising is a newly emerged research discipline, which studies the algorithms and theories for online advertising. Computational advertising lies in the intersection of information retrieval, machine learning, and game theory. However, due to its unique properties, the conventional technologies in the aforementioned areas might not be sufficient to handle the new problems in computational advertising. New principles, models, and theories need to be developed. In this talk, I will first give a brief introduction to online advertising (mainly from a business perspective) and computational advertising (from a research perspective). Then I will discuss the key differences between computational advertising and information retrieval, machine learning, as well as game theory, followed by the proposal of several new research directions, like game-theoretic machine learning and statistical game theory. After that, I will introduce several on-going projects in my group along these directions, including attractiveness-based ad click prediction, learning to auction, and data-driven advertiser modeling. At the end of the talk, I will discuss the future evolution of computational advertising as a research discipline, and online advertising as a business model.

Bio: Tie-Yan Liu is a lead researcher of Microsoft Research Asia, leading the Internet Economics & Computational Advertising group. His research interests include learning to rank, large-scale graph ranking, and Internet economics. So far, he has authored two books, more than 70 journal and conference papers, and nearly 30 granted US / international patents. He is the co-author of the best student paper for SIGIR (2008) and the most cited paper for the Journal of Visual Communication and Image Representation (2004~2006). He is a program committee co-chair of RIAO (2010), a demo/exhibit co-chair of KDD (2012), a track chair of WWW (2011), an area chair of SIGIR (2008~2011) and AIRS (2009-2011), and a co-chair of several workshops at SIGIR, ICML, and NIPS. He is an associate editor of ACM Transactions on Information System (TOIS) and an editorial board member of several other journals including Information Retrieval and ISRN Artificial Intelligence. He is a keynote speaker at PCM (2010) and CCIR (2011), a plenary panelist of KDD (2011), and a tutorial speaker at several conferences including SIGIR and WWW. Prior to joining Microsoft, he obtained his Ph.D. in electronic engineering from Tsinghua University. He is a senior member of the IEEE and a member of the ACM.

Mar 23

Hans Uszkoreit

German Research Ctr. for AI

Learning Relation Extraction Rules from Massive Data

The talk will report on an information extraction platform that combines named-entity detection, generic parsing and statistical confidence estimation for learning large sets of rules that can extract instances of given n-ary relations from free texts. For precision-critical applications that do not need to recognize all mentions, supervised learning approaches often suffice. For recall-critical applications, supervised learning usually misses most of the notorious long tail of patterns. In order to improve recall, two methods have been employed. One of them is minimally supervised learning starting with a small set of examples as semantic seed. More instances and rules are then learned by bootstrapping. The other method is distantly supervised learning, starting with a large set of examples serving as a massive seed. In my talk I want to compare the two methods as alternatives on the same relation extraction platform. On the basis of our empirical findings, I will argue that at least for some relations, distant supervision learning on the Web provides a better basis for attacking the long tail. Both methods are faced with the problem of learning many wrong rules, seriously damaging precision. I will present three approaches to filtering incorrect rules: regular confidence estimation, implicit negative information through closed-world seed knowledge and negative information obtained by the parallel rule-learning for multiple relations.

Bio:Hans Uszkoreit is Scientific Director and Head of the Language Technology Lab at DFKI, and at the same also Professor of Computational Linguistics and Computer Science at Saarland U. since 1988. He received his PhD in 1984 from the U. of Texas at Austin. As a student he worked two years for the machine translation project METAL. He later held research positions at Stanford U., SRI in Menlo Park, and IBM Germany in Stuttgart. He is Past President of the European Association of Logic, Language and Information, Member of the European Academy of Sciences, the International Committee for Computational Linguistics, the ELRA Board and various advisory and editorial boards. Uszkoreit is also co-founder and board member of several LT spin-off companies. Since 2009, he is Coordinator of the European Network of Excellence META-NET with currently 57 European research centers in 33 countries. His research is documented in more than 150 publications in computational linguistics, language technology and related fields. His current research interests are information extraction, machine translation and other language technology applications.

Mar 30

Joe Reisinger

University of Texas at Austin

Latent Variable Models of Distributional Lexical Semantics

In order to respond to increasing demand for natural language interfaces---and provide meaningful insight into user query intent---fast, scalable lexical semantic models with flexible representations are needed. Human concept organization is a rich phenomenon that has yet to be accounted for by a single coherent psychological framework: Concept generalization is captured by a mixture of prototype and exemplar models, and local taxonomic information is available through multiple overlapping organizational systems. Previous work in computational linguistics on extracting lexical semantic information from the Web does not provide adequate representational flexibility and hence fails to capture the full extent of human conceptual knowledge. In this talk I will outline two probabilistic models that can account for some of the rich organizational structure found in human language: (1) a background clustering model of polysemy and (2) a hierarchical LDA-based approach to modeling concept organization. These models can be used to predict contextual variation, selectional preference and feature-saliency norms to a much higher degree of accuracy than previous approaches, and have the potential for improving question answering, text classification, machine translation, and information retrieval.

Bio:Joe Reisinger is a PhD candidate in the Computer Science at the University of Texas at Austin. His research interests include large-scale latent variable modeling, structured information extraction, lexical semantics and econometric modeling. Joe was the recipient of the 2010 Google Research Fellowship in NLP and previously held an NSF Graduate Research Fellowship. Prior to joining UT, he worked at IBM T.J. Watson Research Center and IBM Yamato, and more recently has completed several internships at Google Research in Mountain View.

Apr 6

John McDonough

LTI and Voci Technologies, Inc.

Distant Speech Recognition: No Black Boxes Allowed

A complete system for distant speech recognition (DSR) typically consists of several distinct components. While it is tempting to isolate and optimize each component individually, experience has proven that such an approach cannot lead to optimal performance. In this talk, I will discuss several examples of the interactions between the individual components of a DSR system. In addition, I will describe the synergies that become possible as soon as each component is no longer treated as a ``black box''. To wit, instead of treating each component as having solely an input and an output, it is necessary to "peel back the lid" and look inside. It is only then that it becomes apparent how the individual components of a DSR system can be jointly optimized to obtain the best possible performance.
Among the components I will discuss are:
1. The speaker tracking system used to estimate speakers' physical locations;
2. Beamforming required to combine several signals from a microphone array to emphasize desired speech while suppressing noise and interference;
3. Postfiltering applied to the output of the beamformer for further enhancement;
4. The recognition engine, which turns an enhanced signal into a set of word hypotheses;
5. The speaker adaptation component for adapting to the individual characteristics of a given speaker.
I will also briefly discuss other necessary components, such as those required for detecting focus of attention, and voice prompt suppression. All of these technologies will grow in importance as DSR systems are deployed in automotive, robotics, and manufacturing applications, where automation will be used to achieve cooperative, synergistic, man-machine interactions intended to accomplish shared goals.

Bio:John McDonough has been doing research in automatic speech recognition since 1993 when he joined BBN after completing his Master's at Rensselaer Polytechnic Institute in 1992. In 1997 he returned to graduate school, and received his PhD under Fred Jelinek at Johns Hopkins University in 2000. He then worked at the University of Karlsruhe and Saarland University, where he established courses on distant speech recognition. John supervised all speech and audio technologies research for the EU project CHIL, Computers in the Human Interaction Loop, and co-wrote a book on Distant Speech Recognition during that time. Beginning in February 2010, John spent a year at Disney Research Pittsburgh founding a research effort in distant speech recognition. Since January 2011, he has been a visiting scientist at the Language Technologies Institute at Carnegie Mellon University. He also works with Voci Technologies, Inc., a local CMU startup, where he applies finite-state transducer techniques to hardware-accelerated speech recognition applications.

Apr 13

Gerald Friedland


Cybercasing the Joint: Language Technologies, Multimedia Retrieval, and Online Privacy.

In this talk, I present recent case studies that highlight the potential for (multimedia) retrieval of online (social network) data to support real-world attacks. Both language-based and multimedia-based retrieval has rapidly emerged as a field with highly useful applications in many different domains. Researchers from different areas in signal processing and computer science have invested significant effort into the development of convenient and efficient retrieval mechanisms. While retrieval speed, flexibility, and accuracy are still research problems, this talk will demonstrate that they are not the only ones. This talk aims to raise awareness for a rapidly emerging privacy threat that we termed "cybercasing": leveraging information available online to mount real-world attacks. Based on the initial example of geo-tagging, I will show that while users typically realize that sharing information, e.g., on social networks, has some implications for their privacy, many users 1) are unaware of the full scope of the threat they face when doing so, and 2) often do not even realize when they publish such information. The threat is elevated by recent developments that make systematic search for information (either posted by humans or by sensors) and inference from multiple sources easier than ever before. However, even with relatively high error rates, retrieval techniques can be used effectively for different real-world attacks by using "lop-sided" tuning; for example by favoring low false alarm rates over high hit rates when scanning for potential victims to attack. This talk presents a set of scenarios demonstrating how easy it is to correlate data [4], especially those based on location information, with corresponding publicly available information for compromising a victim's privacy.
[1] G. Friedland, O. Vinyals, T. Darrell: "Multimodal Location Estimation", Proceedings of ACM Multimedia 2010, pp. 1245-1251, Florence, Italy, October 2010.
[2] H. Lei, J. Choi, A. Janin, and G. Friedland: "Persona Linking: Matching Uploaders of Videos Accross Accounts", IEEE International Conference on Acoustic, Speech, and Signal Processing (ICASSP), Prague, May 2011.
[3] G. Friedland, R. Sommer: "Cybercasing the Joint: On the Privacy Implications of Geotagging", Usenix HotSec 2010 at the Usenix Security Conference, Washington DC, August 2010.
[4] Gerald Friedland, Gregor Maier, Robin Sommer, Nicholas Weaver: Sherlock Holmes's Evil Twin: On The Impact of Global Inference for Online Privacy, New Security Paradigms Workshop, Marin County, CA, 2011.

Bio:Dr. Gerald Friedland is a senior research scientist at the International Computer Science Institute, a private lab affiliated with the University of California, Berkeley, where he leads multimedia content analysis research, mostly focusing on ("non-speech, non-music") acoustic techniques as an aid for video analysis. He is currently leading a group of 6 multimedia researchers supported by NSF, DARPA, IARPA, and industry grants. Gerald has published more than 100 peer-reviewed articles in conferences, journals, and books and is currently authoring a new textbook on multimedia computing together with Dr. Ramesh Jain. Gerald co-founded the IEEE International Conference on Semantic Computing and is a proud founder and program director of the IEEE International Summer School on Semantic Computing at UC Berkeley. He is associate editor for ACM Transactions on Multimedia Computing, Communications, and Applications, is in the organization committee of ACM Multimedia 2011, 2012, and 2014. He is also serves as TPC Co-Chair of IEEE ICME 2012. He is the recipient of several research and industry recognitions, among them the European Academic Software Award and the Multimedia Entrepreneur Award by the German Federal Department of Economics. Most recently, he lead the team that won the ACM Multimedia Grand Challenge in 2009. Gerald received his doctorate (summa cum laude) and master's degree in computer science from Freie Universitaet Berlin, Germany, in 2002 and 2006, respectively.

Apr 27

Kevin Collins-Thompson

Microsoft Research Redmond

Not Just for Kids: Enriching Information Retrieval with Reading Level Metadata

A document isn't relevant - at least, not immediately - if you can't understand it, yet search engines have traditionally ignored the problem of finding content at the right level of difficulty as an aspect of relevance. Moreover, little is currently known about the nature of the Web, its users, and how users interact with content when seen through the lens of reading difficulty. I'll present our recent research progress in combining reading difficulty prediction with information retrieval, including models, algorithms and large-scale data analysis. Our results show how the availability of reading level metadata - especially in combination with topic metadata - opens up new and sometimes surprising possibilities for enriching search systems, from personalizing Web search results by reading level to predicting user and site expertise, improving result caption quality, and estimating searcher motivation.
This talk includes joint work with Paul N. Bennett, Ryen White, Susan Dumais, Jin Young Kim, Sebastian de la Chica, and David Sontag.

Bio: Kevyn Collins-Thompson is a Researcher in the Context, Learning and User Experience for Search (CLUES) group at Microsoft Research (Redmond). His research lies in an area combining information retrieval, machine learning, and computational linguistics, and focuses on models, algorithms, and evaluation methods for making search technology more reliable and effective. His recent work has explored algorithms and Web search applications for reading level prediction; optimization strategies that reduce the risk of applying risky retrieval algorithms like personalization and automatic query rewriting; and educational applications of IR such as intelligent tutoring systems. Kevyn received his Ph.D. and M.Sc. from the Language Technologies Institute at Carnegie Mellon University and B.Math from the University of Waterloo.

May 4

Eric Xing


Jointly Maximum Margin and Maximum Entropy Learning of Graphical Models

Graphical models (GMs) offer a powerful language to elegantly define expressive distributions, and a generic computational framework to support reasoning under uncertainty in a wide range of problems. Popular paradigms for training GMs include the maximum likelihood estimation, and more recently the max-margin learning, each enjoys some advantages, as well as weaknesses. For example, the maximum margin structured prediction model such as M3N lacks a straightforward probabilistic interpretation of the learning scheme and the prediction rule. Therefore its unique advantages such as support vector sparsity and kernel tricks cannot be easily conjoined with the merits of a probabilistic model such as Bayesian regularization, model averaging, and ability to model hidden variables.

In this talk, I present a new general framework called Maximum Entropy Discrimination Markov Networks (MEDN), which integrates the margin-based and likelihood-based approaches and combines and extends their merits. This new learning paradigm naturally facilitates integration of the generative and discriminative principles under a unified framework, and the basic strategies can be generalized to learn arbitrary GMs, such as the generative Bayesian networks, models with structured hidden variables, and even nonparametric Bayesian models, with a desirable maximum margin effect on structured or unstructured predictions. I will discuss a number of theoretical properties of this approach, and show applications of MEDN to learning a wide range of GMs including: fully supervised structured i/o model, max-margin structured i/o models with hidden variables, a max-margin LDA-style model for jointly discovering 'discriminative' latent topics and predicting document label/score of text documents, or total scene and objective categories in natural images, etc. Our empirical results strongly suggest that, for any GM with structured or unstructured labels, MEDN always leads to a more accurate predictive GM than the one trained under either MLE or Max Margin.

Joint work with Jun Zhu.

Bio: Dr. Eric Xing is an associate professor in the School of Computer Science at Carnegie Mellon University. His principal research interests lie in the development of machine learning and statistical methodology; especially for solving problems involving automated learning, reasoning, and decision-making in high-dimensional and dynamic possible worlds; and for building quantitative models and predictive understandings of biological systems. Professor Xing received a Ph.D. in Molecular Biology from Rutgers University, and another Ph.D. in Computer Science from UC Berkeley. His current work involves, 1) foundations of statistical learning, including theory and algorithms for estimating time/space varying-coefficient models, sparse structured input/output models, and nonparametric Bayesian models; 2) computational and statistical analysis of gene regulation, genetic variation, and disease associations; and 3) application of statistical learning in social networks, computer vision, and natural language processing. Professor Xing has published over 140 peer-reviewed papers, and is an associate editor of the Annals of Applied Statistics, the IEEE Transaction of Pattern Analysis and Machine Intelligence (PAMI), the PLoS Journal of Computational Biology, an Action Editor of the Machine Learning journal, and a member of the DARPA Information Science and Technology (ISAT) Advisory Group. He is a recipient of the NSF Career Award, the Alfred P. Sloan Research Fellowship in Computer Science, and the United States Air Force Young Investigator Award, and best paper awards in a number of premier conferences including UAI, ACL, SDM, and ISMB.

Language Technologies Institute • 5000 Forbes Ave • Pittsburgh, PA 15213-3891 • (412) 268-6591