LTI

In this talk, I will give an overview of some research projects at MSR aiming at building an open-domain neural dialogue system. We group dialogue bots based on users' goals into three categories: task completion bots, information access bots, and social bots. We explore different neural network models and deep reinforcement learning technologies to build response generation engines for all the bots. We will review our experimental settings, recent results tested on simulator users and real users, share the lessons we learned and discuss future work.

Jianfeng Gao is a Partner Research Manager in Deep Learning Technology Center (DLTC) at Microsoft Research, Redmond.  H works on deep learning for text and image processing (MS internal access) and lead the development of AI systems for dialogue, machine reading comprehension, question answering, and enterprise applications.  they have developed a series of deep semantic similarity models (DSSM, also a.k.a. Sent2Vec), which have been used for a wide range of text and image processing tasks.

From 2006 to 2014, he was Principal Researcher at Natural Language Processing Group at Microsoft Research, Redmond, where he worked on Web search, query understanding and reformulation, ads prediction, and statistical machine translation.  From 2005 to 2006, he was a research lead in Natural Interactive Services Division at Microsoft, where he worked on Project X, an effort of developing natural user interface for Windows.  From 1999 to 2005, he was Research Lead in Natural Language Computing Group at Microsoft Research Asia, where together with he colleagues, he developed the first Chinese speech recognition system released with Microsoft Office, the Chinese/Japanese Input Method Editors (IME) which were the leading products in the market, and the natural language platform for Windows Vista.

While acoustic signals are continuous in nature, the ways that humans generate pitch in speech and music involve important discrete decisions.  As a result, models of pitch must resolve a tension between continuous and combinatorial structure.  Similarly, interpreting images of printed documents requires reasoning about both continuous pixels and discrete characters.  Focusing on several different tasks that involve human artifacts, I'll present probabilistic models with this goal in mind. 

First, I'll describe an approach to historical document recognition that uses a statistical model of the historical printing press to reason about images, and, as a result, is able to decipher historical documents in an unsupervised fashion.  Based on this approach, I'll also demonstrate a related model that accurately predicts compositor attribution in the First Folio of Shakespeare.  Next, I'll present an unsupervised system that transcribes acoustic piano music into a symbolic representation by jointly describing the discrete structure of sheet music and the continuous structure of piano sounds.  Finally, I'll present a supervised method for predicting prosodic intonation from text that treats discrete prosodic decisions as latent variables, but directly models pitch in a continuous fashion.

Taylor Berg-Kirkpatrick joined the Language Technologies Institute at Carnegie Mellon University as an Assistant Professor in Fall 2016.  Previously, he was a Research Scientist at Semantic Machines Inc and, before that, completed his Ph.D. in computer science at the University of California, Berkeley. Taylor's research focuses on using machine learning to understand structured human data, including language but also sources like music, document images, and other complex artifacts.

Faculty Host/Instructor: Alex Hauptmann

Information retrieval and machine learning approaches are running in the background of most of the applications we use in our daily digital life.  The assistance they are providing is manifold, but relies on a set of core content processing tasks requiring compatible representation formalisms.  However, this is rarely the case in real-world scenarios.  This talk is concerned with shared representation formalisms for information encoded in heterogeneous modalities.  The heterogeneity may result from intra-modal varieties, like text in different languages for the modality of natural language, or by the different modalities themselves, like when relating text to images or to knowledge graphs.  I will discuss three ways to obtain a joint representation of heterogeneously represented content.  The first one is based on explicit semantics as encoded in knowledge graphs, the second one extends this approach by adding implicit semantics extracted from large data sets and the final one relies on joint learning without utilizing explicit semantics.  The presented approaches contribute to the long standing challenges of braking the language and modality barriers in order to enable the joint semantic processing of content in originally incompatible representation formalisms.

Achim Rettinger is a KIT Junior Research Group Leader at AIFB where he is heading the Adaptive Data Analytics team.  His research areas include Data Mining, Information Extraction, Knowledge Discovery, Ontology Learning, Machine Learning, Human Computer Systems, and Text Mining.

Joint video-language modeling has been attracting increasing attention in recent years, signifying a return to early AI goals of cooperative cognitive systems.  However, many approaches fail to leverage the complementarity and structure across vision and language.  For example, they may rely on a fixed visual model or fail to leverage the underlying compositional semantics inherent in language.  In this talk, I will discuss indeed seek to explicitly jointly capture structure across modalities, and to capture this structure at a low-level.  The work explores sparse modeling as a means for bridging across vision and language.  These are low-level models that capture a joint, generative embedding using paired and composition dictionary learning.  We also overcome a historical limitation of such sparse models by showing how they can be embedded directly within a deep artificial neural network.  Results for both of these works will be provided and discussed in detail.

Jason Corso is an associate professor of Electrical Engineering and Computer Science at the University of Michigan.  He received his PhD and MSE degrees at The Johns Hopkins University in 2005 and 2002, respectively, and the BS Degree with honors from Loyola College in Maryland in 2000, all in Computer Science.  He spent two years as a post-doctoral fellow at the University of California, Los Angeles. 

From 2007-14 he was a member of the Computer Science and Engineering faculty at SUNY Buffalo.  He is the recipient of a Google Faculty Research Award 2015, the Army Research Office Young Investigator Award 2010, NSF CAREER award 2009, SUNY Buffalo Young Investigator Award 2011, a member of the 2009 DARPA Computer Science Study Group, and a recipient of the Link Foundation Fellowship in Advanced Simulation and Training 2003.  Corso has authored more than one-hundred peer-reviewed papers on topics of his research interest including computer vision, robot perception, data science, and medical imaging.  He is a member of the AAAI, ACM, MAA and a senior member of the IEEE.

Faculty Host: Alexander Hauptmann

Interaction in rich natural language enables people to exchange thoughts efficiently and come to a shared understanding quickly. Modern personal intelligent assistants such as Apple's Siri and Amazon's Echo all utilize conversational interfaces as their primary communication channels, and illustrate a future that in which getting help from a computer is as easy as asking a friend. However, despite decades of research, modern conversational assistants are still limited in domain, expressiveness, and robustness. In this thesis, we take an alternative approach that blends real-time human computation with artificial intelligence to reliably engage in conversations. Instead of bootstrapping automation from the bottom up with only automatic components, we start with our crowd-powered conversational assistant, Chorus, and create a framework that enables Chorus to automate itself over time. Each of Chorus' response is proposed and voted on by a group of crowd workers in real-time.

Toward realizing the goal of full automation, we (i) augmented Chorus' capability by connecting it with sensors and effectors on smartphones so that users can safely control them via conversation, and (ii) deployed Chorus to the public as a Google Hangouts chatbot to collect a large corpus of conversations to help speed automation. The deployed Chorus also provides a working system to experiment automated approaches. In the future, we will (iii) create a framework that enables Chorus to automate itself over time by automatically obtaining response candidates from multiple dialog systems and selecting appropriate responses based on the current conversation. Over time, the automated systems will take over more responsibility in Chorus, not only helping us to deploy robust conversational assistants before we know how to automate everything, but also allowing us to drive down costs and gradually reduce reliance on the crowd.

Thesis Committee:
Jeffrey P. Bigham (Chair)
Alexander Rudnicky
Niki Kittur
Walter S. Lasecki (University of Michigan)
Chris Callison-Burch (University of Pennsylvania)

Copy of Proposal Document

Adam Berger is an accomplished technology executive and team leader in the software arena. He founded and grew two companies, both of which extend the capabilities of mobile devices past voice into compelling, usable new data services.  He has also worked within two world-leading information-technology firms, Nokia and IBM.

Along with three other Ph.D. students from Carnegie Mellon, Berger co-founded Eizel Technologies Inc. in 2000.  Eizel was a venture-backed software firm that developed a corporate mobile email system.  The mobile phone company Nokia purchased Eizel in 2003 and made it a component of its newly-formed Enterprise Systems division.  Within Nokia, Berger and the team grew the Eizel product line into the Nokia One Business Server™ product.  At Nokia, Berger led an innovation team that interfaced between U.S. and European offices to bring product and intellectual property concepts from the research and venturing wing into the product units.  Berger left Nokia to co-found Penthera Technologies in 2005, where he served as chief technology officer, vice president, and director.  In 2007 Berger co-founded Penthera Partners and participated in the management buyout of the assets of Penthera Technologies.

During his years in the software industry, Berger has served as a public speaker at major venues and technology advisor and board member to startups and technology investment firms.  He holds degrees in physics and computer science from Harvard University, and a Ph.D. in computer science from Carnegie Mellon University.  He has been a recipient of an IBM Graduate Fellowship, a Harvard College Scholarship and a Thomas J. Watson Fellowship.  Berger has published more than 20 refereed papers and holds 10 U.S. patents.

Duolingo is a language education platform that teaches 20 languages to more than 150 million students worldwide.  Our free flagship learning app is the #1 way to learn a language online, and is the most-downloaded education app for both Android and iOS devices.  In this talk, I will describe the Duolingo system and several of our empirical research projects to date, which combine machine learning with computational linguistics and psychometrics to improve learning, engagement, and even language proficiency assessment through our products.

Burr Settles develops and studies statistical machine learning systems with applications in human language, biology, and social science. Currently, he is most excited about using these technologies to help people learn languages and make music.

Faculty Host: Alex Hauptmann

Language is socially situated:  both what we say and what we mean depend on our identities, our interlocutors, and the communicative setting.  The first generation of research in computational sociolinguistics focused on large-scale social categories, such as gender.  However, many of the most socially salient distinctions are locally defined.  Rather than attempt to annotate these social properties or extract them from metadata, we turn to social network analysis, which has been only lightly explored in traditional sociolinguistics.  I will describe three projects at the intersection of language and social networks.  

First, I will show how unsupervised learning over social network labelings and text enables the induction of social meanings for address terms, such as "Ms" and "dude".  Next, I will describe recent research that uses social network embeddings to induce personalized natural language processing systems for individual authors, improving performance on sentiment analysis and entity linking even for authors for whom no labeled data is available.  Finally, I will describe how the spread of linguistic innovations can serve as evidence for sociocultural influence, using a parametric Hawkes process to model the features that make dyads especially likely or unlikely to be conduits for language change.

Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech.   He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning.  He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh.  His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google.  Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois.  He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.  Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.

Faculty Host: Alex Hauptmann

One of the desiderata in machine intelligence is that computers must be able to
comprehend sounds as humans do.  They must know about various sounds, associate
them  with  physical  objects,  entities  or  events,  be  able  to  recognize  and  categorize them, know or discover relationships between them, etc.  Successful solutions to these tasks is critical to and can have immediate e ect on a variety of applications including content  based  indexing  and  retrieval  of  multimedia  data  on  web  which  has  grown exponentially in past few years.

Automated  machine  understanding  of  sounds  in  speci c  forms  such  as  speech, language and music has become fairly advanced, and has successfully been deployed into  systems  which  are  now  part  of  daily  life.   However,  the  same  cannot  be  said about natural occurring sounds in our environment.  The problem is exacerbated by the sheer vastness of number of sound types,  the diversity and variability of sound types, the variations in their structure, and even their interpretation.

This dissertation aims to expand the scale and scope of machine hearing capabilities by addressing challenges in recognition of sound events, cataloging large number of potential word phrases that identify sounds and using them to categorize and learn relationships for sounds and by  nding ways to eciently evaluate trained models on large  scale.   On  the  sound  event  recognition  front  we  address  the  major  hindrance of lack of labeled data by describing ways to e ectively use the vast amount of data on web.  We describe methods for audio event detection which uses only weak labels in  the  learning  process,  combines  weakly  supervised  learning  with  fully  supervised learning to leverage labeled data in both forms and  nally semi-supervised learning approaches which exploits the vast amount of available unlabeled data on web.

Further, we describe methods to automatically mine sound related knowledge and
relationships from vast amount of information stored in textual data.  The third part once again addresses labeling challenges but now during evaluation phase.  Evaluation of trained models on large scale once again requires data labeling.  We describe ways to  precisely  estimate  the  performance  of  a  trained  model  under  restricted  labeling budget.

In this proposal we describe the completed works in the above directions. Empirical evaluation  shows  the  e ectiveness  of  the  proposed  methods.   For  each  component framework described, we discuss expected research directions for successful completion of this dissertation.

Thesis Committee:
Bhiksha Raj (Chair)
Alex Hauptmann
Louis-Philippe Morency
Rita Singh
Dan Ellis (Google)

Copy of Proposal Document

The recent wide usage of Interactive Systems (or Dialog Systems), such as Apple Siri has attracted a lot of attention.  The ultimate goal is to transform current systems into real intelligent systems that can communicate with users effectively and naturally.  There are three major challenges to this ultimate goal:  first, how to make systems that cooperate with users in a natural manner; second, how to provide a adaptive and personalized experience to each user to achieve  better communication efficiency; and last, how to make multiple task system transition from one task to another fluidly to achieve overall conversation effectiveness and naturalness.  To address these challenges, I proposed a theoretical framework, Situated Intelligence (SI) and applied it to non-task-oriented, task-oriented and implicit-task-oriented conversations.

In the SI framework, we argue that three capabilities are needed to achieve natural and high quality conversations:  (1) systems need situation awareness; (2) systems need to have a rich repertoire of conversation strategies to regulate its situation contexts, to understand natural language and to provide personalized user experience;  (3) systems must have a global planning policy that optimally chooses among different conversation strategies at run-time to achieve an overall natural conversation flow.  We make a number of contributions in different types of conversation systems in terms of algorithms development and end-to-end systems building via applying the SI framework.

In the end, we introduce the concept of implicit-task-oriented system which interleaves the task conversation with everyday chatting.  We implemented a film-promotion system and run a user study with it.  The results show the system not only achieves the implicitly embedded goal but also keeps users engaged along the way.

Thesis Committee:
Alan Black (Chair)
Alexander Rudnicky
Louis-Philippe Morency
Dan Bohus (Microsoft Research)
David Suendermann-Oeft (Educational Testing Service)

Copy of Thesis Document

Pages

Subscribe to LTI