Interaction in rich natural language enables people to exchange thoughts efficiently and come to a shared understanding quickly. Modern personal intelligent assistants such as Apple's Siri and Amazon's Echo all utilize conversational interfaces as their primary communication channels, and illustrate a future that in which getting help from a computer is as easy as asking a friend. However, despite decades of research, modern conversational assistants are still limited in domain, expressiveness, and robustness. In this thesis, we take an alternative approach that blends real-time human computation with artificial intelligence to reliably engage in conversations. Instead of bootstrapping automation from the bottom up with only automatic components, we start with our crowd-powered conversational assistant, Chorus, and create a framework that enables Chorus to automate itself over time. Each of Chorus' response is proposed and voted on by a group of crowd workers in real-time.

Toward realizing the goal of full automation, we (i) augmented Chorus' capability by connecting it with sensors and effectors on smartphones so that users can safely control them via conversation, and (ii) deployed Chorus to the public as a Google Hangouts chatbot to collect a large corpus of conversations to help speed automation. The deployed Chorus also provides a working system to experiment automated approaches. In the future, we will (iii) create a framework that enables Chorus to automate itself over time by automatically obtaining response candidates from multiple dialog systems and selecting appropriate responses based on the current conversation. Over time, the automated systems will take over more responsibility in Chorus, not only helping us to deploy robust conversational assistants before we know how to automate everything, but also allowing us to drive down costs and gradually reduce reliance on the crowd.

Thesis Committee:
Jeffrey P. Bigham (Chair)
Alexander Rudnicky
Niki Kittur
Walter S. Lasecki (University of Michigan)
Chris Callison-Burch (University of Pennsylvania)

Copy of Proposal Document

Adam Berger is an accomplished technology executive and team leader in the software arena. He founded and grew two companies, both of which extend the capabilities of mobile devices past voice into compelling, usable new data services.  He has also worked within two world-leading information-technology firms, Nokia and IBM.

Along with three other Ph.D. students from Carnegie Mellon, Berger co-founded Eizel Technologies Inc. in 2000.  Eizel was a venture-backed software firm that developed a corporate mobile email system.  The mobile phone company Nokia purchased Eizel in 2003 and made it a component of its newly-formed Enterprise Systems division.  Within Nokia, Berger and the team grew the Eizel product line into the Nokia One Business Server™ product.  At Nokia, Berger led an innovation team that interfaced between U.S. and European offices to bring product and intellectual property concepts from the research and venturing wing into the product units.  Berger left Nokia to co-found Penthera Technologies in 2005, where he served as chief technology officer, vice president, and director.  In 2007 Berger co-founded Penthera Partners and participated in the management buyout of the assets of Penthera Technologies.

During his years in the software industry, Berger has served as a public speaker at major venues and technology advisor and board member to startups and technology investment firms.  He holds degrees in physics and computer science from Harvard University, and a Ph.D. in computer science from Carnegie Mellon University.  He has been a recipient of an IBM Graduate Fellowship, a Harvard College Scholarship and a Thomas J. Watson Fellowship.  Berger has published more than 20 refereed papers and holds 10 U.S. patents.

Duolingo is a language education platform that teaches 20 languages to more than 150 million students worldwide.  Our free flagship learning app is the #1 way to learn a language online, and is the most-downloaded education app for both Android and iOS devices.  In this talk, I will describe the Duolingo system and several of our empirical research projects to date, which combine machine learning with computational linguistics and psychometrics to improve learning, engagement, and even language proficiency assessment through our products.

Burr Settles develops and studies statistical machine learning systems with applications in human language, biology, and social science. Currently, he is most excited about using these technologies to help people learn languages and make music.

Faculty Host: Alex Hauptmann

Language is socially situated:  both what we say and what we mean depend on our identities, our interlocutors, and the communicative setting.  The first generation of research in computational sociolinguistics focused on large-scale social categories, such as gender.  However, many of the most socially salient distinctions are locally defined.  Rather than attempt to annotate these social properties or extract them from metadata, we turn to social network analysis, which has been only lightly explored in traditional sociolinguistics.  I will describe three projects at the intersection of language and social networks.  

First, I will show how unsupervised learning over social network labelings and text enables the induction of social meanings for address terms, such as "Ms" and "dude".  Next, I will describe recent research that uses social network embeddings to induce personalized natural language processing systems for individual authors, improving performance on sentiment analysis and entity linking even for authors for whom no labeled data is available.  Finally, I will describe how the spread of linguistic innovations can serve as evidence for sociocultural influence, using a parametric Hawkes process to model the features that make dyads especially likely or unlikely to be conduits for language change.

Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech.   He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning.  He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh.  His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google.  Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois.  He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award.  Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.

Faculty Host: Alex Hauptmann

One of the desiderata in machine intelligence is that computers must be able to
comprehend sounds as humans do.  They must know about various sounds, associate
them  with  physical  objects,  entities  or  events,  be  able  to  recognize  and  categorize them, know or discover relationships between them, etc.  Successful solutions to these tasks is critical to and can have immediate e ect on a variety of applications including content  based  indexing  and  retrieval  of  multimedia  data  on  web  which  has  grown exponentially in past few years.

Automated  machine  understanding  of  sounds  in  speci c  forms  such  as  speech, language and music has become fairly advanced, and has successfully been deployed into  systems  which  are  now  part  of  daily  life.   However,  the  same  cannot  be  said about natural occurring sounds in our environment.  The problem is exacerbated by the sheer vastness of number of sound types,  the diversity and variability of sound types, the variations in their structure, and even their interpretation.

This dissertation aims to expand the scale and scope of machine hearing capabilities by addressing challenges in recognition of sound events, cataloging large number of potential word phrases that identify sounds and using them to categorize and learn relationships for sounds and by  nding ways to eciently evaluate trained models on large  scale.   On  the  sound  event  recognition  front  we  address  the  major  hindrance of lack of labeled data by describing ways to e ectively use the vast amount of data on web.  We describe methods for audio event detection which uses only weak labels in  the  learning  process,  combines  weakly  supervised  learning  with  fully  supervised learning to leverage labeled data in both forms and  nally semi-supervised learning approaches which exploits the vast amount of available unlabeled data on web.

Further, we describe methods to automatically mine sound related knowledge and
relationships from vast amount of information stored in textual data.  The third part once again addresses labeling challenges but now during evaluation phase.  Evaluation of trained models on large scale once again requires data labeling.  We describe ways to  precisely  estimate  the  performance  of  a  trained  model  under  restricted  labeling budget.

In this proposal we describe the completed works in the above directions. Empirical evaluation  shows  the  e ectiveness  of  the  proposed  methods.   For  each  component framework described, we discuss expected research directions for successful completion of this dissertation.

Thesis Committee:
Bhiksha Raj (Chair)
Alex Hauptmann
Louis-Philippe Morency
Rita Singh
Dan Ellis (Google)

Copy of Proposal Document

The recent wide usage of Interactive Systems (or Dialog Systems), such as Apple Siri has attracted a lot of attention.  The ultimate goal is to transform current systems into real intelligent systems that can communicate with users effectively and naturally.  There are three major challenges to this ultimate goal:  first, how to make systems that cooperate with users in a natural manner; second, how to provide a adaptive and personalized experience to each user to achieve  better communication efficiency; and last, how to make multiple task system transition from one task to another fluidly to achieve overall conversation effectiveness and naturalness.  To address these challenges, I proposed a theoretical framework, Situated Intelligence (SI) and applied it to non-task-oriented, task-oriented and implicit-task-oriented conversations.

In the SI framework, we argue that three capabilities are needed to achieve natural and high quality conversations:  (1) systems need situation awareness; (2) systems need to have a rich repertoire of conversation strategies to regulate its situation contexts, to understand natural language and to provide personalized user experience;  (3) systems must have a global planning policy that optimally chooses among different conversation strategies at run-time to achieve an overall natural conversation flow.  We make a number of contributions in different types of conversation systems in terms of algorithms development and end-to-end systems building via applying the SI framework.

In the end, we introduce the concept of implicit-task-oriented system which interleaves the task conversation with everyday chatting.  We implemented a film-promotion system and run a user study with it.  The results show the system not only achieves the implicitly embedded goal but also keeps users engaged along the way.

Thesis Committee:
Alan Black (Chair)
Alexander Rudnicky
Louis-Philippe Morency
Dan Bohus (Microsoft Research)
David Suendermann-Oeft (Educational Testing Service)

Copy of Thesis Document

Language is the most important channel for humans to communicate about what they see.  To allow an intelligent system to effectively communicate with humans it is thus important to enable it to relate information in words and sentences with the visual world.  One component in a successful communication is the ability to answer natural language questions about the visual world.  A second component is the ability of the system to explain in natural language, why it gave a certain answer, allowing a human to trust and understand it.

In my talk, I will show how we can build models which answer questions but at the same time are modular and expose their semantic reasoning structure.  To explain the answer with natural language, I will discuss how we can learn to generate explanations given only image captions as training data by introducing a discriminative loss and using reinforcement learning.

In his research, Marcus Rohrbach focuses on relating visual recognition and natural language understanding with machine learning.  Currently he is a Post-Doc with Trevor Darrell at UC Berkeley.  He and his collaborators received the NAACL 2016 best paper award for their work on Neural Module Networks and won the Visual Question Answering Challenge 2016.  During his Ph.D. he worked at the Max Planck Institute for Informatics, Germany, with Bernt Schiele and Manfred Pinkal.  He completed it in 2014 with summa cum laude at Saarland University and received the DAGM MVTec Dissertation Award 2015 from the German Pattern Recognition Society for it.  His BSc and MSc degree in Computer Science are from the University of Technology Darmstadt, Germany (2006 and 2009).  After his BSc, he spent one year at the University of British Columbia, Canada, as visiting graduate student.

This talk provides an introduction to a sample of the breakthrough papers in deep learning that will be studied in the new Spring course 11-364, a hands-on, reading and research, independent study course. Deep learning is the most successful of the techniques that are being developed to deal with the enormous and rapidly growing amount of data.  Not only can deep learning process large amounts of data, it thrives on it. 

This talk will present a brief history of deep learning and a brief summary of some of the recent successes:

• Super-human performance reading street signs
• Beating a top human player in the game of Go
• Human parity recognizing conversational speech
•& End-to-end training of state-of-the-art question answering in natural language
• Substantial improvement in naturalness of speech synthesis
• Approaching the accuracy of average human translators on some datasets

In the course, students will study papers describing these breakthroughs and the techniques they use.  The students will then implement these techniques and apply them to real problems.

In addition, the talk and the course will introduce the concept of deep learning based on domain-specific, communicable knowledge using Socratic coaches, a new paradigm that will help integrate deep learning with the world-leading research that Carnegie Mellon does in many areas of artificial intelligence and computer science.

Professor James K Baker (PhD CMU) was the co-founder, Chairman, and CEO of Dragon Systems, Inc., the company that developed Dragon NaturallySpeaking, the first large vocabulary, continuous speech, automatic dictation system, a capability that achieved a decades-old goal that had been called “the Holy Grail of speech recognition.” The recognition engine in Dragon NaturallySpeaking was also the basis for the speech recognition component in Apple’s Siri. Professor Baker led Dragon System from a pure bootstrap with only $30,000 to a valuation of over $500 million1. In recognition of the impact of his achievements, Professor Baker has been elected to membership in the National Academy of Engineering. He has patents pending on an approach to knowledge-based deep learning using Socratic coaches.

1. See New York Time article

We can obtain a large amount of useful information aspects simply by watching television, e.g., details about events in Japan and in the world, current trends, economic activities, and so on.  In  a few experimental studies, we have explored how this data can be used for automated social analysis through face detection and matching, fast commercial film mining, and visual object retrieval tools.  In my lab, we developed and deployed key technologies for analyzing the NII TV-RECS video archive containing 400,000 hours of broadcast videos to achieve this goal.  In this talk, I will present a selection of our work that describes methods to automatically extract and analyze such information.

Shin'ichi Satoh received his BE degree in Electronics Engineering in 1987, his ME and Ph.D. degrees in Information Engineering in 1989 and 1992 at the University of Tokyo. He joined National Center for Science Information Systems (NACSIS), Tokyo, in 1992. He is a full professor at National Institute of Informatics (NII), Tokyo, since 2004. He was a visiting scientist at the Robotics Institute, Carnegie Mellon University, from 1995 to 1997. His research interests include image processing, video content analysis and multimedia database. Currently he is leading the video processing project at NII.

Faculty Host: Alexander Hauptmann

Deep Sequence-to-sequence models have rapidly become an indispensable general-purpose tool for many applications in natural language processing, such as machine translation, summarization, and dialogue.  Many problems that once required careful domain-specific engineering can now be tackled using off-the-shelf systems by interested tinkerers.  However, even with the evident early success of these models, the seq2seq framework itself is still relatively unexplored. In this talk, I will discuss three questions we have been studying in the area of sequence-to-sequence NLP: (1) Can we interpret seq2seq's learned representations? [Strobelt et al, 2016], (2) How should a seq2seq model be trained? [Wiseman and Rush, 2016], (3) How many parameters are necessary for the models to work? [Kim and Rush, 2016]. Along the way, I will present applications in summarization, grammar correction, image-to-text, and machine translation (on your phone).

Alexander Rush is an Assistant Professor at Harvard University studying NLP, and formerly a Post-doc at Facebook Artificial Intelligence Research (FAIR).  He is interested in machine learning and deep learning methods for large-scale natural language processing and understanding.  His past work has introduced novel methods for structured prediction with applications to syntactic parsing and machine translation.  Additional information on his  group web page and his twitter site.

Faculty Host: Alexander Hauptmann


Subscribe to LTI