LTI

One of the desiderata in machine intelligence is that computers must be able to
comprehend sounds as humans do.  They must know about various sounds, associate
them  with  physical  objects,  entities  or  events,  be  able  to  recognize  and  categorize them, know or discover relationships between them, etc.  Successful solutions to these tasks is critical to and can have immediate e ect on a variety of applications including content  based  indexing  and  retrieval  of  multimedia  data  on  web  which  has  grown exponentially in past few years.

Automated  machine  understanding  of  sounds  in  speci c  forms  such  as  speech, language and music has become fairly advanced, and has successfully been deployed into  systems  which  are  now  part  of  daily  life.   However,  the  same  cannot  be  said about natural occurring sounds in our environment.  The problem is exacerbated by the sheer vastness of number of sound types,  the diversity and variability of sound types, the variations in their structure, and even their interpretation.

This dissertation aims to expand the scale and scope of machine hearing capabilities by addressing challenges in recognition of sound events, cataloging large number of potential word phrases that identify sounds and using them to categorize and learn relationships for sounds and by  nding ways to eciently evaluate trained models on large  scale.   On  the  sound  event  recognition  front  we  address  the  major  hindrance of lack of labeled data by describing ways to e ectively use the vast amount of data on web.  We describe methods for audio event detection which uses only weak labels in  the  learning  process,  combines  weakly  supervised  learning  with  fully  supervised learning to leverage labeled data in both forms and  nally semi-supervised learning approaches which exploits the vast amount of available unlabeled data on web.

Further, we describe methods to automatically mine sound related knowledge and
relationships from vast amount of information stored in textual data.  The third part once again addresses labeling challenges but now during evaluation phase.  Evaluation of trained models on large scale once again requires data labeling.  We describe ways to  precisely  estimate  the  performance  of  a  trained  model  under  restricted  labeling budget.

In this proposal we describe the completed works in the above directions. Empirical evaluation  shows  the  e ectiveness  of  the  proposed  methods.   For  each  component framework described, we discuss expected research directions for successful completion of this dissertation.

Thesis Committee:
Bhiksha Raj (Chair)
Alex Hauptmann
Louis-Philippe Morency
Rita Singh
Dan Ellis (Google)

Copy of Proposal Document

The recent wide usage of Interactive Systems (or Dialog Systems), such as Apple Siri has attracted a lot of attention.  The ultimate goal is to transform current systems into real intelligent systems that can communicate with users effectively and naturally.  There are three major challenges to this ultimate goal:  first, how to make systems that cooperate with users in a natural manner; second, how to provide a adaptive and personalized experience to each user to achieve  better communication efficiency; and last, how to make multiple task system transition from one task to another fluidly to achieve overall conversation effectiveness and naturalness.  To address these challenges, I proposed a theoretical framework, Situated Intelligence (SI) and applied it to non-task-oriented, task-oriented and implicit-task-oriented conversations.

In the SI framework, we argue that three capabilities are needed to achieve natural and high quality conversations:  (1) systems need situation awareness; (2) systems need to have a rich repertoire of conversation strategies to regulate its situation contexts, to understand natural language and to provide personalized user experience;  (3) systems must have a global planning policy that optimally chooses among different conversation strategies at run-time to achieve an overall natural conversation flow.  We make a number of contributions in different types of conversation systems in terms of algorithms development and end-to-end systems building via applying the SI framework.

In the end, we introduce the concept of implicit-task-oriented system which interleaves the task conversation with everyday chatting.  We implemented a film-promotion system and run a user study with it.  The results show the system not only achieves the implicitly embedded goal but also keeps users engaged along the way.

Thesis Committee:
Alan Black (Chair)
Alexander Rudnicky
Louis-Philippe Morency
Dan Bohus (Microsoft Research)
David Suendermann-Oeft (Educational Testing Service)

Copy of Thesis Document

Language is the most important channel for humans to communicate about what they see.  To allow an intelligent system to effectively communicate with humans it is thus important to enable it to relate information in words and sentences with the visual world.  One component in a successful communication is the ability to answer natural language questions about the visual world.  A second component is the ability of the system to explain in natural language, why it gave a certain answer, allowing a human to trust and understand it.

In my talk, I will show how we can build models which answer questions but at the same time are modular and expose their semantic reasoning structure.  To explain the answer with natural language, I will discuss how we can learn to generate explanations given only image captions as training data by introducing a discriminative loss and using reinforcement learning.

In his research, Marcus Rohrbach focuses on relating visual recognition and natural language understanding with machine learning.  Currently he is a Post-Doc with Trevor Darrell at UC Berkeley.  He and his collaborators received the NAACL 2016 best paper award for their work on Neural Module Networks and won the Visual Question Answering Challenge 2016.  During his Ph.D. he worked at the Max Planck Institute for Informatics, Germany, with Bernt Schiele and Manfred Pinkal.  He completed it in 2014 with summa cum laude at Saarland University and received the DAGM MVTec Dissertation Award 2015 from the German Pattern Recognition Society for it.  His BSc and MSc degree in Computer Science are from the University of Technology Darmstadt, Germany (2006 and 2009).  After his BSc, he spent one year at the University of British Columbia, Canada, as visiting graduate student.

This talk provides an introduction to a sample of the breakthrough papers in deep learning that will be studied in the new Spring course 11-364, a hands-on, reading and research, independent study course. Deep learning is the most successful of the techniques that are being developed to deal with the enormous and rapidly growing amount of data.  Not only can deep learning process large amounts of data, it thrives on it. 

This talk will present a brief history of deep learning and a brief summary of some of the recent successes:

• Super-human performance reading street signs
• Beating a top human player in the game of Go
• Human parity recognizing conversational speech
•& End-to-end training of state-of-the-art question answering in natural language
• Substantial improvement in naturalness of speech synthesis
• Approaching the accuracy of average human translators on some datasets

In the course, students will study papers describing these breakthroughs and the techniques they use.  The students will then implement these techniques and apply them to real problems.

In addition, the talk and the course will introduce the concept of deep learning based on domain-specific, communicable knowledge using Socratic coaches, a new paradigm that will help integrate deep learning with the world-leading research that Carnegie Mellon does in many areas of artificial intelligence and computer science.

Professor James K Baker (PhD CMU) was the co-founder, Chairman, and CEO of Dragon Systems, Inc., the company that developed Dragon NaturallySpeaking, the first large vocabulary, continuous speech, automatic dictation system, a capability that achieved a decades-old goal that had been called “the Holy Grail of speech recognition.” The recognition engine in Dragon NaturallySpeaking was also the basis for the speech recognition component in Apple’s Siri. Professor Baker led Dragon System from a pure bootstrap with only $30,000 to a valuation of over $500 million1. In recognition of the impact of his achievements, Professor Baker has been elected to membership in the National Academy of Engineering. He has patents pending on an approach to knowledge-based deep learning using Socratic coaches.

1. See New York Time article

We can obtain a large amount of useful information aspects simply by watching television, e.g., details about events in Japan and in the world, current trends, economic activities, and so on.  In  a few experimental studies, we have explored how this data can be used for automated social analysis through face detection and matching, fast commercial film mining, and visual object retrieval tools.  In my lab, we developed and deployed key technologies for analyzing the NII TV-RECS video archive containing 400,000 hours of broadcast videos to achieve this goal.  In this talk, I will present a selection of our work that describes methods to automatically extract and analyze such information.

Shin'ichi Satoh received his BE degree in Electronics Engineering in 1987, his ME and Ph.D. degrees in Information Engineering in 1989 and 1992 at the University of Tokyo. He joined National Center for Science Information Systems (NACSIS), Tokyo, in 1992. He is a full professor at National Institute of Informatics (NII), Tokyo, since 2004. He was a visiting scientist at the Robotics Institute, Carnegie Mellon University, from 1995 to 1997. His research interests include image processing, video content analysis and multimedia database. Currently he is leading the video processing project at NII.

Faculty Host: Alexander Hauptmann

Deep Sequence-to-sequence models have rapidly become an indispensable general-purpose tool for many applications in natural language processing, such as machine translation, summarization, and dialogue.  Many problems that once required careful domain-specific engineering can now be tackled using off-the-shelf systems by interested tinkerers.  However, even with the evident early success of these models, the seq2seq framework itself is still relatively unexplored. In this talk, I will discuss three questions we have been studying in the area of sequence-to-sequence NLP: (1) Can we interpret seq2seq's learned representations? [Strobelt et al, 2016], (2) How should a seq2seq model be trained? [Wiseman and Rush, 2016], (3) How many parameters are necessary for the models to work? [Kim and Rush, 2016]. Along the way, I will present applications in summarization, grammar correction, image-to-text, and machine translation (on your phone).

Alexander Rush is an Assistant Professor at Harvard University studying NLP, and formerly a Post-doc at Facebook Artificial Intelligence Research (FAIR).  He is interested in machine learning and deep learning methods for large-scale natural language processing and understanding.  His past work has introduced novel methods for structured prediction with applications to syntactic parsing and machine translation.  Additional information on his  group web page and his twitter site.

Faculty Host: Alexander Hauptmann

Events are a core component in semantic analysis of text. Given an overwhelming amount of text describing countless events in the world, automatic event-oriented
text analysis is a key information extraction technology to assist us to take more sensible actions from a more holistic view.  A salient property of events among other
linguistic phenomena is that they compose rich semantic argument and discourse structures.

The central goal of this thesis is to devise a computational method that models
the structural property of events in a principled framework to achieve more informed event detection and event coreference resolution.  To achieve this goal, we address five important problems with state-of-the-art work on these areas: (1) restricted annotation of events, (2) data sparseness in event detection, (3) lack of subevent detection, (4) event interdependencies via event coreference, and (5) limited applications of events.  We propose several potential enhancements to structured neural models for event detection and event coreference resolution, and provide a novel solution to  each  of  the  problems.  

At  the  core  of  our  proposed  method  is  document-level joint structured learning of two neural architectures for event detection and event coreference resolution with a learning-to-search framework, aimed at capturing the .interactions between the semantic argument and discourse structures of events adequately.  Our underlying assumption on the method is that the learning-to-search framework enables the nonlinear model to learn the joint decisions more effectively and efficiently than traditional feature-based models.

Thesis Committee:
Teruko Mitamura (Chair)
Eduard Hovy
Graham Neubig
Luke Zettlemoyer (University of Washington)

Copy of Proposal Document

Text-to-speech synthesis has progressed to such a stage that given a large, clean, phonetically balanced dataset from a single speaker, it can produce intelligible, almost natural sounding speech. However, one is severely limited in building such systems for low-resource languages where there is a lack of such data and there is no access to a native speaker.

Thus, the goal in this thesis is to use the data that is freely available on the web,a.k.a, ”Found Speech”to build TTS systems. However, since this data is collected from different sources, it is noisy and contains a lot of variations in terms of speaker and channel characteristics. This presents us with a number of challenges in using this data for building TTS systems within the current pipeline.

In this thesis, we address some of these challenges. First we look at data selection strategies to select good utterances from this noisy dataset which can produce intelligible speech.

Second, we investigate data augmentation techniques from cleaner external sources of data. Specifically, we study cross lingual data augmentation techniques from high resource languages. However, oftentimes found audio data is untranscribed. Thus, we also look at methods of using this untranscribed audio along with unrelated text data in the same language to build a decoder for transcription. Finally, we address the issue of language, speaker and channel variations, by training multi-language, multi-speaker models, which factorize out these differences similar to a cluster-adaptive training framework.

Thesis Committee:
Alan W. Black (Chair)
FLorian Metze
Louis-Phillippe Morency
Heiga Zen (Google Research, UK)

Copy of Proposal Document

Machine learning is ubiquitous, but most users treat it as a black box:  a handy tool that suggests purchases, flags spam, or autocompletes text.  I present qualities that ubiquitous machine learning should have to allow for a future filled with fruitful, natural interactions with humans:  interpretability, interactivity, and an understanding of human qualities.  After introducing these properties, I present machine learning applications that begin to fulfill these desirable properties.  I begin with a traditional information processing task---making sense and categorizing large document collections---and show that machine learning methods can provide interpretable, efficient techniques to categorize large document collections with a human in the loop.  From there, I turn to techniques to help computers understand and detect when texts reveal their writer's ideology or duplicity.  Finally, I end with a setting combining all of these properties:  language-based games and simultaneous machine translation.

Jordan Boyd-Graber is an assistant professor in the University of Colorado Boulder's Computer Science Department, formerly serving as an assistant professor at the University of Maryland.  Before joining Maryland in 2010, he did his Ph.D. thesis on "Linguistic Extensions of Topic Models" with David Blei at Princeton.  Jordan's research focus is in applying machine learning and Bayesian probabilistic models to problems that help us better understand social interaction or the human cognitive process.

He and his students have won "best of" awards at NIPS (2009, 2015), NAACL (2016), and CoNLL (2015), and Jordan won the British Computing Society's 2015 Karen Spärk Jones Award.  His research has been funded by DARPA, IARPA, NSF, NCSES, ARL, OMO, NIH, and Lockheed Martin and has been featured by CNN, Huffington Post, New York Magazine, Talking Machines, and the Wall Street Journal.

Beginning with the philosophical and cognitive underpinnings of referring expression generation, and ending with theoretical, algorithmic and applied contributions in mainstream vision-to-language research, I will discuss some of my work through the years towards the ultimate goal of helping humans and computers to communicate.  This will be a multi-modal, multi-disciplinary talk (with pictures!), aimed to be interesting no matter what your background is..

Margaret Mitchell is a Senior Research Scientist in Google's Machine Intelligence group.  She works on vision-language and grounded language generation, focusing on how to help computers communicate based on what they can process.  Her work combines computer vision, natural language processing, social media, many statistical methods, and insights from cognitive science.

She was a founding researcher in Microsoft's Cognition Group, focusing on advancing artificial intelligence towards positive goals.  Before MSR, she was a postdoctoral researcher at The Johns Hopkins University Center of Excellence, where she mainly focused on semantic role labeling and sentiment analysis using graphical models, working under Benjamin Van Durme.  Before that she, I was a postgraduate (Ph.D.) student in the natural language generation (NLG) group at the University of Aberdeen, where she focused on how to naturally refer to visible, everyday objects.  She primarily worked with Kees van Deemter and Ehud Reiter.

She spent a good chunk of 2008 getting a Master's in Computational Linguistics at the University of Washington, studying under Emily Bender and Fei Xia.

Faculty Host: Florian Metze
Instructor: Alex Hauptmann

Pages

Subscribe to LTI