Despite recent progress, AI is still far from understanding the physics of the world, and there is a large gap between the abilities of humans and the state-of-the-art AI methods. In this talk, I will focus on physics-based scene understanding and interactive visual reasoning, which are crucial next steps in computer vision and AI. The first part of the talk will describe our work on understanding preliminary physics from images, and the second part of the talk will be about our recent work on using Reinforcement Learning and Imitation Learning to perform tasks in the challenging AI2-THOR environment.
Roozbeh Mottaghi is a Research Scientist at Allen Institute for Artificial Intelligence (AI2). Prior to joining AI2, he was a post-doctoral researcher at the Computer Science Department at Stanford University. He obtained his PhD in Computer Science in 2013 from UCLA. His research is mainly focused on computer vision and machine learning.
The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will discuss the challenges of scale in labeling, modeling, and inference behind this gap. I will then present three works addressing these challenges. The first is a method for efficient inference of action detection in videos. We formulate this method as a reinforcement learning-based agent that interacts with a video over time, and decides both where in the video to look next and when to emit a prediction, significantly reducing the total frames processed in the video.
The second work pushes dense, detailed understanding of actions in video. We introduce a dataset of dense, multilabel action annotations to enable research in this direction, and a model that increases temporal modeling capacity from standard recurrent neural networks for action recognition to target this task.
Finally, I will discuss an approach for leveraging noisy web videos to learn classifiers for new concepts without requiring manually labeling training videos. We propose a reinforcement learning-based formulation for selecting the right examples for training a classifier from noisy web search results. I will show that after learning a data labeling policy on a small labeled training set, we can then use this policy to automatically label noisy web data for new visual concepts.
Serena Yeung is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. Serena interned at Facebook AI Research in Summer 2016, and before starting her Ph.D., received a B.S. and M.S. in Electrical Engineering, both from Stanford.
Automated analysis of dense crowds is a challenging problem with far-reaching applications in crowd safety and management, as well as gauging political significance of protests and demonstrations. In this talk, I will first describe a counting approach which uses traditional computer vision techniques, and was recently applied to Catalonia Demonstrations in Spain in 2015 and 2016. An extension of this work using convolutional neural network with hundreds of layers is presented next, partially made possibly through a new dataset for counting with over one million humans - all marked with dot annotations. Next, I will discuss how context in the form of local consistency captures the similarity in scale in local neighborhoods in an image and is used to detect partially visible humans in dense crowds. Finally, for the task of re-identification in a multi-camera setup, spatio-temporal context in the form of personal, social and environmental constraints aid in eliminating incorrect hypotheses and significantly improve performance on correct re-acquisition of people across cameras especially when appearance and visual features alone are insufficient.
Haroon Idrees is a postdoctoral researcher in the Center for Research in Computer Vision (CRCV) at the University of Central Florida (UCF). He is interested in machine vision and learning, with focus on crowd analysis, action recognition, multi-camera and airborne surveillance, as well as deep learning and multimedia content analysis. He chaired the THUMOS challenge on Action Recognition (CVPR, 2015) and has been program committee member of Workshop on Applications for Aerial Video Exploitation (WACV, 2015), Multiple Object Tracking Challenge (ECCV, 2016), and the upcoming BMTT-PETS Workshop on Tracking and Surveillance (CVPR, 2017) and Open Domain Action Recognition (CVPR, 2017).
He has published several papers in CVPR, ICCV, ECCV, Journal of Image and Vision Computing, and IEEE Transactions on Pattern Analysis and Machine Intelligence. He received BSc (Honors) degree in computer engineering from the Lahore University of Management Sciences, Pakistan in 2007, and the PhD degree in computer science from the University of Central Florida in 2014.
Faculty Hosts: Kris Kitani, Yaser Sheikh
Sponsored in part by Disney Research.
Over the past 5 years the community has made significant strides in the field of Computer Vision. Thanks to large scale datasets, specialized computing in form of GPUs and many breakthroughs in modeling better convnet architectures Computer Vision systems in the wild at scale are becoming a reality. At Facebook AI Research we want to embark on the journey of making breakthroughs in the field of AI and using them for the benefit of connecting people and helping remove barriers for communication. In that regard Computer Vision plays a significant role as the media content coming to Facebook is ever increasing and building models that understand this content is crucial in achieving our mission of connecting everyone. In this talk I will gloss over how we think about problems related to Computer Vision at Facebook and touch various aspects related to supervised, semi-supervised, unsupervised learning. I will jump between various research efforts involving representation learning. I will highlight some large scale applications that use the technology and talk about limitations of current systems.
Manohar Paluri is currently a Research Lead and manages the Computer Vision team in the Applied Machine Learning organization. He is passionate about Computer Vision and in the longer term goal of building systems that can perceive the way humans do. Through out his career he spent considerable time looking at Computer Vision problems in Industry and Academia. He worked at renowned places like Google Research, IBM Watson Research Labs, Stanford Research Institute before helping co found Facebook AI Research directed by Dr. Yann Lecun. He spent his formative years at IIIT Hyderabad where he finished his undergraduate studies with Honors in Computer Vision and joined Georgia Tech. to pursue his Ph.D. For over a decade he has been working on various problems related to Computer Vision and in general Perception and has made various contributions through his publications at CVPR, NIPS, ICCV, ECCV, ICLR, KDD, IROS, ACCV etc. He is passionate about building real world systems that are used by billions of people. Some of these systems are running at Facebook and already have tremendous impact on how people communicate using Facebook.
Sponsored in part by Disney Research
As opposed to the traditional notion of actions and activities in computer vision, where the motion (e.g. jumping) or the goal (e.g. cooking) is the focus, I will argue for an object-centred perspective onto actions and activities, during daily routine or as part of an industrial workflow. I will present approaches for the understanding of ‘what’ objects one interacts with, ‘how’ these objects have been used and ‘when’ interactions takes place.
The talk will be divided into three parts. In the first part, I will present unsupervised approaches to automatic discovery of task-relevant objects and their modes of interaction, as well as automatically providing guidance on using novel objects through a real-time wearable setup. In the second part, I will introduce supervised approaches to two novel problems: action completion – when an action is attempted but not completed, and expertise determination - who is better in task performance and who is best. In the final part, I will discuss work in progress on uncovering labelling ambiguities in object interaction recognition including ambiguities in defining the temporal boundaries for object interactions and ambiguities in verb semantics.
Dima Damen is a Lecturer (Assistant Professor) in Computer Vision at the University of Bristol. She received her Ph.D. from the University of Leeds (2009). Dima's research interests are in the automatic understanding of object interactions, actions and activities using static and wearable visual (and depth) sensors. Dima co-chaired BMVC 2013, is area chair for BMVC (2014-2017) and associate editor of IET Computer Vision. In 2016, Dima was selected as a Nokia Research collaborator. She currently supervises 7 Ph.D. students, and 2 postdoctoral researchers.
Humans perform a wide range of complex tasks such as navigation, manipulation of diverse objects and planning their interaction with other humans. However, at birth humans are not yet adept at many of these tasks. When observing infants, one might conclude that they perform random actions such as flailing their limbs or manipulating objects without purpose. It is possible that while infants engage in such exploration of their motor abilities they learn a mapping between their sensory and motor systems that enable adults to plan and perform complex sensorimotor tasks. Taking inspiration from this hypothesis, I will present some initial results on how a robotic agent can learn via random interaction with its environment and its intrinsic curiosity to push objects, manipulate ropes and navigate in mazes. I will then show how these basic skills can be combined with imitation to perform more complex tasks. Finally I will touch upon how models similar to object interaction can be used to reason about human behavior in sports games.
Pulkit Agrawal is a PhD Student in the department of Computer Science at UC Berkeley. His research focuses on computer vision, robotics and computational neuroscience. He is advised by Dr. Jitendra Malik. Pulkit completed his bachelors in Electrical Engineering from IIT Kanpur and was awarded the Director’s Gold Medal. He is a recipient of Fulbright Science and Technology Award, Goldman Sachs Global Leadership Award, OPJEMS, Sridhar Memorial Prize and IIT Kanpur’s Academic Excellence Awards among others. Pulkit served as the General Secretary of Science and Technology Council and vice-captain of water-polo team at IIT-Kanpur. Pulkit holds a “Sangeet Prabhakar” (equivalent to bachelors in Indian classical Music) and occasionally performs in music concerts.
Assistive technology is the art of building tools, devices and services that can support activities of daily life of people with disabilities. In this talk, I will describe some recent projects from my UCSC group focusing on sensing technology for people who are blind and for people with low vision. These include: blind wayfinding using visual landmarks and inertial sensors; text and sign reading; and accessible public transportation. I will conclude with a few reflections on some critical requirements of accessible wayfinding systems.
Roberto Manduchi is a Professor of Computer Engineering at the University of California, Santa Cruz, where he conducts research in the areas of computer vision and sensor processing with applications to assistive technology. Prior to joining UCSC in 2001, he worked at the NASA Jet Propulsion Laboratory and at Apple. He is a consultant for Aquifi, Inc., and sits on the scientific advisory board of Aira, Inc. In 2013 he shared with Carlo Tomasi the Helmholtz Test-of-Time Award from the International Conference on Computer Vision for their article on Bilateral Filtering.
Host: Dragan Ahmetovic
Advances in sensor miniaturization, low-power computing, and battery life have enabled the first generation of mainstream wearable cameras. Millions of hours of videos have been captured by these devices, creating a record of our daily visual experiences at an unprecedented scale. This has created a major opportunity to develop new capabilities and products based on First Person Vision (FPV)--the automatic analysis of videos captured from wearable cameras. Meanwhile, vision technology is at a tipping point. Major progress has been made over the last few years in both visual recognition and 3D reconstruction. The stage is set for a grand challenge of activity recognition in FPV. My research focuses on understanding naturalistic daily activities of the camera wearer in FPV to advance both computer vision and mobile health.
In the first part of this talk, I will demonstrate that first person video has the unique property of encoding the intentions and goals of the camera wearer. I will introduce a set of first person visual cues that captures the users' intent and can be used to predict their point of gaze and the actions they are performing during activities of daily living. Our methods are demonstrated using a benchmark dataset that I helped to create. In the second part, I will describe a novel approach to measure children’s social behaviors during naturalistic face-to-face interactions with an adult partner, who is wearing a camera. I will show that first person video can support fine-grained coding of gaze (differentiating looks to eyes vs. face), which is valuable for autism research. Going further, I will present a method for automatically detecting moments of eye contact. Finally, I will briefly cover my work on cross-modal learning using deep models.
This is joint work with Zhefan Ye, Sarah Edmunds, Dr. Alireza Fathi, Dr. Agata Rozga and Dr. Wendy Stone.
Yin Li is currently a doctoral candidate in the School of Interactive Computing at the Georgia Institute of Technology. His research interests lie at the intersection of computer vision and mobile health. Specifically, he creates methods and systems to automatically analyze first person videos, known as First Person Vision (FPV). He has particular interests in recognizing the person's activities and developing FPV for health care applications. He is the co-recipient of the best student paper awards at MobiHealth 2014 and IEEE Face & Gesture 2015. His work had been covered by MIT Tech Review, WIRED UK and New Scientist.
Visual recognition methods have made great strides in recent years by exploiting large manually curated and labeled datasets specialized to various tasks. My research focuses on asking: could we do better than this painstakingly manually supervised approach? In particular, could embodied visual agents teach themselves through interaction with and experimentation in their environments?
In this talk, I will present approaches that we have developed to model the learning and performance of visual tasks by agents that have the ability to act and move in their worlds. I will showcase results that indicate that computer vision systems could benefit greatly from action and motion in the world, with continuous self-acquired feedback. In particular, it is possible for embodied visual agents to learn generic image representations from unlabeled video, improve scene and object categorization performance through intelligent exploration, and even learn to direct their cameras to be effective videographers.
Dinesh Jayaraman is a PhD candidate in Kristen Grauman's group at UT Austin. His research interests are broadly in visual recognition and machine learning. In the last few years, Dinesh has worked on visual learning and active recognition in embodied agents, unsupervised representation learning from unlabeled video, visual attribute prediction, and zero-shot categorization. During his PhD, he has received the Best Application Paper Award at the Asian Conference on Computer Vision 2016 for work on automatic cinematography, the Samsung PhD Fellowship for 2016-17, a UT Austin Microelectronics and Computer Development Fellowship, and a Graduate Dean's Prestigious Fellowship Supplement Award for 2016-17. Before beginning graduate school, Dinesh graduated with a bachelor's degree in electrical engineering from the Indian Institute of Technology Madras (IITM), Chennai, India.
Sponsored in part by Disney Research.
Training object class detectors typically requires a large set of images with objects annotated by bounding boxes, which are very time consuming to produce. In this talk I will present several schemes to reduce annotation time. These augment existing techniques for weakly supervised learning with a small amount of extra human annotation: (a) verifying bounding-boxes produced by the learning algorithm; (b) clicking on the object center; (c) response times measured during visual search. I will show that this extra annotation can go a long way: some of these schemes deliver detectors almost as good as those trained in a fully supervised setting, while reducing annotation time by about 10x. I will conclude with our effort to annotate part of the COCO dataset with a broad range of stuff classes. To this end we propose a specialized annotation protocol which leverages existing thing annotations to enable to efficiently label all stuff pixels.
Vittorio Ferrari is a Professor at the School of Informatics of the University of Edinburgh and a Research Scientist at Google, leading a research group on visual learning in each institution. He received his PhD from ETH Zurich in 2004 and was a post-doctoral researcher at INRIA Grenoble in 2006-2007 and at the University of Oxford in 2007-2008. Between 2008 and 2012 he was Assistant Professor at ETH Zurich, funded by a Swiss National Science Foundation Professorship grant. He received the prestigious ERC Starting Grant, and the best paper award from the European Conference in Computer Vision, both in 2012. He is the author of over 90 technical publications. He regularly serves as an Area Chair for the major computer vision conferences, he will be a Program Chair for ECCV 2018 and a General Chair for ECCV 2020. He is an Associate Editor of IEEE Pattern Analysis and Machine Intelligence. His current research interests are in learning visual models with minimal human supervision, object detection, and semantic segmentation.
Host: Olga Russakovsky