In this talk I will present the use of motion cues, in particular long-range temporal interactions among objects, for computer vision tasks such as video segmentation, object tracking, pose estimation and semantic segmentation. The first part of the talk will present a method to capture such interactions and to construct an intermediate-level video representation. We also use them for tracking objects, and develop a tracking-by-detection approach that exploits occlusion and motion reasoning. This reasoning is based on long-term trajectories, which are labelled as object or background tracks with an energy-based formulation. We then show the use of temporal constraints for estimating articulated human poses, which is cast as an optimization problem. We present a new approximate scheme to solve it, with two steps dedicated to pose estimation.

The second part of the talk presents the use of motion cues for semantic segmentation. Fully convolutional neural networks (FCNNs) have become the new state of the art for this task recently, but rely on a large number of images with strong pixel-level annotations. To address this, we present motion-CNN (M-CNN), a novel FCNN framework which incorporates motion cues and is learned from video-level weak annotations. Our learning scheme to train the network uses motion segments as soft constraints, thereby handling noisy motion information. We demonstrate that the performance of M-CNN learned with 150 weak video annotations is on par with state-of-the-art weakly-supervised methods trained with thousands of images.

Karteek Alahari is an Inria permanent researcher (chargé de recherche) since October 2015. He has been at Inria since 2010, initially as a postdoctoral fellow in the WILLOW team in Paris, and then on a starting research position in Grenoble since September 2013. Dr. Alahari's PhD from Oxford Brookes University, UK, was on efficient inference and learning algorithms. His work as a postdoc focused on new models for scene understanding problems defined on videos. His current research interests are models for human pose estimation, semantic segmentation and object tracking, and weakly supervised learning.

Host: Olga Russakovsky

Matrix completion is a generic framework aiming to recover a matrix from a limited number of (possibly noisy) entries. In this content, low-rank regularizers are often imposed so as to find matrix estimators that are robust to noise and outliers. In this talk I will discuss three recent advances on matrix completion, developed to solve three different vision applications. First, coupled matrix completion to solve joint head and body pose estimation. Second, non-linear matrix completion to recognize emotions from abstract paintings. Third, self-adaptive matrix completion for remote heart-rate estimation from videos.

Xavier Alameda-Pineda received the M.Sc. degree in mathematics and telecommunications engineering from the Universitat Politècnica de Catalunya – BarcelonaTech in 2008 and 2009 respectively, the M.Sc. degree in computer science from the Université Joseph Fourier and Grenoble INP in 2010, and the Ph.D. degree in mathematics/computer science from the Université Joseph Fourier in 2013. He worked towards his Ph.D. degree in the Perception Team, at INRIA Grenoble Rhône-Alpes. He currently holds a postdoctoral position at the Multimodal Human Understanding Group at University of Trento. His research interests are machine learning and signal processing for scene understanding, speaker diaritzation and tracking, sound source separation and behavior analysis.


In recent years, deep learning has begun to dominate computer vision research, with convolutional neural networks becoming the standard machine learning tool for a wide range of tasks. However, one of the requirements for these methods to work effectively is a rich source of training data. Therefore, parallel applications in "real-world" robotics such as manipulation, are often still limited by the capacity to generate large-scale, high-quality data. In this talk, I will introduce some techniques I have developed to train robots using simulation, without the need to conduct costly real-world experiments. Specifically, I will talk about multi-view active object recognition, robotic grasping using physics simulation, and deep reinforcement learning for robotic arm control.

Ed Johns is a Dyson Fellow at Imperial College London, working on computer vision, robotics and machine learning. He received a BA and MEng from Cambridge University, followed by a PhD in visual recognition and localisation from Imperial College London. After post-doctoral work at University College London, he then took up a research fellowship and returned to Imperial to help set up the Dyson Robotics Lab with Professor Andrew Davison. He now works on visually-guided robot manipulation for domestic robotics.

Faculty Host: Michael Kaess


Deep learning has been proven very successful in many applications that require advanced pattern matching, including computer vision. However, it is still unclear how deep learning could be involved in other tasks such as logic reasoning. In this talk, I introduce two of our recent works on this direction, Visual Question and Answering and Computer Go. We show that with different architecture, we could achieve state-of-the-art performance against existing approaches.

Yuandong Tian is a Research Scientist in Facebook AI Research, working on Deep Learning and Computer Vision. Prior to that, he was a Software Engineer in Google Self-driving Car team in 2013-2014. He received Ph.D in Robotics Institute, Carnegie Mellon University on 2013, Bachelor and Master degree of Computer Science in Shanghai Jiao Tong University. He is the recipient of 2013 ICCV Marr Prize Honorable Mentions for his work on global optimal solution to nonconvex optimization in image alignment.

Modern intelligent agents will need to learn the manipulation actions that humans perform. They will need to recognize these actions when they see them and they will need to perform these actions themselves. The lesson from the findings on mirror neurons is that the two processes of interpreting visually observed action and generating actions, should share the same underlying cognitive process. The talk will present a cognitive system that interprets human manipulation actions from perceptual information (image and depth data) and consists of perceptual modules and reasoning modules that are in interaction with each other.

The talk focuses on two core problems at the heart of manipulation action understanding: a) the grounding of relevant information about actions in perception (the perception - action integration problem), and b) the organization of perceptual and high-level symbolic information for interpreting the actions (the sequencing problem). At the high level, actions are represented with the Manipulation Action Context-free Grammar (MACFG), a syntactic grammar and associated parsing algorithms, which organizes actions as a sequence of sub-events. Each sub-event is described by the hand (as well as grasp type), movements (actions) and the objects and tools involved, and the relevant information about these quantities is obtained from biologically inspired perception modules. These modules recognize the hand grasp, manipulation action consequences and object-wise spatial primitives. Furthermore, a probabilistic semantic parsing framework based on CCG (Combinatory Categorial Grammar) theory is adopted to model the semantic meaning of human manipulation actions.

Analogically, understanding manipulation actions is like understanding language, while executing them is like generating language. Experiments on two tasks, 1) a robot observing people performing manipulation actions, and 2) a robot then executing manipulation actions accordingly, are conducted to validate the formalism.

Dr. Yezhou Yang is a Postdoctoral Research Associate at the Computer Vision Lab and the Automation, Robotics and Cognition (ARC) Lab, with the University of Maryland Institute for Advanced Computer Studies, working with his PhD advisors: Prof. Yiannis Aloimonos and Dr. Cornelia Fermuller. His main interests lie in Cognitive Robotics, Computer Vision and Robot Vision, especially exploring visual primitives in human action understanding from visual input, grounding them by natural language as well as high-level reasoning over the primitives for intelligent robots. He was a recipient of the Qualcomm Innovation Fellowship 2011, the UMD CS Department Dean's Fellowship award and the Microsoft Research Asia Young Researcher Scholarship 2009. He received a B.A. in Computer Science from Zhejiang University in 2010, and a Ph.D. in Computer Science from the University of Maryland, College Park in 2015.

From the clink of a mug placed onto a saucer to the bustle of a busy café, our days are filled with visual experiences that are accompanied by characteristic sounds.  These sounds, when paired with their corresponding videos, can provide a rich training signal that allows us to learn visual representations of objects, materials, and scenes.  In this talk, I'll first address the material-understanding task of predicting what sound an object makes when it is hit or scratched.  I'll present an algorithm that learns to predict plausible soundtracks for silent videos of people striking objects with a drumstick. The sounds predicted by this model convey information about materials and physical interactions, and they frequently fool human subjects in "real or fake" psychophysical studies. I will then apply similar ideas to show that ambient audio -- e.g., crashing waves, people speaking in a crowd -- can be used to learn about objects and scenes.  By training a convolutional network to predict held-out sound for internet videos, we can learn image representations that perform well on object recognition tasks, and which contain units that are selective for sound-producing objects.

Andrew Owens is a graduate student at the MIT Computer Science and Artificial Intelligence Laboratory, working under the supervision of Bill Freeman and Antonio Torralba.  Before that, he obtained his B.A. in Computer Science at Cornell University in 2010.  He is a recipient of a Microsoft Research PhD Fellowship, an NDSEG Fellowship, and a Best Paper Honorable Mention Award at CVPR 2011.

A first person camera placed at the person's head captures candid moments in our life, providing detailed visual data of how we interact with people, objects, and scenes. It reveals our future intention and momentary visual sensorimotor behaviors. With the first person vision, can we build a computational model for personalized intelligence that predicts what we see and act by "putting yourself in her/his shoes"?

We provide three examples. (1) At physical level, we predict the wearer's intent in a form of force and torque that control the movements. Our model integrates visual scene semantics, 3D reconstruction, and inverse optimal control to compute the active force (peddaling and braking while biking) and experienced passive force (gravity, air drag, and friction) in a first person sport video. (2) At spatial scene level, we predict plausible future trajectories of ego-motion. The  predicted paths avoid obstacles, move between objects, even turn around a corner into invisible space behind objects. (3) At object level, we study the holistic correlation of visual attention with motor action by introducing "action-objects" associated with seeing and touching actions. Such action-objects exhibit characteristic 3D spatial distance and orientation with respect to the person, which allow us to build a predictive model using EgoNet. We demonstrate that we can predict momentary visual attention and motor actions without gaze tracking and tactile sensing for first person videos.

This is a join work with Hyun Soo Park.

Jianbo Shi studied Computer Science and Mathematics as an undergraduate at Cornell University where he received his B.A. in 1994. He received his Ph.D. degree in Computer Science from University of California at Berkeley in 1998. He joined The Robotics Institute at Carnegie Mellon University in 1999 as a research faculty, where he lead the Human Identification at Distance(HumanID) project, developing vision techniques for human identification and activity inference. In 2003 he joined University of Pennsylvania where he is currently a Professor of Computer and Information Science. In 2007, he was awarded the Longuet-Higgins Prize for his work on Normalized Cuts.

His current research focuses on first person human behavior analysis and image recognition-segmentation. His other research interests include image/video retrieval, 3D vision, and vision based desktop computing. His long-term interests center around a broader area of machine intelligence, he wishes to develop a "visual thinking" module that allows computers not only to understand the environment around us, but also to achieve cognitive abilities such as machine memory and learning.

Sponsored in part by Disney Research

In recent years, commodity 3D sensors, such as the Microsoft Kinect, have become easily and widely available. These advances in sensing technology have inspired significant interest in using the captured data for mapping and understanding 3D environments. In this talk, I will present our current research in this fascinating field, show potential future research directions, and talk about long-term goals. More specifically, I will show how we can now easily obtain a 3D reconstruction of an environment, and how we can exploit these results in order to infer semantics of a scene.

Matthias Niessner is a visiting assistant professor at Stanford University. Previous to his appointment at Stanford, he earned his PhD from the University of Erlangen-Nuremberg, Germany under the supervision of Günther Greiner. His research focuses on different fields of computer graphics and computer vision, including the reconstruction and semantic understanding of 3D scene environments. Additional information.

Sponsored in part by Disney Research.

A successful autonomous system needs to not only understand the visual world but also communicate its understanding with humans. To make this possible, language can serve as a natural link between high level semantic concepts and low level visual perception. In this talk, I'll present our recent work in the interdisciplinary domain of vision and language. I’ll show how we can exploit the alignment between movies and books in order to build more descriptive captioning systems. I’ll also discuss our efforts towards automatic understanding of stories behind long and complex videos.

Sanja Fidler is an Assistant Professor at the Department of Computer Science, University of Toronto.  Previously she was a Research Assistant Professor at TTI-Chicago, a philanthropically endowed academic institute located in the campus of the University of Chicago. She was a postdoctoral fellow at University of Toronto during 2011-2012. She completed her PhD in computer science at University of Ljubljana in 2010, and was a visiting student at UC Berkeley in the final year of her PhD. She is serving as an Area Chair in the upcoming CVPR’16, EMNLP’16 and ACCV’16, and as a Program Chair of 3DV’16. Her main research interests are object recognition, 3D scene understanding, and combining vision and language.

Sponsored in part by Disney Research.

The focus of this talk will be on detailed scene understanding from RGB-D images. We approach this problem by studying a variety of central vision problems like bottom-up grouping, object detection, instance segmentation, and pose estimation in context of RGB-D images, and finally aligning CAD models to objects in the scene. This results in a detailed output which goes beyond what most current computer vision algorithms produce: a bounding box or a segmentation mask for the object of interest; and is useful for a variety of real world applications like perceptual robotics, and augmented reality. 

A central question in this work is how to learn good features for depth images in view of the fact that labeled RGB-D datasets are much smaller than labeled RGB datasets (such as ImageNet) typically used for feature learning, and I will describe our "cross-modal distillation" technique which allows us to leverage easily available annotations on RGB images to learn representations on depth images. In addition, I will very briefly also talk about some work on vision and language that I did on an internship at Microsoft Research.

Saurabh Gupta is a Ph.D. student at UC Berkeley, where he is advised by Jitendra Malik. His research interests include computer vision and machine learning. During his PhD he has studied the problem of scene understanding from RGB-D images. His work has been supported by the Berkeley Fellowship and the Google Fellowship in Computer Vision.

Sponsored in part by Disney Research


Subscribe to VASC