VASC

Visual simulation of fluids has become an indispensable tool for computer graphics. Many fluid phenomena can be simulated by solving Navier-Stokes equations. In computer graphics, the NS equations are mostly used for simulating smoke, water and fire. However, it is useful for other different purposes. In this talk, we show our usage of the NS equations for different applications, including inverse cloud simulation, aerodynamic sound simulation, modeling of fluids from images, and editing of simulated fluid data.

Yoshinori Dobashi is an associate professor at Hokkaido University in the graduate school of engineering, Japan since 2000. His research interests center in computer graphics including lighting models. Dobashi received his BE, ME and Ph.D in Engineering in 1992, 1994, and 1997, respectively, from Hiroshima University. He worked at Hiroshima City University from 1997 to 2000 as a research associate.

Sponsored in part by Disney Research

Boosting classifiers have been extensively used for learning multi-view single objects detectors (e.g. faces, cars or pedestrians) or in multiple object categories detectors. Object detection has been evolving from being specific for a given object category to multi-view or even being able to detect multiple categories at the same time. The usual framework for Boosting based object detection uses binary classification (e.g. AdaBoost). Multi-class detection problems (e.g. multi-view face detection or car detection) has been usually solved with a binary Boosting classifiers: either with a monolithic detector (i.e. Object-vs-Background) or with a detector per object view or positive class (i.e. training and executing K detectors). On the other hand, Boosting approaches are falling in popularity behind the Deep Learning based detectors, as the later are achieving impressive results. The consequence of the Deep Learning path being the mainstream nowadays is that Boosting based detectors are not provided with the new advances in the object detection field in terms of new ideas and techniques.

In this talk we present our on going research in the developing of needed tools in the Boosting detection field. We present our PI-Boost multi-class algorithm with binary weak learners which separates subsets of classes. We also present BAdaCost, a multi-class cost-sensitive classification algorithm. It combines a set of cost-sensitive multi-class weak learners to obtain a strong classification rule within the Boosting framework. Finally we show how to apply BAdaCost to the multi-view detection problem and what can be achieved with the application of this new tool.

Jose M. Buenaposada received the BS and MS degrees in 1999 and the PhD degree in 2005 all in computer science from the Technical University of Madrid (Universidad Politecnica de Madrid). Since 2003 he has been working at the Rey Juan Carlos University, Spain, and from 2008 he is a Contratado Doctor (Associate Professor equivalent) at the same University. He is member of the Computer Vision and Image Processing (CVIP) research group at Universidad Rey Juan Carlos. He is also an external member of the Computer Perception Group at Universidad Politecnica de Madrid (PCR). His research interests in Computer Vision include image alignment, object tracking, face analysis and object detection.

Existing vision based solutions for obtaining precise, dense object centric 3D reconstructions require expensive hardware such as high resolution cameras, or integrated depth sensors, while lower resolutions and susceptibility to motion blur and rolling shutter effects have limited the potential of smart devices. With a focus on spatio-angular resolution, this work advances the view that the increasingly ubiquitous high frame rate video capture ability is the key to enabling precise reconstructions using smartphone platforms. This is achieved by making use of both photometric and silhouette based cues.

Recent advances in SLAM (DSO, LSD, and DVO) have demonstrated compelling benefits to using photometric alignment rather than feature based matching such as less drift due to subpixel alignments, and higher robustness to illumination change by explicitly modelling it. All while being fast.

Drawing upon these discoveries, at the core of this work is an entirely photometric object centric bundle adjustment with more focus on the recovery of the object's surface than on the odometry of the camera. By processing a video in batch, photometric methods can be pushed beyond the limits explored by SLAM algorithms. By exploiting the temporal consistency of high frame rate sequences, reconstructions can be obtained significantly more efficiently than existing batch methods.

Silhouettes are complementary to photometric cues, as they mark the regions where a surface can no longer be directly observed, and allow for the reconstruction of non-lambertian surfaces such as metal and glass. Instead of generating visual hulls by enforcing pixels to be labelled as being inside or outside the object, visual edges are considered as potential occlusion and silhouette boundaries, from which edge clouds with normals are reconstructed. This allows for the reconstruction of boundaries at self-occlusions rather than just foreground-background regions.

Christopher Ham is a computer vision PhD candidate at the University of Queensland, Australia, with graduation just around the corner. He's part of the CI2CV Lab led by Prof Simon Lucey. With a degree in Mechatronics Engineering, and having been taught how to work with wood from a young age, he has a very hands-on attitude. While designing and making something - be it physical or software - he's always considering how someone might approach it, what their first impressions will be, and how this affects their expectations and interactions. By way of design his goal is to manipulate these aspects.

Sponsored in part of Disney Research

Despite recent progress, AI is still far from understanding the physics of the world, and there is a large gap between the abilities of humans and the state-of-the-art AI methods. In this talk, I will focus on physics-based scene understanding and interactive visual reasoning, which are crucial next steps in computer vision and AI. The first part of the talk will describe our work on understanding preliminary physics from images, and the second part of the talk will be about our recent work on using Reinforcement Learning and Imitation Learning to perform tasks in the challenging AI2-THOR environment.

Roozbeh Mottaghi is a Research Scientist at Allen Institute for Artificial Intelligence (AI2). Prior to joining AI2, he was a post-doctoral researcher at the Computer Science Department at Stanford University. He obtained his PhD in Computer Science in 2013 from UCLA. His research is mainly focused on computer vision and machine learning.

The quantity of video data is vast, yet our capabilities for visual recognition and understanding in videos lags significantly behind that for images. In this talk, I will discuss the challenges of scale in labeling, modeling, and inference behind this gap. I will then present three works addressing these challenges. The first is a method for efficient inference of action detection in videos. We formulate this method as a reinforcement learning-based agent that interacts with a video over time, and decides both where in the video to look next and when to emit a prediction, significantly reducing the total frames processed in the video.

The second work pushes dense, detailed understanding of actions in video. We introduce a dataset of dense, multilabel action annotations to enable research in this direction, and a model that increases temporal modeling capacity from standard recurrent neural networks for action recognition to target this task.

Finally, I will discuss an approach for leveraging noisy web videos to learn classifiers for new concepts without requiring manually labeling training videos. We propose a reinforcement learning-based formulation for selecting the right examples for training a classifier from noisy web search results. I will show that after learning a data labeling policy on a small labeled training set, we can then use this policy to automatically label noisy web data for new visual concepts.

Serena Yeung is a Ph.D. student in the Stanford Vision Lab, advised by Prof. Fei-Fei Li. Her research interests are in computer vision, machine learning, and deep learning. She is particularly interested in the areas of video understanding, human action recognition, and healthcare applications. Serena interned at Facebook AI Research in Summer 2016, and before starting her Ph.D., received a B.S. and M.S. in Electrical Engineering, both from Stanford.

Automated analysis of dense crowds is a challenging problem with far-reaching applications in crowd safety and management, as well as gauging political significance of protests and demonstrations. In this talk, I will first describe a counting approach which uses traditional computer vision techniques, and was recently applied to Catalonia Demonstrations in Spain in 2015 and 2016. An extension of this work using convolutional neural network with hundreds of layers is presented next, partially made possibly through a new dataset for counting with over one million humans - all marked with dot annotations. Next, I will discuss how context in the form of local consistency captures the similarity in scale in local neighborhoods in an image and is used to detect partially visible humans in dense crowds. Finally, for the task of re-identification in a multi-camera setup, spatio-temporal context in the form of personal, social and environmental constraints aid in eliminating incorrect hypotheses and significantly improve performance on correct re-acquisition of people across cameras especially when appearance and visual features alone are insufficient.



Haroon Idrees is a postdoctoral researcher in the Center for Research in Computer Vision (CRCV) at the University of Central Florida (UCF). He is interested in machine vision and learning, with focus on crowd analysis, action recognition, multi-camera and airborne surveillance, as well as deep learning and multimedia content analysis. He chaired the THUMOS challenge on Action Recognition (CVPR, 2015) and has been program committee member of Workshop on Applications for Aerial Video Exploitation (WACV, 2015), Multiple Object Tracking Challenge (ECCV, 2016), and the upcoming BMTT-PETS Workshop on Tracking and Surveillance (CVPR, 2017) and Open Domain Action Recognition (CVPR, 2017).

He has published several papers in CVPR, ICCV, ECCV, Journal of Image and Vision Computing, and IEEE Transactions on Pattern Analysis and Machine Intelligence. He received BSc (Honors) degree in computer engineering from the Lahore University of Management Sciences, Pakistan in 2007, and the PhD degree in computer science from the University of Central Florida in 2014.

Faculty Hosts: Kris Kitani, Yaser Sheikh

Sponsored in part by Disney Research.

Over the past 5 years the community has made significant strides in the field of Computer Vision. Thanks to large scale datasets, specialized computing in form of GPUs and many breakthroughs in modeling better convnet architectures Computer Vision systems in the wild at scale are becoming a reality. At Facebook AI Research we want to embark on the journey of making breakthroughs in the field of AI and using them for the benefit of connecting people and helping remove barriers for communication. In that regard Computer Vision plays a significant role as the media content coming to Facebook is ever increasing and building models that understand this content is crucial in achieving our mission of connecting everyone. In this talk I will gloss over how we think about problems related to Computer Vision at Facebook and touch various aspects related to supervised, semi-supervised, unsupervised learning. I will jump between various research efforts involving representation learning. I will highlight some large scale applications that use the technology and talk about limitations of current systems.

Manohar Paluri is currently a Research Lead and manages the Computer Vision team in the Applied Machine Learning organization. He is passionate about Computer Vision and in the longer term goal of building systems that can perceive the way humans do. Through out his career he spent considerable time looking at Computer Vision problems in Industry and Academia. He worked at renowned places like Google Research, IBM Watson Research Labs, Stanford Research Institute before helping co found Facebook AI Research directed by Dr. Yann Lecun.  He spent his formative years at IIIT Hyderabad where he finished his undergraduate studies with Honors in Computer Vision and joined Georgia Tech. to pursue his Ph.D. For over a decade he has been working on various problems related to Computer Vision and in general Perception and has made various contributions through his publications at CVPR, NIPS, ICCV, ECCV, ICLR, KDD, IROS, ACCV etc. He is passionate about building real world systems that are used by billions of people. Some of these systems are running at Facebook and already have tremendous impact on how people communicate using Facebook.

Sponsored in part by Disney Research

As opposed to the traditional notion of actions and activities in computer vision, where the motion (e.g. jumping) or the goal (e.g. cooking) is the focus, I will argue for an object-centred perspective onto actions and activities, during daily routine or as part of an industrial workflow. I will present approaches for the understanding of ‘what’ objects one interacts with, ‘how’ these objects have been used and ‘when’ interactions takes place.

The talk will be divided into three parts. In the first part, I will present unsupervised approaches to automatic discovery of task-relevant objects and their modes of interaction, as well as automatically providing guidance on using novel objects through a real-time wearable setup. In the second part, I will introduce supervised approaches to two novel problems: action completion  – when an action is attempted but not completed, and expertise determination - who is better in task performance and who is best. In the final part, I will discuss work in progress on uncovering labelling ambiguities in object interaction recognition including ambiguities in defining the temporal boundaries for object interactions and ambiguities in verb semantics.

Dima Damen is a Lecturer (Assistant Professor) in Computer Vision at the University of Bristol. She received her Ph.D. from the University of Leeds (2009). Dima's research interests are in the automatic understanding of object interactions, actions and activities using static and wearable visual (and depth) sensors. Dima co-chaired BMVC 2013, is area chair for BMVC (2014-2017) and associate editor of IET Computer Vision. In 2016, Dima was selected as a Nokia Research collaborator. She currently supervises 7 Ph.D. students, and 2 postdoctoral researchers.

Humans perform a wide range of complex tasks such as navigation, manipulation of diverse objects and planning their interaction with other humans. However, at birth humans are not yet adept at many of these tasks. When observing infants, one might conclude that they perform random actions such as flailing their limbs or manipulating objects without purpose. It is possible that while infants engage in such exploration of their motor abilities they learn a mapping between their sensory and motor systems that enable adults to plan and perform complex sensorimotor tasks. Taking inspiration from this hypothesis, I will present some initial results on how a robotic agent can learn via random interaction with its environment and its intrinsic curiosity to push objects, manipulate ropes and navigate in mazes. I will then show how these basic skills can be combined with imitation to perform more complex tasks. Finally I will touch upon how models similar to object interaction can be used to reason about human behavior in sports games.

Pulkit Agrawal is a PhD Student in the department of Computer Science at UC Berkeley. His research focuses on computer vision, robotics and computational neuroscience. He is advised by Dr. Jitendra Malik. Pulkit completed his bachelors in Electrical Engineering from IIT Kanpur and was awarded the Director’s Gold Medal. He is a recipient of Fulbright Science and Technology Award, Goldman Sachs Global Leadership Award, OPJEMS, Sridhar Memorial Prize and IIT Kanpur’s Academic Excellence Awards among others. Pulkit served as the General Secretary of Science and Technology Council and vice-captain of water-polo team at IIT-Kanpur. Pulkit holds a “Sangeet Prabhakar” (equivalent to bachelors in Indian classical Music) and occasionally performs in music concerts.

Assistive technology is the art of building tools, devices and services that can support activities of daily life of people with disabilities. In this talk, I will describe some recent projects from my UCSC group focusing on sensing technology for people who are blind and for people with low vision. These include: blind wayfinding using visual landmarks and inertial sensors; text and sign reading; and accessible public transportation. I will conclude with a few reflections on some critical requirements of accessible wayfinding systems.

Roberto Manduchi is a Professor of Computer Engineering at the University of California, Santa Cruz, where he conducts research in the areas of computer vision and sensor processing with applications to assistive technology.  Prior to joining UCSC in 2001, he worked at the NASA Jet Propulsion Laboratory and at Apple.  He is a consultant for Aquifi, Inc., and sits on the scientific advisory board of Aira, Inc. In 2013 he shared with Carlo Tomasi the Helmholtz Test-of-Time Award from the International Conference on Computer Vision for their article on Bilateral Filtering.

Host: Dragan Ahmetovic

Pages

Subscribe to VASC