Robotics Thesis Proposal
- Newell-Simon Hall
- Mauldin Auditorium 1305
- ROHIT GIRDHAR
- Ph.D. Student
- Robotics Institute
- Carnegie Mellon University
Spatiotemporal Understanding of People Using Scenes, Objects, and Poses
Humans are arguably one of the most important entities that AI systems would need to understand to be useful and ubiquitous. From autonomous cars observing pedestrians to assistive robots helping the elderly, a large part of this understanding is focused on recognizing human actions, and potentially, their intentions. Humans themselves are quite good at this task: we can look at a person and explain in great detail every action they are doing. Moreover, we can reason over those actions over time, and even predict what potential actions they may intend do in the future. Computer vision algorithms, on the other hand, have lagged far behind on this task.
In this thesis, we explore techniques to improve human action understanding from a visual input. Our key insight is that human actions are dependent on their own state (parameterized by their pose), as well as the state of their environment (parameterized by the scene and the objects in it). We exploit this dependence in three key ways: (1) Predicting a prior on human actions using affordances of the scenes and objects they interact with; (2) Attending to person and their surroundings when classifying their actions; and (3) Building systems capable of learning from or aggregating this contextual knowledge over space and time to recognize actions. We propose to extend these methods to recognizing actions in complex multi-person videos, where multiple people are performing multiple different actions at any given time.
However, these methods still mostly look at short time scales. Tackling the goal of recognizing human intentions would require reasoning over long temporal horizons. We believe one reason for the limited progress in this direction is the lack of vision tasks that actually require such reasoning. Most video action classification problems are solved fairly well using our previously explored methods by looking at just a few frames. Hence, to remedy that, we propose a new benchmark dataset and tasks that by design requires reasoning over time to be solved. We believe this would be a first step towards building truly intelligent video understanding systems.
Deva Ramanan (Chair)
Andrew Zisserman (University of Oxford)
Jitendra Malik (University off California, Berkeley)