Detecting Invisible People

We visualize an online tracking scenario from Argoverse that requires tracking a pedestrian through a complete occlusion. Such applications cannot wait for objects to re-appear (e.g., as re-identification approaches do): autonomous agents must properly react during the occlusion. We treat online detection of occluded people as a short-term forecasting challenge.


Monocular object detection and tracking have improved drastically in recent years, but rely on a key assumption: that objects are visible to the camera. Many offline tracking approaches reason about occluded objects post-hoc, by linking together tracklets after the object re-appears, making use of reidentification (ReID). However, online tracking in embodied robotic agents (such as a self-driving vehicle) fundamentally requires object permanence, which is the ability to reason about occluded objects before they re-appear. In this work, we re-purpose tracking benchmarks and propose new metrics for the task of detecting invisible objects, focusing on the illustrative case of people. We demonstrate that current detection and tracking systems perform dramatically worse on this task. We introduce two key innovations to recover much of this performance drop. We treat occluded object detection in temporal sequences as a short-term forecasting challenge, bringing to bear tools from dynamic sequence prediction. Second, we build dynamic models that explicitly reason in 3D, making use of observations produced by state-of-the-art monocular depth estimation networks. To our knowledge, ours is the first work to demonstrate the effectiveness of monocular depth estimation for the task of tracking and detecting occluded objects. Our approach strongly improves by 11.4% over the baseline in ablations and by 5.0% over the state-of-the-art in F1 score.

Fully-occluded people (highlighted in dark on the right) that are not otherwise visible to the camera, are successfully recovered in the 3D top-down view of the scene.

Qualitative Analysis


We thank Gengshan Yang for his help with generating 3D visuals, Patrick Dendorfer for incorporating our metrics with the MOT challenge server, and Xueyang Wang for sharing the low-resolution version of the PANDA dataset. We thank Laura Leal-Taixé and Simon Lucey for insightful discussions, participants of the human vision experiment (Adithya Murali, Jason Zhang, Jessica Lee, Kushagra Mahajan, Mehar Khurana, Radhika Kannan, Rashmi Salamani, Steve Yadlowsky, Vaishaal Shankar, and Vidhi Jain) and internal reviewers at CMU (Alireza Golestaneh, David Held, Jack Li, Kangle Deng, and Yi-ting Chen) for reviewing early drafts. This work was supported by the CMU Argo AI Center for Autonomous Vehicle Research, the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0051, and the National Science Foundation (NSF) under grant number IIS-1618903.


	author  = {Khurana, Tarasha and Dave, Achal and Ramanan, Deva},
	title   = {Detecting Invisible People},
	journal = {arXiv preprint arXiv:2012.08419},
	year    = {2020}