Towards Streaming Perception

Mengtian (Martin) Li

Carnegie Mellon University

Yu-Xiong Wang

Carnegie Mellon University
UIUC

Deva Ramanan

Carnegie Mellon University
Argo AI

(Formerly titled "Towards Streaming Image Understanding")

Abstract

Embodied perception refers to the ability of an autonomous agent to perceive its environment so that it can (re)act. The responsiveness of the agent is largely governed by latency of its processing pipeline. While past work has studied the algorithmic trade-off between latency and accuracy, there has not been a clear metric to compare different methods along the Pareto optimal latency-accuracy curve. We point out a discrepancy between standard offline evaluation and real-time applications: by the time an algorithm finishes processing a particular frame, the surrounding world has changed. To these ends, we present an approach that coherently integrates latency and accuracy into a single metric for real-time online perception, which we refer to as "streaming accuracy". The key insight behind this metric is to jointly evaluate the output of the entire perception stack at every time instant, forcing the stack to consider the amount of streaming data that should be ignored while computation is occurring. More broadly, building upon this metric, we introduce a meta-benchmark that systematically converts any single-frame task into a streaming perception task. We focus on the illustrative tasks of object detection and instance segmentation in urban video streams, and contribute a novel dataset with high-quality and temporally-dense annotations. Our proposed solutions and their empirical analysis demonstrate a number of surprising conclusions: (1) there exists an optimal "sweet spot" that maximizes streaming accuracy along the Pareto optimal latency-accuracy curve, (2) asynchronous tracking and future forecasting naturally emerge as internal representations that enable streaming perception, and (3) dynamic scheduling can be used to overcome temporal aliasing, yielding the paradoxical result that latency is sometimes minimized by sitting idle and "doing nothing".

Watch on Youtube

Watch on Bilibili

Watch on Youtube

Watch on Bilibili

M. Li, Y. Wang and D. Ramanan
Towards Streaming Perception
In ECCV, 2020.

Best Paper Honorable Mention

[Paper] [Code] [Bibtex]

Qualitative results can be found in A Visual Walkthrough of Streaming Perception Solutions.

Dataset — Argoverse-HD

Online Viewers: [Train] [Val]

Based upon the autonomous driving dataset Argoverse 1.1, we build our dataset with high-frame-rate annotations for streaming evaluation that we name Argoverse-HD (High-frame-rate Detection). Despite being created for streaming evaluation, Argoverse-HD can also be used for study on image/video object detection, multi-object tracking, and forecasting. One key feature is that our annotations follow MS COCO standards, thus allowing direct evaluation of COCO pre-trained models on this autonomous driving dataset. Since this dataset is primarily intended for evaluation, ~~we only annotated the validation set~~ (see below), but provide pseudo ground truth of the training set. We find that pseudo ground truth could be used to self-supervise the training of streaming algorithms. Additional details about the dataset itself can be found in Section 4.1 & A.4 of the paper. Additional details about pseudo ground truth can be found in Section 3.4 & A.2 of the paper.

Updated Mar 2021: we now have all train, val and test splits annotated for the streaming perception challenge! Previously, only the annotations for the val split is provided. The test split annotations will be held out for ranking submissions on the challenge leaderboard. The table above contains updated number for the size of our dataset (updated to 1.26M from 250K for the number of boxes in Table B of the paper).

We provide the download links to our dataset below. Our dataset is released under the MIT License. However, if you use the images from Argoverse, you should check out their terms of use. The annotations and pseudo ground truth are provided in COCO format with additional metadata, which means that they work directly with cocoapi. You can refer to our code for how to set up the image data and parse the annotations.

Full dataset (29GB on Amazon S3, North America)

Full dataset (29GB on Amazon S3, Asia Pacific)

Full dataset (29GB on Kaggle)

Annotations only (17MB)

Additional annotations with pickup truck labels* (17MB)

Legacy pseudo ground truth with instance masks (66MB)

AWS AMI with the full dataset and the deep learning environment

*By default COCO classifies "pickup truck" as "truck", while appearance-wise, many "pickup trucks" resembles sedans, which are classified as "car" in COCO. These additional annotations, not included in the full dataset downloads, add one extra attribute of "is_pickup_truck" for each object. If used properly, this attribute can help to mitigate the confusion between "truck" and "car".

Note that our dataset only contains images from the center ring camera in Argoverse 1.1, for data of LiDAR and other cameras, please check out the Argoverse website (links listed under "Argoverse 3D Tracking v1.1")

Workshop Challenge

We are proud to announce the 2021 Streaming Perception Challenge! This challenge is part of the CVPR 2021 Workshop on Autonomous Driving (WAD) and the Argoverse 2021 competition series.

Acknowledgements: this work was supported by the CMU Argo AI Center for Autonomous Vehicle Research and was supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR001117C0051, and NSF Grant 1618903. Annotations for Argoverse-HD were provided by Scale AI.