Current methods for object detection, segmentation and tracking fail in the presence of severe occlusions in busy urban environments. Labeled real data of occlusions is scarce (even in large datasets) and synthetic data leaves a domain gap, making it hard to explicitly model and learn occlusions. In this work, we present the best of both the real and synthetic worlds for automatic occlusion supervision using a large readily available source of data: time-lapse imagery from stationary webcams observing street intersections over weeks, months or even years. We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of 12 (4K and 1080p) cameras capturing urban environments over a year. We exploit this real data in a novel way to first automatically mine a large set of unoccluded objects and then composite them in the same views to generate occlusion scenarios. This self-supervision is strong enough for an amodal network to learn the object-occluder-occluded layer representations. We show how to speed up discovery of unoccluded objects and relate the confidence in this discovery to the rate and accuracy of training of occluded objects. After watching and automatically learning for several days, this approach shows significant performance improvement in detecting and segmenting occluded people and vehicles, over human-supervised amodal approaches.
We introduce a new dataset, Watch and Learn Time-lapse (WALT), consisting of 12 (4K and 1080p) cameras capturing urban environments over a year.
Given a sequence of images from a stationary camera, we compute the median image by finding the median RGB value per pixel from a collection of images. Since the camera is captured throughout the day and in different weather computing a single median image is unrealistic. To create realistic background images, we generate median images for varying imaging conditions like time of the day or different weather i.e. sunny, rainy, etc. This is computed by sampling the images under different conditions as shown below.
We exploit the time-lapse data in a novel way to mine a large dataset of real unoccluded objects over time. We develop a new method to classify unoccluded objects based on the idea that when objects on the same ground plane occlude one another, their bounding boxes overlap in a particular common configuration. We visualize the detected unoccluded objects on the time lapse images.
We visualize the mined unoccluded at different locations and time instances. Observe the accurate segmentation and detection in different object poses and times.
Once unoccluded objects are discovered, they are composited in layers back into the same scene as shown below. We also generate groundtruth mask and bounding box of these generated images to be trained as ground truth.
Visualization of Clip Art WALT Dataset on a 4k camera from WALT camera:
Visualization of Clip Art WALT Dataset on mutiple cameras. We generate different occlusions with people and vehicles using continuously detected unoccluded objects:
Since human annotators can only hallucinate the object extent in the occluded region, their labeling may not be reliable. To circumvent this problem, we propose using consistency in stationary object segmentation and detection under occlusions as a metric to quantify the accuracy of the algorithm. From the test set of WALT, we mine for unoccluded stationary objects by clustering objects detected at the same location. We use the unoccluded bounding box and segmentation of the stationary object as ground truth to compare predictions when the object is occluded by another object at a different time instance. Sample detected stationary objects with the mask and bounding box are shown below:
We observe accurate segmentation of occluded vehicles under occlusions.
Our method is able to perform well with multiple layers of severe occlusions with people.
Results showcasing our amodal detection with multi object interactions.
Results showcasing generalization of our method with a camera captured from iphone at a intersection.
We furthur show generalization of our amodal segmentation with multiple classes.
Results showcasing generalization of our method trained on 12 cameras on a new camera compared to the results of retraining with the new camera data into the pipeline can be seen below.
We replicate the WALT Dataset using computer graphics rendering. We use this rendered dataset only to evaluate different segment of our algorithm. We use a parking lot 3D model and simulate object trajectories similar to the real-world parking lot. We render 1000 time-lapse images of the scene from multiple viewpoints. The cameras for rendering are placed on the dashboard of the vehicles or on infrastructure around the parking lot. Sample rendered images from the dataset are shown below.
For an in-depth description of WALT, please refer to our paper.
"WALT: Watch And Learn 2D Amodal Representation using Time-lapse Imagery",N. Dinesh Reddy, Robert Tamburo, and Srinivasa Narasimhan