Deep Material-aware Cross-spectral Stereo Matching

Dance and Jump

Cross-spectral imaging provides strong benefits for recognition and detection tasks. Often, multiple cameras are used for cross-spectral imaging, thus requiring image alignment, or disparity estimation in a stereo setting. Increasingly, multi-camera cross-spectral systems are embedded in active RGBD devices (e.g. RGB-NIR cameras in Kinect and iPhone X). Hence, stereo matching also provides an opportunity to obtain depth without an active projector source. However, matching images from different spectral bands is challenging because of large appearance variations. We develop a novel deep learning framework to simultaneously transform images across spectral bands and estimate disparity. A material-aware loss function is incorporated within the disparity prediction network to handle regions with unreliable matching such as light sources, glass windshields and glossy surfaces. No depth supervision is required by our method. To evaluate our method, we used a vehicle-mounted RGB-NIR stereo system to collect 13.7 hours of video data across a range of areas in and around a city. Experiments show that our method achieves strong performance and reaches real-time speed.


"Deep Material-aware Cross-spectral Stereo Matching"
Tiancheng Zhi, Bernardo R. Pires, Martial Hebert and Srinivasa G. Narasimhan
IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2018.
[PDF] [Supp] [Video] [Dataset] [Poster] [Code(v2)]


PittsStereo-RGBNIR: A Large RGB-NIR Stereo Dataset Collected in Pittsburgh with Challenging Materials

Note: The images look dark because they are linear. Please brighten or gamma correct them for better visualization.

[Stereo Matching Data]:
Download: 42000 Stereo Pairs (40000 Train + 2000 Test) + 5000 Sparse Disparity Annotations + Our Results + Evaluation Code
[Material Segmentation Data]:
Download: 3600 Stereo Pairs (2400 Train + 1200 Test) + Dense Material Annotations
[Video Frames]:
Download: 13.7 Hour RGB-NIR Stereo Video (resized to 582x429, middle exposure only)
[Raw Frames]:
Contact the author: Million Unresized Stereo Pairs (3 exposure levels) + GPS + Vehicle States


Model overview. The disparity prediction network (DPN) predicts left-right disparity for a RGB-NIR stereo input. The spectral translation network (STN) converts the left RGB image into a pseudo-NIR image. The two networks are trained simultaneously with reprojection error.

Comparison of smoothing with and without confidence. Smoothing without confidence makes the reliable matching around the car sides be misled by the unreliable matching on glass, which causes the predicted disparity (orange) to be smaller than the correct one (red). Introducing confidence addresses this issue.

Transmitted and reflected scenes look farther than the real glass position.



This work was supported in parts by ChemImage Corporation, an ONR award N00014-15-1-2358, an NSF award CNS-1446601, and a University Transportation Center T-SET grant.