Jon A. Webb, Bill Ross

Real-Time Parallel Stereo Vision

This is a "sampler" page of the Robotics Institute, Carnegie Mellon University.

In stereo vision the view from multiple cameras is used to compute range, or distance from the cameras. The principle is very simple, and has been known since (at least) the middle ages: points near an observer appear to move a lot when the viewpoint shifts, while points far away seem to move less. This apparent shift is known as the disparity, as shown here.

Thus, if we know the precise relationships between the cameras (position, focal length, etc.), we can determine range by comparing images and finding matching points. But while human vision does this very quickly, it is hard to do with computers. It requires a lot of computation (because so many points must be compared between the images) and it is fraught with error (because so much of the images look alike, creating false matches.)

Computational problem

In dealing with the computational problem, we can apply constraints and use powerful computers. First, the possible matches between two points cannot lie anywhere in the image; instead, they fall onto lines (called epipolar lines.) Second, if we carefully align the cameras and lenses using an adjustable jig, we can make the epipolar lines fall on the rows of the image, which makes the computation much simpler. The jig illustrated here is used to do this alignment.

Even with these optimizations and others, we still need a powerful computer to execute the algorithm in real time. For this, we use the iWarp computer, illustrated here.

iWarp is a compact computer specifically designed to execute image and signal processing algorithms efficiently:

It is systolic. This means that data can flow directly from the processor to the interprocessor communications pathway without passing through memory. This effectively doubles the memory bandwidth. Memory bandwidth is important in image and signal processing because relatively little computation is done per datum.

It is synchronous. Both processor and pathway bandwidth can be coupled directly to the external video signal, eliminating the need for expensive and timeconsuming buffering of video data.

It is powerful. Pathway bandwidth is 40 MB/s, more than enough for video. Processor power is 20 MFLOPS, or 1.28 GFLOPS for an 88 processor array, which fits in a standard 19" rack.

Dealing with error

Reducing error is the reason why we use more than two cameras. Human vision uses a lot of high-level understanding to make do with only two eyes. But with computers, it is often best to find "brute force" ways to reduce error, rather than trying to be clever. With more than two cameras, we introduce redundancy, which reduces the chance of false matches and increases accuracy. This is the essence of the Kanade-Okutomi multibaseline stereo algorithm we use.

Another way of reducing error is to project a pattern on the scene. This is not always desirable, but when we can do it it works very well. The pattern (almost any pattern will do; frequency-modulated sine waves appear to be best) provides sufficient texture for the matching algorithm to work almost anywhere in the image-except where the surface is not visible to all cameras.

The result is that we are able to capture and process images from three cameras and turn them into a 240X256X16 depth levels range image at 15 Hz, making this one of the fastest depth ranging systems ever implemented, and the fastest outright where all processing is implemented in software.