An Active Multibaseline Stereo System with Real-Time Image Acquisition

An Active Multibaseline Stereo System with Real-Time Image Acquisition Sing Bing Kang, Jon A. Webb, C. Lawrence Zitnick, Takeo Kanade

School of Computer Science, Carnegie Mellon University

Pittsburgh, PA 15213-3891

CMU-CS-94-167

Acknowledgment

This research was partially supported by the Adanced Research Projects Agency of the Department of Defense under contract number F19628-93-C-0171, ARPA order number A655, "High Performance Computing Graphics," monitored by Hanscom Air Force Base. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Research Projects Agency, the Department of Defense, or the U.S. government.

Abstract

We describe our four-camera multibaseline stereo system in a convergent configuration and our implementation of a parallel depth recover scheme for this system. Our system is capable of image capture at video rate. This is critical in applications that require three-dimensional tracking. We obtain dense stereo depth data by projecting a light pattern of frequence modulated sinusoidally varying intensity onto the scene, thus increasing the local discriminability at each pixel and facilitating matches. In addition, we make most of the camera view areas by converging them at a volume of interest. Results indicate that we are able to extract depth data that are, on the average, less than 1 mm in error at distances between 1.5 to 3.5 m away from the cameras.

1 Introduction

2 The active 4-camera system

2.1 The principle of multibaseline stereo

2.2 Why use a verged camera configuration?

2.3 Video-rate image acquisition system

3 Camera calibration

4 Image rectification and depth recovery

4.1 Direct approach for depth recovery

4.2 A computationally more efficient approach for depth recovery

4.3 An approximate depth recovery approach

5 Experimental results

6 Observations on accuracy

7 Summary

1 Introduction

Binocular stereo vision is a simple and flexible method by which three-dimensional (range) information of a scene can be obtained. Therefore, it is not surprising to find that stereo is a very active area of research [2]. The geometrical issues in stereo have also been well explored [6]. The primary drawback of stereo is the problem with image point correspondence (for a survey of correspondence techniques, see [5]). The trade-off between accuracy (which is aided by a wide baseline, or separation between the cameras) and ease of correspondence (which is simpler with a narrow baseline) has been mitigated using multiple cameras or camera locations. Such an approach has been termed multibaseline stereo [12].

Stereo vision is computationally intensive. Fortunately, the spatially repetitive nature of depth recovery lends itself to parallelization. This is especially critical in the case of multibaseline stereo with high image resolution and the practical requirement of timely extraction of data. A number of researchers have worked on fast implementation of stereo (e.g., [11], [13], [14]).

In this report, we describe our implementation of a depth recovery scheme implemented in iWarp for a four-camera multibaseline stereo in a convergent configuration. Our system is capable of image capture at video rate. This is critical in applications that require tracking in three dimensions (an example is [10]). One method to obtain dense stereo depth data is to interpolate between reliable pixel matches [8]. However, the interpolated values may not be accurate. We obtain accurate dense depth data by projecting a light pattern of sinusoidally varying intensity onto the scene, thus increasing the local discriminability at each pixel. In addition, we make the most of the camera view areas by converging them at a volume of interest. Experiments have indicated that we are able to extract stereo depth data that are, on the average, less than 1 mm in error at distances between 1.5 to 3.5 m away from the cameras.

We introduce the notion of an active multibaseline stereo for extraction of dense stereo range data in Section 2. The principle of multibaseline stereo is explained, and in addition, we justify our use of the camera system in a convergent configuration. In this section, we briefly describe our image acquisition system that enables us to capture intensity images at video rate (30 Hz). Before the camera system can be used, it must be calibrated; this procedure is described in Section 3.

Prior to depth recovery, we apply a warping operation called image rectification to the set of images as a preprocessing step for computational reasons; this warping operation is described in Section 4. Our implementation of the depth recovery algorithm is subsequently detailed in this section.

Finally, we present results of our experiments in Section 5, analyze the sources of error in our system in Section 6, and summarize our work in Section 7.

2 The active 4-camera system

Our multibaseline camera system is shown in Fig. 1. It comprises four cameras mounted on a plain metal bar, which in turn is mounted on a sturdy tripod stand; each camera can be rotated about a vertical axis and fixed at discrete positions along the bar. The four camera video signals are all synchronized by ganging the genlock signals.

In addition to the camera, we use a projector to cast a pattern of sinusoidal varying intensity (active lighting) onto the scene. This notion of an active multibaseline stereo allows a denser depth map as a result of improved local scene discrimination and hence correspondence.

2.1 The principle of multibaseline stereo

In binocular stereo where the two camera axes are parallel, depth can easily be calculated given the disparity (the shift in position for corresponding points between the images). If the focal length of both cameras is f, the baseline b and disparity d, then the depth z is given by z=fb/d (Fig. 2).

In multibaseline stereo, more than two cameras or camera locations are employed, yielding multiple images with different baselines [12]. In the parallel configuration, each camera is a lateral displacement of the other. From Fig. 2, d=fb/z (we assume for illustration that the cameras have identical focal lengths).

For a given depth, we then calculate the respective expected disparities relative to a reference camera (say, the leftmost camera) as well as the sum of match errors over all the cameras. (An example of a match error is the image difference of image patches centered at corresponding points.) By iterating the calculations over a given resolution and interval of depths, the depth associated with a given pixel in the reference camera is taken to be the one with the lowest error.

The multibaseline approach has the advantage of reducing mismatches during correspondences due to the simultaneous multiple baselines. In addition, it produces a statistically more accurate depth value [12]. However, using multiple cameras alone does not solve the problem of matching ambiguity that occurs with smooth untextured object surfaces in the scene. This is the reason why the idea of using active lighting in the form of a projected pattern on the scene is important. The projected pattern on object surfaces in the scene helps in disambiguiting local matches in the camera images.

2.2 Why use a verged camera configuration?

The primary problem associated with a stereo arrangement of parallel camera locations is the limited overlap between the fields of views of all the cameras. The percentage of overlap increases with depth. The primary advantage is the simple and direct formula in extracting depth.

The parallel camera configuration is suitable for outdoor applications where accuracy is not of utmost importance while speed is (e.g., [13]). A problem with this configuration is the low percentage of overlap in the field of views of the cameras.

Verging the cameras at a specific volume in space is optimal in an indoor application where maximum utility of the camera visual range is desired and the workspace size is constrained and known a priori. Such a configuration is illustrated in Fig. 3. One such application is the tracking of objects in the Assembly Plan from Observation project [9]. The aim of the project is to enable a robot system observe a human perform a task, understand the task, and replicate that task using a robotic manipulator. By continuously monitoring the human hand motion, motion breakpoints such as the point of grasping and ungrasping an object can be extracted [10]. The verged multibaseline camera system can extend the capability of the system to tracking the object being manipulated by the human. For this purpose, we require fast image acquisition (though processing is not as critical) and accurate depth recovery.

2.3 Video-rate image acquisition system

Our image acquisition system consists of the physical camera setup described earlier in this section, the video interface board, and the 8x8 matrix of iWarp cells (Fig. 4). Each iWarp component contains a 20 MFLOPS computation engine and low-latency (100-150 ns) communication engine for interfacing with other iWarp cells [3]. The existing iWarp system is an 8x8 torus of iWarp cells, half of which have 16 MB DRAMS per cell. The video interface, which is described in detail elsewhere [17], is connected directly to the iWarp cell through the memory interface; the digitized video data is routed and distributed at video rate to the DRAMs by taking advantage of iWarp's systolic design [4].

3 Camera calibration

Before data images can be taken and the scene depth recovered, we must first calibrate the camera configuration. Calibrating the camera configuration refers to the determination of the extrinsic (relative pose) and intrinsic (optic center offset, focal length and aspect ratio) camera parameters. The pinhole camera model is assumed in the calibration process. The origin of the verged camera configuration coincides with that of the leftmost camera.

A printed planar dot pattern arranged in a 7x7 equally spaced grid is used in calibrating the cameras; images of this pattern are taken at known depth positions (five in our case). An example set of images taken by the camera system is shown in Fig. 5.

The dots of the calibration pattern are detected using a star-shaped template with the weight distribution decreasing towards the center. The entire pattern is extracted and tracked from one camera to the next by imposing structural constraints of each dot relative to its neighbors, namely by determining the nearest and second nearest distances to another dot. This filters out wrong dot candidates, as shown in Fig. 6.

The simultaneous recovery of the camera parameters of all four cameras can be done using the non-linear least-squares technique described by Szeliski and Kang [16]. The inputs and outputs to this module are shown in the simplified diagram in Fig. 7. An alternative would be to use the pairwise-stereo calibration approach proposed by Faugeras and Toscani [7].

4 Image rectification and depth recovery

If two camera axes are not parallel, their associated epipolar lines are not parallel to the scan lines. This introduces extra computation to extract depth from stereo. To simplify and reduce the amount of computation, rectification can be carried out first. The process of rectification for a pair of images (given the camera parameters, either through direct or weak calibration) transforms the original pair of image planes to another pair such that the resulting epipolar lines are parallel and equal along the new scan lines. Rectification is depicted in Fig. 8. Here c1 and c2 are the camera optical centers, Π1 and Π2 the original image planes, and Ω1 and Ω2 the rectified image planes. The condition of parallel and equal epipolar lines necessitates planes Ω1 and Ω2 to lie in the same plane, indicated as Ω12. A point q is projected to image points v1 and v2 on the same scan line in the rectified planes.

A simple rectification method is described in [1]. However, the rectification process described there is a direct function of the locations of the camera optical centers. It is not apparent how the desirable properties of minimal distortion and maximal inclusion can be achieved with their formalism. We have modified their formalism to simplify the rectification mapping and adapt it to our situation.

Let the original 3x4 perspective transforms of two cameras be P1 and P2, where

The original perspective transform Pj is constructed from known camera parameters of the form

where the tilde (~) above the vector indicates its homogeneous representation. q is the 3D point, uj the image coordinate vector, fj the focal length, aj the aspect ratio, and Rj and tj the extrinsic camera parameters. It is easy to see that the camera axis vector is rj3, and in the camera image coordinate system, the x- and y-directions are along rj1 and rj2, respectively.

Also, let M and N be the rectified perspective transforms, respectively, where

Since perspective matrices are defined up to a scale factor, we can set both m34 and n34 to be unity. Accordingly, based on the analysis in [1], m3 = n3, m2 = n2, m24 = n24, and from the constraint that c1 and c2 remain the optical centers,

Let d12 = c1 - c2. In a departure from [1], we choose the common rectified camera axis direction not only to be perpendicular to d12, but also to point in the direction between those of the unrectified camera axes (i.e., r13 and r23). This is done by first calculating

g = r13 + r23.

We then find the nearest vector perpendicular to d12:

Thus,

Determining m2 (and hence m24) is similar, with the additional constraint that

Finally, m1 is determined from the relation m1 = tau(m2 cross m3)

Tau (and hence m1 and m14) is calculated based on the constraint

n1 and n14 are calculated in the same way, using the counterpart values of P2.

As in [1], the homographies (or linear projective correspondences) that map the unrectified image coordinates to the rectified image coordinates are

where v1 = H1u1.

u1 and v1 are the homogeneous unrectified and rectified image coordinates, respectively, and

with v2 = H2u2. u2 and v2 are similarly defined.

To recover depth from multibaseline stereo (specifically a 4-camera system) in a convergent configuration, we first rectify pairs of images as shown in Fig. 9.

There are two schemes which allows us to recover depth. The first uses all the homographies between the unrectified images and rectified images (namely H11, H12, H13, H21, H32, and H43 in Fig. 10).

4.1 Direct approach for depth recovery

Subsequent to rectification, to recover depth, we first determine the corresponding location in the rectified image plane for the three pairs of cameras (Fig. 10). We wish to recover the 3D location q of the point corresponding to u0. q can be specified as q = c1 + lambda d, where c1 is the optical center of the first ("reference") camera, d is the unit vector in the direction from c1 to q, and lambda is the depth of q from the reference camera optical center. If

then

since P1c1 = [0 0 0]T. So

i.e.,

from which we get

To find the disparity, Delta j = x'j - xj, as a function of the projection transform elements, we first find the expressions for the rectified image coordinates (noting that yj = y'j):

Hence

By varying lambda within a specified interval and resolution, we can calculate Delta j's for the pairs of rectified images, and hence calculate the sum of matching errors (as in [13] with multiple parallel cameras.) The depth is recovered by picking the value of lambda associated with the least matching error.

4.2 A computationally more efficient approach for depth recovery

The method described above implies that we must calculate, at each point and for each depth, the corresponding points in all images. This requires projective transformations of all images to be performed for each depth value. There is a more computationally efficient way to recover depth. This stems from the following properties:

1. The two rectified planes fall on the same plane.
2. The line joining the two projection centers is parallel to this common plane.
Properties 1 & 2 (which are the necessary conditions for rectification) give rise to

3. The homography between the two rectified planes cannot be projective (since the scan lines on the rectified images are parallel, i.e., the corresponding rows at both rectified images are equal). This is true since the "projection" lines (the corresponding scan lines) meet at infinity.
From 3, the homography between rectified planes must then be at most a 2D affine transform, i.e., the last row of the homography matrix must be (0 0 1). This dispenses with the additional division by the z-component in calculating the corresponding matched point for a particular depth.

The scheme now follows that in Fig. 11. The matching is done using the homographies between rectified images K1, K2 and K3 (which we term as rectified homographies). The rectified homographies can be readily determined as follows:

For a known depth plane (z = d), we can "contract" the 3x4 perspective matrix M (to the rectified plane) to a 3x3 homography G. For camera l, we have

where plj is the jth column of Ml and (ul, vl)T is the projected image point in camera l. Similarly, for camera m,

Since the rectified planes are coplanar, sl = sm; hence

Note that, due to rectification, vm = vl, and as explained earlier in this subsection, the bottom row of Klm is (0 0 1). In other words, the projective transformations are reduced to affine transformations, reducing the amount of computation.

Depth recovery then proceeds in a similar manner as the direct approach described in the previous subsection.

4.3 An approximate depth recovery approach

In both approaches described earlier, for each depth, each pixel in the unrectified reference image has to be mapped Ncameras -1 times to the respective rectified images (corresponding to the homographies H11, H12, and H13 in Fig. 11). We can work in the rectified image coordinates (say M1), but this still requires mapping from M2 to M1 and M3 to M1 in the collection of match errors for each depth value. This means that we need to perform (Ncameras -2) Ndepth sets of bilinear interpolations associated with image warping (where Ndepth is the number of depth values and Ncameras is the number of cameras).

In order to avoid the warping operations, we use an approximate depth recovery method. The matching is done with respect to the rectified image of the first pair. However, the rectified images N2 and N3 will not be row preserved relative to M1 (Fig. 12). We warp rectified images N2 and N3 so as to preserve the rows as much as possible, resulting in N'2 and N'3 (Fig. 12). The errors should be tolerably small as long as the vergence angles are small. In addition, this effect should not pose a significant problem as we are using a local windowing technique in calculating the match error.

By comparing Fig. 12 with Fig. 11, we can see that the mapping from M1 to N2 is given by the homography L12 = K13H12H11-1. Similarly, the mapping from M1 to N3 is given by L13 = K14H13H11-1. The matrices A2 and A3 are constructed such that

i.e., the resulting overall mapping is row preserving (r and c are the row and column respectively). In general, this would not be possible, unless all the camera centers are colinear; however, this is a good approximation for small vergence angles and approximately aligned cameras. A2 and A3 are calculated from the following overconstrained relation using the pseudoinverse calculation:

where L1dmin is associated with the minimum depth and L1dmax with the maximum depth, cmin and cmax are the minimum and maximum values of the image column, and rmin and rmax are the minimum and maximum values of the image row, respectively. Xi (i=1,...,8) are don't-care values. The symbol | is used to represent matrix augmentation.

This algorithm has been implemented in parallel using the Fx (parallel Fortran) language developed at Carnegie Mellon [15]. Fx, a variant of High Performance Fortran with optimizations for high-communication applications like signal and image processing, runs on the Carnegie Mellon-Intel Corporation iWarp, the Paragon/XPS, the Cray T3D, and the IBM SP2. The experiments reported in this paper were done on the iWarp.

5 Experimental results

In this section, we present results of our active multibaseline stereo system. As mentioned before, a pattern of sinusoidally varying intensity are projected onto the scenes to facilitate image point correspondence.

An example of a set of images (Scene 1) and the extracted depth image is shown in Fig. 13 and Fig. 14 respectively. The large peaks at the borders of the depth map are outliers due to mismatches in the background outside the depth range of interest.

Another example (Scene 2) is shown in Fig. 15 with the recovered elevation map in Fig. 16. As can be seen from the elevation map, except at the edges of the objects on the scene, the data looks very reasonable.

For Scene 3 (Fig. 17), subsequent to depth recovery (Fig. 18), we fit the known models onto the range data using Wheeler and Ikeuchi's 3D template matching algorithm [18] to yield results seen in Fig. 19. Again the data looks very reasonable.

We have also performed some error analysis on some of the range data that were extracted from Scene 2. Fig. 20 show the areas for planar fit; Table 1 shows the numerical results of the planar fit. As can be seen, the average planar fit error is smaller than 1 mm (the furthest planar patch is about 1.7m away from the camera system). Fig. 21 depicts the error distribution of the resulting planar fit across the image (only on pixels on planar surfaces in the scene). The darker pixels are associated with lower absolute error in planar fitting.

We have also obtained stereo range data of a cylinder of known cross-sectional radius and calculated the fit error. In both scenes (with different camera settings), the cylinder is placed about 3.3 m away from the camera system.

As can be seen from Table 2, the mean absolute error of fit is less than 1 mm.

6 Observations on accuracy

We have exceeded one millimeter accuracy. Here we informally characterize the remaining sources of error in our system.

There are a number of sources of error in our system and in stereo generally:

1. The use of an active multibaseline approach reduces the chance of false matches, but they can still occur.
2. The fundamental assumptions of stereo are that the texture being viewed is unique over the search window, and that the surface is visible to and lies at the same angle to all camera optical axes. The former assumption is addressed by the active component of our system, but the latter is not and cannot be, except by placing the cameras as close together as possible (which reduces accuracy). The failure of this assumption is particularly evident at the boundaries of objects, where it is the cause of significant error.
3. Errors are possible during calibration, since the position of our calibration plate is adjusted by hand (limiting its accuracy in positioning to about 1 mm), and the dot pattern positions are not always found precisely.
4. We use a pinhole camera model, which will result in errors near the edge of the image, particularly with short focal lengths.
5. We make the approximation discussed in Section 4.3, which will result in errors when the camera optical centers are not colinear.
Of these, only the first seems to be a cause of significant error (the second also causes large error, but we deliberately omit it from our error analysis since it is fundamental to stereo). All of the large errors (more than 1 mm) are observed to be in regions where the projected pattern does not provide sufficient texture for a correct match.

We have attempted to reduce these errors by analysis and experimentation. Analysis shows that a frequency-modulated sine wave pattern, as used there, is a good choice since it does not require large dynamic range (our iWarp video interface has manually adjustable gain and offset controls, leading us to limit the dynamic range to avoid clipping). Also, a randomly frequency-modulated sine wave gives the best possible result, since the same pattern occurs twice in the search area with vanishingly small probability, theoretically eliminating the possibility of false matches. Experiments with randomly modulated patterns have shown that

The lowest frequency of the sine wave (as seen in the image) must be higher than the width of the correlation match window.
The highest frequency usable is constrained by the resolution of the camera and the focus control of the projector. Using a higher frequency than the maximum results in a gray blur and many false matches.

The trade-off between these two constraints involves optimizing the projector placement and focus, the camera resolution, the number of cameras, and the camera dynamic range.

In addition, many of the problems of false matches occur where the limited dynamic range of our video interface plays a role, particularly with dark surfaces or sufaces which lie at an oblique angle to the projector (so that no pattern appears in the image), or surfaces with specularities (so that clipping overwhelms the pattern). In these cases, we believe careful adjustment of the projector, including use of multiple projectors (since there is no particular constraint between the projector and camera in active stereo, this is easy to do), can serve to reduce these effects. The use of multiple patterns, either time-sequenced (taking advantage of our system's ability to capture images at high speed) or color-sequenced (using color cameras) is also promising.

7 Summary

We have briefly described a 4-camera system that is capable of video rate image acquisition. It uses a software distribution approach which takes advantage of iWarp's systolic design. The four cameras are used in a converging configuration for more effective use of the camera view spaces. In addition, to recover dense stereo range data from each set of images, we project a sinusoidally varying pattern onto the scene to enhance local intensity discriminability. This results in the notion of active multibaseline stereo system.

We have also described in detail our implementation of the depth recovery algorithm which involves the preprocessing stage of image rectification. Our approximate depth recovery implementation was designed for reduced computation.

The results that we have obtained from this system indicated that the mean errors (discounting object border areas) are less than a millimeter at distances varying from 1.5 m to 3.5 m from the camera system. The performance of the system is thus comparable to a good structured light system, while allowing data to be captured at full video rate.

Active multibaseline stereo appears to be a promising addition to structured light imaging systems. It allows images to be captured at high speed and still have high spatial resolution. It allows great freedom in the relationship between the camera, the surface, and the light source, making it possible to manipulate these so as to get high accuracy in a wide variety of circumstances.

Acknowledgments
Many thanks to Bill Ross for his helpful information on setting up a multibaseline camera system. Tom Warfel built the real-time video interface board to the iWarp that made this project possible. We are also grateful to Luc Robert for interesting discussions on image rectification. Mark Wheeler helped in the analysis that led to the decision to verge the cameras.