A Sequential Factorization Method for Recovering Shape and Motion from Image Streams

Toshihiko Morita and Takeo Kanade

School of Computer Science

Carnegie Mellon University

Pittsburgh, PA 15213

This research is sponsored by the Department of the Army, Army Research Office under Grant No. DAAH04-94-G-0006.

Abstract

We present a sequential factorization method for recovering the three-dimensional shape of an object and the motion of the camera from a sequence of images, using tracked features. The factorization method originally proposed by Tomasi and Kanade produces robust and accurate results incorporating the singular value decomposition. However, it is still difficult to apply the method to real-time applications since it is based on a batch-type operation and the cost of the singular value decomposition is large. We develop the factorization method into a sequential method by regarding the feature positions as a vector time series. The new method produces estimates of shape and motion at each frame. The singular value decomposition is replaced with an updating computation of only three dominant eigenvectors, which can be performed in time, while the complete singular value decomposition requires operations for an matrix. Also, the method is able to handle infinite sequences since it does not store any increasingly large matrices. Experiments using synthetic and real images illustrate that the method has nearly the same accuracy and robustness as the original method.

1. Introduction

Recovering both the 3D shape of an object and the motion of the camera simultaneously from a stream of images is an important task and has wide applicability in many tasks such as navigation and robot manipulation. Tomasi and Kanade[1] first developed a factorization method to recover shape and motion under an orthographic projection model, and obtained robust and accurate results. Poelman and Kanade[2] have extended the factorization method to scaled-orthographic projection and paraperspective projection. This method closely approximates perspective projection in most practical situations so that it can deal with image sequences which contain perspective distortions.

Although the factorization method is a useful technique, its applicability is so far limited to off-line computations for the following reasons. First, the method is based on a batch-type computation; that is, it recovers shape and motion after all the input images are given. Second, the singular value decomposition, which is the most important procedure in the method, requires operations for features in frames. Finally, it needs to store a large measurement matrix whose size increases with the number of frames. These drawbacks make it difficult to apply the factorization method to real-time applications.

This report presents a sequential factorization method that considers the input to the system as a vector time series of feature positions. The method produces estimates of shape and motion at each input frame. A covariance-like matrix is stored instead of feature positions, and its size remains constant as the number of frames increases. The singular value decomposition is replaced with a computation, updating only three dominant eigenvectors, which can be performed in time. Consequently, the method becomes recursive.

We first briefly review the factorization method by Tomasi and Kanade. We then present our sequential factorization method in Section 3. The algorithm's performance is tested using synthetic data and real images in Section 4.

2. Theory of the Factorization Method: Review

2.1 Formalization

The input to the factorization method is a measurement matrix , representing image positions of tracked features over multiple frames. Assuming that there are features over frames, and letting be the image position of feature at frame , is a matrix such that

(1)
.

Each column of contains all the observations for a single point, while each row contains all the observed x-coordinates or y-coordinates for a single frame.

Suppose that the camera orientation at frame is represented by orthonormal vectors , , and , where corresponds to the x-axis of the image plane and to the y-axis. The vectors and are collected over frames into a motion matrix such that

(2)
.

Let be the location of feature in a fixed world coordinate system, whose origin is set at the center-of-mass of all the feature points. These vectors are then collected into a shape matrix such that

(3)
.

Note that

(4)
.

With this notation, the following equation holds by assuming an orthographic projection.

(5)

Tomasi and Kanade[1] pointed out the simple fact that the rank of is at most 3 since it is the product of the motion matrix and the shape matrix . Based on this rank theory, they developed a factorization method that robustly recovers the matrices and from .

2.2 Subspace Computation

The actual procedure of the factorization method consists of two steps. First, the measurement matrix is factorized into two matrices of rank 3 using the singular value decomposition. Assume, without loss of generality, that . By computing the singular value decomposition of , we can obtain orthogonal matrices and such that

(6)
,

where = diag() and . In reality, the rank of W is not exactly 3, but approximately 3. is made from the first three columns of the left singular matrix of . Likewise, consists of the first three singular values and is made from the first three rows of the right singular matrix. By setting

(7)
and

we can factorize into

(8)
,

where the product is the best possible rank three approximation to .

It is well known that the left singular vectors span the column space of while the right singular vectors span its row space. The span of , namely motion space, determines the motion, and the span of , namely shape space, determines the shape. The rank theory claims that the dimension of each subspace is at most three, and the first step of the factorization method finds those subspaces in the high dimensional input spaces. Both spaces are said to be dual in the sense that one of them can be computed from the other. This observation helps us to further develop the sequential factorization method.

2.3 Metric Transformation

The decomposition of equation (8) is not completely unique: it is unique only up to an affine transformation. The second step of the method is necessary to find a non-singular matrix , which transforms and into the true solutions and as follows.

(9)
(10)

Observing that rows and of must satisfy the normalization constraints,

(11)
and ,

we obtain the system of overdetermined equations such that

(12)

where is a symmetric matrix

(13)

and, and are the rows of . By denoting , , and

(14)
,

the system (12) can be rewritten as

(15)
,

where , , and are defined by

(16)
, , ,

and

(17)

The simplest solution of the system is given by the pseudo-inverse method such that

(18)
.

The vector determines the symmetric matrix , whose eigendecomposition gives . As a result, the motion and the shape are derived according to equations (9) and (10).

The matrix is an affine transform which transforms into in the motion space, while the matrix transforms into in the shape space. Obtaining this transform is the main purpose of the second step of the factorization method, which we call metric transformation.

3. A Sequential Factorization Method

3.1 Overview

In the original factorization method, there was one measurement matrix containing tracked feature positions throughout the image sequence. After all the input images are given and the feature positions are collected into the matrix , the motion and shape are then computed. In real-time applications, however, it is not feasible to use this batch-type scheme. It is more desirable to obtain an estimate at each moment sequentially. The input to the system must be viewed as a vector time series. At frame , two vectors containing feature positions such that

(19)
and

are given. Immediately after receiving these vectors, the system must compute the estimates of the camera coordinates , and the shape at that frame. At the next frame, new samples and arrive and new camera coordinates and are to be computed as well as an updated shape estimate .

The key to developing such a sequential method is to observe that the shape does not change over time. The shape space is stationary, and thus, it should be possible to derive from without performing expensive computations.

More specifically, we store the feature vectors and in a covariance-type matrix defined recursively by

(20)
.

As shown later, the rank of is at most three and its three dominant eigenvectors span the shape space. Once is obtained, the camera coordinates at frame can be computed simply by multiplying the feature vectors and the eigenvectors as follows.

(21)
,

This framework makes it possible to estimate camera coordinates immediately after receiving feature vectors at each frame. All information obtained up to the frame is accumulated in and used to produce the estimates at that frame.

In equation (20), the size of is fixed to , which only depends on the number of feature points. Therefore, the algorithm does not need to store any matrices whose sizes increase over time.

The computational effort in the original factorization method is dominated by the cost of the singular value decomposition. In the framework above, we need to compute eigenvectors of . Note that, however, we only need the first three dominant eigenvectors. Fortunately, several methods exist to compute only the dominant eigenvectors with much less computation necessary to compute all the eigenvectors. Before describing the details of our algorithm, we briefly review these techniques in the following section.

3.2 Iterative Eigenvector Computation

Among the existing methods which can compute dominant eigenvectors of a square matrix, we introduce two methods, the power method and orthogonal iteration[3]. The power method is the simplest, which computes the most dominant eigenvector, i.e., an eigenvector associated with the largest eigenvalue. It provides the starting point for most other techniques, and is easy to understand. The method of orthogonal iteration, which we adopt in our method, is able to compute several dominant eigenvectors.

3.2.1 Power Method

Assume that we want to compute the most dominant eigenvectors of an matrix . Given a unit 2-norm , the power method iteratively computes a sequence of vectors :

for


end

The second step of the iteration is simply a normalization that prevents from becoming very large or very small. The vectors generated by the iteration converge to the most dominant eigenvector of . To examine the convergence property of the power method, suppose that is diagonalizable. That is, with an orthogonal matrix , and . If

(22)

and , then it follows that

(23)

where is a constant. Since , equation (23) shows that the vectors point more and more accurately toward the direction of the dominant eigenvector , and the convergence factor is the ratio .

3.2.2 Orthogonal Iteration

A straightforward generalization of the power method can be used to compute several dominant eigenvectors of a symmetric matrix. Assume that we want to compute dominant eigenvectors of a symmetric matrix , where . Starting with an matrix with orthonormal columns, the method of orthogonal iteration generates a sequence of matrices :

for

(QR factorization)
end

The second step of the above iteration is the QR factorization of , where is an orthogonal matrix and is an upper triangular matrix. The QR factorization can be achieved by the Gram-Schmidt process. This step is viewed as a normalization process that is similar to the normalization used in the power method.

Suppose that is the eigendecomposition of with an orthogonal matrix , and . It has been shown in [3] that the subspace generated by the iteration converges to at a rate proportional to , i.e.,

(24)
,

where and is a constant. The function dist represents the subspace distance defined by

(25)

The method offers an attractive alternative to the singular value decomposition in situations where is a large matrix and a few of its largest eigenvalues are needed. In our case, these conditions clearly hold. In addition, the rank theory of the factorization method[1] guarantees that the ratio is very small, and as a result, we should achieve fast convergence for computing the first three eigenvectors.

3.3 Sequential Factorization Algorithm

As in the original method, the sequential factorization method consists of two steps, sequential shape space computation and sequential metric transformation.

3.3.1 A Sequential Shape Space Computation

In the sequential factorization method, we consider the feature vectors, and , as a vector time series. Let us denote the measurement matrix in the original factorization method at frame by . Then, it grows in the following manner:

(26)
, , ....., , .....

Now let us define a matrix time series by

(27)
.

From the definition, it follows that

(28)
.

Since the rank of is at most three, the rank of is also at most three. If

(29)

is the singular value decomposition of , where and are orthogonal matrices, and , then

(30)
.

This means the eigenvectors of are equivalent to the right singular vectors of . Hence, it is possible to obtain the shape space by computing the eigenvectors of .

To compute , we combine orthogonal iteration with updating by equation (27). Given a matrix with orthonormal columns and a null matrix , the following algorithm generates a sequence of matrices :

[Algorithm (1)] for f = 1, 2, ....
(1)
(2)
(3) (QR factorization)
end

The index corresponds to the frame number and each iteration is performed frame by frame. The matrix generated by the algorithm is expected to converge to the eigenvectors of . While the original orthogonal iteration works with a fixed matrix, the above algorithm works with the matrix , which varies from iteration to iteration incorporating new features. In other words, the sequential factorization method folds the update of into the orthogonal iteration. If the randomly changes over time, no convergence is expected to appear. However, it can be shown that

(31)
, for all .

Therefore, is stationary and converges to as in the orthogonal iteration. Even when noise exists, if the noise is uncorrelated or the noise space is orthogonal to the signal space , then is still equal to and the convergence can be shown. The following convergence rate of the algorithm is deduced from the convergence rate of the orthogonal iteration.

(32)

3.3.2 Stationary Basis for the Shape Space

Algorithm (1) presented in the previous section produces the matrix , which converges to the matrix that spans the shape space. The true shape and motion are determined from the shape space by a metric transformation. It is not straightforward at this point, however, to apply the metric transformation sequentially. The problem is that, even though is stationary, the matrix itself changes as the number of frames increases. This is due to the nature of singular vectors. They are the basis for the row and column subspaces of a matrix, and the singular value decomposition chooses them in a special way. They are more than just orthonormal. As a result, they rotate in the 3D subspace . Recall that the matrix obtained in metric transformation (9) is a transform from (or ) to in the subspace . Since changes at each frame, also changes. Consequently, the matrix also changes frame by frame.

For clarity, let us denote an matrix at frame as . The fact that changes at each frame makes it difficult to estimate iteratively and efficiently. Thus we need to add an additional process to obtain stationary basis for the shape space to update matrix .

Let us define a projection matrix onto the by

(33)
,

where is the output from Algorithm (1). Needless to say, the rank of is at most three. Since (=) is stationary, the projection matrix must be stationary. It is thus possible to obtain the stationary basis for the shape space by replacing with the eigenvectors of .

An iterative method similar to Algorithm (1) can be used to reduce the amount of computation. Given a matrix with orthonormal columns, the iterative method below generates a matrix , which provides the stationary basis for the shape space.

[Algorithm (2)] for f = 1, 2, ....


(QR factorization)
end

3.3.3 Sequential Metric Transformation

In the previous section, we derived the shape space in terms of . Once is obtained, it is possible to compute camera coordinates and as

(34)
,

These coordinates are used to solve the overdetermined equations (12) and the true camera coordinates are recovered in the same way as in the original method. Doing so, however, requires storing all the coordinates and , the number of which may be very large. Instead, we use the following sequential algorithm.

[Algorithm (3)] for f = 1, 2, ....
,


end

Let and be the matrices and at frame , where and are defined in Section 2.3 From the definition, it follows that

(35)
(36)
.

Assigning equations (35) and (36) to equation (18), we have

(37)

which gives the symmetric matrix . The eigendecomposition of yields the affine transform and, as a result, the camera coordinates and the shape are obtained as follows:

(38)
(39)

Algorithm (3) followed by equations (37), (38), and (39) completes the sequential method. The size of matrices and are fixed to and , and the method does not store any matrices that grow, even in the sequential metric transformation.

4. Experiments

4.1 Synthetic Data

In this section we compare the accuracy of our sequential factorization method with that of the original factorization method. Since both methods are essentially based on the rank theory, we do not expect any difference in the results. Our purpose here is to confirm that the sequential method has the same accuracy of shape and motion recovery as the original method.

4.1.1 Data Generation

The object in this experiment consists of 100 random feature points. The sequences are created using a perspective projection of those points. The image coordinates of each point are perturbed by adding Gaussian noise, which we assume to simulate tracking error and image noise. The standard deviation of the Gaussian noise is set to two pixels of a pixel image. The distance of the object center from the camera is fixed to ten times the object size. The focal length is chosen so that the projection of the object covers the whole image. The camera is rotated as shown in Figure 1, while the object is translated to keep its image at the image center. Quantization errors are not added since we assume that we are able to track features with a subpixel resolution.

When discussing the accuracy of the sequential method, one needs to consider its dynamic property regarding the 3D recovery. The accuracy of the recovery at a particular frame by the sequential method depends on the total amount of motion up to that time, since the recovery is made only from the information obtained up to that time. At the beginning of an image sequence, for example, the motion is generally small, so high accuracy can not be expected. The accuracy generally improves as the motion becomes larger. The original method does not have this dynamic property, since it is based on a batch-type scheme and uses all the information throughout the sequence.

In order to compare both methods under the same conditions, we perform the following computations beforehand. First, we form a submatrix , which only contains the feature positions up to frame . The original factorization is applied to the submatrix, then the results are kept as solutions at frame . They are the best estimates given by the original method. Repeating this process for each frame, we derive the best estimates, with which our results are compared.

4.1.2 Accuracy of the Sequential Shape Space Computation

We first discuss the convergence property of the sequential shape space computation. The sequential factorization method starts with Algorithm (1) in Section 3.3.1, iteratively generating the matrix which is an estimate for the true shape space . Let us represent the estimation error with respect to the true shape space by

(40)

Recall that the function dist provides a notion of difference between two spaces. On the other hand, the original method produces the best estimate for the shape space by computing the right singular vectors of the submatrix , and its error with respect to the true shape space is also represented by

(41)

Comparing both errors, Figure 2 shows that they are almost identical. That is, the errors given by the sequential method are almost equal to those given by the original method.

At the beginning of the sequence, the amount of motion is small and both errors are relatively large. The ratio of the 4th to 3rd singular values, shown in Figure 3, also indicates that it is difficult to achieve good accuracy at the beginning. Both errors, however, quickly become smaller as the camera motion becomes larger. After about the 20th frame, constant errors of are observed in this experiment.

The solutions given by the two methods are so close that the graphs are completely overlapped. Thus, we also plot their difference defined by

(42)

in Figure 4. Although is relatively large at the beginning, it quickly becomes very small. In fact, after about the 30th frame, is less than , while and are both .

Figure 1: True camera motion

The camera roll, pitch, and yaw are varied as shown in this figure. The sequence consists of 150 frames.

Figure 2: Shape space errors

Shape space estimation errors by the sequential method (solid line) and the original method (dashed line) with respect to the true shape space. The errors are defined by subspace distance and plotted logarithmically.

The ratio of the 4th to 3rd singular values, that is .

Figure 4: Difference of shape space errors

The difference of the estimates by the sequential and original methods, versus the frame number. The difference is plotted logarithmically.

4.1.3 Accuracy of the Motion and Shape Recovery

The three plots of Figure 5 show errors in roll, pitch, and yaw in the recovered motion: the solid lines correspond to the sequential method, the dotted lines to the original method. The difference in motion errors between the original and sequential methods is quite small.

Both results are unstable for a short period at the beginning of the sequence. After that, they show two kinds of errors: random and structural. Random errors are due to Gaussian noise added to the feature positions. Structural errors are due to perspective distortion, and relate to the motion patterns. The structural errors show a negative peak at about the 60th frame and are almost constant between the 90th and 120th frames. Note the pattern corresponds to the motion pattern shown in Figure 1.

Of course, these intrinsic errors cannot be eliminated in the sequential method. The point to observe is that the differences between the two solutions are sufficiently smaller than the intrinsic errors.

Shape errors which are compared in Figure 6 also indicate the same results. Again, the differences between the two methods are quite small compared to the intrinsic errors which the original method possesses. Note that no Gaussian noise appears in the shape errors since they are averaged over all the feature points.

We conclude from these results that the sequential method is nearly as accurate as the original method except that some extra frames are required to converge.

4.2 Real Images

Experiments were performed on two sets of real images. The first set is an image sequence of a satellite rotating in space. Another experiment uses a long video recording (764 images) of a house taken with a hand-held camera. These experiments demonstrate the applicability of the sequential factorization method in real situations. In both experiments, features are selected and tracked using the method presented by Tomasi and Kanade[1].

4.2.1 Satellite Images

Figure 5: Motion errors

Errors of recovered camera roll (top), pitch (middle), and yaw (bottom). The errors given by the sequential method are plotted with solid lines, while the errors given by the original method are plotted with dotted lines.

This figure compares the shape errors given by the two method. The errors given by the sequential method are plotted with solid lines, while the errors given by the original method are plotted with dotted lines. The errors are computed as the root-mean-square errors of the recovered shape with respect to the true shape, at each frame.

The first frame of the satellite image sequence. The superimposed squares indicate the selected features.

Recovered camera roll (solid line), pitch (dashed line), and yaw (dotted line) for the satellite image sequence.

A side view of the recovered shape of the satellite. The features on the solar panel are shown with opaque squares and others with filled squares. Notice that the features on the solar panel correctly lie in a single plane.

The first frame of the house image sequence. The superimposed squares indicate the selected features.

Recovered camera roll (solid line), pitch (dashed line), and yaw (dotted line) for the house image sequence.

A view of the recovered shape of the house from above. The features on the two side walls are correctly recovered.

The processing time of the sequential method on a Sun4/10 (solid line) compared with that of the original method (dotted line), as a function of the number of features which is varied from 10 to 500. The number of frames is fixed at 120.

 

Figure 6: Shape error

Figure 7 shows an image of the satellite with selected features indicated by small squares. The image sequence was digitized from a video recording[4] actually taken by a space shuttle astronaut. The feature tracker automatically selected and tracked 32 features throughout the sequence of 101 images. Of these, five features on the astronaut maneuvering around the satellite were manually eliminated because they had a different motion. Thus, the remaining 27 features were processed. Figure 8 shows the recovered motion in terms of roll, pitch, and yaw. The side view of the recovered shape is displayed in Figure 9, where the features on the solar panel are marked with opaque squares and others with filled squares. No ground-truth is available for the shape or the motion in this experiment. Yet, it appears that the solutions are satisfactory, since the features on the solar panel almost lie in a single line in the side view.

4.2.2 House Images

Figure 10 shows the first image of the sequence used in the second experiment. Using a hand-held camera, one of the authors took this sequence while walking. It consists of 764 images which correspond to about 25 seconds. The feature tracker detected and tracked 62 features. The recovered motion and shape are shown in Figures 11 and 12. It is clearly seen that the shape is qualitatively correct. It is also reasonable to observe that only the camera yaw is increasing, because the camera is moving parallel to the ground. In addition, note that the computed roll motion reveals the pace of the recorder's steps, which is about 1 step per second.

Further evaluation of accuracy in these experiments is difficult. However, this qualitative analysis of the results with real images, and quantitative analysis of the results with synthetic data essentially shows that the sequential method works as well with real images as the original batch method.

4.3 Computational Time

Finally, we compare the processing time of the sequential method with the original method. The computational complexity of the original method is dominated by the cost of the singular value decomposition, which needs computations for a measurement matrix with [5]. Note that corresponds to the number of frames and to the number of features. On the other hand, the complexity of the sequential method is . Computing the solution for frame F, therefore, takes only using the sequential method, while the original method would require operations.

Figure 7: An image of a satellite

Figure 8: Recovered motion of satellite

Figure 10: An image of a house

Figure 11: Recovered motion of house

Figure 12: Top view of the recovered shape

Figure 13: Processing time

Figure 13 shows the actual processing time of the sequential method on a Sun4/10 compared together with that of the original method. The number of features varied from 10 to 500, while the number of frames was fixed at 120. The processing time for selecting and tracking features was not included. The singular value decomposition of the original method is based on a routine found in [6]. The results sufficiently agree with our analysis above. In addition, when the number of features is less than 40, the sequential method can be run in less than 1/30 of a second, which enables video-rate processing on a Sun4/10.

5. Conclusions

We have presented the sequential factorization method, which provides estimates of shape and motion at each frame from a sequence of images. The method produces as accurate and robust results as the original method, while significantly reducing the computational complexity. The reduction in complexity is important for applying the factorization method to real-time applications. Furthermore, the method does not require storing any growing matrices so that its implementation in VLSI or DSP is feasible.

Faster convergence in the shape space computation could be achieved using more sophisticated algorithms such as the orthogonal iteration with Ritz acceleration[3] instead of the basic orthogonal iteration. Also, it is possible to use scaled orthographic projection or paraperspective projection[2] to improve the accuracy of the sequential factorization method.

Acknowledgments

The authors wish to thank Conrad J. Poelman and Richard Madison for their helpful comments.

References

  1. Carlo Tomasi and Takeo Kanade, Shape and motion from image streams: a factorization method, Technical Report CMU-CS-91-172, Carnegie Mellon University, 1991. Later published as Carlo Tomasi and Takeo Kanade, "Shape and Motion from Image Streams Under Orthography: A Factorization Method," International Journal of Computer Vision, Vol. 9, No. 2, pp. 137-154, November 1992.

  2. Conrad J. Poelman and Takeo Kanade, A paraperspective factorization method for shape and motion recovery, Technical Report CMU-CS-92-208, Carnegie Mellon University, 1992. Revised and superceded as Technical Report CMU-CS-93-219, Carnegie Mellon University, 1993. Part of the latter technical report is presented at and included in the Proceedings of the Third European Conference on Computer Vision, Vol I, pp. 97-110, Stockholm, Sweden, May 1994.

  3. Gene H. Golub and Charles F. Van Loan, Matrix computations, second edition, The Johns Hopkins University Press, 1989.

  4. Satellite rescue in space: highlights of shuttle flights 41C & 51A, #V41, The Holiday Video Library, Finley-Holiday Film Corporation.

  5. Pierre Comon and Gene H. Golub, Tracking a few extreme singular values and vectors in signal processing, Proceedings of the IEEE, Vol. 78, no. 8, pp. 1327-1343, 1990.

  6. William H. Press, Brian P. Flannery, Saul A. Teukolsky, and William T. Vetterling, Numerical recipes in C: the art of scientific computing, Cambridge University Press, 1988.