Toshihiko Morita and Takeo Kanade
School of Computer Science
Carnegie Mellon University
Pittsburgh, PA 15213
This research is sponsored by the Department of the Army, Army Research Office under Grant No. DAAH04-94-G-0006.
We present a sequential factorization method for recovering the three-dimensional shape of an object and the motion of the camera from a sequence of images, using tracked features. The factorization method originally proposed by Tomasi and Kanade produces robust and accurate results incorporating the singular value decomposition. However, it is still difficult to apply the method to real-time applications since it is based on a batch-type operation and the cost of the singular value decomposition is large. We develop the factorization method into a sequential method by regarding the feature positions as a vector time series. The new method produces estimates of shape and motion at each frame. The singular value decomposition is replaced with an updating computation of only three dominant eigenvectors, which can be performed in
time, while the complete singular value decomposition requires
operations for an
matrix. Also, the method is able to handle infinite sequences since it does not store any increasingly large matrices. Experiments using synthetic and real images illustrate that the method has nearly the same accuracy and robustness as the original method.
Recovering both the 3D shape of an object and the motion of the camera simultaneously from a stream of images is an important task and has wide applicability in many tasks such as navigation and robot manipulation. Tomasi and Kanade[1] first developed a factorization method to recover shape and motion under an orthographic projection model, and obtained robust and accurate results. Poelman and Kanade[2] have extended the factorization method to scaled-orthographic projection and paraperspective projection. This method closely approximates perspective projection in most practical situations so that it can deal with image sequences which contain perspective distortions.
Although the factorization method is a useful technique, its applicability is so far limited to off-line computations for the following reasons. First, the method is based on a batch-type computation; that is, it recovers shape and motion after all the input images are given. Second, the singular value decomposition, which is the most important procedure in the method, requires
operations for
features in
frames. Finally, it needs to store a large measurement matrix whose size increases with the number of frames. These drawbacks make it difficult to apply the factorization method to real-time applications.
This report presents a sequential factorization method that considers the input to the system as a vector time series of feature positions. The method produces estimates of shape and motion at each input frame. A covariance-like matrix is stored instead of feature positions, and its size remains constant as the number of frames increases. The singular value decomposition is replaced with a computation, updating only three dominant eigenvectors, which can be performed in
time. Consequently, the method becomes recursive.
We first briefly review the factorization method by Tomasi and Kanade. We then present our sequential factorization method in Section 3. The algorithm's performance is tested using synthetic data and real images in Section 4.
The input to the factorization method is a measurement matrix
, representing image positions of tracked features over multiple frames. Assuming that there are
features over
frames, and letting
be the image position of feature
at frame
,
is a
matrix such that
.
Each column of
contains all the observations for a single point, while each row contains all the observed x-coordinates or y-coordinates for a single frame.
Suppose that the camera orientation at frame
is represented by orthonormal vectors
,
, and
, where
corresponds to the x-axis of the image plane and
to the y-axis. The vectors
and
are collected over
frames into a motion matrix
such that
.
Let
be the location of feature
in a fixed world coordinate system, whose origin is set at the center-of-mass of all the feature points. These vectors are then collected into a shape matrix
such that
.
Note that
.
With this notation, the following equation holds by assuming an orthographic projection.
Tomasi and Kanade[1] pointed out the simple fact that the rank of
is at most 3 since it is the product of the
motion matrix
and the
shape matrix
. Based on this rank theory, they developed a factorization method that robustly recovers the matrices
and
from
.
The actual procedure of the factorization method consists of two steps. First, the measurement matrix is factorized into two matrices of rank 3 using the singular value decomposition. Assume, without loss of generality, that
. By computing the singular value decomposition of
, we can obtain orthogonal matrices
and
such that
,
where
= diag(
) and
. In reality, the rank of W is not exactly 3, but approximately 3.
is made from the first three columns of the left singular matrix of
. Likewise,
consists of the first three singular values and
is made from the first three rows of the right singular matrix. By setting
and
,
where the product
is the best possible rank three approximation to
.
It is well known that the left singular vectors
span the column space of
while the right singular vectors
span its row space. The span of
, namely motion space, determines the motion, and the span of
, namely shape space, determines the shape. The rank theory claims that the dimension of each subspace is at most three, and the first step of the factorization method finds those subspaces in the high dimensional input spaces. Both spaces are said to be dual in the sense that one of them can be computed from the other. This observation helps us to further develop the sequential factorization method.
The decomposition of equation (8) is not completely unique: it is unique only up to an affine transformation. The second step of the method is necessary to find a
non-singular matrix
, which transforms
and
into the true solutions
and
as follows.
Observing that rows
and
of
must satisfy the normalization constraints,
and
,
we obtain the system of
overdetermined equations such that
and,
and
are the rows of
. By denoting
,
, and
,
the system (12) can be rewritten as
,
,
,
,
and
The simplest solution of the system is given by the pseudo-inverse method such that
.
The vector
determines the symmetric matrix
, whose eigendecomposition gives
. As a result, the motion
and the shape
are derived according to equations (9) and (10).
The matrix
is an affine transform which transforms
into
in the motion space, while the matrix
transforms
into
in the shape space. Obtaining this transform is the main purpose of the second step of the factorization method, which we call metric transformation.
In the original factorization method, there was one measurement matrix
containing tracked feature positions throughout the image sequence. After all the input images are given and the feature positions are collected into the matrix
, the motion and shape are then computed. In real-time applications, however, it is not feasible to use this batch-type scheme. It is more desirable to obtain an estimate at each moment sequentially. The input to the system must be viewed as a vector time series. At frame
, two vectors containing feature positions such that
and
are given. Immediately after receiving these vectors, the system must compute the estimates of the camera coordinates
,
and the shape
at that frame. At the next frame, new samples
and
arrive and new camera coordinates
and
are to be computed as well as an updated shape estimate
.
The key to developing such a sequential method is to observe that the shape does not change over time. The shape space is stationary, and thus, it should be possible to derive
from
without performing expensive computations.
More specifically, we store the feature vectors
and
in a covariance-type matrix
defined recursively by
.
As shown later, the rank of
is at most three and its three dominant eigenvectors
span the shape space. Once
is obtained, the camera coordinates at frame
can be computed simply by multiplying the feature vectors and the eigenvectors as follows.
,
This framework makes it possible to estimate camera coordinates immediately after receiving feature vectors at each frame. All information obtained up to the frame is accumulated in
and used to produce the estimates at that frame.
In equation (20), the size of
is fixed to
, which only depends on the number of feature points. Therefore, the algorithm does not need to store any matrices whose sizes increase over time.
The computational effort in the original factorization method is dominated by the cost of the singular value decomposition. In the framework above, we need to compute eigenvectors of
. Note that, however, we only need the first three dominant eigenvectors. Fortunately, several methods exist to compute only the dominant eigenvectors with much less computation necessary to compute all the eigenvectors. Before describing the details of our algorithm, we briefly review these techniques in the following section.
Among the existing methods which can compute dominant eigenvectors of a square matrix, we introduce two methods, the power method and orthogonal iteration[3]. The power method is the simplest, which computes the most dominant eigenvector, i.e., an eigenvector associated with the largest eigenvalue. It provides the starting point for most other techniques, and is easy to understand. The method of orthogonal iteration, which we adopt in our method, is able to compute several dominant eigenvectors.
Assume that we want to compute the most dominant eigenvectors of an
matrix
. Given a unit 2-norm
, the power method iteratively computes a sequence of vectors
:
The second step of the iteration is simply a normalization that prevents
from becoming very large or very small. The vectors
generated by the iteration converge to the most dominant eigenvector of
. To examine the convergence property of the power method, suppose that
is diagonalizable. That is,
with an orthogonal matrix
, and
. If
where
is a constant. Since
, equation (23) shows that the vectors
point more and more accurately toward the direction of the dominant eigenvector
, and the convergence factor is the ratio
.
A straightforward generalization of the power method can be used to compute several dominant eigenvectors of a symmetric matrix. Assume that we want to compute
dominant eigenvectors of a symmetric matrix
, where
. Starting with an
matrix
with orthonormal columns, the method of orthogonal iteration generates a sequence of matrices :
(QR factorization)
The second step of the above iteration is the QR factorization of
, where
is an orthogonal matrix and
is an upper triangular matrix. The QR factorization can be achieved by the Gram-Schmidt process. This step is viewed as a normalization process that is similar to the normalization used in the power method.
Suppose that
is the eigendecomposition of
with an orthogonal matrix
, and
. It has been shown in [3] that the subspace
generated by the iteration converges to
at a rate proportional to
, i.e.,
,
where
and
is a constant. The function dist represents the subspace distance defined by
The method offers an attractive alternative to the singular value decomposition in situations where
is a large matrix and a few of its largest eigenvalues are needed. In our case, these conditions clearly hold. In addition, the rank theory of the factorization method[1] guarantees that the ratio
is very small, and as a result, we should achieve fast convergence for computing the first three eigenvectors.
As in the original method, the sequential factorization method consists of two steps, sequential shape space computation and sequential metric transformation.
In the sequential factorization method, we consider the feature vectors,
and
, as a vector time series. Let us denote the measurement matrix in the original factorization method at frame
by
. Then, it grows in the following manner:
,
, .....,
, .....
Now let us define a matrix time series
by
.
From the definition, it follows that
.
Since the rank of
is at most three, the rank of
is also at most three. If
is the singular value decomposition of
, where
and
are orthogonal matrices, and
, then
.
This means the eigenvectors of
are equivalent to the right singular vectors
of
. Hence, it is possible to obtain the shape space by computing the eigenvectors of
.
To compute
, we combine orthogonal iteration with updating by equation (27). Given a
matrix
with orthonormal columns and a null matrix
, the following algorithm generates a sequence of matrices
:
(QR factorization)
The index
corresponds to the frame number and each iteration is performed frame by frame. The matrix
generated by the algorithm is expected to converge to the eigenvectors
of
. While the original orthogonal iteration works with a fixed matrix, the above algorithm works with the matrix
, which varies from iteration to iteration incorporating new features. In other words, the sequential factorization method folds the update of
into the orthogonal iteration. If the
randomly changes over time, no convergence is expected to appear. However, it can be shown that
, for all
.
Therefore,
is stationary and
converges to
as in the orthogonal iteration. Even when noise exists, if the noise is uncorrelated or the noise space is orthogonal to the signal space
, then
is still equal to
and the convergence can be shown. The following convergence rate of the algorithm is deduced from the convergence rate of the orthogonal iteration.
Algorithm (1) presented in the previous section produces the matrix
, which converges to the matrix
that spans the shape space. The true shape and motion are determined from the shape space by a metric transformation. It is not straightforward at this point, however, to apply the metric transformation sequentially. The problem is that, even though
is stationary, the matrix
itself changes as the number of frames increases. This is due to the nature of singular vectors. They are the basis for the row and column subspaces of a matrix, and the singular value decomposition chooses them in a special way. They are more than just orthonormal. As a result, they rotate in the 3D subspace
. Recall that the matrix
obtained in metric transformation (9) is a transform from
(or
) to
in the subspace
. Since
changes at each frame,
also changes. Consequently, the matrix
also changes frame by frame.
For clarity, let us denote an
matrix at frame
as
. The fact that
changes at each frame makes it difficult to estimate
iteratively and efficiently. Thus we need to add an additional process to obtain stationary basis for the shape space to update matrix
.
Let us define a projection matrix
onto the
by
,
where
is the output from Algorithm (1). Needless to say, the rank of
is at most three. Since
(=
) is stationary, the projection matrix
must be stationary. It is thus possible to obtain the stationary basis for the shape space by replacing
with the eigenvectors of
.
An iterative method similar to Algorithm (1) can be used to reduce the amount of computation. Given a
matrix
with orthonormal columns, the iterative method below generates a matrix
, which provides the stationary basis for the shape space.
(QR factorization)
In the previous section, we derived the shape space in terms of
. Once
is obtained, it is possible to compute camera coordinates
and
as
,
These coordinates are used to solve the overdetermined equations (12) and the true camera coordinates are recovered in the same way as in the original method. Doing so, however, requires storing all the coordinates
and
, the number of which may be very large. Instead, we use the following sequential algorithm.
,

Let
and
be the matrices
and
at frame
, where
and
are defined in Section 2.3 From the definition, it follows that
.
Assigning equations (35) and (36) to equation (18), we have
which gives the symmetric matrix
. The eigendecomposition of
yields the affine transform
and, as a result, the camera coordinates and the shape are obtained as follows:
Algorithm (3) followed by equations (37), (38), and (39) completes the sequential method. The size of matrices
and
are fixed to
and
, and the method does not store any matrices that grow, even in the sequential metric transformation.
In this section we compare the accuracy of our sequential factorization method with that of the original factorization method. Since both methods are essentially based on the rank theory, we do not expect any difference in the results. Our purpose here is to confirm that the sequential method has the same accuracy of shape and motion recovery as the original method.
The object in this experiment consists of 100 random feature points. The sequences are created using a perspective projection of those points. The image coordinates of each point are perturbed by adding Gaussian noise, which we assume to simulate tracking error and image noise. The standard deviation of the Gaussian noise is set to two pixels of a
pixel image. The distance of the object center from the camera is fixed to ten times the object size. The focal length is chosen so that the projection of the object covers the whole
image. The camera is rotated as shown in Figure 1, while the object is translated to keep its image at the image center. Quantization errors are not added since we assume that we are able to track features with a subpixel resolution.
When discussing the accuracy of the sequential method, one needs to consider its dynamic property regarding the 3D recovery. The accuracy of the recovery at a particular frame by the sequential method depends on the total amount of motion up to that time, since the recovery is made only from the information obtained up to that time. At the beginning of an image sequence, for example, the motion is generally small, so high accuracy can not be expected. The accuracy generally improves as the motion becomes larger. The original method does not have this dynamic property, since it is based on a batch-type scheme and uses all the information throughout the sequence.
In order to compare both methods under the same conditions, we perform the following computations beforehand. First, we form a submatrix
, which only contains the feature positions up to frame
. The original factorization is applied to the submatrix, then the results are kept as solutions at frame
. They are the best estimates given by the original method. Repeating this process for each frame, we derive the best estimates, with which our results are compared.
We first discuss the convergence property of the sequential shape space computation. The sequential factorization method starts with Algorithm (1) in Section 3.3.1, iteratively generating the matrix
which is an estimate for the true shape space
. Let us represent the estimation error with respect to the true shape space by
Recall that the function dist provides a notion of difference between two spaces. On the other hand, the original method produces the best estimate for the shape space by computing the right singular vectors
of the submatrix
, and its error with respect to the true shape space is also represented by
Comparing both errors, Figure 2 shows that they are almost identical. That is, the errors given by the sequential method are almost equal to those given by the original method.
At the beginning of the sequence, the amount of motion is small and both errors are relatively large. The ratio of the 4th to 3rd singular values, shown in Figure 3, also indicates that it is difficult to achieve good accuracy at the beginning. Both errors, however, quickly become smaller as the camera motion becomes larger. After about the 20th frame, constant errors of
are observed in this experiment.
The solutions given by the two methods are so close that the graphs are completely overlapped. Thus, we also plot their difference defined by
in Figure 4. Although
is relatively large at the beginning, it quickly becomes very small. In fact, after about the 30th frame,
is less than
, while
and
are both
.
Figure 1: True camera motion
The camera roll, pitch, and yaw are varied as shown in this figure. The sequence consists of 150 frames.
Figure 2: Shape space errors
Shape space estimation errors by the sequential method (solid line) and the original method (dashed line) with respect to the true shape space. The errors are defined by subspace distance and plotted logarithmically.
The ratio of the 4th to 3rd singular values, that is
.
Figure 4: Difference of shape space errors
The difference of the estimates by the sequential and original methods, versus the frame number. The difference is plotted logarithmically.
The three plots of Figure 5 show errors in roll, pitch, and yaw in the recovered motion: the solid lines correspond to the sequential method, the dotted lines to the original method. The difference in motion errors between the original and sequential methods is quite small.
Both results are unstable for a short period at the beginning of the sequence. After that, they show two kinds of errors: random and structural. Random errors are due to Gaussian noise added to the feature positions. Structural errors are due to perspective distortion, and relate to the motion patterns. The structural errors show a negative peak at about the 60th frame and are almost constant between the 90th and 120th frames. Note the pattern corresponds to the motion pattern shown in Figure 1.
Of course, these intrinsic errors cannot be eliminated in the sequential method. The point to observe is that the differences between the two solutions are sufficiently smaller than the intrinsic errors.
Shape errors which are compared in Figure 6 also indicate the same results. Again, the differences between the two methods are quite small compared to the intrinsic errors which the original method possesses. Note that no Gaussian noise appears in the shape errors since they are averaged over all the feature points.
We conclude from these results that the sequential method is nearly as accurate as the original method except that some extra frames are required to converge.
Experiments were performed on two sets of real images. The first set is an image sequence of a satellite rotating in space. Another experiment uses a long video recording (764 images) of a house taken with a hand-held camera. These experiments demonstrate the applicability of the sequential factorization method in real situations. In both experiments, features are selected and tracked using the method presented by Tomasi and Kanade[1].
Figure 5: Motion errors
Errors of recovered camera roll (top), pitch (middle), and yaw (bottom). The errors given by the sequential method are plotted with solid lines, while the errors given by the original method are plotted with dotted lines.
This figure compares the shape errors given by the two method. The errors given by the sequential method are plotted with solid lines, while the errors given by the original method are plotted with dotted lines. The errors are computed as the root-mean-square errors of the recovered shape with respect to the true shape, at each frame.
The first frame of the satellite image sequence. The superimposed squares indicate the selected features.
Recovered camera roll (solid line), pitch (dashed line), and yaw (dotted line) for the satellite image sequence.
A side view of the recovered shape of the satellite. The features on the solar panel are shown with opaque squares and others with filled squares. Notice that the features on the solar panel correctly lie in a single plane.
The first frame of the house image sequence. The superimposed squares indicate the selected features.
Recovered camera roll (solid line), pitch (dashed line), and yaw (dotted line) for the house image sequence.
A view of the recovered shape of the house from above. The features on the two side walls are correctly recovered.
The processing time of the sequential method on a Sun4/10 (solid line) compared with that of the original method (dotted line), as a function of the number of features which is varied from 10 to 500. The number of frames is fixed at 120.
Figure 6: Shape error
Figure 7 shows an image of the satellite with selected features indicated by small squares. The image sequence was digitized from a video recording[4] actually taken by a space shuttle astronaut. The feature tracker automatically selected and tracked 32 features throughout the sequence of 101 images. Of these, five features on the astronaut maneuvering around the satellite were manually eliminated because they had a different motion. Thus, the remaining 27 features were processed. Figure 8 shows the recovered motion in terms of roll, pitch, and yaw. The side view of the recovered shape is displayed in Figure 9, where the features on the solar panel are marked with opaque squares and others with filled squares. No ground-truth is available for the shape or the motion in this experiment. Yet, it appears that the solutions are satisfactory, since the features on the solar panel almost lie in a single line in the side view.
Figure 10 shows the first image of the sequence used in the second experiment. Using a hand-held camera, one of the authors took this sequence while walking. It consists of 764 images which correspond to about 25 seconds. The feature tracker detected and tracked 62 features. The recovered motion and shape are shown in Figures 11 and 12. It is clearly seen that the shape is qualitatively correct. It is also reasonable to observe that only the camera yaw is increasing, because the camera is moving parallel to the ground. In addition, note that the computed roll motion reveals the pace of the recorder's steps, which is about 1 step per second.
Further evaluation of accuracy in these experiments is difficult. However, this qualitative analysis of the results with real images, and quantitative analysis of the results with synthetic data essentially shows that the sequential method works as well with real images as the original batch method.
Finally, we compare the processing time of the sequential method with the original method. The computational complexity of the original method is dominated by the cost of the singular value decomposition, which needs
computations for a
measurement matrix with
[5]. Note that
corresponds to the number of frames and
to the number of features. On the other hand, the complexity of the sequential method is
. Computing the solution for frame F, therefore, takes only
using the sequential method, while the original method would require
operations.
Figure 7: An image of a satellite
Figure 8: Recovered motion of satellite
Figure 10: An image of a house
Figure 11: Recovered motion of house
Figure 12: Top view of the recovered shape
Figure 13: Processing time
Figure 13 shows the actual processing time of the sequential method on a Sun4/10 compared together with that of the original method. The number of features varied from 10 to 500, while the number of frames was fixed at 120. The processing time for selecting and tracking features was not included. The singular value decomposition of the original method is based on a routine found in [6]. The results sufficiently agree with our analysis above. In addition, when the number of features is less than 40, the sequential method can be run in less than 1/30 of a second, which enables video-rate processing on a Sun4/10.
We have presented the sequential factorization method, which provides estimates of shape and motion at each frame from a sequence of images. The method produces as accurate and robust results as the original method, while significantly reducing the computational complexity. The reduction in complexity is important for applying the factorization method to real-time applications. Furthermore, the method does not require storing any growing matrices so that its implementation in VLSI or DSP is feasible.
Faster convergence in the shape space computation could be achieved using more sophisticated algorithms such as the orthogonal iteration with Ritz acceleration[3] instead of the basic orthogonal iteration. Also, it is possible to use scaled orthographic projection or paraperspective projection[2] to improve the accuracy of the sequential factorization method.
The authors wish to thank Conrad J. Poelman and Richard Madison for their helpful comments.