Question: Why is it a mistake to converge stereo cameras whose images
will be viewed by people (vs. analysed by a computer)?
Short Answer: Any 3D scene point (that is not in the plane that
contains the horizontally converged lens axes) projects onto the
converged sensors at different heights. When the resulting images are
rotated onto the plane of the viewing screen, corresponding image
points on the screen are vertically displaced, so the lines from the
eyes to the corresponding screen points nowhere intersect, i.e., they
no longer correspond to a 3D scene point. The underlying reason is
that rotating both images onto the screen plane destroys knowledge of
the initial camera geometry that is crucial to reconstructing the 3D
scene.
Long Answer:
Let the plane V of the viewing screen be vertical and perpendicularly
equidistant from the centers-of-projection L and R of the viewer's
eyes, and let line LR be horizontal. Any plane containing LR
intersects V in a horizontal line [1].
Let P be a world point that is to be depicted stereoscopically by
drawing it on the screen for the left eye at L', the intersection of
line LP with the screen, and by drawing it on the screen for the right
eye at R', the intersection of line RP with the screen. Points L, R,
and P determine a plane that contains LR, so L'R' is horizontal [2].
Conversely, given purportedly corresponding points L" and R" with L"R"
not horizontal, LR and L"R" are not parallel [3], they do not
intersect [4], LR and L"R" are not coplanar [5], LL" and RR" cannot
intersect [6], so L" and R" cannot actually correspond to a world
point.
Notice I have said nothing about the "gaze directions" of the eyes,
and nothing about the relationship between left and right retinal
points that the brain can or cannot fuse. It is purely a matter of
geometry that only horizontally separated points on the screen
correspond to actual points in 3D space.
So what happens when cameras (i.e., the perpendiculars to their sensor
planes) are converged and their images are printed on a viewing
screen? A rectangle centered on and perpendicular to the altitude of
the isocolese triangle formed by the centers of the sensors and the
intersection of the normals to the sensors through their centers is
captured by the sensors as oppositely tapered trapezoids [7]. When
these trapezoids are drawn on the screen, corresponding points (e.g.,
the actual rectangle corners) are vertically disparate. By the above
discussion, they do not correspond to a 3D-world point.
How did this happen? Initially the sensors were not parallel. In
drawing both the images they recorded onto the same screen, crucial
knowledge about their actual relative orientation was discarded. It
is this discarding of information about the actual camera geometry
that is the origin of the problem.
In fact, recognizing this, it seems that there is a way to repair the
damage. Rather than simply overlaying the two images, we can project
them back through an optical system that is converged exactly the same
way the sensors were converged originally, undoing the keystone
distortion. This arrangement is actually used for 3D slide shows and
movies, where left and right films are separately projected onto one
screen. Equivalently, a linear rectification algorithm can be used to
undo the keystone distortion. The optical and the algorithmic
solutions are both exact in the approximation of pinhole optics, but
for optics that use lenses they are inexact. One source of
inexactness is that even ideal gaussian lenses exhibit depth-of-field
effects which neither optics nor geometrical rectification corrects.
Another source of inexactness is that real lenses also have
aberrations not all of which can be corrected by optical reversal, and
which are generally impractical to correct algorithmically.
By the way, a tempting but incorrect answer to the original question
is that the eye is spherical, so to a rough approximation it receives
and perceives a 3D->2D projection whose shape is independent of where
it is pointed; only the location of the image on the retina changes.
[The reason it is a rough approximation is that the center of
projection is on the surface of the sphere, not at its center. The
reason the eye points is, of course, because of the foveal structure
of the retina.] The camera, with its flat film or CCD, in contrast
records a 3D->2D projection whose shape depends strongly on where it
is pointed. But this pseudo-answer evades the real issue, which is
the destruction of information about relative camera orientation when
the images are both rotated into the viewing plane. The question
should correctly be answered with reference only to the locations of
the centers of projection of the eyes; it should not be necessary to
invoke the detailed engineering of the eye.
I have described only the special case of the eyes horizontal and
equidistant from the viewing screen. It should be apparent from this
discussion that this is in fact the only correct viewing geometry.
That is, to achieve strictly correct viewing geometry it is necessary
not only for the camera axes to be parallel (and the sensors displaced
if necessary to overlay their fields of view), but for the viewer's
eyes to be located in the original camera positions. However it is
not necessary for the viewer's eyes to point in the initial camera
pointing directions; that is a matter of individual preference.
References:
(1) Arthur C. Hardy and Fred H. Perrin, Principles of optics,
McGraw-Hill, 1932, chapter on stereoscopy.
(2) V. S. Grinberg, G. W. Podnar, M. W. Siegel, "Geometry of Binocular
Imaging", in Stereoscopic Displays and Applications V, Proceedings
of the 1994 SPIE/IS&T Conference on Electronic Imaging: Science &
Technology, San Jose, California, USA, SPIE, 8-10 February 1994,
pp. 56-65.
Notes:
[1] "Any plane containing LR" can be defined by parallel lines LL' and
RR' perpendicular to LR with L' and R' in V. The perpendicular
equidistance assumption makes LL'R'R a rectangle. Thus L'R' is
parallel to LR. LR is assumed horizontal, so L'R' must be horizontal.
[2] Conclusion of previous paragraph.
[3] LR is horizontal but L"R" is not.
[4] L"R" is in the screen plane and LR is not.
[5] Coplanar lines must either intersect or be parallel.
[6] Intersecting lines are coplanar.
[7] This is called "keystone distortion".