Research Overview (top)
Shape is a powerful visual cue for recognizing ob jects in images, segmenting images into regions
corresponding to individual objects, and, more generally, understanding the 3D structures of scenes.
However, to be able to exploit shape information, we need reliable ways of detecting fragments of
ob ject boundaries, a difficult problem in itself. We also need a way of incorporating the boundary
information into the image interpretation process. Thus, we are investigating possible solutions to
both problems in a two-part research project. In the first part, we explore ways to reliably detect
occluding boundaries. Starting with the large body of work in detecting meaningful contours by
using appearance cues from single images, we focus on methods of incorporating motion cues
in addition to appearance cues for better detection of occlusion boundaries. In the second part,
we explore different ways in which boundaries can be used in key vision tasks by investigating the
integration of boundary information in segmentation and category recognition.
Introduction (top)
Image interpretation, e.g. the ability to recognize object categories in images, remains a
formidable challenge. While considerable progress has been made in using image descriptions
based on local appearance or texture,
effective ways of extracting, representing, and using shape information are not nearly as advanced.
This is problematic since many object categories are defined by their function and it is typically
the case that function dictates an objects shape rather than its low-level surface appearance.
In viewing a 3D scene, much of the shape information is contained in the boundaries between
surfaces in the scene, such as the boundaries at which occlusions between two objects occur. These
occlusion boundaries are valuable sources of information about scene structure and object shape.
Consider the scene depicted in Figure 1. There are many overlapping objects and surfaces in this
scene. Almost everything in the scene is occluded by and/or occludes another object or surface.
Knowledge of the extents of those objects and surfaces can be a valuable source of information for
understanding the scene's overall structure and its content.
To be able to exploit shape information, we need reliable ways of detecting the boundary fragments,
a difficult problem in itself, and we need to be able to use the boundary information in parts of the image interpretation process. We are investigating possible solutions to both problems in the
two parts of this project. In the first part, we explore ways to reliably detect occluding boundaries. Starting with the large body of work in detecting meaningful contours by using appearance cues from single images, we focus on the question of incorporation motion cues in addition to appearance cues for more robust detection of occlusion boundaries.
While some computer vision applications, such as image retrieval, necessarily limit the system to using a single image, it is quite reasonable to assume a temporal sequence of images is available for vision applications operating in the physical world. Why force a mobile robot, for example, to attempt to understand its surroundings from disconnected still snapshots? It has the ability to move itself or to manipulate its environment and observe the result as a continuous, connected event. In such a system, the additional temporal dimension provided by the image sequence yields an extra source of information that should be exploited.
Boundary Detection (top)
We have therefore extended existing 2D patch-based, non-parametric approaches to appearance-based edge detection to the spatio-temporal domain. With this 3D detector (Figure 2), we can detect not only edge strength and orientation, but also edge speed in the direction normal to its orientation. We can also use the estimated motion of patches extracted from either side of a detected edge, aligned to its orientation and speed, as depected in Figure 3.
We can use inconsistencies in the motion estimates to determine at a pixel level which appearance edges are also occlusion boundaries, as shown in Figure 4. Note that the motion inconsistencies here are due only to very subtle parallax cues, not large-scale independent object motions, as is often used in motion segmentation work.
Using motion alone is likely insufficient for detecting object/occlusion boundaries. Thus we are developing methods for learning to combine motion and appearance cues in our classifier. As seen in Figure 5, each of these cues provides information useful for differentiating boundaries from non-boundaries. In an experiment on a dataset of short image sequences, we show in Figure 6 that the combination of these cues does indeed improve pixel-wise precision vs. recall performance on the task of labeling edges as occlusion boundaries or not.
Using Boundaries for Segmentation & Recognition (top)
Assuming that we can detect these boundaries, why might they be useful for higher-level vision
tasks? In the second part of the project, we are exploring different ways in which boundaries can be used in key vision tasks by investigating the integration of boundary information in segmentation and
category recognition. Since these are challenging research topics on their own, this project does not implement a complete research program in those areas involving the development of entirely new techniques. Our
more modest goal is to design and demonstrate ways to incorporate boundary information into existing
segmentation and recognition approaches, where possible.
Segmentation
Many scene segmentation approaches rely on some form of pairwise pixel affinity. One such
affinity between two pixels is computed from the number and location of edges, or intervening
contour, between them; the larger the number, the lower the affinity. This approach is normally
used in normalized cuts. The oversegmentation of objects with strong surface markings could
be prevented if instead of using simple edges directly computed from the image itself, we were
to use only occlusion boundaries as the intervening contours. For example, consider the example
scene in Figure 7.
Relying solely on boundary information for segmentation may be optimistic, particularly since
our boundaries will likely have gaps. Much research has suggested that a combination of boundaries
detected from the image with cues derived from the enclosed regions is more appropriate for segmentation. But how can we use both? Can we avoid prior (top-down) knowledge
and models on the appearance or shapes of the objects we wish to segment? Such models
could be learned directly from the image, using occlusion boundaries to bootstrap the process. In
a somewhat simplified framework with only one independently moving object and a static camera,
Ross and Kaelbling have explored using background subtraction to automatically generate an
appearance model of the foreground object. They tile the image into non-overlapping patches and
attempt to find a dividing boundary along with a foreground/background assignment for each that
is consistent with its neighbors. This is accomplished by treating potential local binary segmentations
of each patch as possible labels and using a Conditional Random Field to find a globally
consistent set of local labels (i.e. segmentations) that define the whole object. Using our occlusion
boundaries rather than background subtraction, we propose to follow a similar approach, but
with fewer restrictions. The main challenge here is to extend the approach to scenes with multiple
overlapping objects. Another possibility relies on normalized cuts with repulsive forces defined by
the occlusion boundaries, combined with the attractive forces computed from region-based affinity measures.
If we have some indication of which side of a boundary is object ("figure") and which side is
background or a different object ("ground"), which may be possible to estimate from local motion
cues, it is possible in principle to use the pixels near the edge to construct appearance models of
the foreground and background. The appearance models could be as simple as color histograms,
for example. Once such a model is obtained, we can leverage the numerous recent advances in
interactive image and video segmentation and matting (most
recently, the impressive results by Levin et al.). These methods use sparse user interactions
which specify foreground, background, and unknown pixels in a "tri-map" to constrain a hard or
soft segmentation and have shown to be quite powerful.
The hand-labeled foreground and background pixels provided by the user specify the foreground
and background models to drive the segmentation. But if we provide those constraints in an automatic
fashion by using our detected occlusion boundaries and their associated notion of foreground
and background, the result would be a fully automatic segmentation of objects in the scene. We
will explore such an approach as outlined in Figure 8. In fact, a similar idea has been explored quite recently, nearly in parallel with our work, in promising research by Apostoloff and Fitzgibbon. To provide the replacement for sparse user
inputs, they use their own T-junction detector rather than elongated occlusion boundaries like
ours. We feel that the two (junctions and boundaries) are likely somewhat complementary in
nature, but that boundaries, which are far less sparse, could provide richer, more accurate models
with which to constrain the segmentation.
Recognition
Object recognition is another area where the use of boundary information is crucial, but not
fully exploited. Many object recognition approaches rely on appearance features computed by
aggregating image information within local patches. One issue with these approaches is that the
patches may cross object boundaries, resulting in many unusable large-scale features which contain
information from objects and background. A more problematic issue is that, since the local
features are essentially convenient means of representing the local texture, they are far less discriminative
for objects that are characterized primarily by their shape. This has been addressed recently
by using recognition techniques that use contour fragments instead of regional descriptors.
This addresses part of the problem, but a remaining issue is that many of the contour fragments
may be irrelevant if they correspond to spurious intra-category variations on the appearance of
the object, rather than capturing useful shape information. Using boundaries should, in principle,
force the model to focus on those fragments that capture shape. We are working to combine a category
recognition approach with the boundary detection techniques. Our proposed recognition approach
supports semi-supervised category learning and it can operate directly from contour fragments.
Importantly, the recognition approach can also incorporate other regional features based on appearance.
Therefore, as before, we do not advocate that boundaries or contours alone are sufficient
for recognition. Our more limited goal is to show how they can be used effectively to exploit shape
information in a category recognition setting.
In addition, we are exploring the use of boundaries as a bridge between segmentation
and recognition for generating candidate object locations in an input image. Many recognition
approaches operate from a database of known categories and features on which they have been
trained. The system then functions in a top-down manner, trying to find model features and deciding
(via some spatial reasoning, for example) whether a particular object exists at a particular
location. On the other hand, a system that uses bottom-up cues from boundaries to reason about
the existence of an object (that is, any generic object) within the scene could first propose locations
of potential objects, as a cueing mechanism, thereby directing the recognition scheme to the most
fruitful locations within the scene and removing surrounding background clutter from consideration.
In addition, the ability of extracting potential objects from a scene automatically may have
implications for unsupervised learning and discovery of novel objects, since each new object would
not necessarily need to be manually extracted from its environment. This could potentially
also allow for simultaneous in situ learning of objects and their context.