Detecting & Using Object Boundaries

| Research Overview | | People | | Introduction | | Boundary Detection | | Segmentation & Recognition | | Related Publications |

Research Overview (top)

Shape is a powerful visual cue for recognizing ob jects in images, segmenting images into regions corresponding to individual objects, and, more generally, understanding the 3D structures of scenes. However, to be able to exploit shape information, we need reliable ways of detecting fragments of ob ject boundaries, a difficult problem in itself. We also need a way of incorporating the boundary information into the image interpretation process. Thus, we are investigating possible solutions to both problems in a two-part research project. In the first part, we explore ways to reliably detect occluding boundaries. Starting with the large body of work in detecting meaningful contours by using appearance cues from single images, we focus on methods of incorporating motion cues in addition to appearance cues for better detection of occlusion boundaries. In the second part, we explore different ways in which boundaries can be used in key vision tasks by investigating the integration of boundary information in segmentation and category recognition.

People (top)

Martial Hebert - Principal Investigator

Andrew Stein - PhD Student

Marius Leordeanu - PhD Student

Introduction (top)

Image interpretation, e.g. the ability to recognize object categories in images, remains a formidable challenge. While considerable progress has been made in using image descriptions based on local appearance or texture, effective ways of extracting, representing, and using shape information are not nearly as advanced. This is problematic since many object categories are defined by their function and it is typically the case that function dictates an objects shape rather than its low-level surface appearance.

In viewing a 3D scene, much of the shape information is contained in the boundaries between surfaces in the scene, such as the boundaries at which occlusions between two objects occur. These occlusion boundaries are valuable sources of information about scene structure and object shape. Consider the scene depicted in Figure 1. There are many overlapping objects and surfaces in this scene. Almost everything in the scene is occluded by and/or occludes another object or surface. Knowledge of the extents of those objects and surfaces can be a valuable source of information for understanding the scene's overall structure and its content.

To be able to exploit shape information, we need reliable ways of detecting the boundary fragments, a difficult problem in itself, and we need to be able to use the boundary information in parts of the image interpretation process. We are investigating possible solutions to both problems in the two parts of this project. In the first part, we explore ways to reliably detect occluding boundaries. Starting with the large body of work in detecting meaningful contours by using appearance cues from single images, we focus on the question of incorporation motion cues in addition to appearance cues for more robust detection of occlusion boundaries. While some computer vision applications, such as image retrieval, necessarily limit the system to using a single image, it is quite reasonable to assume a temporal sequence of images is available for vision applications operating in the physical world. Why force a mobile robot, for example, to attempt to understand its surroundings from disconnected still snapshots? It has the ability to move itself or to manipulate its environment and observe the result as a continuous, connected event. In such a system, the additional temporal dimension provided by the image sequence yields an extra source of information that should be exploited.

Boundary Detection (top)

We have therefore extended existing 2D patch-based, non-parametric approaches to appearance-based edge detection to the spatio-temporal domain. With this 3D detector (Figure 2), we can detect not only edge strength and orientation, but also edge speed in the direction normal to its orientation. We can also use the estimated motion of patches extracted from either side of a detected edge, aligned to its orientation and speed, as depected in Figure 3.

We can use inconsistencies in the motion estimates to determine at a pixel level which appearance edges are also occlusion boundaries, as shown in Figure 4. Note that the motion inconsistencies here are due only to very subtle parallax cues, not large-scale independent object motions, as is often used in motion segmentation work.

Using motion alone is likely insufficient for detecting object/occlusion boundaries. Thus we are developing methods for learning to combine motion and appearance cues in our classifier. As seen in Figure 5, each of these cues provides information useful for differentiating boundaries from non-boundaries. In an experiment on a dataset of short image sequences, we show in Figure 6 that the combination of these cues does indeed improve pixel-wise precision vs. recall performance on the task of labeling edges as occlusion boundaries or not.

Using Boundaries for Segmentation & Recognition (top)

Assuming that we can detect these boundaries, why might they be useful for higher-level vision tasks? In the second part of the project, we are exploring different ways in which boundaries can be used in key vision tasks by investigating the integration of boundary information in segmentation and category recognition. Since these are challenging research topics on their own, this project does not implement a complete research program in those areas involving the development of entirely new techniques. Our more modest goal is to design and demonstrate ways to incorporate boundary information into existing segmentation and recognition approaches, where possible.

Segmentation

Many scene segmentation approaches rely on some form of pairwise pixel affinity. One such affinity between two pixels is computed from the number and location of edges, or intervening contour, between them; the larger the number, the lower the affinity. This approach is normally used in normalized cuts. The oversegmentation of objects with strong surface markings could be prevented if instead of using simple edges directly computed from the image itself, we were to use only occlusion boundaries as the intervening contours. For example, consider the example scene in Figure 7.

Relying solely on boundary information for segmentation may be optimistic, particularly since our boundaries will likely have gaps. Much research has suggested that a combination of boundaries detected from the image with cues derived from the enclosed regions is more appropriate for segmentation. But how can we use both? Can we avoid prior (top-down) knowledge and models on the appearance or shapes of the objects we wish to segment? Such models could be learned directly from the image, using occlusion boundaries to bootstrap the process. In a somewhat simplified framework with only one independently moving object and a static camera, Ross and Kaelbling have explored using background subtraction to automatically generate an appearance model of the foreground object. They tile the image into non-overlapping patches and attempt to find a dividing boundary along with a foreground/background assignment for each that is consistent with its neighbors. This is accomplished by treating potential local binary segmentations of each patch as possible labels and using a Conditional Random Field to find a globally consistent set of local labels (i.e. segmentations) that define the whole object. Using our occlusion boundaries rather than background subtraction, we propose to follow a similar approach, but with fewer restrictions. The main challenge here is to extend the approach to scenes with multiple overlapping objects. Another possibility relies on normalized cuts with repulsive forces defined by the occlusion boundaries, combined with the attractive forces computed from region-based affinity measures.

If we have some indication of which side of a boundary is object ("figure") and which side is background or a different object ("ground"), which may be possible to estimate from local motion cues, it is possible in principle to use the pixels near the edge to construct appearance models of the foreground and background. The appearance models could be as simple as color histograms, for example. Once such a model is obtained, we can leverage the numerous recent advances in interactive image and video segmentation and matting (most recently, the impressive results by Levin et al.). These methods use sparse user interactions which specify foreground, background, and unknown pixels in a "tri-map" to constrain a hard or soft segmentation and have shown to be quite powerful.

The hand-labeled foreground and background pixels provided by the user specify the foreground and background models to drive the segmentation. But if we provide those constraints in an automatic fashion by using our detected occlusion boundaries and their associated notion of foreground and background, the result would be a fully automatic segmentation of objects in the scene. We will explore such an approach as outlined in Figure 8. In fact, a similar idea has been explored quite recently, nearly in parallel with our work, in promising research by Apostoloff and Fitzgibbon. To provide the replacement for sparse user inputs, they use their own T-junction detector rather than elongated occlusion boundaries like ours. We feel that the two (junctions and boundaries) are likely somewhat complementary in nature, but that boundaries, which are far less sparse, could provide richer, more accurate models with which to constrain the segmentation.

Recognition

Object recognition is another area where the use of boundary information is crucial, but not fully exploited. Many object recognition approaches rely on appearance features computed by aggregating image information within local patches. One issue with these approaches is that the patches may cross object boundaries, resulting in many unusable large-scale features which contain information from objects and background. A more problematic issue is that, since the local features are essentially convenient means of representing the local texture, they are far less discriminative for objects that are characterized primarily by their shape. This has been addressed recently by using recognition techniques that use contour fragments instead of regional descriptors. This addresses part of the problem, but a remaining issue is that many of the contour fragments may be irrelevant if they correspond to spurious intra-category variations on the appearance of the object, rather than capturing useful shape information. Using boundaries should, in principle, force the model to focus on those fragments that capture shape. We are working to combine a category recognition approach with the boundary detection techniques. Our proposed recognition approach supports semi-supervised category learning and it can operate directly from contour fragments. Importantly, the recognition approach can also incorporate other regional features based on appearance. Therefore, as before, we do not advocate that boundaries or contours alone are sufficient for recognition. Our more limited goal is to show how they can be used effectively to exploit shape information in a category recognition setting.

In addition, we are exploring the use of boundaries as a bridge between segmentation and recognition for generating candidate object locations in an input image. Many recognition approaches operate from a database of known categories and features on which they have been trained. The system then functions in a top-down manner, trying to find model features and deciding (via some spatial reasoning, for example) whether a particular object exists at a particular location. On the other hand, a system that uses bottom-up cues from boundaries to reason about the existence of an object (that is, any generic object) within the scene could first propose locations of potential objects, as a cueing mechanism, thereby directing the recognition scheme to the most fruitful locations within the scene and removing surrounding background clutter from consideration. In addition, the ability of extracting potential objects from a scene automatically may have implications for unsupervised learning and discovery of novel objects, since each new object would not necessarily need to be manually extracted from its environment. This could potentially also allow for simultaneous in situ learning of objects and their context.

Related Publications (top)

Beyond Local Appearance: Category Recognition from Pairwise Interactions of Simple Features
M. Leordeanu, M. Hebert, and R. Sukthankar
CVPR, June, 2007. [Abstract]
Download: pdf [5404 KB]

Local Detection of Occlusion Boundaries in Video
A. Stein and M. Hebert
British Machine Vision Conference, September, 2006. [Abstract]
Download: pdf [481 KB]

Using Spatio-Temporal Patches for Simultaneous Estimation of Edge Strength, Orientation, and Motion
A. Stein and M. Hebert
Beyond Patches Workshop at IEEE Conference on Computer Vision and Pattern Recognition, June, 2006. [Abstract]
Download: pdf [1279 KB]

Incorporating Background Invariance into Feature-Based Object Recognition
A. Stein and M. Hebert
Seventh IEEE Workshop on Applications of Computer Vision (WACV), January, 2005. [Abstract]
Download: pdf [471 KB]