Detecting Boundaries for Segmentation and Recognition

People

Description

Shape is a powerful visual cue for recognizing objects in images, segmenting images into regions corresponding to individual objects, and, more generally, understanding the 3D structures of scenes. However, to be able to exploit shape information, we need reliable ways of detecting fragments of object boundaries, a difficult problem in itself. In This project investigates possible solutions to these problems. Specifically, it explores ways to reliably detect occluding boundaries by focusing on the question of combining motion cues with appearance cues for better detection of occlusion boundaries. In addition, the project explores different ways in which boundaries can be used in key vision tasks by investigating the integration of boundary information in segmentation and category recognition.

The project generates advances in two areas: 1) detecting boundary information from images as shape cues and 2) using boundary information in segmentation and recognition tasks. In the first area, it extends the current approaches for contour detection to include motion cues. This is motivated by the availability of temporal information in many practical applications. Assuming that we can extract boundary fragments, the issue of using them effectively in segmentation and recognition remains an open question. The project will make substantial contributions toward answering that question, including integrating boundary information with image segmentation and recognition algorithms.

The project will result in advances that are directly relevant to a wide range of applications. In particular, the ability to reliably detect boundaries from image sequences in crucial in all applications that involve the analysis of videos for object discovery, recognition, and segmentation. The project is also expected to contribute to the development of more reliable category recognition algorithms based on shape representations, an enabling technology in a wide variety of fields including defense, health care, human-computer interaction, image retrieval and data mining, industrial and personal robotics, manufacturing, scientific image analysis, surveillance and security, and transportation. Although the project focuses primarily on automatic segmentation and recognition, it will also contribute to human-assisted segmentation and image editing, a technology area that is becoming increasingly important with the advent of consumer video and image editing tools. An additional product of the project is the generation of a corpus of labeled data, including motion data.

The first step in the estimation of occlusing boundaries is the computation of local cues that can be used for discriminating between occluding and non-occluding boundaries. Two cues are combined. The first one is computed from local appearance only using Martin's Pb detector. The second one is computed from local motion information, i.e., features computed from the difference between motion statistics on either side of each tentative edge fragment. This estimation is "local" in that it does not linking the individual contour elements and it does enforce any global consistency constraints.

Ground truth Pb Motion Combined

The local motion cues are then combined with the appearance cues (the output of the Pb detector) and with region information from a coarse segmentaton of the image to form connected boundaries. Two examples are shown below, including input video clip and output boundaries.


Movie	Movie

One use of the estimated boundaries is in combination with segmentation algorithms. For example, the figure below shows the comparison between using raw contour data and estimated boundary in a standard segmentation algorithm. More generally, the boundaries can be used for segmenting foreground object from background.

After foreground/background segmentation, it is possible in principle to use the pixels near the edge to construct appearance models. The appearance models could be as simple as color histograms, for example. Once such a model is obtained, we can leverage the numerous recent advances in interactive image and video segmentation and matting. The hand-labeled foreground and background pixels provided by the user in these approaches specify the foreground and background models to drive the segmentation. But if we provide those constraints in an automatic fashion by using our detected occlusion boundaries and their associated notion of foreground and background, the result would be a fully automatic segmentation of objects in the scene. Starting with the input scene in the upper left and moving to the right, we could first extract edges, followed by classifying those edges into surface markings (black) and occlusion boundaries (white). In addition, it is possible, in principle, to detect which side of the occlusion boundaries are foreground, as indicated by the blue arrows.

Object recognition is another area where the use of boundary information is crucial, but not fully exploited. Many object recognition approaches rely on appearance features computed by aggregating image information within local patches. One issue with these approaches is that the patches may cross object boundaries, resulting in many unusable large-scale features which contain information from objects and background. A more problematic issue is that, since the local features are essentially convenient means of representing the local texture, they are far less discriminative for objects that are characterized primarily by their shape. This has been addressed recently by using recognition techniques that use contour fragments instead of regional descriptors. This addresses part of the problem, but a remaining issue is that many of the contour fragments may be irrelevant if they correspond to spurious intra-category variations on the appearance of the object, rather than capturing useful shape information. Using boundaries should, in principle, force the model to focus on those fragments that capture shape. We are working to combine a category recognition approach with the boundary detection techniques. Our proposed recognition approach supports semi-supervised category learning and it can operate directly from contour fragments. Importantly, the recognition approach can also incorporate other regional features based on appearance. Therefore, as before, we do not advocate that boundaries or contours alone are sufficient for recognition. Our more limited goal is to show how they can be used effectively to exploit shape information in a category recognition setting.

In addition, we are exploring the use of boundaries as a bridge between segmentation and recognition for generating candidate object locations in an input image. Many recognition approaches operate from a database of known categories and features on which they have been trained. The system then functions in a top-down manner, trying to find model features and deciding (via some spatial reasoning, for example) whether a particular object exists at a particular location. On the other hand, a system that uses bottom-up cues from boundaries to reason about the existence of an object (that is, any generic object) within the scene could first propose locations of potential objects, as a cueing mechanism, thereby directing the recognition scheme to the most fruitful locations within the scene and removing surrounding background clutter from consideration. In addition, the ability of extracting potential objects from a scene automatically may have implications for unsupervised learning and discovery of novel objects, since each new object would not necessarily need to be manually extracted from its environment. This could potentially also allow for simultaneous in situ learning of objects and their context.

Data Set

The data set used in the first two references below is now available, including video clips, ground truth labels, and reference frames. Feel free to use this data set. Please acknowledge the first reference below if you use the data set in published work.

References

A. Stein, D. Hoiem, and M. Hebert, L earning to Find Object Boundaries Using Motion Cues. IEEE International Conference on Computer Vision (ICCV), October, 2007.

A. Stein and M. Hebert, Combining Local Appearance and Motion Cues for Occlusion Boundary Detection. British Machine Vision Conference (BMVC), September, 2007.

A. Stein and M. Hebert, Local Detection of Occlusion Boundaries in Video.
British Machine Vision Conference, September, 2006.

A. Stein and M. Hebert, Using Spatio-Temporal Patches for Simultaneous Estimation of Edge Strength, Orientation, and Motion. Proc. Beyond Patches Workshop, IEEE Conference on Computer Vision and Pattern Recognition, June, 2006.

Funding

This research is supported by:

NSF Grant IIS-0713406
KIST