MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 19:45:11 GMT Content-Type: text/html Content-Length: 15559 Last-Modified: Saturday, 09-Dec-95 04:57:20 GMT ORWELL: Removal of Tracked Objects in Digital Video

ORWELL: Removal of Tracked Objects in Digital Video

Alfred Hong, Heji Kim, Lin-hsian Wang

Department of Computer Science
324 Upson Hall
Cornell University
Ithaca, NY 14853-7501 US

http://www.cs.cornell.edu/Info/Projects/zeno/rivl/rivl.html


Table of Contents


1. Introduction

Object tracking in a sequence of images can provide a base for a multitude of digital video processing applications such as removal of object in the scene. Although numerous video-processing editors are available, object-tracking and removal (OTR) is mostly a manual process. Using the existing object-tracking feature in RiVL, we implement a semi-automated application that allows the user to specify and remove an object, then reconstruct the background to result in a new video sequence. Our work primarily focuses on algorithms for the domain of stationary backgrounds with a single moving object.

In addition to OTR, we also extend this work to segment the tracked-object from the background; we can use the resulting segmentation for a variety of video processing effects such as overlaying the tracked-object on top of different sequence. The resulting application is an ideal test bed for experimenting with various OTR and segmentation algorithms. We reconstruct the background and use different techniques for segmentation as illustrated in the diagram below.


Figure 1:
Orwell OTR and Segmentation Overview

The rest of the paper is organized as follows.

[Key words: object-tracking, Hausdorff distance, object-removal, segmentation, background reconstruction, image filtering]

<-- Table of Contents

2. Background Information

RiVL

RiVL is a resolution independent video language which has video and audio as first class data types. Jonathan Schwartz has implemented RiVL as a Tcl/Tk extension for multimedia processing. The high level operators used in RiVL are independent of video format and resolution and provides the necessary infrastructure to test our ideas.

RiVL_GenC

RiVL_GenC generates the C code for RiVL functions that need to perform low level image processing routines that are not already included in the RiVL library. The implementations of median and mean filters use the functions generated by RiVL_GenC for pixel-level computations.

Hausdorff Tracker

The Hausdorff tracker is a feature-based object tracking system for a continuous sequence of images. The model of the tracked object is represented by a binary edge-map produced by applying a Canny edge operator over a smoothed version of the gray-level image of the input image. Taking advantage of the fact that the motion of the object is roughly an affine transformation between any two consecutive frames, the algorithm matches all the possible translations and scales of the model to a specified search window (shown as a red dotted box in Figure 2). Generally, the best match has the most points which overlap with a transformation of the model. Since we use this best match as a model for the next image, once the tracker begins to wander, the results can deteriorate quickly.


Figure 2:
Hausdorff tracking algorithm explained

<-- Background Information
<-- Table of Contents

3. Specifics

This section discusses the implementation of the algorithms accessible from the Orwell Editor. The first subsection discusses object-tracking. The second subsection discusses the background reconstruction algorithms assuming a stationary camera, and the third subsection discusses the segmentation algorithms.

3. 1 Object-Tracking: Hausdorff Tracker

The tracker in RiVL returns scale and translation coordinates for each image. Performance of the tracker depends on setting the correct parameters for the search, i.e. size of the search window, scaling factors, and the forward and backward distance which limits the allowed dissimilarity in a match. We must make a trade-off between the laxness of the constraints and the processing time required to track the object. The Hausdorff tracker also works better for larger images.

<-- Background Information

3. 2 Background Reconstruction

We need the background to replace the tracked object from the original sequence and to possibly segment the object. We experiment with three different approaches to background reconstruction: a temporal median filter, a temporal mean filter, and a physical space search.

The first two approaches for the background reconstruction are temporal mean filter and the temporal median filter.

Temporal Mean Filter (TMEF)

The TMEF technique computes the mean pixel value by taking the arithmetic average of the whole frame sequence and assigns this result to the pixels in the background frame. This technique averages out the tracked object in the scene with a possible blurring effect. We implement this filter by averaging each of the RGB values independently.

Temporal Median Filter (TMDF)

TMDF builds the background frame by computing the median pixel value from sorting the images in the video sequence. This techniques relies on the assumption that any portion of the tracked object appears in any one particular location in less than half the image frames. We implement this filter by finding which frame has the median using its gray-level value, and then reconstructing the background using the corresponding RGB values.

Both temporal filters are pixel level operations we wrote in RiVL_Genc. RiVL_Genc only allows twenty frames maximum to be entered as input to a function, and because medians of medians is not a median, we could not implement a true median function over the entire video sequence. Instead, we compute the median for several different samples -- each sample composing of twenty frames set at equal intervals, and allow the user to decide over the best result.

Physical Space Search

Physical space search finds the frame where the bounding box of the tracked object does not overlap with the one in the currently processed frame, that is the part of background needed to replace the object is the one that has not been occupied by the object in the previous frames. Using assumptions of motion continuity, we initially search for the the background for the current image near the frame where we found the background for the previous frame; this way we can avoid a comprehensive search. For the initial frame, we must search the entire sequence for all possible background replacements. Although we prefer the closest frame that contains the background, we also want to find multiple scenes in which the background resides in case that another moving object has moved into the background. It is also possible to partition the bounding box into smaller blocks and search for the background in pieces.

If we assume a single moving object in the sequence, then it is possible to use one frame which has the object removed and the background reconstructed as the background for the entire sequence. However, due to shifting lighting levels, it is desirable to reconstruct the scene for every frame or every block of frames. Figure 3 shows the result of the background covered by the subject's head reconstruced.


Figure 3:
Sequence illustrating object-tracking and background reconstruction

3. 3 Object Segmentation

Image Segmentation

Segmentation, or separating the tracked object from the background, is one of the core problems in vision that has yet to be adequately solved for unconstrained settings. We explore motion differencing, second differencing, and background subtraction for this classical problem.


Figure 4:
Segmentation methods

A. Image Differencing

Motion differencing applies a threshold over two consecutive images to produce a binary image indicating the regions of motion. We extend motion differencing to use three consecutive frames. With second differencing, we perform a binary AND operation on the difference image of the first two frames and the last two frames to segment out the moving object in the middle frame. Moving objects are more clearly segmented when there exists less overlap of the moving object with itself in consecutive images; we choose the three consecutive frames such that there has been sufficient motion.


Figure 4:
Segmentation methods

B. Background Subtraction

Background subtraction involves application of a threshold over the background with the image containing the moving object. This techniques works well only when used with a faithful copy of the background.

<-- Table of Contents

4. Evaluation


Figure 5: Input video sequence



We used the above video sequence of 200 frames as one of the inputs for our test. Images were recorded as Motion JPEGS with a Sun Microsystems camera using a Parallax board.



Mean Median
Figure 6: Temporal filter results

The example images above are the results from temporal filters. The reconstructed background image from the mean filter has a slightly visible blurring effect caused by the moving object. As the number of frames in the video increases, this effect will become negligible. We can further process the mean filter with a smoothing function, and then a sharpening function to further rid of the shadowing effect.


Figure 4:
Segmentation results for background subtraction

We choose the "eyeball test", a metric commonly used by many vision researchers, to determine the quality of the segmentation. Background subtraction produced the best segmentation with the smoothest edges and the least number of holes within the object. Motion differencing performed the worst since it tends to give an irregular outline of the motion and includes portion of the background which belongs to the object in the previous image but not in the current image: this effect appears as an undesirable white outline around the object in the right pair of images below. The second differencing method shows improved results over regular motion differencing, but is still not as solid as background subtraction. Second differencing has an advantage over background subtraction since reconstructing the background is not necessary. Some sort of post-filtering is necessary for all cases to fill in the holes and smooth the edges.


Figure 5:
Segmentation results for second differencing, and motion differencing

We set out with a two-fold goal, one of object-removal and the other of object-segmentation. The overall quality of object-removal depends on the accuracy of the Hausdorff tracker and the fidelity of the reconstructed background. We feel we have accomplished OTR as long as the background does exist. We have had less success with segmentation, and leave much room for future improvements.

<-- Table of Contents

5. Related Work and Extensions

Multimedia and vision are highly experimental areas embodying numerous possibilities. Although tracking and object-segmentation are active ares of research in vision, there appears to be virtually no established work on automating object removal using background reconstruction.

We can extend this project along these orthogonal directions:

<-- Table of Contents

6. References

[1]
Tracking Non-Rigid Objects in Complex Scenes.Proceedings of the Fourth International Conference on Computer Vision (1993), 93-101 (with J.J. Noh and W.J. Rucklidge).
[2]
Jain, Kasturi and Schunck, Machine Vision, McGraw-Hill, 1995.
[3]
Ousterhout, John K. Tcl and the Tk Toolkit. Addison-Wesley, Massachusetts, 1994.
[4]
Swartz, Jonathan and Smith, Brian C. RiVL: A Resolution Independent Video Language. Submitted to the 1995 Tcl/Tk Workshop, July 1995, Toronto, CA.
[5]
L. Teodosio, W. Bender, Salient Video Stills: Content and Context preserved, Proc. ACM Multimedia, 1993, pp. 39-46.
<-- Table of Contents