MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 18:54:03 GMT Content-Type: text/html Content-Length: 34844 Last-Modified: Saturday, 18-May-96 02:22:04 GMT
http://www.cs.cornell.edu/Info/People/barber/potrivl/potrivl.html
Sicco Tans (stans@cs.cornell.edu
Jonathan Barber (barber@cs.cornell.edu)
CS664 Final Project
Professor Ramin Zabih
Department of Computer Science
Cornell University
In recent years, a lot of effort has been spent in attempting to make image-processing and vision-related algorithms easier to program, by adding many layers of abstraction between the image data, the image operations themselves, and the interface to the programmer/user. At the same time, these higher levels of abstraction should not add to the computational complexity of the operation.
This left researchers and developers with the extraordinarily difficult problem of making multimedia/vision operations fast, efficient, and easy-to-use. The effort manifested itself with the construction of RivL (A Resolution Independent Video Language) [1]. RivL is a multimedia software processing package that, given a set of images (or a set of a sequence of images), can efficiently process these multimedia streams and generate an outgoing image (or a sequence of images). RivL is implemented as a tcl extension that is capable of performing common image operations such as overlay, smoothing, clipping, cropping, etc. It also includes more complex vision-related image processing operations, such as object tracking, which has been the focus of our work. The tcl interface simplifies the process of coding an image/vision processing script.
In recent months, several developers have improved RivL performance measures via a fine-grained parallelization scheme using a shared memory machine and an a distributed computing environment [2]. The parallelization is independent of most of the image operations resident in the RivL library (e.g. im_clip, im_smooth, im_canny). Unfortunately, this scheme does not lend itself to more complicated computer vision applications. In particular, the scheme does not work for Object Tracking.
Bearing this in mind, we established the project goal to develop a backwards- compatible parallel implementation of Object Tracking tailored for RivL.
In Section 3.0, we introduce RivL, and describe the generic parallelization scheme. In Section 4.0, we describe the Hausdorff-based Object-Tracking algorithm implemented in RivL. In Section 5.0, we introduce the scheme for parallelizing RivL's Object Tracking operation. In Section 6.0, we describe our implementation of a parallel Object-Tracking RivL operation. In Section 7.0 we present our performance results. In Section 8.0, we present some extensions for future work and improvements in the current implementation. In Section 9.0, we draw some conclusions.
In order for RivL to execute, it requires a set of multimedia input data, and a control RivL script. The RivL script is a sequence of tcl-RivL commands that specify what image processing operations should occur on the input data. Once RivL is invoked, the RivL script is translated into the RivL graph, as pictured above. Each node corresponds to some image operator (e.g. im_smooth, im_canny, etc.), and each edge or signal corresponds to the actual image data. Those nodes lying inside of the illustrated rectangle above correspond to true image operators. Those nodes lying outside of the rectangle are the RivL I/O nodes. The nodes outside and to the left of the rectangle correspond to read nodes (i.e. one read/node per image [or stream]), and the node to right of the rectangle corresponds to the write node.
We want to emphasize that construction of the RivL graph does not compute on any multimedia data. The RivL graph is merely the control-flow structure through which each inputted sequence of data must propagate to generate the outputted, processed image.
There are two phases in processing data using the RivL graph once it has been constructed. The first phase manifests itself in a graph traversal from right-to-left. This is what makes RivL an efficient image processing mechanism. The first node that is evaluated is the Write node (the right-most node). By traversing the graph in reverse-order, RivL decides at each node exactly how much data the output signal requires from the input signal. The evaluation is reverse-propagated from the write node, through the graph, and back to every read node. Once the reverse-propagation completes, every node in the graph knows exactly how much data from each input signal is required to compute the node's corresponding output signal. The multimedia data is then processed on the second traversal, which conforms to a left-to-right traversal of the RivL graph, propagating the input data forwards through the graph, only operating on data that is relevant to the final output image.
Instead of running RivL on a single processor, RivL spawns multiple processes on different processors, and has each process work towards computing a different segment of the output data. We define the notion of a single master RivL process, and multiple slave RivL processes. Each slave process should run on a different processor. Once started, the slave process sits idle, listening for instructions from the master. During the initial setup period, the master sends each slave process a logical ID#. In addition, each slave is aware of the total number of processes "available for work".
Following the control-setup period, the master sends each slave a copy of the RivL script. Once each slave (and the master) receives the RivL script, they each generate a copy of the RivL graph, and perform the right-to-left traversal independently.
The difference between the right-to-left traversal now, is that the logical ID# for the current processor and the total number of processes becomes a factor in determining how much computation gets done for each process.
According the figure above, the amount of data fetched from each read node is no longer a function of the output of the write node, but is now a function of:
Actual data computation (the left-to-right graph traversal) occurs when the master says "go". Each slave and the master process computes their appropriated portion of the output image.
The Tracker operates as follows: it looks at adjacent images in a sequence, which I will define here as Prev (for previous) and Next. We want to determine where the object model went from Prev to Next. For every adjacent set of images, the Tracker performs the following sequence of operations: it first smooths (using the RivL im_smooth operation) and then edge-detects (using the im_canny operator, which is a Canny Edge-Detector [3]) Next. Prev was smoothed and edge-detected in the previous iteration. The im_search command is then invoked, which actually performs the object tracking. The im_search command extracts the actual "object to be tracked" from Prev specified by model. im_search then searches for an instance of the object in Next. When im_search completes, it returns a new bounding-box model, which corresponds to the location of the tracked object in Next. By modifying the RivL script, we can generate an output sequence of images that illustrates the object being tracked.
The sequence of images above illustrates the output of RivL's Object Tracker. The tracked object appears highlighted, while rest of the image is dimmed.
The search utilizes a multi-resolution approach [5]. The image Next is evenly divided into separate regions. Each region is then pre-searched, to determine if there is "anything interesting" in that region. By interesting, we mean that there is a substantial clustering of edges, again within some other threshold U. For each region that was determined interesting, it is then recursively sub-divided and pre-searched.
By recursively dividing up the image and locating only "interesting" regions, the overall search space is decreased The Hausdorff distance comparisons between the model and the region of interests can then proceed only on the reduced search space.
In generic parallel RivL, the output write region is sub-divided based on the process's logical ID#, and the total number of processes willing to work. In this paradigm, each process is responsible for its own portion of the output region. Computation of each output region does not rely upon the output of any other regions. In generic RivL, there is no communication between different processes for a given traversal of the RivL graph. Each process is independent of one another.
Unlike the more general operations, the output region of im_search cannot simply be sub-divided into different regions and computed independently of one another. This is true for the reason that an object being tracked may overlap several write regions. Since there is no communication between processes for a given traversal of the RivL graph, im_search will not work using Generic RivL.
The alternative parallelization scheme works as follows: RivL is initially invoked on only one process, the master Process. The master constructs the RivL graph, and performs the right-to-left traversal, all by itself. im_search, like any other image operation, is constructed as a RivL node. When the image sequence to be tracked is loaded and ready, each image makes its left-to-right traversal through the RivL graph. When the data encounters the im_search node, the following sequence of events happens:
Essentially, each slave process performs the pre-searches in a hungry-puppy fashion, to narrow down the scope of the overall search region. The master process is responsible for maintaining the queues. It initially places work onto the high-priority queue for the slaves to fetch. It then clears new pre-search jobs specified by each slave process from the low-priority queue, and places them back onto the high-priority queue for the next level of recursion.
Once the pre-searches have concluded, the slaves have fulfilled their tasks for the current iteration. The master then computes the Hausdorff distances between the object and the "interesting regions", and looks for the best possible match, if any. If one is found, it outputs the new bounding-box of the object, based on the current image, Next.
When the left-to-right traversal of the RivL graph hits the im_search node, RivL attaches the high and low priority job queue data structure to shared memory, and generates UNIX- IPC semaphores to govern access to this shared object to prevent race conditions, and to synchronize the parallelization. Once the shared-memory and semaphores are set up, RivL then forks n slave processes.
We want to emphasize that this implementation is SPMD. The only shared data is the job queue, which is simply a data structure that contains pointers to different portions of Next. The object-model and image data are completely replicated in each RivL process, and reside at exactly the same address in each process's address space. The parallel computation proceeds as described above. When the slave processes are done (i.e. all interesting regions have been found), the master kills each slave, and de-allocates the shared-memory segment. The master then proceeds to finish the object tracking computation. On the next traversal of the RivL graph, the above sequence of events are repeated The master again sets up shared-memory and the semaphores, re-forks, and then re-kills the slaves.
We believe that this is a very wasteful implementation of im_search. At every iteration, expensive UNIX kernel system calls are generated to:
This wastefulness led us to develop a smarter implementation of im_search that re-uses resources, and improves performance of the object tracker.
The remaining structures - in particular the image`s distance transforms have a consistent size that is dependent on the size of the images in the sequence. In other words, knowing the size of the first image in a sequence enables a single allocation that will be sufficient for the entire sequence. Of course, the dependence on the size of the images in the sequence is the reason that a particular shared memory segment can only be kept around for one sequence of images. Making assumptions about the maximum size of a sequence would enable shared memory segments and child processes to stay around for multiple sequences to be tracked, but we did not make this extension.
The diagram above illustratest the contents of the shared memory segment. The segment contains the main job queue data structure (the high and low priority queues). It also contains vital model and image data, along with their corresponding distance transforms.
Shared memory added some significant hurdles due to the difficulty of tracing pointers into and out of it. Some data structures remained unchanged from initialization in the child processes and were explicitly left out of shared memory for that reason. Some of these structures however, were pointed to by some structures in shared memory. The invariant that had to be maintained was that the pointers in shared memory to the constant structures could not be changed. The easiest way to keep track of the structures in shared memory turned out to be putting them in the same order every time and maintaining some global information as to the location of the structures in shared memory relative to the start of the shared memory segment.
We tested our implementation on a 50MHZ 4 processor Sparc-station running Solaris, version 2.5. We tested the performance of our implementation using a master process, and 1 to 4 slave processes. For comparison, we also tested the first implementation of the RivL parallel object-tracker on the same sequence from 1 to 4 processors. As a control, we also tested the sequential RivL object-tracker on the same sequence, on the same machine. A graph of our results appears in the following diagram.
Unfortunately, our current performance results indicates not only that our implementation is slower than the first implementation, but that it is also slower than the sequential version. However, we believe that these results are not truly indicative of the advantage of our implementation over the older one. Due to the fact that we ran out time, we were unable to fully iron out the bugs and inefficiencies of our implementation, and fine-tune it so that it would reach its full potential. We believe that this is not reflective of the soundness of our ideas.
However, it is notable that our implementation scales better from 1 to 4 processors than the previous implementation. This implies that our implementation of the parallel object- tracker does significantly improve overall performance as we increase the number of slave processes.
We are confident that a little more work will yield the results we are looking for. Intuitively, it makes sense that our implementation should run faster than the previous implementation, for the simple reason that we have significantly reduced the overhead involved in setting up and running RivL in a multi-processor environment.