MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 18:54:06 GMT Content-Type: text/html Content-Length: 28767 Last-Modified: Monday, 06-May-96 23:26:29 GMT
Jonathan Barber (barber@cs.cornell.edu)
Sugata Mukhopadhyay (sugata@cs.cornell.edu)
CS516 Final Project
Professor Thorsten von Eicken
Department of Computer Science
Cornell University
Click here to view a slide-show presentation of this paper
.
Recently, RIVL has been extended to process continuous streams of multimedia data, and generate a corresponding output stream of images. The extended RIVL package, called CM RIVL, was made possible by treating RIVL evaluation as a midpoint in a continuous media object. This work was facilitated by using CMT (The Continuous Media Toolkit).
Image processing continuous streams of media in real-time is a very hard problem, considering today's current state of computer technology. Performing even a simple image oper- ation over a single sequence of images, and outputting the resultant image[s] in real-time requires on the order of a million CPU cycles. To approach a real-time image-processing frame-rate of 30 frames per second, which is the standard frame-rate for perceiving continuous motion, would require one of the following items to be true:
Bearing this in mind, we established the project goal to develop an easy-to-use, fast, and inexpensive, real-time multimedia processing application.
In Section 3.0, we describe a generic method for parallelizing most of the image operations in RIVL, by exploiting the way that RIVL processes an inputted set of images. In Section 4.0, we describe two implementations of Parallel CM RIVL (PRIVL). The first version is designed to run on shared memory machines. The second version is designed to run over a cluster of Workstations. In Section 5.0, we present an analysis of performance results. In Section 6.0, we describe some improvements to our implementations. Finally, in Section 7.0, we draw some conclusions and analyze our progress.
In order for RIVL to execute, it requires a set of multimedia input data, and a control RIVL script. The RIVL script is a sequence of tcl-rivl commands that specify what image processing operations should occur on the input data. Once RIVL is invoked, the RIVL script is translated into the RIVL graph, as pictured above. Each node corresponds to some image operator (e.g. im_smooth, im_canny, etc.), and each edge or signal corresponds to the actual image data. Those nodes lying inside of the illustrated rectangle above correspond to true image operators. Those nodes lying outside of the rectangle are the RIVL I/O nodes. The nodes outside and to the left of the rectangle correspond to read nodes (i.e. one read/node per image [or stream]), and the node to right of the rectangle corresponds to the write node.
We want to emphasize that construction of the RIVL graph does not compute on any multimedia data. The RIVL graph is merely the control-flow structure through which each inputted sequence of data must propagate to generate the outputted, processed image.
There are two phases in processing data using the RIVL graph once it has been constructed. The first phase manifests itself in a graph traversal from right-to-left. This is what makes RIVL an efficient image processing mechanism. The first node that is evaluated is the Write node (the right-most node). By traversing the graph in reverse-order, RIVL decides at each node exactly how much data the output signal requires from the input signal. The evaluation is reverse-propagated from the write node, through the graph, and back to every read node. Once the reverse-propagation completes, every node in the graph knows exactly how much data from each input signal is required to compute the node's corresponding output signal. The multimedia data is then processed on the second traversal, which conforms to a left-to-right traversal of the RIVL graph, propagating the input data forwards through the graph, only operating on data that is relevant to the final output image.
Instead of running RIVL on a single processor, we spawn multiple RIVL processes on different processors, and have each process work towards computing a different segment of the output data. We define the notion of a single Master RIVL process, and multiple slave RIVL processes. Each slave process is started on a different processor. Once started, the slave process sits idle, listening for instructions from a Master process. After the slave processes have been started, a Master process is created. The Master Process determines how many slaves are "available for work". Once a control connection is established between the Master and every Slave, the Master assigns each slave a logical ID# (the Master ID# is 0, the Slave's ID# ranges from 1 to N slaves). After each slave is assigned an ID#, the Master sends each slave the total number of processes "available for work", followed by a copy of the RIVL script. Once each slave (and the master) receives the RIVL script, they each generate a copy of the RIVL graph, and perform the right-to- left traversal independently.
The difference between the right-to-left traversal now, is that the logical ID# for the current processor and the total number of processes becomes a factor in determining how much computation gets done for each process.
According the figure above, the amount of data fetched from each read node is no longer a function of the output of the write node, but is now a function of:
Actual data computation (the left-to-right graph traversal) occurs when the master says "go". Each slave and the master process computes their appropriated portion of the output image.
The model of parallelization for RIVL just described maps smoothly to CM RIVL. With CM RIVL, there is an initial setup phase for each slave process and the master process, as previously described (the Master process sends each slave its logical ID#, the total number of processes, and a copy of the RIVL script. Each RIVL process then computes the RIVL graph and makes the right-to-left traversal).
The image processing for computing each output frame in a continuous media stream occurs as follows:
The shared-memory implementation is illustrated above. Each RIVL process resides on a different processor, but each processor resides on the same machine, which has access to the same shared memory segment.
This implementation mirrors the generic parallel model described in Section 3.
Implementation Details:
Following the initial setup phase, the CMO works at capturing all data necessary to compute a single RIVL output frame. Once the CMO captures all the necessary data, it tells each RIVL process to begin processing by means of an entry semaphore. Each RIVL process then reads only the data relevant to its own output via a shared-memory read. Once the left-to-right evaluation of the RIVL graph completes, the RIVL process then performs a shared-memory write to the memory region containing the output image that is accessible by the CMO. The RIVL process then blocks at an exit semaphore until all of the RIVL processes complete computation for the same frame of data. Once every RIVL process blocks, the master RIVL process un-sets the exit semaphore, and each RIVL process waits again at the entry semaphore, until the CMO again releases it.
The networked implementation is illustrated above. Each RIVL process resides on a different processor, and each processor resides on a different machine.
This implementation also mirrors the generic parallel model described in Section 3.
Implementation Details:
Like its shared-memory counterpart, this model performs the initial setup using IP multicast to establish the Active Message connections from the master to each slave RIVL process. The CMO works at capturing all data necessary to compute a single RIVL output frame. This model differs from the generic-model in that the master process knows exactly what portion of the input data each RIVL process needs to evaluate their RIVL graph. Once the CMO captures all the necessary data, it tells each RIVL process to begin processing by issuing a gam_store() to each RIVL process. Once the message is received by each RIVL process, a handler is invoked which tells the RIVL process that it can begin evaluating its RIVL graph on the transferred data. Once the output data is computed, the RIVL process then issues a gam_store() to the Master process, specifying exactly where the sent data should be stored in the final output image buffer managed by the CMO. Eventually, a handler routine in the Master process will update a "received-from list". Once the Master receives data from each RIVL process, the CMO outputs the computed frame, and begins processing the next multimedia frame.
The process synchronization mechanism is implicit with the actual data-transfer, in that, a RIVL process cannot begin evaluating its graph on a given frame segment, until it receives an Active-message from the Master process. Similarly, the Master process cannot move on to the next multimedia image until it receives an Active-message from each slave process.
Another subtle point is that by having the Master determine how much of the input data each RIVL process requires, rather than having the RIVL process itself determine this information, we reduce the round-trip communication rate from master to slave. Having each RIVL process compute its own region, would require a gam_request(), followed by a gam_reply() by the Master process. Instead, the Master decides how much data each RIVL process needs and simply issues a single gam_store().
When the Master process received active-messages from any slave process, the slave process attempted to invoke an AM handler in the Master that existed in the slave, but not in the handler. The situation was the same when a slave process received an Active Message from the Master.
We overcame this shortcoming in by modifying the Active-Message's source code. The modification allows an application to register a handler with Active-Messages by calling
"Handler_t handler" corresponds to the handler's virtual address. The process returns an "hid", which is an integer, but stands for "handler ID#". In our implementation, since only the Master executable and slave executable are different, the Master and each slave must register their handlers with the Active- Message's library. Now, when a process sends an Active Message (from slave to master and vice versa), it no longer ships the processes's virtual address of the handler, but rather, ships a logical ID#, corresponding to the handler to be invoked. The Active-Message's library maintains a look-up table that is indexed by the logical ID#. The logical ID# corresponds to a process handler's virtual address, which is then invoked from Active-Messages.
From our graphed results above, the shared-memory implementation performs somewhat better than our Networked implementation. Both implementations, however, perform better than their serial counterparts (the green bar graph). One observation was that the networked implementation exhibited a large spread of timings for different frames, and this we attributed to our process getting preempted. The behavior was not visible on the shared memory implementation as our process was sleeping, waiting for the semaphores to change, while the process in the network implementation busy-waits. Hopefully, an interrupt driven implementation of active messages would cure this.
Note: In all tests, the processor speed is relatively equal. Results:
Modifying the Networked implementation should prove more trouble-some, and while improving the overall load-balance, will probably increase the communication overhead, as more Active-Message will have to be sent and processed.
Modifying the Shared-Memory version should be easier. The current synchronization mechanism is implemented by using UNIX semaphores. No RIVL process is allowed to begin executing the next frame until all RIVL processes have completed execution of the current frame. The output-image is currently divided up by the number processes available for work. We could improve the load-balance for this implementation by doing two things: (1) by dividing up the output-image work regions into more numerous smaller segments; and (2) for a current frame, allow RIVL processes to complete executing their output segment, and grab another segment from the Still-Need-to-be-Computed Queue residing on the Master process. This implementation will improve load-balance by allowing less-busy processes to contribute equally to the entire output image, while giving busier processors the time they need to compute their data without becoming a bottleneck for the entire output image.
We do not have results for more than four processors. However, by examining our results, we can determine that under the current implementations, the processes running Parallel CM RIVL will not be load-balanced.
Unfortunately, we must conclude that our implemenations as they stand will not scale to upwards of 30 processors to achieve the desired frame rate. However, further work is under way to address this load-balancing problem. Furthermore, a "Hungry-Puppy" object-tracking algorithm is currently being incorporated into PRIVL. The experimental results from this should be available shortly.
We have however made significant progress in parallelizing CM RIVL. CM RIVL is a non-trivial application, and our parallelization scheme works for most of the standard RIVL image operations.