MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 20:09:19 GMT Content-Type: text/html Content-Length: 12013 Last-Modified: Monday, 06-May-96 15:59:30 GMT Parallel Raytracing in CC++

Parallel Raytracing in CC++

CS516 Final Project - Spring '96

Vineet Ahuja
Amith Yamasani

Abstract

This project is a parallelisation of a public domain raytracer. It is implemented in CC++ (Compositional C++) on the SP2. Raytracing in itself is easily parallelised, as the screen can be split up into several areas and each area can be given to one processor. The problem arises when antialiasing needs to be performed to reduce the aliasing effects due to finite sampling. Due to this, boundaries between processors need to be kept at a minimum. One also needs to worry about balancing the load well between the processors, depending on the complexity of the scene. Transferring all results back to one processor to write to disk also becomes an issue. We have attempted to tackle these problems and come up with as efficient a solution as possible, given the constraints of the language and the size of the problem set.


1.Introduction

CC++

Compositional C++, is a language that supports structured parallel programming that is being developed at Caltech by the Computational Biology group. It provides structured parallelism in the form of par blocks and parfor loops and unstructured parallelism in the form of the spawn statement. Functions preceeded by the keyword atomic are used to implement mutual exclusion for functions that work on shared data. Sync variables are used for synchronization and the global keyword is used to modify pointers so that they can refer to local and remote memory. Sync variables work by suspending the thread that tries to read the variable till the variable is written to by another thread. Processor objects are objects that control the computation on a processor and they are defined in the regular manner except that the global keyword is used before the class definition.

Data transfer functions have to be defined while calling functions that lie on another processor object so that data like arrays and other user defined data structures can be copied and sent to the other processor.

Raytracing

We picked up a public domain implementation of a raytracer called RayLab, and worked on parallelising that.

Raytracing is a method of rendering graphics scene that considers the rays entering the viewers eye. The method traces the path of those rays (from the eye to the scene) and calculates the intensity and color of that ray, depending on the ray's path and reflections. As calculating the ray value is independent from the calculation of all other rays, parallelisation is trivial.

The difficulty in parallelisation arises from the need to perform antialiasing. Antialiasing is used to reduce the jaggies in the ouptut image caused by the finite sampling rate of the screen (i.e. a finite resolution). This is done for each pixel by considering the immediate pixels to the east, the south and the south-east.


2.Strategies for Parallelisation



Strategy 1 - Blocked row distribution

The easiest method is to divide the screen into n strips (where n is the number of processors) and allow each processor to ray trace its own strip. The problem with this method is load balancing. For example, if all the objects in the scene were to lie on strip n, then the nth procesor would have to do the most work leaving the first n-1 processors idle after they have completed their own strips.

Strategy 2 - Column and row cyclic distribution

Another static division method, this one divides the screen into small squares of n x n pixels (where n x n is the number of processors). Each of the pixels in that square goes to a different processor. This is basically the column and row cyclic division (see the figure above). The problem with this method is that doing anti-aliasing is no longer straight forward as all the current pixel's neighbours are on different processors, and performance becomes an issue if a processor has to go across the network (connecting the processors) to anti-alias each pixel.

Strategy 3 - Hungry Puppies

This method works by dividing up the screen into strips of m lines each. A processor is designated as the master and the rest of the nodes are slaves. Each slave node requests work from the master and goes back to ray trace its strip. On completion it returns the quantized ray traced strip and gets more work from the master (by quantization we mean converting the values of the RGB value of the pixel from float to char to throw on the display). This goes on till all the strips have been ray traced. (So each slave is a hungry puppy that runs to the master for work) The master node then writes the entire raytraced scene to disk.

Another problem is antialiasing. What does a processor do with the last row of its strip. It needs the next row (which is on the another processor) to antialias it. We have two different implementations to handle this problem. In one we compute the next row on this processor as well (so doing the computation twice - once on this one and once on the processors that actually owns the line). In the other, we send the unquantized last and first rows of each strip (in addition to the quantized values of all but the last row of each strip) so that the master node can antialias the last rows of all the strips therefore avoiding any redundant computation. At this time this version has a bug in it so all performance results discussed in the next section are with respect to the first implementaion.

The obvious problem with this method is scalablity. Since all the slaves are sending so much data to the master, there is a lot of contention for resources on the master. This obviously limits the number of processors that can be thrown at the problem.

3.Implementation Details


We define 2 types of processor objects (see section 1), a dispatcher and a tracer. A dispatcher is essentially the master processor that dishes out the work and the tracer does the actual raytracing of each strip.

Node 0 contains both a dispatcher and a tracer since it would be a waste of a processor to dedicate it to just sending strip numbers to each tracer, and disk writes. ( We timed the disk writes and ray tracing in the serial version and found raytracing a pixel is 600 times more expensive than writing it to disk).


We found a problem with CC++ in that when data was being sent to another processor by means of a data transfer functions there is unnecessary copying of the data before it is packaged in the data transfer function. We avoided this problem by defining copy constructors that do not actually allocate memory for data referred to by pointers. We just make the new object point to the data of the old object and do not define a destructor that releases the memory. This was the only way we could think of avoiding the extra copies.

Sync variables are used when the dispatcher is waiting for the tracers to complete their work, so no cycles are wasted in waiting in a dummy loop but the dispatcher's main loop is suspended till the sync variable is set by the function of the dispatcher that decides there are no more strips to be traced.

The function that dishes out work in the dispatcher (called NeedLines) is an atomic function since it has to increment a counter while giving out the next strip. Now if this function is called simultaneously by 2 tracers (or almost simultaneously) the value of the counter might not make any sense if both the copies of NeedLines access the counter together.



4.Performance


Speedup

The graph below shows speedup vs. number of processors, where speedup is defined as the ratio of the time taken to run the serial implementation to the time taken to run the parallel version.




The speedup is almost linear, but doesnt have a slope of 1. This is because of the bottleneck created on processor 1 which also needs to act as the dispatcher and thus the tracer on processor 1 isn't as efficient as the tracers on the other processors.

Communication vs. Computation time

The series of graphs shown below show breakup the time taken by the code to run into the computation time and the communication time.



As can be seen from the above graphs, the performance of the program is maximum at a granularity of about 30 lines per chunk. At lower granularity, the communication overhead becomes too high. At higher granularity again the efficiency goes down. This is due to load imbalance as some processors get more work and others get less work and stay idle for a longer time. This is shown to be true from the load balancing graph below.

Load Balancing

The graph below shows the ratio of maximum work done by a processor to the minimum work done by a processor (so showing how effectivley the load balancing policy is).



At low granularity the processor 1 ends up taking much longer to do its work because it is interrupted very often by the dispatcher which is invoked by the other tracers. At slightly higher granularities the load is much better balanced. Again at very high granularity the load imbalance is due to some processors getting more intensive work than others.

5.Experiences and Conclusions


CC++

We found CC++ to be overall a very effective language to work with in that it allowed us to think about parallelism in a structured manner. However, we also found the compiler we worked with to be very inefficient in code generation. To fairly compare our parallel implementation with the original serial C version, we had to recompile the serial version with the CC++ compiler; otherwise the output code of the C compiler for the ray tracing code (and therfore the computation intensive part of the program) was 50% faster than the CC++ code. Another problem is with the data transfer functions, which we mentioned earlier.

Raytracing

We feel that the best way to handle antialiasing is using shared memory multiprocessors so that processors can readily access the neighbouring rows that are handled by other processors. Also, since each processor needs only the first row of the next block to antialias its last row, it is almost certain that the neighbouring processor would have completed raytracing that row and the processor can therefore immediately use that data. (This statement is made under the assumption that all the processors carry the same work-load outside of the raytracing.)