MIME-Version: 1.0
Server: CERN/3.0
Date: Sunday, 01-Dec-96 20:09:19 GMT
Content-Type: text/html
Content-Length: 12013
Last-Modified: Monday, 06-May-96 15:59:30 GMT
Parallel Raytracing in CC++
Parallel Raytracing in CC++
CS516 Final Project - Spring '96
Vineet Ahuja
Amith Yamasani
Abstract
This project is a parallelisation of a public domain raytracer. It is
implemented in CC++ (Compositional C++) on the SP2. Raytracing in
itself is easily parallelised, as the screen can be split up into
several areas and each area can be given to one processor. The problem
arises when antialiasing needs to be performed to reduce the aliasing
effects due to finite sampling. Due to this, boundaries between
processors need to be kept at a minimum. One also needs to worry about
balancing the load well between the processors, depending on the
complexity of the scene. Transferring all results back to one
processor to write to disk also becomes an issue. We have attempted to
tackle these problems and come up with as efficient a solution as
possible, given the constraints of the language and the size of the
problem set.
1.Introduction
CC++
Compositional C++, is a language that supports structured parallel
programming that is being developed at Caltech by the Computational Biology
group. It provides structured parallelism in the form of par blocks and
parfor loops and unstructured parallelism in the form of the spawn
statement. Functions preceeded by the keyword atomic are used to
implement mutual exclusion for functions that work on shared data. Sync
variables are used for synchronization and the global keyword is used
to modify pointers so that they can refer to local and remote memory. Sync
variables work by suspending the thread that tries to read the variable till
the variable is written to by another thread. Processor objects are objects
that control the computation on a processor and they are defined in the
regular manner except that the global keyword is used before the class
definition.
Data transfer functions have to be defined while calling functions that lie
on another processor object so that data like arrays and other user defined
data structures can be copied and sent to the other processor.
Raytracing
We picked up a public domain implementation of a raytracer called RayLab,
and worked on parallelising that.
Raytracing is a method of rendering graphics scene that considers the rays
entering the viewers eye. The method traces the path of those rays (from the
eye to the scene) and calculates the intensity and color of that ray,
depending on the ray's path and reflections. As calculating the ray value is
independent from the calculation of all other rays, parallelisation is
trivial.
The difficulty in parallelisation arises from the need to perform
antialiasing. Antialiasing is used to reduce the jaggies in the ouptut image
caused by the finite sampling rate of the screen (i.e. a finite resolution).
This is done for each pixel by considering the immediate pixels to the east,
the south and the south-east.
2.Strategies for Parallelisation
Strategy 1 - Blocked row distribution
The easiest method is to divide the screen into n strips (where n is the
number of processors) and allow each processor to ray trace its own strip. The
problem with this method is load balancing. For example, if all the objects in
the scene were to lie on strip n, then the nth procesor would have to do the
most work leaving the first n-1 processors idle after they have completed
their own strips.
Strategy 2 - Column and row cyclic distribution
Another static division method, this one divides the screen into small squares
of n x n pixels (where n x n is the number of processors). Each of the pixels
in that square goes to a different processor. This is basically the column and
row cyclic division (see the figure above). The problem with this method is
that doing anti-aliasing is no longer straight forward as all the current
pixel's neighbours are on different processors, and performance becomes an
issue if a processor has to go across the network (connecting the processors)
to anti-alias each pixel.
Strategy 3 - Hungry Puppies
This method works by dividing up the screen into strips of m lines each. A
processor is designated as the master and the rest of the nodes are slaves.
Each slave node requests work from the master and goes back to ray trace its
strip. On completion it returns the quantized ray traced strip and gets more
work from the master (by quantization we mean converting the values of the RGB
value of the pixel from float to char to throw on the display). This goes on
till all the strips have been ray traced. (So each slave is a hungry puppy
that runs to the master for work) The master node then writes the entire
raytraced scene to disk.
Another problem is antialiasing. What does a processor do with the last row
of its strip. It needs the next row (which is on the another processor) to
antialias it. We have two different implementations to handle this problem. In
one we compute the next row on this processor as well (so doing the
computation twice - once on this one and once on the processors that actually
owns the line). In the other, we send the unquantized last and first rows of
each strip (in addition to the quantized values of all but the last row of
each strip) so that the master node can antialias the last rows of all the
strips therefore avoiding any redundant computation. At this time this version
has a bug in it so all performance results discussed in the next section are
with respect to the first implementaion.
The obvious problem with this method is scalablity. Since all the slaves
are sending so much data to the master, there is a lot of contention for
resources on the master. This obviously limits the number of processors that
can be thrown at the problem.
3.Implementation Details
We define 2 types of processor objects (see section 1), a dispatcher and a
tracer. A dispatcher is essentially the master processor that dishes out the
work and the tracer does the actual raytracing of each strip.
Node 0 contains both a dispatcher and a tracer since it would be a waste of
a processor to dedicate it to just sending strip numbers to each tracer, and
disk writes. ( We timed the disk writes and ray tracing in the serial version
and found raytracing a pixel is 600 times more expensive than writing it to
disk).
We found a problem with CC++ in that when data was being sent to another
processor by means of a data transfer functions there is unnecessary copying
of the data before it is packaged in the data transfer function. We avoided
this problem by defining copy constructors that do not actually allocate
memory for data referred to by pointers. We just make the new object point to
the data of the old object and do not define a destructor that releases the
memory. This was the only way we could think of avoiding the extra copies.
Sync variables are used when the dispatcher is waiting for the tracers to
complete their work, so no cycles are wasted in waiting in a dummy loop but
the dispatcher's main loop is suspended till the sync variable is set by the
function of the dispatcher that decides there are no more strips to be traced.
The function that dishes out work in the dispatcher (called NeedLines) is
an atomic function since it has to increment a counter while giving out the
next strip. Now if this function is called simultaneously by 2 tracers (or
almost simultaneously) the value of the counter might not make any sense if
both the copies of NeedLines access the counter together.
4.Performance
Speedup
The graph below shows speedup vs. number of processors, where speedup is
defined as the ratio of the time taken to run the serial implementation to the
time taken to run the parallel version.
The speedup is almost linear, but doesnt have a slope of 1. This is because
of the bottleneck created on processor 1 which also needs to act as the
dispatcher and thus the tracer on processor 1 isn't as efficient as the
tracers on the other processors.
Communication vs. Computation time
The series of graphs shown below show breakup the time taken by the code to
run into the computation time and the communication time.
As can be seen from the above graphs, the performance of the program is
maximum at a granularity of about 30 lines per chunk. At lower granularity,
the communication overhead becomes too high. At higher granularity again the
efficiency goes down. This is due to load imbalance as some processors get
more work and others get less work and stay idle for a longer time. This is
shown to be true from the load balancing graph below.
Load Balancing
The graph below shows the ratio of maximum work done by a processor to the
minimum work done by a processor (so showing how effectivley the load
balancing policy is).
At low granularity the processor 1 ends up taking much longer to do its
work because it is interrupted very often by the dispatcher which is invoked
by the other tracers. At slightly higher granularities the load is much better
balanced. Again at very high granularity the load imbalance is due to some
processors getting more intensive work than others.
5.Experiences and Conclusions
CC++
We found CC++ to be overall a very effective language to work with in that it
allowed us to think about parallelism in a structured manner. However, we also
found the compiler we worked with to be very inefficient in code generation.
To fairly compare our parallel implementation with the original serial C
version, we had to recompile the serial version with the CC++ compiler;
otherwise the output code of the C compiler for the ray tracing code (and
therfore the computation intensive part of the program) was 50% faster than
the CC++ code. Another problem is with the data transfer functions, which we
mentioned earlier.
Raytracing
We feel that the best way to handle antialiasing is using shared memory
multiprocessors so that processors can readily access the neighbouring rows
that are handled by other processors. Also, since each processor needs only
the first row of the next block to antialias its last row, it is almost
certain that the neighbouring processor would have completed raytracing that
row and the processor can therefore immediately use that data. (This statement
is made under the assumption that all the processors carry the same work-load
outside of the raytracing.)