CMU 15-418 (Spring 2012) Final Project Report
Parallel Subdivision on Parametric Surfaces
Jeffrey Liu, Albert Shih
Main Project Page

SUMMARY

We implemented parallel view-dependent subdivision on models using bezier surfaces in CUDA.

BACKGROUND

The program first loads an input file of patches. For each frame, out subdivision algorithm runs once. The general algorithm goes like this:

Data Structures (all in device memory):

  - A patch array for the original mesh of patches.
The data is loaded once at the beginning and never changed, - An input and output buffer of patches.    The input is for holding the current patches we need to process. The output is for the patches we need to process next subdivision. We swap these after each subdivision.

  - Array of decision bits
    There are 1 decision bit for each control point. Decision bits are either 1 or 0 depending on the error value calculated at that control point. We can reuse this array each subdivision.

  - An integer "todo" array and a "done" array
    We have one value for each current patch so we can do prefix sum. The values in the todo array are either 0 or 4, depending on whether a patch needs subdivision. The values in the done array are 0 or 1, depending on whether a patch does not need to be subdivided.


Model View Transformation: Apply model-view transformation to all the control points in each patch and copy the resulting patches to the input buffer. We can't apply perspective division yet, because doing so will warp the control points.

Calculate Decision Bits: The decision bits are calculated by taking the original control points and finding the distance between it and its degree elevated bilinear counterpart. The bilinear counterparts are basically found via linear interpolation of the edge lines that the 4 corner points make.

Run Exclusive Prefix-Sum: Runs prefix sum on the "todo array" and the "done array." We include one more element than the number of patches to get the total size of each.

Process the Patches: If a patch still needs to be subdivided, the split kernel will call the subdivide function and throw four new patches into the new patch queue. Otherwise, it will call the function that generates the quad approximation with the 4 corner points and the normal of the patch and throw those into a vertex buffer.

Repeat: If not all the patches were calculated into quadrilaterals and thrown into the vertex buffer, repeat the process starting from step two, calculating the decision bits, using the new subdivided patches. To do this, we simply swap the input and output buffers.


APPROACH

Like the authors of the paper, we implemented the subdivision using CUDA. We rendered the results using OpenGL.

On CUDA Implementation:

There are a few embarrassingly parallel parts in this algorithm, and in those cases we just let each CUDA thread process a control point or patch. However, the two main kernal calls, "oracle" and "split_kernal," involve taking CUDA threads to mean different things at different stages. For both functions, we let each CUDA block contain 16 patches and allocate 16 CUDA threads for each patch.

The "oracle" function sets the decision bits for each patch. Each CUDA thread will do perspective division on each control point and its planar counterpart and calculate their decision bit. The points in perspective division are only used here (we can?™t store them in the patch data because the patch might be subdivided again and perspective division will warp the control points so their bilinear degree elevation counterparts, or respective points on the plane, will be wrong), so they are put in shared memory. Then just one thread for each patch will set the "todo" and "done" array based on the complete decision bitfield.

The "split_kernal" function looks at the decision bits for each patch and subdivides it or generates primitives for rendering. Generative primitives just involve taking the corners of each patch and calculating the normals, so this done by 1 thread. Subdivision involves matrix multiplication so each of the 16 threads is in charge of calculating one point (including x,y,z coordinates) in the final product of each 4X4 matrix multiplication. We do two matrix multiplication for each dimension of each subdivided patch. The intermediate results are stored in shared memory while the final results are stored in global.

We could have splitted both functions into two or more separate global kernal functions. However, this would have increased the bandwidth dramatically. Instead of copying and reading all the patches from global memory with each kernal call, it is much better to have some diversion in the code and reuse the cached memory.

On Existing Codebase:

We started with an old computer graphics project (from 15-462) and added CUDA functionality to it. The computer graphics project contained basic camera, vector, and matrix classes, but we had to write patch classes and patch loading code. Unfortunately, none of these functions could be used in kernal calls in CUDA, so we had to write device versions of many of the functions.

Integrating CUDA into an existing project was also a bit more difficult than we expected. Neither of us had much experience with makefiles, and existing makefile for the graphics project was already pretty complicated.

Other Stuff:

It is note worthing that we are calculating normals differently. Our patch input files don't include normals (or texture coordinates) so we are calculating them on the fly. This works if we just do one subdivision on all the patches, but after dynamic subdivision the orientation of the patches change. This results in some of our normals being flipped. That?™s why we use a uniform normal for all our dynamic subdivision examples.

We haven't looked deeply into this issue, but it might be fixable. We could possibly pre-compute normals for each patch instead of loading them from an input file, but then the normals for each corner would be the same and would not reflect the C1 continuity of the bezier patches. Also, we might be able to fix this by changing our patch subdivision math. We could not find a standard way of subdividing patches. Our best reference used matrix multiplication with a few constant matrices, but it was partially incorrect and we had to figure out the math ourselves to get a correct subdivision. However, we don't know if this caused the normals to behave abnormally.

Also, it is probably worth mentioning that there are sometimes there are weird artifacts floating around.

RESULTS

To be honest, we could not get very good analytical data. There are a few reasons that we can't accurately compare our results to the results of the paper. The models we use are different (except for bigguy, but he has a different starting patch number) and our patch input files don't have normals and texture coordinates. Also, we don't have all the functionalities yet, such as backface culling (which requires proper normals) and frustrum checking.

Also, we ran out of time to implement parallel subdivision using the CPU and SIMD. We also do not have a sequential version of our subdivision. It's unfortunate that we spent too much time trying to get the subdivision algorithm to work correctly rather than optimizing and analyzing its parallelism. We both would have spent a lot more time if we knew how long the project was going to take us.

But here are some results:

Model Initial Patches Final Patches Subdivision Time
Plane 1 1 1ms
Bigguy 2900 34848 80ms
Forest 55080 154979 180ms

*Note: These timings are all taken at some distance away from the models, because we do not have frustrum checking implemented yet. Unfortunately, the timings change drastically based on distance.

Model Oracle Time Prefix Sum Time Split Kernal Time
Plane 0ms 1ms 0ms
Bigguy 3ms 77ms 2ms
Forest 10ms 164ms 6ms

*Note: These timings are all made after calling cudaThreadSynchronize() or else the kernal calls will be made asynchronously and the timings will be off. However, this increases total time by a bit.

If we were to just compare the bigguy timings with the paper's implementation, our implementation is at least 9 times slower. I think the main reason is that our prefix sum function is extremely slow and is taking up most of our time. Actually, we did not expect the prefix sum function to take so long so we only ran it on a block. This means if we use all the GPU cores, we can at best decrease the time spent by the prefix sum by a factor of 15. Still, there is probably something inefficient about our prefix sum because the paper's implementation of it only took a fraction of the oracle and split time.

Also, we did not include rendering time in our calculations because our rendering is horribly inefficient.

After every patch is subdivided once.
Dynamic subdivision. Everything finishes before the fifth subdivision.
Because the patches were small to begin with, most patches were only subdivided once.
Bigguy's original butt.
Now much rounder.


REFERENCES

http://research.microsoft.com/en-us/um/people/cloop/EisenEtAl2009.pdf

The paper whose method we implemented


http://www.idav.ucdavis.edu/education/CAGDNotes/Matrix-Cubic-Bezier-Patch/Matrix-Cubic-Bezier-Patch.html#subdivision

Information on Bezier patches and subdivision. Note that the matrix form of the subdivision is partially wrong. The matrices appended to the final subdivision split are incorrect because the four patch corners are not retained.


http://www.cs.mtu.edu/~shene/COURSES/cs3621/NOTES/spline/Bezier/bezier-elev.html

More information on Bezier patches. More specifically, how to find the degree elevation of a bezier curve.

LIST OF WORK BY EACH STUDENT

We pair programmed a lot, but:

Jeffrey: Figuring out how the method actually works and related research. Wrote many of the functions used by the oracle kernel.

Albert: Wrote the code for patch classes, data structures, and mesh loading from input files. In charge of repurposing the old code base. Wrote most of the writeup.

Project Code

Download

[Have a nice day.]