CMU 15-418/618 Final Project: Checkpoint Writeup

Checkpoint Report:

Suffix Array Construction on GPU

Work completed so far

We have studied deep into the serial Skew Algorithm implementation and parallized code by Guy in PBBS.
PBBS skew benchmark Code

We identied the parts which could be paralleized in GPU:
- radix sort mod12 based on 3 chars
- generate rank of triplets
- move mod 1 suffixes to bottom half and and mod 2 suffixes to top
- place ranks for the mod12 elements in full length array
- radix sort mod 0 suffixes
- mergesort s0 and s12

The parallelzed Skew Algorithm pseudocode on GPU:
(Based on the paper "Parallel suffix array and least common prefix for the GPU" by Mrinal Deo and Sean Keely.)

        	
        		computeSA(int *T, int *SA, int length)
        		{
        		    initMod12();  
        		    radixSort(s12);  // lsb radix 1st Character                     kernel1
        		    radixSort(s12);  // lsb radix 2nd Character
        		    radixSort(s12);  // lsb radix 3rd Character
        		    lexicRankOfTriplets();                                          kernel2
        		    if (!allUniqueRanks)
    			        computeSA();                                                recursion
    			        storeUniqueRanks();                                         kernel3
    			    else
    			        computeSAFromUniqueRank();                                  kernel4
    			    radixSort(s0)                                                   kernel1
    			    mergeSort(s0, s12)                                              kernel5
        		}

Progress: We are trying hard to make this code work!

We studied the radixSort and mergeSort library in GPU, which is available to be adapted in our code.
Cuda doc on RadixSort and MergeSort

More detailed goals

First implement the Skew Algorithm on Cuda according to the pseudocode above with the CUDA RadixSort and MergeSort library.

Then optimized the RadixSort and MergeSort parallel part based on the Merrill10Revisiting and Satish09Designing paper about efficient sorting algorithms for manycore gpus

Demo plan in competition

We plan to show our speedup graph of Skew Algorithm based on the serial Skew Algorithm code and comparison with the PBBS parallel Skew Algorithm code. Moreover, we plan to compare the performance of the Suffix Array based application such as Longest Common Prefix(LCP) with PBBS benchmark.

Preliminary results

No preliminary results so far.

Remaining issues

We are condifient to implement the Skew Algorithm on GPU with the CUDA RadixSort and MergeSort library. Howerver, we are not sure about how far we can go to implement our own RadixSort and MergeSort part on GPU to make improvements. There are some existing research results on optimizing those on GPU. We will try our best to achieve them.

Project schedule

Week What We Plan To Do What We Actually Did

Apr 11-14 Get familiar with Suffix Array construction algorithms and code We did this!

Apr 15-20 Go deep into the Skew parallel code in Guy's benchmark We did this!

Apr 21-30 Implement Skew algorithms in CUDA We are half way there. With pseudocode ready we are implementing the algorithm.

May 1-9 Compare with Guy's parallel benchmark and make optimizations

May 10-13 Final report and demo

Week	What We Plan To Do	What We Actually Did
Apr 11-14	Get familiar with Suffix Array construction algorithms and code	We did this!
Apr 15-20	Go deep into the Skew parallel code in Guy's benchmark	We did this!
Apr 21-30	Implement Skew algorithms in CUDA	We are half way there. With pseudocode ready we are implementing the algorithm.
May 1-9	Compare with Guy's parallel benchmark and make optimizations
May 10-13	Final report and demo