MIME-Version: 1.0 Server: CERN/3.0 Date: Sunday, 01-Dec-96 20:09:16 GMT Content-Type: text/html Content-Length: 8162 Last-Modified: Monday, 20-May-96 18:25:42 GMT CS 612 - Final Report

Parallel FFTs


1.Overview

FFTs are currently calculated in 2 phase cycles, of communication and computation, the number of cycles being equal to the number of dimensions of the problem. The communication phase is a distributed matrix transpose and is implemented with an all-to-all personalized black box function. So a 2 dimensional FFT would need 2 cycles (4 phases):

1d FFT
all-to-all (distributed transpose)
1d FFT
all-to-all (distributed transpose)

Data is initially distributed such that each processor has an n1xm2 (rowsxcolumns) block of the matrix. (i.e. a column block distribution)
n1,n2- number of rows,columns of the problem size
m1,m2 - n1/p and n2/p respectively
p - the number of processors.

Each process then computes the FFT of its local data and calls the distributed transpose function which results in each processor having a matrix of size n1xm2. The FFT of this matrix is computed followed by another distributed transpose phase.

2. Problems in the communication phase - Motivations

Implementing the distributed transpose operation as a black box can be inefficient due to unused processor cycles while waiting for and sending data and idle floating point units while doing the sends and the recieves.

The function starts by collecting data for all the processors so that the data is stored linearly and can be sent directly to the other processors. The nth m1 rows (of m2 columns each) goes to the nth processor, so the collecting is done trivially block copying a column at a time (the matrix is stored in column major).(B<-A)

Once that is done, the processor sends out signals to all other processors indicating that it is ready to recieve data from the other processors and then proceeds to send its data to the other processors when it recieves a ready signal from the processors. It then waits for all its incoming data to arrive (A<-B) and then does a transpose on the local buffer. (B<-At)

a. The recieving end:
At the recieving end the processor wastes processor cycles while waiting for data from all processors to arrive so it can do a transpose on the buffer. (Synchronization wait)
It also wastes the FPUs (the SP2 has 2) while doing the flow control and load/store operations of data from the network interface to memory since these require only integer arithmetic.

b. The sending end
The sending end suffers a lot of overhead if the message being sent is greater then 8k in length. (see section 4)The thrust here is to break one big message into a number of messages smaller than 8k.

3. Eliminating Synchronization Waits

We try to eliminate synchronization waits by performing operations on the small blocks that are exchanged between processors,thereby shifting stance from collective communication to point-to-point communication. This requires a modification in the initial data distribution to lead to an efficient implementation of the computation, which changes to column cyclic.

The distributed matrix transpose phase changes to collecting a block for a processor and performing the last part of the computation of cycle 1 on that block before sending it to the processor. As the distribution is column cyclic the data has to be collected in rows of stride p. The processor that recieves the block transposes it and performs the first part of the computation of cycle 2 that is to be applied on that block and "scatters" the data with stride p across the rows.




This loop is unrolled once so that processors do not waste any time waiting for data. That is every time the processor polls the network for an incoming packet it gets the data it needs and can operate on it. (see figure )






So 2 forms of waiting are eliminated:
1 The need to wait for all blocks since we can operate on each block independently
2 The need to poll the network for each block since we unroll the loop and schedule the communication so that we get the packet every time we poll the network.

Once a processor has got all its data it can perform the main body of the computation.

Limitations

We find that the cost of copying the data into and out of the message packets is prohibitive. sAs the data has to be "gathered" before sending and "scattered" after recieving with stride p we can do only an element wise copy of the data and that is very inefficient.

4. Limiting packet size

The implementation of Active Messages over the SP2 sends messages in 8k "chunks". Messages greater than 8k have to wait for an acknowledgement from the recieving processor before the next chunk can be sent, and therefore suffer the entire roundtrip latency of the network. The call to am_store_async therefore does not return till all chunks have been sent. To avoid this overhead we send packets of size 8k so that am_store_async can return immediately and the processor can get its next packet ready for sending. This allows some parallelism in the send operation.

Each 8k message is collected from the packet (of size m1xm2 collectd from the matrix as described in the previous section)with row stride r1 and column stride r2.
r1= m1/number of rows in 8k packet
r2= m2/number of columns in 8k packet

After collecting a packet the last part of the computation of cycle 1 is applied to it before sending it. At the recieving end each 8k packet is transposed and the first part of the computation of cycle 2 is applied to it followed by a "scatter" operation into the m2xm1 buffer with column stride r1. The elements of each column are copied contiguously.

This is done a number of times till the m2xm1 buffer has received all its data (or till the sending side has finished sending the m1xm2 buffer).







Note that, as in the previous case the loop is unrolled so that each processor is sending data to 2 processors and recieving data from 2 processors to avoid waiting for packets from the network.

Limitations

We find again the cost of the copying between buffers to be prohibhitive to attaining good efficiency. Note that the copying in this case is more than the copying in the earlier case as we have to collect data with stride p from the main array into the m1xm2 buffer (as in the previous case) and then from there copy data into 8k packets with row and column stride r1 and r2 respectively to be sent through the network. At the recieving end the 8k packet is copied into the m2xm1 buffer with column stride r1. After all the packets have arrived the m2xm1 buffer is merged into the main buffer with a contiguous copy of all columns. A lot of this copying is element wise (except for the copying of the of the 8k packet into the m2xm1 buffer and the m2xm1buffer into the main array which is done column wise) and therefore
we cannot take advantage of block copying functions here.

5. Results












6. Conclusions and future work

We have seen that this approach will not work as is due to the copying and computation overhead it introduces. Our next step will be to look into how these overheads can be removed.