In this subsection, we evaluate the performance of our algorithm on several applications that contain indirect array references. Two of these cases (CG and SPARSPAK) are sparse-matrix codes, where the indirect references occur while mapping between the sparse matrix and the dense storage array. IS is an integer sort program that uses the ``bucket sort'' algorithm; in this case, the indirect references occur while computing the the number of occurrences of each integer in a dense histogram-style array. Finally, MP3D is a rarified-air particle simulator, where the indirect references occur when the (x,y,z) coordinates of a particle are used to look up the matching ``space cell'' in a 3-dimensional matrix.
Figure shows the results of our experiment. For each application, we show three performance bars: (i) the original case without prefetching (N), (ii) the case where we only attempt to prefetch dense (i.e. affine) references (D), and (iii) the case where we prefetch both dense and indirect references (B).
We see in Figure that prefetching significantly improves the performance of each application, with over two thirds of memory stall time eliminated, and speedups ranging from 10%to 100%. The interesting question, however, is how much of this gain came from prefetching indirect references. The answer to this varies across the applications. In the case of IS, roughly half of the benefit was from prefetching the indirect references. At the other extreme, SPARSPAK saw no additional benefit from prefetching indirect references. This may seem a bit surprising, given that SPARSPAK is a sparse-matrix application. However, in SPARSPAK the elements of the sparse matrix are actually stored in a dense matrix, and the indirect references are to pointers into the appropriate rows or columns of the matrix. Therefore the data set size of the indirect references is relatively small, and tends to fit into the primary cache.
Another interesting issue with prefetching indirect references is instruction overhead. In Table , we isolate the average instruction overhead for prefetching only the indirect references. Notice that these numbers are generally larger than the numbers we saw earlier in Table . The reason for this is that prefetching an affine reference can take as little as a single instruction (the prefetch instruction itself), since the prefetch address can often be generated by simply changing the constant offset with respect to another load or store. However, computing the prefetch address for an indirect reference requires a minimum of four instructions: (1) a load, to load the index value; (2) a shift, to convert the index to the number of bytes in an array element; (3) an add, to add the byte offset to the starting address of the array; and (4) the prefetch instruction itself. In the case of SPARSPAK, our compiler achieved this bare minimum. For IS and CG, the numbers are also respectable. The case that stands out is MP3D, where generating an indirection prefetch address requires an average of 25 instructions. The reason for this is that MP3D must dereference three index arrays (x, y, and z) to compute each indirect reference, and in addition these index values must be converted from floating-point to integer values.
In theory, it may be possible to reduce the prefetching instruction overheads further by saving the indirection address in a register between the time it is computed for the prefetch and the time it is used for the load or store. However, there are two complications with this optimization. First, the compiler must be absolutely certain that the index value will not change during this interval. Otherwise, not only will the wrong location be prefetched (which only hurts performance), but more importantly the wrong location will be used by the load or store, and therefore the program will produce incorrect results. Second, storing the address values between the prefetch and the load or store will consume additional registers, which may result in the large overheads of register spilling. To avoid both of these problems, we chose to simply re-compute the indirect addresses each time they are needed in our compiler implementation.
Overall, the gains in reduced memory stalls were large enough to outweigh the increased instruction count in all cases. Although the performance gains for MP3D are small in the uniprocessor case, we will see later in Section that the gains are large in the multiprocessor case. The bottom line is that through a simple extension of the core algorithm, our compiler is now able to prefetch both affine and indirect array references, which covers a large fraction of array-based scientific and engineering applications.