The MP3D application spends most of its time executing a loop where each processor takes a particle and moves it through one time step. The overwhelming majority of cache misses are caused by references to two structures within this loop: (i) the particle which is being moved (36%of misses), and (ii) the space cell where the particle resides (53%of misses). Particles are statically assigned to processors and are allocated from the shared memory local to each processor, while the memory for the space cells is distributed uniformly among the processors.
We inserted prefetches into MP3D by hand as follows. Since a particle must be referenced to determine the space cell it occupies, we prefetch a particle record two iterations before its turn to be moved. In the iteration following the prefetch, the particle is read, and the associated space cell is determined and prefetched. As a result, when it is time for the particle to be moved, both the particle and space cell records are available in the cache. We also prefetch several other references that occur at time step boundaries. The end result is a coverage factor of 90%for our hand-insertion scheme. Exclusive-mode prefetches are used since the objects are modified during each iteration. Introducing these prefetches required adding 16 lines to the source code.
When our compiler inserted prefetches into MP3D, it recognized that the address of a space cell is computed based on the x, y, and z fields in a particle record (which represent the coordinates of the space cell). Since this is an indirect reference, the compiler used the algorithm described in Section to prefetch the particles two iterations ahead, and the space cells one iteration ahead. The scheduler determined that only a single iteration is needed to hide the memory latency, since the loop body is rather large. Therefore the compiler duplicated the hand-inserted approach to prefetching particles and space cells, resulting in a coverage factor of 89%. The compiler also prefetched a few other references at time step boundaries, but they turned out to be insignificant.
Figure (a) shows the performance of both the compiler-based and the hand-inserted prefetching schemes for MP3D. As we see in this figure, they both do quite well. The hand-inserted case performs slightly better simply because the scalar optimizer was able to eliminate a few more instructions in that case, but this difference is basically in the ``noise''. Therefore the compiler-based scheme appears to be living up to its potential in this case.