In contrast with WATER, BARNES spends 38%of its time stalled for memory; therefore hiding latency can potentially result in significant performance improvements. However, since BARNES contains mainly pointers and recursive data structures, our compiler does not insert any prefetches since such references are beyond the scope of our algorithm. To evaluate how much the compiler might be able to improve performance if it could handle such cases, we inserted prefetches into BARNES by hand. We begin with a quick overview of BARNES and a description of how we inserted prefetches, followed by our experimental results.
In BARNES, the main computation consists of traversing the octree structure to compute the gravitational force exerted by the given body on all other bodies in the tree. This is repeated for each body in the system, and the bodies are statically assigned to processors for the duration of each time step. Cache misses occur whenever a processor visits a part of the octree that is not already in its cache, either due to replacements or communication.
We inserted prefetches into BARNES by hand as follows. The octree structure is traversed depth-first, and each internal node can have up to eight children. When we first arrive at a node, we issue prefetches for all immediate children before descending depth-first into the first child. By doing so, we succeeded in covering 78%of the misses, with only 21%of the prefetches being unnecessary. Roughly half of the prefetched primary misses became primary hits-the other half were found in the secondary cache. These prefetches turned out to be moderately expensive in terms of instruction overhead (on average 12 instructions per prefetch) because of the loop overhead and conditional tests at each node. Overall, we achieved a speedup of 9%, as shown in Figure .
Despite the many pointers in BARNES, it was relatively straightforward to pinpoint the cause of the misses. Through the profiling information provided by our simulator, we discovered that 81%of the miss latency was caused by only six references. In addition, the author of BARNES was readily able to predict these important references without the benefit of profiling information, which indicates that these misses are intuitively obvious to someone familiar with the algorithm. Only five lines of code were added to BARNES to insert the prefetches. The greater challenge in BARNES was scheduling the prefetches early enough to hide the latency, since the control flow consists of recursive procedure calls rather than loops. We were largely successful toward this goal because we understood the structure of the tree and the order in which it was traversed.
Could the compiler insert these same prefetches automatically? With the help of sophisticated pointer analysis to recognize the tree structures and the order in which they are traversed , and perhaps with profiling feedback information to indicate which pointer dereferences suffer misses (see Section ), it may be possible. It certainly will not be easy, since pointer analysis is a difficult problem, but it may be possible.