Several other researchers have also worked on software-controlled prefetching. This section summarizes their work and relates it to ours.
Porterfield [64][9] was the first to explore software-controlled prefetching for uniprocessors. He proposed a compiler algorithm for inserting prefetches into dense-matrix codes. He implemented his algorithm as a preprocessing pass that inserted prefetching into the source code. His initial algorithm prefetched all array references in inner loops one iteration ahead. He recognized that this scheme was issuing too many unnecessary prefetches, and presented a more sophisticated scheme based on dependence vectors and overflow iterations (i.e. the predicted number of iterations where a loop will access more data than fit in the cache). Since the simulation occurred at a fairly abstract level, the prefetching overhead was estimated rather than measured. Overall performance numbers were not presented. Also, the more sophisticated scheme was not automated, since the overflow iterations were calculated by hand, and did not take cache line reuse into account. Despite leaving many important questions unanswered, Porterfield's work demonstrated that software-controlled prefetching was a promising technique that warranted further exploration.
Klaiber and Levy [42] extended Porterfield's work by recognizing
the need to prefetch more than a single iteration ahead. They included several
memory system parameters in their equation for how many iterations ahead to
prefetch, and inserted prefetches by hand at the assembly-code level. The
results were presented in terms of average memory access latency rather than
overall performance. Also, in contrast with this dissertation, they proposed
prefetching into a separate fetchbuffer rather than directly into the
cache. However, as we will discuss in detail later in
Section
, using a separate fetchbuffer has a number of
important disadvantages, including the sacrifice of chip area that could
otherwise be used for a normal cache, and the fact that it becomes very
difficult to makes prefetches non-binding, which is a crucial
property for multiprocessors.
Gornish, Granston and Veidenbaum [32][31] presented
an algorithm for determining the earliest time when it is safe to prefetch
shared data in a multiprocessor with software-controlled cache coherency.
Since the prefetches are binding, all control and data dependencies must be
carefully considered. Their work is targeted for a block prefetch
instruction, rather than the single-line prefetches considered in our
study. The proposed schemes are evaluated using a large number of numerical
subroutines. Although the speedups predicted from static analysis are quite
high, over twofold, the speedups obtained using detailed simulations are
limited to 10-20%. The complexity of the compiler algorithm presented in
this work illustrates how difficult the compiler's job becomes when binding rather than non-binding prefetches are used. This is in
sharp contrast with the simplicity of our non-binding prefetching
algorithm, presented later in Section
.
Chen et al. [13] investigated prefetching for non-numerical codes. They attempted to move address generation back as far as possible before loads to hide a small cache miss latency (10 cycles), and found mixed results. Generating addresses early is difficult in non-numerical code because control and data dependencies tend to be tight, and the access patterns can be very irregular. Although these codes are difficult to prefetch, they also tend to make better use of the cache than numerical codes and therefore typically suffer less from memory latency. Because of the difficulty of prefetching highly irregular access patterns and the relatively small expected gains, we chose not to focus on this class of applications during this dissertation.
Tullsen and Eggers [83] post-processed reference
traces to evaluate the performance of an ``oracle'' prefetching scheme on a
bus-based multiprocessor architecture with limited bandwidth. They observed
that if an application is already bandwidth-limited, prefetching cannot
improve performance. However, because of their post-processing trace
methodology, they were not able to make proper use of exclusive-mode
prefetches, which can potentially eliminate up to half of the memory
bandwidth consumption. As we will show later in
Section
, exclusive-mode prefetching
eliminates up to 27%of the memory requests in our applications, which
would have translated directly into improved performance in their
bus-bandwidth-limited architecture.
We now place this related work in the context of our own research. Porterfield's work on software-controlled prefetching for uniprocessors was done concurrently with our investigation of non-binding software-controlled prefetching for multiprocessors as part of the DASH project [54]. Our study, where we inserted prefetches by hand, was the first to consider non-binding prefetching for multiprocessors [61]. The compiler algorithm presented in this dissertation partially overlapped the work by Porterfield and by Klaiber and Levy, but has a number of key differences. First, our uniprocessor algorithm is more comprehensive in the types of locality it optimizes for, including temporal, spatial, and group locality. Second, unlike previous studies, we have implemented our algorithm in an optimizing compiler, and can therefore generate overall performance numbers for fully-functional codes. Finally, the scope of our complete algorithm is much broader since it also covers indirect references and multiprocessing.