Several other researchers have also worked on software-controlled prefetching. This section summarizes their work and relates it to ours.
Porterfield  was the first to explore software-controlled prefetching for uniprocessors. He proposed a compiler algorithm for inserting prefetches into dense-matrix codes. He implemented his algorithm as a preprocessing pass that inserted prefetching into the source code. His initial algorithm prefetched all array references in inner loops one iteration ahead. He recognized that this scheme was issuing too many unnecessary prefetches, and presented a more sophisticated scheme based on dependence vectors and overflow iterations (i.e. the predicted number of iterations where a loop will access more data than fit in the cache). Since the simulation occurred at a fairly abstract level, the prefetching overhead was estimated rather than measured. Overall performance numbers were not presented. Also, the more sophisticated scheme was not automated, since the overflow iterations were calculated by hand, and did not take cache line reuse into account. Despite leaving many important questions unanswered, Porterfield's work demonstrated that software-controlled prefetching was a promising technique that warranted further exploration.
Klaiber and Levy  extended Porterfield's work by recognizing the need to prefetch more than a single iteration ahead. They included several memory system parameters in their equation for how many iterations ahead to prefetch, and inserted prefetches by hand at the assembly-code level. The results were presented in terms of average memory access latency rather than overall performance. Also, in contrast with this dissertation, they proposed prefetching into a separate fetchbuffer rather than directly into the cache. However, as we will discuss in detail later in Section , using a separate fetchbuffer has a number of important disadvantages, including the sacrifice of chip area that could otherwise be used for a normal cache, and the fact that it becomes very difficult to makes prefetches non-binding, which is a crucial property for multiprocessors.
Gornish, Granston and Veidenbaum  presented an algorithm for determining the earliest time when it is safe to prefetch shared data in a multiprocessor with software-controlled cache coherency. Since the prefetches are binding, all control and data dependencies must be carefully considered. Their work is targeted for a block prefetch instruction, rather than the single-line prefetches considered in our study. The proposed schemes are evaluated using a large number of numerical subroutines. Although the speedups predicted from static analysis are quite high, over twofold, the speedups obtained using detailed simulations are limited to 10-20%. The complexity of the compiler algorithm presented in this work illustrates how difficult the compiler's job becomes when binding rather than non-binding prefetches are used. This is in sharp contrast with the simplicity of our non-binding prefetching algorithm, presented later in Section .
Chen et al.  investigated prefetching for non-numerical codes. They attempted to move address generation back as far as possible before loads to hide a small cache miss latency (10 cycles), and found mixed results. Generating addresses early is difficult in non-numerical code because control and data dependencies tend to be tight, and the access patterns can be very irregular. Although these codes are difficult to prefetch, they also tend to make better use of the cache than numerical codes and therefore typically suffer less from memory latency. Because of the difficulty of prefetching highly irregular access patterns and the relatively small expected gains, we chose not to focus on this class of applications during this dissertation.
Tullsen and Eggers  post-processed reference traces to evaluate the performance of an ``oracle'' prefetching scheme on a bus-based multiprocessor architecture with limited bandwidth. They observed that if an application is already bandwidth-limited, prefetching cannot improve performance. However, because of their post-processing trace methodology, they were not able to make proper use of exclusive-mode prefetches, which can potentially eliminate up to half of the memory bandwidth consumption. As we will show later in Section , exclusive-mode prefetching eliminates up to 27%of the memory requests in our applications, which would have translated directly into improved performance in their bus-bandwidth-limited architecture.
We now place this related work in the context of our own research. Porterfield's work on software-controlled prefetching for uniprocessors was done concurrently with our investigation of non-binding software-controlled prefetching for multiprocessors as part of the DASH project . Our study, where we inserted prefetches by hand, was the first to consider non-binding prefetching for multiprocessors . The compiler algorithm presented in this dissertation partially overlapped the work by Porterfield and by Klaiber and Levy, but has a number of key differences. First, our uniprocessor algorithm is more comprehensive in the types of locality it optimizes for, including temporal, spatial, and group locality. Second, unlike previous studies, we have implemented our algorithm in an optimizing compiler, and can therefore generate overall performance numbers for fully-functional codes. Finally, the scope of our complete algorithm is much broader since it also covers indirect references and multiprocessing.