The goal of the analysis phase of prefetching is to decide which references should be prefetched. With the software-based approach, this analysis is performed by considering all references in the code and deciding whether or not they need to be prefetched. If so, the prefetches are scheduled by moving them back in time within the code. The hardware, however, does not have the luxury of examining the entire program and then moving prefetches back in time. The main challenge of deciding what to prefetch in hardware is predicting what locations will be referenced in the future. In the various hardware-based prefetching schemes that have been proposed, this prediction is achieved in one of three ways: (i) assume that there is abundant spatial locality (as in the schemes Porterfield studied ); (ii) decode ahead in the instruction stream (as in Lee's proposal ); or (iii) maintain a history of past access patterns (as in Baer and Chen's proposal ). The idea behind the first technique is that whenever a cache line is accessed, neighboring cache lines will be accessed in the near future, so they should also be brought into the cache. The problem, of course, is that this is not always true, and fetching lines unnecessarily can hurt performance both by displacing useful data and by causing memory queueing delays. The second technique attempts to predict access patterns by simply ``getting ahead'' in the instruction stream. However, the limitations of buffering capacity and branch prediction accuracy make it difficult to decode far enough ahead to hide large latencies. Finally, by maintaining history information, the hardware attempts to recognize constant-stride access patterns and prefetches ahead whenever those same instructions are executed. This last scheme appears to be the best of the three hardware-based schemes since it can prefetch constant-stride access patterns with arbitrary stride lengths (unlike the ``long cache line'' schemes which only exploit unit-stride patterns) without the expense of decoding ahead in the instruction stream (as in Lee's scheme). However, the analysis capability of a history-based scheme has the following limitations: (i) it can only recognize constant-strides accesses (unlike our compiler which can prefetch indirect references); (ii) it can only react to historical behavior (unlike the compiler, which can anticipate future misses); and (iii) the detection scheme depends upon the history table being large enough to contain the entire working set of references, which may not be possible (particularly given that this structure is content-addressable in order to look up the address of the lookahead PC).
So far we have described how the hardware predicts which locations will be referenced in the future. In contrast with the software-based approaches, the hardware typically does not bother to predict whether or not the data is already in the cache. Instead, for each reference the hardware predicts will be referenced in the future, it simply checks the primary cache to see whether the data is already present. The obvious advantage of always probing the cache is that the hardware will never be fooled into predicting that data is in the cache (thereby not issuing a prefetch) when it actually is not. For example, if the operating system swaps out the process and later restarts it, a software-based scheme would have no idea that the entire working set had been flushed from the cache, but a hardware-based scheme would still prefetch the data correctly. Part of the reason why the hardware always probes the cache is that doing so involves no instruction overhead. However, it may cause cache contention overheads, as we will discuss later in this subsection. In general, the primary disadvantage of performing the analysis in the hardware is that it cannot detect complex access patterns such as indirect references.