The first step in executing a prefetch is translating the virtual data address to a physical address. Address translation is accelerated in modern RISC processors through a ``translation lookaside buffer'' (TLB), which is simply a cache of recent virtual-to-physical address mappings. Hence the first question is whether a prefetch should be dropped if its virtual address does not match an entry in the TLB-otherwise a TLB fault handler must be run, which is a relatively expensive operation.
The answer to this question is complicated by two conflicting goals. On the one hand, we would like to hide the latency in situations where we are legitimately suffering frequent TLB misses, and this cannot occur if the prefetch is dropped. An example would be code that iterates across the outer dimensions of large matrices, in which case each reference may be to a unique page. On the other hand, one of the desirable properties of prefetch instructions (as mentioned earlier in Section ) is that they are free to reference invalid addresses, in which case we would like to drop the prefetch with minimal performance loss. Since TLBs do not contain invalid address mappings, an invalid address can only be detected by performing full address translation, hence suffering the cost of a TLB miss (which can potentially be hundreds of cycles). This second scenario may occur frequently in code containing pointers and other indirect references, in which case this TLB miss overhead may be prohibitively expensive.
Although choosing between these two goals is difficult, since each is important given its own scenario, we can start by comparing their expected frequencies. The case where legitimate TLB misses are occurring frequently is somewhat unlikely for the following reasons. First, it can only be a sustained problem for applications having both very large data sizes and very large (at least a page) strides. Although both of these may occur in some scientific codes, it is far more common to see smaller strides as the code iterates through inner dimensions of matrices. Smaller strides are advantageous since they can exploit spatial locality by reusing cache lines, and we would expect locality optimizations such loop interchange (as demonstrated in Section ) to continue enhancing this in the future. Second, since legitimate TLB misses would occur even without prefetching, then presumably processor designers have already dealt with this problem by making the TLB sufficiently large. In contrast, invalid prefetch addresses may occur frequently in any code containing indirect references (hopefully not, but it is a possibility). This is independent of both data size and the number of TLB entries. Also, given the inherent difficulty in prefetching code containing recursive data structures (as we encountered with PTHOR and BARNES in Section ), the additional burden of TLB miss penalties on invalid addresses is likely to make the task hopelessly frustrating. Therefore if a default policy must be chosen, it is probably better to drop prefetches on TLB misses.
An alternative to choosing a fixed policy is to allow the software to select the more appropriate policy by making use of the prefetch hint bits described earlier in Section . For example, there could be two types of prefetches: ``speculative'' prefetches, which should be dropped on TLB misses since the address may be invalid, and ``non-speculative'' prefetches, where it is better to suffer the TLB miss for the sake of hiding the latency. This approach satisfies both goals, and may lead to the best overall performance.
Once a valid physical addresses has been computed for a prefetch, it is ready to be issued to the memory subsystem. The mechanics of how a prefetch normally proceeds through the memory subsystem will be discussed later in Section . However, even after a prefetch has been issued to the memory subsystem, it is still possible to abort it before it completes. The scenario where this might make sense is when the memory subsystem queues are already full, and the prefetch cannot proceed without stalling the processor, as we will discuss next.