A lockup-free cache is a common requirement for most latency-hiding techniques, including prefetching, relaxed consistency models, non-blocking loads, and multithreading. The complexity of implementing a lockup-free cache depends on which of these techniques it is intended to support (as described in detail by Laudon [52]). For example, if the goal is simply to support multiple outstanding prefetches, then it is not strictly necessary for the processor to maintain state on outstanding transactions, as long as the cache is prepared to receive prefetch responses from outside the processor while the processor may be simultaneously issuing new requests. In contrast, supporting multiple outstanding stores (as with relaxed consistency models) or loads (if they are non-blocking) does require that special state be maintained for each outstanding access. For stores, the stored data must be merged into the cache line when it returns. For loads, the requested data must be forwarded directly to a register-thus requiring state to associate each outstanding access with the register(s) waiting for the value-and any future uses of that register must interlock if the value has not returned yet.
Kroft [45] presented the original lockup-free cache design, which adds structures called ``miss information/status handling registers'' (MSHRs) to keep track of outstanding misses. Each MSHR contains enough state to handle one or more accesses of any type to a single memory line. Due to the generality of the MSHR mechanism, the amount of state involved is non-trivial, including the address, pointers to the cache entry and destination register, written data, and various other pieces of state. The majority of subsequent lockup-free cache proposals have been a variation of this original MSHR scheme [46][79][70][63]. An alternative approach is to maintain the state of outstanding misses in the cache tag array itself [52][17], thereby permitting a larger number of outstanding misses.
For the uniprocessor architecture used in Chapter
, the lockup-free cache supports a single load or
store miss (since loads and stores
directly stall
the processor) and up to sixteen prefetch misses. The state of outstanding
prefetch misses is maintained in a sixteen-entry prefetch issue
buffer. This structure is simpler than a normal MSHR because it contains
only the prefetch addresses and any state necessary to track or control
outstanding misses. It does not contain any data, since the
prefetched data is placed directly into the cache hierarchy.
To illustrate how the prefetch issue buffer is used, we will walk through
the steps of processing a prefetch. When the processor executes a prefetch
instruction, the primary data cache is checked during that cycle. If the
data is already present in the cache, then the prefetch is considered
completed at that point. Otherwise, the prefetch addresses is then compared
against the addresses already in the prefetch issue buffer. If there is a
match, then the prefetch is dropped. Otherwise, a new entry is allocated in
the prefetch issue buffer for the new prefetch. At this point, the prefetch will proceed through the
memory hierarchy to first find the data and then place it in the primary
cache, as was discussed in Section
. An entry
in the prefetch issue buffer is deallocated as soon as the prefetch
completes (as opposed to waiting for the data to actually be referenced).
Although a prefetch issue buffer is not strictly necessary, it offers the
following performance advantages. First, in cases where the prefetch miss
cannot be serviced immediately (perhaps because of a limited number of MSHRs),
it allows the processor to continue executing by buffering the prefetch
request. Second, by keeping track of the prefetches that are already
outstanding, it provides a mechanism for merging subsequent prefetch and
memory references to matching cache lines. Merging a prefetch with a previous
prefetch simply means dropping the subsequent prefetch. This helps to
avoid unnecessary bandwidth consumption further down the memory hierarchy,
similar to the benefit of checking the primary cache when searching for the
prefetched data (as discussed earlier in Section
). However, with selective prefetching, the
compiler is generally good at avoiding back-to-back redundant prefetches (at
least for these scientific codes). The more important benefit occurs when a
load is merged with a previous prefetch that was partially completed, since
this allows the load to benefit from whatever partial latency-hiding the
prefetch has been able to accomplish. This is especially important for cases
where data or control-flow dependencies make it impossible to issue prefetches
early enough to hide the entire memory latency, which can occur frequently in
pointer-based codes such as PTHOR (e.g., when attempting to prefetch a linked
list).
Figure shows how the prefetch issue buffer
is incorporated into our uniprocessor architecture. In this figure, we have
shown distinct MSHRs to allow for the possibility that misses are handled
separately from the buffering of prefetch requests. Note that in our model, a
prefetch remains in the prefetch issue buffer until it is completed by its
MSHR. In our experiments so far, we have assumed a sixteen-entry deep
prefetch buffer and seventeen MSHRs (one for each prefetch and one for
either a load or store). We chose these large parameters to minimize their
effect on performance. We will now examine how many prefetch issue buffer
entries and MSHRs are actually needed for our benchmarks.
Before running these experiments with reduced prefetch buffering and MSHR
capacities, we can gain insight into what to expect by examining the number
of outstanding prefetches in the original ``full-capacity'' architecture.
Figure plots the distribution of prefetch misses
already outstanding during each prefetch miss for the original
architecture. As we see in Figure
, the
distributions vary considerably across applications. On the one extreme is
EP, where there is almost never more than one outstanding prefetch. On the
other hand there are programs like CHOLSKY, where the distribution is
spread out over larger numbers of previously outstanding prefetches (i.e.
the distribution has a significant tail). We would expect these latter
types of applications to benefit from having multiple MSHRs and deep
prefetch buffers.
To measure the actual impact on performance, we first varied the number of
MSHRs while holding the depth of the prefetch issue buffer fixed at sixteen
entries. Figure shows the results of these
experiments. A single MSHR corresponds to a ``blocking'' cache (i.e. one
that is not lockup-free), while seventeen MSHRs corresponds to the original
architecture. Note that with fewer than seventeen MSHRs, there may
potentially be more outstanding miss requests (up to seventeen) than can be
serviced at a given time, in which case prefetch misses must compete with
load and store misses for the MSHRs. We resolved such cases by giving
load or store misses (which directly stall the processor) priority for
the next available MSHR-otherwise, the oldest unserviced prefetch will
be awarded the next available MSHR.
As we see in Figure , the benefits of a lockup-free
cache are often substantial. Seven of the thirteen applications showed at
least an 15%performance improvement by going from one to two MSHRs. In
six of those seven cases, there was at least a 5%additional improvement
from having four MSHRs. The benefits of moving from four to seventeen MSHRs
were typically quite small. These performance improvements correspond to
what we would expect from the distributions shown in Figure
. In particular, the largest improvements occur in
the benchmarks with the largest tails in their distributions (e.g.,
CHOLSKY, GMTRY, VPENTA, and CG). Therefore we see that it is important to
have a lockup-free cache, and that four outstanding misses are sufficient to
capture most of the benefits for these applications.
Given that four MSHRs are sufficient, the next question is whether buffering
additional prefetch requests beyond that is advantageous, or whether the
prefetch issue buffer should also be reduced to four entries. To evaluate
this, we simulated the performance when the number of prefetch issue buffer
entries is reduced, while holding the number of MSHRs fixed at four. Figure
shows the two cases (CHOLSKY and VPENTA) where this
caused a noticeable difference in performance (in six of the other cases,
there was less than a 2%difference between four and eight buffer entries,
and no difference between eight and sixteen buffer entries). CHOLSKY showed
the largest improvement, with a 6%speedup when going from four to eight
buffer entries, and 5%speedup beyond that for having sixteen entries.
VPENTA showed a 5%speedup between four and eight entries, and 1%additional speedup with sixteen entries. Therefore we see that additional
prefetch buffering is beneficial in a small number of cases. However,
given that this benefit is relatively small and infrequent, the
hardware cost would also have to be quite small for it to be justified.