Hardware to support coherency mechanism is also needed for multiprocessors.

Lockup-free caches permit both buffering and pipelining. For buffering alone, write buffers are sufficient.

Explicit synchronization must be identified for multiprocessors.

The replicated thread state may include replicated register files.

If the primary cache is checked before prefetches proceed further through the memory hierarchy (we will discuss this in more detail later in Section ) then the additional memory subsystem loading caused by unnecessary prefetches will be limited to increased primary cache tag contention.

``Memory line'' refers to cache-line-sized blocks of contiguous memory, which are the unique elements that can be mapped into cache entries.

If the data are stored in column major order, then accesses to the same cache line can only occur when the same column is accessed.

In reality, the alignment of the two references with respect to the cache line boundaries may also be a concern. In the worst case, adjacent references may always straddle cache line boundaries, potentially never reusing the same cache lines. In our implementation, we account for this probabilistically by assuming that two references within half a cache line width of each other probably fall within the same cache line.

In the general case, 0 would be replaced by the lower bound of the loop, and a constant offset would be added to the modulo prefetch predicate for spatial locality in Table .

Since scalar optimization is the most time-consuming part of compilation, we assume that running full scalar optimization twice is not a viable option.

Multiple store misses cannot be outstanding because the processor stalls on a store miss.

R0 is a reserved register that returns the value 0 when used as a source operand. A load to R0 is essentially a NOP, and therefore this code sequence is not normally generated by the compiler.

We chose explicitly-parallelized applications because we wanted to use highly-optimized, ``real'' applications for our study, and therefore did not want to be constrained by the limitations of today's automatic parallelizing compiler technology. The explicitly-parallelized cases are also interesting because the compiler has the least amount of information to work with, which makes getting good performance more challenging for the prefetching algorithm.

One technique that may help in this case is a ``producer prefetch'', as implemented in the DASH prototype. With a producer prefetch, the processor that generates the column would send it out to the processors waiting for it. This technique has the advantage that it has better hot-spotting performance than ``consumer prefetch'' (which is what we normally use), since only a single message is sent out from the hotspot, rather than having large numbers of request messages backing up at the hotspot, and then sending out the responses.

We observed this in practice. At first, our scalar optimizer was not taking advantage of the constant offset difference. The performance impact was devastating in many cases, once register spilling occurred.

If TLB refills (i.e. address translation) are handled by hardware, then the prefetch can potentially hide the latency of both the TLB refill and the memory access. On the other hand, if TLB refills are handled through software exceptions (as in the MIPS architecture), then only the memory latency can be hidden.

The two cases that are not shown are TOMCATV and OCEAN. These two applications suffered more from memory latency when the caches are not checked than any of the other applications-so much so that the simulations never ran to completion.

Note that this is a conservative approximation because a prefetch may have been displaced by another prefetch, in which case the displaced prefetch has no benefit, but also does not increase the number of misses by displacing data that would not have otherwise missed.

In an architecture that prefetches only into the secondary cache, prefetches would have a cost and no benefit in cases where the data is found in the secondary cache.

Note that there is even more to providing memory subsystem bandwidth than having a lockup-free cache. In addition, main memory must have sufficient bandwidth to service these multiple outstanding misses.

Stores stall the processor since the R4000 (the processor on which we base our uniprocessor architecture) has a write-back cache and no write buffer.

If the prefetch issue buffer is already full, then either the processor must stall until an entry becomes available or else drop the prefetch. The tradeoff between these two choices was described earlier in Section , and for this study we will stall until an entry becomes available.

The one possible exception is when a read-exclusive prefetch follows a read-shared prefetch of the same line. In this case, if the read-shared prefetch has not been issued to the memory system yet, it should be converted to a read-exclusive prefetch. Otherwise, the options are to either drop the read-exclusive prefetch or else issue it separately. We chose the latter for our simulations, but it made little difference because our selective prefetching algorithm is good at avoiding such redundancies.

The reason for this difference is that our uniprocessor architecture is based on the MIPS R4000 processor, while the multiprocessor architecture is patterned after DASH, which contains R3000 processors.

An example of this is the pixie utility provided by MIPS Computer Systems, Inc. Pixified code generally runs roughly half as fast as the original code.

Scheduling prefetches the right amount of time in advance becomes much more difficult when software pipelining cannot be used (e.g., when traversing a linked list). In such cases, dependencies may make it quite difficult to move prefetches sufficiently far in advance to hide the latency.

The exception would be if the stride detection hardware only looked at references that suffered cache misses.

The RC model would even allow multiple read misses to occur simultaneously, but this does not occur in our architecture since it does not support non-blocking loads.

We show OCEAN with only two contexts because the problem size we use cannot be run with 64 processes.

The instructions category includes both ``useful'' instructions and instructions wasted while spinning on idle work queues. For our previous experiments, these latter instructions were included in the synchronization rather than the instruction category.

Sat Jun 25 15:13:04 PDT 1994