As in the uniprocessor study, the effectiveness of the scheduling algorithm can be evaluated based on how often prefetched misses result in primary cache hits. Figure breaks down what happened to the original primary cache misses, similar to Figure in the uniprocessor study. However, since there are more explanations for ineffective prefetches in the multiprocessor architecture, we have broken the misses down into several more specific categories.
The topmost section of the bars in Figure (labeled nopf-miss) is still the cases that are not prefetched and obviously remain cache misses. These are the misses that are not part of the coverage factor, which was discussed in the previous subsection. The bottom section of the bars (labeled pf-hit) includes the original misses that are prefetched and subsequently become primary hits, which corresponds to effective scheduling. The remaining four cases in the middle of the bars (labeled pf-miss) are the ones where prefetches are not effective. We will discuss each of these cases in detail.
The ``pf-miss: too late'' category includes cases where the prefetches were not issued early enough to hide the latency, and therefore the data item had not returned to the cache before it was referenced. This case was only noticeable in one application: LU. The problem in LU is that prefetches are delayed due to ``hot-spotting'' in the network. The hot-spotting occurs because several processors are often waiting for the same column to be produced by a single processor. Once that processor signals that the column is ready, the other processors simultaneously send numerous prefetches to that processor, requesting the data. The prefetches subsequently exceed the bandwidth capabilities of that particular network node, resulting in queueing delays. This problem is not remedied by simply scheduling the software pipeline to hide a larger latency, since the prefetches cannot be issued before the synchronization point, and after that point the request rate simply exceeds the bandwidth capacity of the network node. In fact, when LU was recompiled for twice the normal target memory latency (i.e. 600 rather than 300 cycles), the performance was considerably worse. So to some extent these delays are unavoidable.
The ``pf-miss: invalidated'' category includes the cases where the prefetched data item is brought into the cache but is invalidated before it can be referenced. This situation arises because we use non-binding prefetches, and therefore sometimes the data must be invalidated to preserve correctness. This category was most prominent in MP3D, where it accounted for roughly half of the pf-miss cases. In MP3D, some of the objects that are prefetched (the ``space cells'') are very actively shared and modified amongst the processors. Therefore in many cases another processor wants to modify a location between the time it is prefetched and referenced. This effect is magnified to some extent by the fact that we use exclusive-mode prefetches, which means that a processor can invalidate other copies of the line when it expects to write the line in the near future, not only when it is actually modifying the line. The misses in the ``pf-miss: invalidated'' category tend to be expensive, since the data is typically found far from the processor (i.e. rather than being in the home memory, it is usually in a remote processor's cache). We see that this is true in Table , where the average read miss latency for MP3D is only reduced from 90.1 to 83.0 cycles, which is a meager reduction compared to the other applications. Although these invalidations are an interesting effect, the prefetches that are successful still manage to eliminate 82%of the miss stall time in MP3D. The fact that the invalidations allow the prefetches to be non-binding is a benefit that far outweighs the cost of prefetches that are occasionally unsuccessful due to invalidations.
The ``pf-miss: replaced'' category includes cases where the prefetched data is brought into the primary cache but is subsequently replaced from both the primary and secondary caches before it can be referenced. This case did not occur often for any of the applications.
Finally, the ``pf-miss: in-scache'' category includes cases where prefetched data is replaced from the primary cache before it can be referenced, but the data is still found in the secondary cache. Fortunately this is the dominant case for all the applications, as it was in the uniprocessor study. This is the least harmful of the pf-miss categories since the latency of a miss to the secondary cache is small relative to other types of misses, and in many cases the latency of an expensive miss has been reduced to a relatively cheap miss (i.e. it represents partial latency-hiding). For example, the majority of misses in this category for MP3D were formerly ``dirty remote'' misses, which have latencies of at least 132 cycles. Therefore the latency is reduced by roughly an order of magnitude when the data are found in the secondary cache.
Although the ``pf-miss: in-scache'' category is the least harmful of the pf-miss categories, it is still obviously less desirable than the pf-hit category. In the case of CHOLESKY, only 10%of the prefetched primary misses became primary hits-the other 90%are found in the secondary cache. Unfortunately, in this case nearly all of those latter misses were in the secondary cache to begin with, so there is no partial latency-hiding benefit. On the other hand, most of the misses that became primary hits were originally expensive remote misses. All of these prefetches are for the same structures: the panels. Once a panel is ``ready'', it is repeatedly applied to other panels through a procedure call. When a panel is first used, the prefetches within the procedure succeed in fetching it from a remote location into the primary cache. However, on subsequent calls to this procedure, the prefetches for that same panel fail to bring it from the secondary to the primary cache due to conflicts with other references. These conflicts are with other panel references in the key loop, where eight separate columns within the panel are referenced on each iteration. To some extent, the conflict problem is also aggravated because we schedule for the worst-case latency-although this allows prefetching to work for the expensive misses (which is important), it causes the data to arrive in the primary cache unnecessarily early when found in the secondary cache. Treating these cases differently is difficult since they are simply different calls to the same procedure. It is unfortunate that scheduling is not more effective for CHOLESKY, considering how well locality analysis performs: only 15%of the prefetches are unnecessary and the coverage factor is 85%. The performance of LOCUS is also hindered by a similar conflict problem, but to a lesser extent.
To summarize, we have seen that the scheduling algorithm is often quite successful, which results in substantial speedups for OCEAN, LU, and MP3D. However, there is room for further improvement in CHOLESKY and LOCUS by reducing cache conflicts, which we will address later in Section . Despite these conflicts, which replace data from the primary to the secondary cache, the benefit of hiding remote latencies resulted in the elimination of at least 30%of memory stall cycles in all cases.