Next: Prefetching into a Up: Improving Effectiveness Previous: Improving Effectiveness

Dealing with Cache Conflicts

Many commercial RISC microprocessors have direct-mapped primary caches. This is because direct-mapped caches have faster access times than set-associative caches, and the difference in speed often outweighs the difference in miss rate for general-purpose codes. General-purpose codes (e.g., the C compiler) have relatively few mapping conflicts on direct-mapped caches because their access patterns tend to be randomly distributed throughout the cache. In contrast, scientific codes that stride through matrices (particularly ones dimensioned to powers of two) can have pathologically bad mapping conflicts. In four of the applications we studied (MXM, CFFT2D, VPENTA, and TOMCATV), the cache conflicts in the original code were so bad that we manually realigned the data to help alleviate these problems.

The cache conflict problem can be addressed either in hardware or in software. One software-based approach is to place the burden of avoiding conflicts on the programmer. Once the programmer is aware that conflicts are a problem, the solution to the problem is often quite obvious. In the applications where we fixed conflicts by hand, we simply added thirteen (an arbitrary prime number) to the size of each matrix dimension that was a power of two. This was enough to prevent elements with similar access functions in adjacent rows or matrices from mapping to the same cache entries. Although this is fairly straightforward, ideally the programmer should not have to worry about this. A more appealing software-based approach would be for the compiler to automatically realign data to fix mapping problems. However, the difficulty here is that once the data are moved, every possible reference to the data must also be adjusted appropriately. This may be particularly difficult given explicit address arithmetic, pointers, aliasing, or compilation across separate files. So although eliminating the problem in software may sound like the ideal approach, it is unclear whether this is a practical solution.

Cache conflicts can be addressed in the hardware through associativity of some form. While associativity has the advantage of reducing conflicts by allowing locations to map to multiple cache entries, it has the disadvantage of slowing down cache access rates due to added complexity. Therefore minimizing the degree of associativity is an important concern. One option is to have a set-associative primary cache, where addresses are mapped to N-way associative sets. Another approach is to keep the cache direct-mapped but also add a ``victim cache'' [40] off to the side. A victim cache is a small buffer containing the last several lines replaced from the cache. While set-associative caches are the most common approach for adding associativity, the victim cache approach is appealing because it is tailored to cases where data are reused shortly after being displaced-this is precisely what happens with prefetching conflicts.

We evaluate both of these techniques in this subsection by incorporating them into our uniprocessor architecture. The applications we consider are the ones that suffered most noticeably from conflicts, including two cases where prefetches were often ineffective due to conflicts (CHOLSKY and TOMCATV, as discussed in Section ), and the four original codes before we manually realigned data (MXM, CFFT2D, VPENTA, and TOMCATV). The results of these experiments are shown in Figures through .

Each graph shows curves both with and without prefetching (PF and NOPF, respectively) for victim caches ranging from one to 256 entries, and for set-associative caches ranging from 2-way to 8-way associativity. Two curves are shown for the set-associative cases: one with random replacement (Random), and one with a least-recently-used (LRU) replacement policy. The LRU policy is slightly more effective, but is typically more expensive to implement.

The success of victim caching and set-associativity varied across the applications, and we discuss each case individually below.

We begin with CHOLSKY (the uniprocessor NASA7 SPEC benchmark), as shown in Figure . CHOLSKY is an interesting case because cache conflicts only occur once prefetching is added. The conflicts occur in two similar loop nests, one of which is shown in Figure . The references B(I,L,K+JJ) and B(I,L,K) are separated by 8032(JJ) bytes (JJ is a loop index, and therefore its value changes). Since JJ is always greater than zero, the B references never directly conflict in an 8 Kbyte cache, and hence the original code does not have a cache conflict problem. As we see in Figure , the NOPF cases show little improvement with associativity or victim caching.

To insert prefetches, the compiler unrolls loop L in Figure four times for the spatial locality of the A reference, and prefetches are software-pipelined three iterations ahead. Therefore the B references are being prefetched twelve references ahead, which spans 384 bytes (i.e. twelve separate 32-byte cache lines). This region is large enough that the B references overlap when they are close enough together. What happens then is that the prefetch of B(I,L,K+JJ) displaces the prefetch of B(I,L,K), and the load of B(I,L,K) displaces the prefetch of B(I,L,K+JJ), thereby rendering both prefetches ineffective. These conflicts only occur for certain values of JJ, and they happen about 25%of the time.

As we see in Figure , a 2-way set-associative cache (with LRU replacement) is enough to eliminate this conflict problem. A victim cache only eliminates the problem when it is 32 entries deep. This is because we prefetch each reference twelve iterations ahead, and up to 24 conflicting misses must be captured. This type of conflict pattern is better suited to associativity.

TOMCATV is a much more complicated case. The key loop in TOMCATV contains seven distinct references that conflict with each other at various times. References are prefetched only three iterations ahead since the loop body is fairly large, and the conflict patterns are rather complex.

We begin by focusing on the original TOMCATV code, as shown in Figure . For each of these cases where we show original code before the arrays are manually realigned (i.e. TOMCATV, MXM, CFFT2D, and VPENTA), we also indicate the performance of the code after realignment using dashed and dotted horizontal lines, which correspond to the cases with and without prefetching, respectively, on the original direct-mapped architecture. We observe that either eight victim cache entries or an 8-way set-associative cache are needed to match the performance of simply realigning the arrays. Once the arrays are realigned, as shown in Figure , a 2-way set-associative cache is quite helpful, and there is steady improvement from increasing the victim cache size up to sixteen entries. Large degrees of both victim caching and set-associativity fare well in the case of TOMCATV, but clearly realigning the arrays is the most important optimization.

The original MXM code is a very interesting case, and the resulting performance is shown in Figure . This is a partially-blocked matrix multiply, and the key loop is shown in Figure . The conflicts in this code occur between the A and C matrices. For both matrices, the size of the inner dimension is 2 Kbytes (256 double-precision elements). In a single pass through the I loop, each of the four A references will sweep through their own quarter of the cache, never conflicting with each other. However, the C reference will line up directly with one of the A references (it rotates between each of them). Without these conflicts, the A and C references would suffer misses once every four references, as they cross cache line boundaries. These conflicts will cause the C reference and one of the A references to miss on each iteration. Increasing the associativity to 2- or 4-ways does not improve performance, since the C and the four A references will always map into the same set. In fact, the performance is dramatically worse with a 4-way LRU cache because the code cycles repeatedly through 5 references, so the least recently used reference is the worst one to throw out. The code finally improves with 8-way set-associativity since all five conflicting references can fit in the same set.

Unlike associativity, victim caching performs very well in this case, as we see in Figure . A single victim entry allows all five references to remain in the cache, thereby dramatically improving the NOPF case. The prefetching case is fetching data two iterations ahead, and does somewhat worse than NOPF with a single victim entry because it replaces the crucial victim entry with data displaced by prefetching. With eight or more victim cache entries, we finally see the full prefetching benefit, since both the current and the prefetched data sets can be captured. We also see that by restructuring the code by hand, cache conflicts can be avoided altogether, and this matches the NOPF performance with a single victim cache entry. The performance of the restructured code with prefetching is only exceeded by eight or more victim entries, where it is possible to avoid occasional conflicts with other references.

In the original CFFT2D code, the conflicts occur between references of the X matrix in two separate loop nests, shown in Figure . The inner dimension of the X matrix contains 128 complex numbers, which is 2 Kbytes of data. In the code without prefetching, conflicts occur only in the second loop nest (Figure (b)) whenever the difference between II and IM is a multiple of four (which occurs more than 50%of the time). Either a single victim cache entry or a 2-way set-associative cache fixes this problem, as we see at the top of Figure . This does not match the performance of the restructured code, because it still suffers from conflicts in the direct-mapped secondary cache (a 2-way set-associative secondary cache should fix this).

In the prefetching case, this second loop is unrolled by a factor of two and we prefetch three iterations ahead. When the references line up, obviously the prefetches will also interfere with each other. Consequently the prefetch of X(K,IM) was often replaced by the prefetch and reference of X(K,II) before it could be used. However, the more important conflict cases with prefetching are in the first loop nest (Figure (a)). Here we prefetch six iterations ahead along the outer dimension. Since each of these accesses is separated by 2 Kbytes, they map into only four entries in the 8 Kbyte direct-mapped cache, so each reference ends up displacing its own prefetches. Six or more victim cache entries are enough to capture these conflicts, and we see a noticeable performance improvement in Figure with eight entries. An eight entry victim cache does better than 8-way set-associativity in this case because all of the prefetches map into the same set, and the replacement policy is not perfect. The manually-realigned code with a direct-mapped cache performs better than an 8-way set-associative cache and as well as a victim cache with eight or more entries.

Finally, there is the VPENTA code shown in Figure . The critical loop nest for this application contains eight matrices which line up directly on top of each other in a direct-mapped cache, as shown in Figure . Each 2-D matrix is 128 Kbytes large, and the 3-D F matrix behaves like three large 2-D matrices. So ten unique references line up in the same entry of the cache on each loop iteration. Consequently the performance of this code is limited almost entirely by the degree of associativity, up to an 8-way set-associative cache or an 8-entry victim cache. Prefetching offers little improvement for these configurations because there is nowhere for the prefetched data to go. Finally, with a victim cache of sixteen or more entries, some of the prefetched data is retained, and we see a widening performance gap. The performance levels off at 64 entries because we prefetch two iterations ahead, so the working set is slightly larger than 32 lines, including the various stack references in the loop. Even with a 256 entry victim cache, the original code with prefetching does not match the restructured code without prefetching. This is because of the large number of secondary cache conflicts (the 256 Kbyte secondary cache would need to be at least 5-way set-associative to avoid conflict misses). By simply increasing the values of the matrix dimension parameters from 128 to 141, the original code without prefetching runs nearly three times faster. When this is combined with prefetching, the code runs five times faster on this direct-mapped cache architecture.

To summarize these results, we have seen a mixture of effects. In cases like CHOLSKY where a small number of references conflict only because their prefetched data sets overlap, a small amount of set-associativity is the best approach. Victim caches do not perform as well in these cases where the depth of software pipelining is large, since the replaced data must be held across that many iterations (i.e. there is a multiplicative effect of degree of conflict times pipelining distance). In other cases such as MXM where the conflicting data maps into the same sets in a set-associative cache, victim caching is better suited to the problem. In other cases, the same degree of associativity (either N-way set-associativity or an N-entry victim cache) works equally well.

In a few cases we noticed two dips in the performance curves. The first dip corresponds to holding the ``active'' working set in either the primary cache or the victim cache (e.g., two victim entries in MXM and CFFT2D), and the second dip occurs when the prefetched active set can also be held at the primary cache level (e.g., eight victim entries in MXM and CFFT2D). When conflicts already occur (e.g., the original VPENTA code), prefetching may be doomed since the prefetched data will also conflict. However, prefetching may cause new conflicts, especially when striding through arrays that are only slightly separated in the cache.

Since prefetch-specific conflicts occur within a certain window of time, the victim cache approach sounds appealing, since it is tailored to capturing recently displaced data. However, with latencies of over a hundred cycles, more than sixteen victim cache entries may be needed, which is probably too large for an associative lookup.

In all cases, the software-based restructuring approach did quite well. Our restructuring algorithm was simply to add a prime number (thirteen) to the size of array dimensions that were a power of two. While care must be taken with this approach, it worked quite well and has the added benefit of reducing conflicts in the secondary cache. In one case (VPENTA), no reasonable amount of hardware could solve the cache conflict problems.

Next: Prefetching into a Up: Improving Effectiveness Previous: Improving Effectiveness

Sat Jun 25 15:13:04 PDT 1994