The benchmarks evaluated in this study are all scientific and engineering applications drawn from several benchmark suites. This collection includes NASA7 and TOMCATV from the SPEC benchmarks , OCEAN-a uniprocessor version of a SPLASH benchmark , and CG (conjugate gradient), EP (``embarassingly parallel''-a Monte Carlo simulation), IS (integer sort), MG (multigrid) from the NAS Parallel Benchmarks . Since the NASA7 benchmark really consists of 7 independent kernels, we study each kernel separately (MXM, CFFT2D, CHOLSKY, BTRIX, GMTRY, EMIT and VPENTA). In addition, for our study in Section on prefetching indirect references, we also evaluate MP3D (another uniprocessor version of a SPLASH benchmark) and SPARSPAK  (a sparse matrix application), since these applications contain many indirect references. Table provides a brief summary of the applications, including their input data sets, and Table shows some general characteristics of the applications.
For four of the applications (MXM, CFFT2D, VPENTA and TOMCATV), the mapping conflicts in the direct-mapped cache occurred so frequently that we manually changed the alignment of some of the matrices to help reduce these conflicts. These problematic matrices tend to have dimensions that are powers of two, which causes the cache size (also a power of two) to evenly divide into the size of a row or possibly the entire matrix. Therefore adjacent elements in the same column-and sometimes elements with similar access functions in adjacent matrices-often mapped into the same cache entry, thus resulting in large numbers of conflicts within inner loops. We manually fixed this problem by adding 13 (an arbitrary prime number) to the size of each dimension for these problematic matrices, while being careful that these changes affected only the data layout and not the actual computation. Later, in Section , we will examine these mapping conflicts in more detail, and will evaluate possible architectural enhancements to minimize their performance impact.