Next: Memory Hierarchy Optimizations Up: Introduction Previous: Introduction

Cache Performance on Scientific Code

To illustrate the need for improving the cache performance of microprocessor-based systems, we present results below for a set of scientific programs. For the sake of concreteness, we pattern our memory subsystem after the MIPS R4000. The architecture consists of a single-issue processor running at a 100 MHz internal clock. The processor has an on-chip primary data cache of 8 Kbytes, and a secondary cache of 256 Kbytes. Both caches are direct-mapped and use 32 byte lines. The penalty of a primary cache miss that hits in the secondary cache is 12 cycles, and the total penalty of a miss that goes all the way to memory is 75 cycles. To limit the complexity of the simulation, we assume that all instructions execute in a single cycle and that all instructions hit in the primary instruction cache.

We present results for a collection of scientific programs drawn from several benchmark suites. This collection includes NASA7 and TOMCATV from the SPEC benchmarks[27], OCEAN - a uniprocessor version of a SPLASH benchmark[25], and CG (conjugate gradient), EP (embarassingly parallel), IS (integer sort), MG (multigrid) from the NAS Parallel Benchmarks[3]. Since the NASA7 benchmark really consists of 7 independent kernels, we study each kernel separately (MXM, CFFT2D, CHOLSKY, BTRIX, GMTRY, EMIT and VPENTA). The performance of the benchmarks was simulated by instrumenting the MIPS object code using pixie[26] and piping the resulting trace into our cache simulator.

Figure 1 breaks down the total program execution time into instruction execution and stalls due to memory accesses. We observe that many of the programs spend a significant amount of time on memory accesses. In fact, 8 out of the 13 programs spend more than half of their time stalled for memory accesses.



Next: Memory Hierarchy Optimizations Up: Introduction Previous: Introduction


Robert French