Next: Coping with Memory Up: Introduction Previous: Introduction

Cache Performance on Scientific and Engineering Codes

To illustrate the need for improving the cache performance of microprocessor-based uniprocessor and multiprocessor systems, we present results in this subsection for a set of scientific and engineering applications. We begin with the uniprocessor architecture. For the sake of concreteness, we pattern our memory subsystem after a typical MIPS R4000-based workstation. The architecture consists of a single-issue processor running at a 100 MHz internal clock. The processor has an on-chip primary data cache of 8 Kbytes, and a secondary cache of 256 Kbytes. Both caches are direct-mapped and use 32 byte lines. The penalty of a primary cache miss that hits in the secondary cache is 12 cycles, and the total penalty of a miss that goes all the way to main memory is 75 cycles. In this simple model, we assume that all instructions execute in a single cycle and that all instructions hit in the primary instruction cache. The performance of the benchmarks was simulated by instrumenting the MIPS object code using pixie [74] and piping the resulting trace into our detailed cache simulator.

Figure breaks down the total program execution time into instruction execution and stalls due to memory accesses for 13 uniprocessor programs taken from the SPEC [77], SPLASH [72], and NAS Parallel [8] benchmark suites. Many of the programs spend a significant amount of time on memory accesses. In fact, 8 out of the 13 programs spend more than half of their time stalled for memory accesses.

We conducted a similar experiment to evaluate the impact of memory latency on large-scale shared-memory multiprocessors by simulating the entire SPLASH [72] parallel application suite on an architecture resembling the Stanford DASH multiprocessor [54]. The architecture we model includes 16 R3000 processors running at 33 MHz, two-level cache hierarchies (64 Kbytes/256 Kbytes), and a low-latency interconnection network. Miss latencies for loads range from 15 cycles to the secondary cache to over 130 cycles for ``remote dirty'' lines. Further details on this architecture and the parallel applications will be presented later in Chapter . Figure shows the results. Execution time is now broken into three categories: time spent executing instructions, time spent stalled for memory, and time spent stalled for synchronization (such as locks and barriers). Once again, memory stalls are significant, with 7 of the 8 applications spending more than 35%of their time stalled waiting for memory accesses to complete.

Next: Coping with Memory Up: Introduction Previous: Introduction

tcm@
Sat Jun 25 15:13:04 PDT 1994