# 15-213 "The course that gives CMU its Zip!" # **Cache Memories** February 26, 2008 #### **Topics** - Generic cache memory organization Direct mapped caches - Set associative caches - Impact of caches on performance - The memory mountain # **Synchronization** # First exam this evening - If you have not received mail with a Subject line like "15-213 exam: conflict session C2" then we expect you at the main - Room split by Andrew username (not first/last/middle name!) - a-c Wean 7500 d-z McConomy Auditorium in University Center - Bring with you Your TA's name and/or 15-213 section le - If you want your test to be returned in recitation Book and notes, if you wish - Suggested: know your powers of 2 No calculators 15-213, S'08 # Synchronization - 2 # Computer Club movie night - . "Colossus, The Forbin Project" - Wednesday evening - Wean 7500 - 19:00 Computer Club Intro, co-op pizza order - 19:30 Movie # **Determinant** Theorem 5. If A and B are square matrices of the same size, then $det(AB) = det(A)^* det(B)$ . The elegant simplicity of this result contrasted with the complex nature of both matrix multiplication and the determinant definition is both refreshing and surprising. We shall omit the proof. Anton, Elementary Linear Algebra, 4th ed., p. 72. 15-213, S'08 # **Cache Memories** Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in L1, then in L2, then in main #### Inserting an L1 Cache Between the CPU and Main Memory The tiny, very fast CPU register file has room for four The transfer unit between the CPU register file and the cache is a 4-byte block. 4-byte words. The small fast L1 cache has room for two 4-word blocks. line 1 The transfer unit between the cache and main memory is block 10 abcd a 4-word block (16 The big slow main memory has room for many 4-word bytes). block 21 pgrs blocks block 30 w×yz 15-213, S'08 # **Maintaining an Associative Cache** - · How to decide which cache line to use in a set? - Least Recently Used (LRU), Requires $\lceil Ig_2(E!) \rceil$ extra bits - Not recently Used (NRU) - Random 21 - · Virtual vs. Physical addresses: - The memory system works with physical addresses, but it takes time to translate a virtual to a physical address. So most L1 caches are virtually indexed, but physically tagged. # What about writes? Multiple copies of data exist: - L1 - L2 - Main Memory - Disk What to do when we write? - Write-through - Write-back - need a dirty bit - What to do on a write-miss? What to do on a replacement? - Depends on whether it is write through or write back ``` Cache Performance Metrics Fraction of memory references not found in cache (misses / references) Typical numbers: 3-10% for L1 • can be quite small (e.g., < 1%) for L2, depending on size, etc. Hit Time Time to deliver a line in the cache to the processor (includes time to determine whether the line is in the cache) *Tunical numbers: Aside for architects: Typical numbers: -Increasing cache size? • 1-2 clock cycle for L1 • 5-20 clock cycles for L2 -Increasing block size? Miss Penalty -Increasing associativity? Additional time required because of a miss Typically 50-200 cycles for main memory (Trend: increasing!) ``` # **Writing Cache Friendly Code** - Repeated references to variables are good (temporal locality) - Stride-1 reference patterns are good (spatial locality) - Examples: 28 - cold cache, 4-byte words, 4-word cache blocks ``` int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum;</pre> ``` nt sum\_array\_rows(int a[M][N]) int sum\_array\_cols(int a[M][N]) int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum;</pre> Miss rate = 1/4 = 25% Miss rate = 100% 15-213, S'08 # **The Memory Mountain** # Read throughput (read bandwidth) Number of bytes read from memory per second (MB/s) #### Memory mountain - Measured read throughput as a function of spatial and temporal locality. - Compact way to characterize memory system performance. 27 ``` Memory Mountain Test Function ``` ``` /* The test function */ woid test(int elems, int stride) { int i, result = 0; volatile int sink; for (i = 0; i < elems; i += stride) result += data[i]; sink = result; /* So compiler doesn't optimize away the loop */</pre> double cycles; int elems = size / sizeof(int); test(elems, stride); /* warm up the cache */ cycles = fcyc2(test, elems, stride, 0); /* call test(elems,stride) */ return (size / stride) / (cycles / Mhz); /* convert cycles to MB/s */ ``` **Memory Mountain Main Routine** ``` /* mountain.c - Generate the memory mountain. */ #define MINEYTES (1 << 10) /* Working set size ranges from 1 KB */ #define MAXISTRIDE 16 /* Strides range from 1 to 16 */ #define MAXISTRIDE 15 /* Strides range from 1 to 16 */ #define MAXISTRIDE 15 /* #define MAXISTRIDE 15 /* int data[MAXELEMS]; /* The array we'll be traversing */ int main() int size; /* Working set size (in bytes) */ int stride; /* Stride (in array elements) */ double Mhz; /* Clock frequency */ init_data(data, MAXELEMS); /* Initialize each element in data to 1 */ Nhn = mhx(0); NTTES; size >> HINGTES; striber+) { for (size = MAXELEMS); /* Estimate the clock frequency */ for (stribe = 1; stribe <= MAXELEMS; striber+) print("A.Stribe", size >> HINGTES; striber+) print("A.Stribe", striber, striber, striber) } exit(0); ``` 29 15-213, S'08 ``` Pentium Matrix Multiply Performance Miss rates are helpful but not perfect predictors. Code scheduling matters, too. ``` ``` Improving Temporal Locality by Blocking Example: Blocked matrix multiplication "block" (in this context) does not mean "cache block". Instead, it mean a sub-block within the matrix. Example: N = 8; sub-block size = 4 \begin{bmatrix} A_1 & A_2 \\ A_2 & A_2 \end{bmatrix} \times \begin{bmatrix} B_2 & B_2 \\ B_3 & B_2 \end{bmatrix} = \begin{bmatrix} C_2 & C_2 \\ C_3 & C_2 \end{bmatrix} Key idea: Sub-blocks (i.e., A_{rr}) can be treated just like scalars. C_{11} = A_{11}B_{11} + A_{12}B_{21} \qquad C_{22} = A_{11}B_{12} + A_{22}B_{22} C_{21} = A_{21}B_{31} + A_{22}B_{21} \qquad C_{22} = A_{31}B_{12} + A_{22}B_{22} ``` ``` Equation Rewriting The math C_{11} = A_{11}B_{11} + A_{12}B_{21} \qquad C_{12} = A_{11}B_{12} + A_{12}B_{22} \qquad C_{21} = A_{21}B_{11} + A_{22}B_{22} Straightforward conversion to present ecode C = 0 C_{11} + A_{11}B_{11} \quad C_{11} + A_{11}B_{21} \qquad C_{12} + A_{12}B_{22} Straightforward conversion to C_{11} + A_{11}B_{12} \quad C_{12} + A_{12}B_{12} A_ ``` ``` Blocked Matrix Multiply (bijk) for (jj=0; jj<n; jj+*bsize) { for (i=0; i<n; i++) for (j=j); j < min(jj+bsize,n); j++) c(i=j); n < min(jj+bsize,n); j++) for (k*0; kk<n; kk**bsize) { for (i*0; i<n; i++) { for (j=j); j < min(jj+bsize,n); j++) { sum = 0.0 for (k*kk; k < min(kk*bsize,n); k++) { sum += a[i][k] * b[k][j]; } c[i][j] += sum; } } 48 ``` ``` Blocked Matrix Multiply Analysis • Innermost loop pair multiplies a 1 X bsize sliver of A by a bsize X bsize block of B and accumulates into 1 X bsize sliver of C • Loop over i steps through n row slivers of A & C, using same B for (i=0; i:n; i:+) { for (j=jj; j < min(jj+bsize,n); j++) { sum = 0.0 for (k=kk; k < min(kk+bsize,n); k++) { sum += a[i][k] * b[k][j]; } }</pre> } c[i][j] += sum; Update successive row sliver accessed block reused n bsize times times in succession ``` # **Concluding Observations** # Programmer can optimize for cache performance - How data structures are organized - · How data are accessed - Nested loop structure Blocking is a general technique #### All systems favor "cache friendly code" - Getting absolute optimum performance is very platform specific Cache sizes, line sizes, associativities, etc. - Can get most of the advantage with generic code Keep working set reasonably small (temporal locality) Use small strides (spatial locality) 51