Class 14: Caches II Review basic idea of caches from last time: CPU regs <----> L1 <----> L2 <----> Memory Nacona Xeon: 16 regs 12Kuop i-cache 1M unified Up to 16K d-cache 256 TB both 8-way 8-way (2^48 B) 64B blocks 64B 1-2 cycles 10 cycles 50-100 cycles 10ns (50-100ns) Todays topics: 1. Performance impact of caches 2. Writing cache-friendly code ******************************* 1. Performance impact of caches ******************************* * Demo: Run example summary.c program by rows, then cols. Analyze miss rate: 1/8=12% (row-wise) vs 100% (col-wise) * Review memory mountain: Read throughput as a 2-d function of temporal and spatial locality * Careful walk-through of memory mountain code slides * Demo: Show P3 and Nocona Xeon slides ****************************** 2. Writing cache-friendly code ****************************** ************************************************** 2a: Rearranging loops to increase spatial locality ************************************************** Example: nxn matrix multiplication (explain on board) Analysis technique: Assume large N (1/N ~ 0) Assume single cache, won't hold multiple rows. Assume 8-byte elements, 64-byte block size, Look at access pattern of inner loop /* Six different versions of matrix multiply */ typedef double array[MAXN][MAXN]; ******************* Case 1: ijk and jik ******************* Misses per inner loop: A: 0.125 B: 1.0 C: 0 Total: 1.125 2 loads void ijk(array A, array B, array C, int n) { int i, j, k; double sum; for (i = 0; i < n; i++) for (j = 0; j < n; j++) { sum = 0.0; for (k = 0; k < n; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } } void jik(array A, array B, array C, int n) { int i, j, k; double sum; for (j = 0; j < n; j++) for (i = 0; i < n; i++) { sum = 0.0; for (k = 0; k < n; k++) sum += A[i][k]*B[k][j]; C[i][j] += sum; } } ******************* Case 2: ikj and kij ******************* Misses: A: 0 B: .125 C: .125 Total: 0.25 2 load, 1 store void ikj(array A, array B, array C, int n) { int i, j, k; double r; for (i = 0; i < n; i++) for (k = 0; k < n; k++) { r = A[i][k]; for (j = 0; j < n; j++) C[i][j] += r*B[k][j]; } } void kij(array A, array B, array C, int n) { int i, j, k; double r; for (k = 0; k < n; k++) for (i = 0; i < n; i++) { r = A[i][k]; for (j = 0; j < n; j++) C[i][j] += r*B[k][j]; } } ******************* Case 3: kji and jki ******************* Misses: A: 1 B: 0 C: 1 Total: 2 2 loads, 1 store void kji(array A, array B, array C, int n) { int i, j, k; double r; for (k = 0; k < n; k++) for (j = 0; j < n; j++) { r = B[k][j]; for (i = 0; i < n; i++) C[i][j] += A[i][k]*r; } } void jki(array A, array B, array C, int n) { int i, j, k; double r; for (j = 0; j < n; j++) for (k = 0; k < n; k++) { r = B[k][j]; for (i = 0; i < n; i++) C[i][j] += A[i][k]*r; } } Resuls on Saltwater Fish Machines: matmult cycles/loop iteration n ijk jik jki kji kij ikj 25 15.63 15.20 18.68 18.50 18.32 18.81 50 15.83 15.72 36.38 37.04 18.86 18.33 75 15.99 15.60 44.73 44.72 18.48 18.26 100 22.52 22.74 44.93 44.89 18.41 18.33 125 23.01 23.41 45.97 45.94 18.31 18.25 150 23.24 23.71 46.72 46.64 18.28 18.21 175 23.31 23.86 46.86 46.66 18.27 18.22 200 23.39 24.15 46.95 46.65 18.32 18.36 225 23.62 24.37 47.27 46.82 18.79 18.63 250 24.04 24.56 47.60 46.99 19.18 19.01 275 24.37 24.86 47.83 47.32 19.97 19.19 300 25.31 24.91 48.50 48.30 20.36 19.09 325 25.82 24.85 49.06 48.98 20.40 19.02 350 26.99 24.89 49.88 51.51 20.74 19.11 375 27.85 24.81 50.40 52.86 20.57 19.09 400 28.25 24.81 50.76 54.44 20.80 18.97 ***************************************** 2b: Blocking to improve temporal locality **************************************** Example: blocked matmult (explain on board) * Innermost loop pair j,k multiplies a 1 x bsize sliver of A by a bsize x bsize block of B and accumulates into 1 x bsize sliver of C * i loop step through n row slivers of A and C, using same B * At that point, B block is used up, throw away and go the next block /* Blocked matrix multiply */ void bijk(array A, array B, array C, int n, int bsize) { int i, j, k, kk, jj; double sum; int en = bsize * (n/bsize); /* Amount that fits evenly */ for (i = 0; i < n; i++) for (j = 0; j < n; j++) C[i][j] = 0.0; for (kk = 0; kk < en; kk += bsize) { for (jj = 0; jj < en; jj += bsize) { for (i = 0; i < n; i++) { for (j = jj; j < jj + bsize; j++) { sum = C[i][j]; for (k = kk; k < kk + bsize; k++) { sum += A[i][k]*B[k][j]; } C[i][j] = sum; } } } /* Now finish off rest of j values (not shown in textbook) */ for (i = 0; i < n; i++) { for (j = en; j < n; j++) { sum = C[i][j]; for (k = kk; k < kk + bsize; k++) { sum += A[i][k]*B[k][j]; } C[i][j] = sum; } } } /* Now finish remaining k values (not shown in textbook) */ for (jj = 0; jj < en; jj += bsize) { for (i = 0; i < n; i++) { for (j = jj; j < jj + bsize; j++) { sum = C[i][j]; for (k = en; k < n; k++) { sum += A[i][k]*B[k][j]; } C[i][j] = sum; } } } /* Now finish off rest of j values (not shown in textbook) */ for (i = 0; i < n; i++) { for (j = en; j < n; j++) { sum = C[i][j]; for (k = en; k < n; k++) { sum += A[i][k]*B[k][j]; } C[i][j] = sum; } } } Results on the Saltwater Fish Machines; blocked mm cycles/iteration n b ijk 25 25 15.03 50 25 14.78 75 25 14.76 100 25 14.75 125 25 14.73 150 25 14.64 175 25 14.66 200 25 14.73 225 25 14.79 250 25 14.80 275 25 14.93 300 25 14.93 325 25 14.98 350 25 15.01 375 25 15.05 400 25 15.07 In-class Problem The heart of the recent hit game {\em SimAquarium} is a tight loop that calculates the average position of 256 algae. You are evaluating its cache performance on a machine with a 1024-byte direct-mapped data cache with 16-byte blocks ($B=16$). You are given the following definitions: struct algae_position { int x; int y; }; struct algae_position grid[16][16]; int total_x = 0, total_y = 0; int i, j; You should also assume the following: * sizeof(int) == 4 * grid begins at memory address 0. * The cache is initially empty. * Memory reference to grid only A. Determine the miss rates for the following codes: Code 1: for (i = 0; i < 16; i++) { for (j = 0; j < 16; j++) { total_x += grid[i][j].x; Miss rate: 50% } } for (i = 0; i < 16; i++) { for (j = 0; j < 16; j++) { total_y += grid[i][j].y; } } Code 2: for (i = 0; i < 16; i++){ for (j = 0; j < 16; j++) { total_x += grid[j][i].x; Miss rate: 50% total_y += grid[j][i].y; Miss rate(2x cache): 25% } } Code 3: for (i = 0; i < 16; i++){ for (j = 0; j < 16; j++) { total_x += grid[i][j].x; Miss rate: 25% total_y += grid[i][j].y; Miss rate (2x cache): 25% } } Summary: 1. All systems favor cache-friendly code 2. Getting optimimum is system-specific (cache size, line size etc) 3. Can get most advantage with generic code - use small strides (spatial locality) - keep working sets small (temporal locality) blocking 4. Code optimization matters too