Class 11. Storage and Locality So far we have concentrated on the processor, with a simple model of memory as an array of bytes. While valid, this model is not quite right. Memory is actually implemented as a hierarchy of storage devices that can have a big impact on the performance of your programs. The design of this hierarchy is a fundamental consequence of the properties of storage technologies and application programs. Today's topics: 1. Storage technologies 2. Program locality *********************** 1. Storage technologies *********************** ************************** Random-access memory (RAM) ************************** Trans Access Needs Needs per bit time refresh EDC Cost Apps SRAM 4-6 1x No Maybe 400x Cache mems DRAM 1 5-10x Yes Yes 1x Main mems, frame bufs DRAM organization: d x w DRAM chip: dw total bits: d w-bit supercells Example: 16 x 8 DRAM chip Sketch out picture (CS:APP Fig 6.3): * CPU * memory controller * 2 bit address (-->) * 8 bit data (<-->) * Memory chip * 4x4 array * internal row buffer Accessing supercell (i,j): * RAS (Row access strobe): selects row i and copies to row buffer * CAS (Col access strobe): selects col j from row buffer Combining DRAM chips to form modules (banks): * Example: 64 MB memory module consisting of 8 8M x 8 DRAM chips * Sketch out picture (CS:APP Fig 6.5): * (RAS,CAS) concurrently to each chip selects 64 bits * DRAM0 -> bits 0-7 * DRAM1 -> bits 8-15 * ... * DRAM6 -> bits 48-55 * DRAM7 -> bits 56-63 At a higher level, this is the same technique used by the high-performance banked memory systems of supercomputers. Enhanced DRAMS: Same core with better interface (exploit row buffer, sync) * SDRAM (Synch DRAM): single-edge clocking (1 bit/pin/clock) * DDR SDRAM (Double data-rate SDRAM) double edge clocking (2 bits/pin/clock) * DDR2 SDRAM: incremental improvement to DDR SDRAM *********************** How CPU accesses memory *********************** Sketch on a separate board and then save: *****CPU chip***** register <--> ALU file | bus interface <--system bus--> I/O bridge <--mem bus--> main mem *****CPU chip***** Memory read transaction (movl A, %eax): 1. CPU places A on memory bus 2. Memory retrieves data and places it on bus 3. CPU reads data from bus and copies to %eax Key idea: It takes time to read and write memory. ***** Disks ***** Disk geometry (sketch on board: * platters * surfaces * tracks * sectors and gaps (Aligned tracks form a cylinder) Disk capacity: * 512 bytes/sector * 300 sectors/track (on average) * 20,000 tracks/surface * 2 surfaces/platter * 5 platters/disk Capacity= 512 x 300 x 20,000 x 2 x 5 = 30 GB Note: To disk vendors, 1 GB = 10^9 bytes! (instead of 2^30 bytes) Disk operation (sketch out): * arm with read/write head moves (seeks) radially * platter spins at fixed rotational rate Typical parameters: * rotational rate: 7,200 RPMs * seek time: 5 ms Access time = T_avg_seek + T_avg_rotation + T_avg_transfer * T_avg_seek: ~5 ms * T_avg_rotation: 1/2 x 1/RPMs x 60s/1m (~4 ms) * T_avg_transfer: 1/RPM x 1/avg sectors/track x 60s/1m (~0.02 ms) Key ideas: 1. First bit per sector is expensive, remaining bits free 2. Twice seek time is a reasonable guess for access time 3. Disk much slower than memory * 40,000x (SRAM) * 2,500x (DRAM) ********************* How CPU accesses disk ********************* Add sketch of I/O bus onto system picture. Disk controllers present abstraction of a sequence of logical blocks to the CPU: LBN0, LBN1, LBN2, LBN3, .... Reading the disk: 1. CPU writes command, LBN, and dest memory address to port (address) associated with disk controller. 2. Disk controller reads sector and performs direct memory access (DMA) into main memory. 3. When DMA completes, disk controller notifies CPU with interrupt (asserts an interrupt pin on CPU). ****************** Key Storage Trends ****************** 1. Fast storage devices are more expensive per byte, have less capacity (Show table of storage devices 1980-2005) 2. The gap between CPU and memory speeds is widening (Show cpu-memory gap graph 1980-2005) *********** 2. Locality *********** Principle of locality: programs tend to use data items near other recently used items, or that were recently used themselves. One of the big ideas of computer science, because it has such enormous impact on the design of hardware and software systems. If a program references some data item, then it is likely to: 1. reference that data item in the near future (temporal locality). 2. reference a nearby item soon (spatial locality). ************************************** Locality of references to program data: ************************************** Example of good locality: int sumvec(int v[N]) { int i, sum = 0; for (i = 0; i < N; i++) sum += v[i]; return sum; } Example of good locality: int sumarrayrows(int a[M][N]) { int i, j, sum = 0; for (i = 0; i < M; i++) for (j = 0; j < N; j++) sum += a[i][j]; return sum; } Example of bad locality: int sumarraycols(int a[M][N]) { int i, j, sum = 0; for (j = 0; j < N; j++) for (i = 0; i < M; i++) sum += a[i][j]; return sum; } ******************************* Locality of instruction fetches ******************************* Example of good instruction locality: int sumvec(int v[N]) { int i, sum = 0; for (i = 0; i < N; i++) sum += v[i]; return sum; } ******************* Summary of locality ******************* Qualitatively evaluating programs for locality: 1. Referencing the same variable -> good temporal locality. 2. smaller strides -> good spatial locality. 3. Stride-1 (sequential) -> best spatial locality. 3. Loops -> good instruction locality. ****************** Locality Exercises ****************** Permute the loops to give good locality: a[i][j][k] int sumarray3d(int a[N][N][N]) { int i, j, k, sum = 0; for (i = 0; i < N; i++) { for (j = 0; j < N; j++) { for (k = 0; k < N; k++) { sum += a[k][i][j]; } } } return sum; } Rank the following examples: Solution: best to worst: clear1 (stride 1 in memory), clear2 (semi-sequential) clear3 (stride-6 access) #define N 1000 typedef struct { int vel[3]; int acc[3]; } point; point p[N]; void clear1(point *p, int n) { int i, j; for (i = 0; i < n; i++) { for (j = 0; j < 3; j++) p[i].vel[j] = 0; for (j = 0; j < 3; j++) p[i].acc[j] = 0; } } void clear2(point *p, int n) { int i, j; for (i = 0; i < n; i++) { for (j = 0; j < 3; j++) { p[i].vel[j] = 0; p[i].acc[j] = 0; } } } void clear3(point *p, int n) { int i, j; for (j = 0; j < 3; j++) { for (i = 0; i < n; i++) p[i].vel[j] = 0; for (i = 0; i < n; i++) p[i].acc[j] = 0; } }