Class 12: Caches I ************************************** Review of locality from previous class ************************************** Rules for good locality: 1. Referencing same variable good (TL-data) 2. Small strides good (SL-data) 3. Stride-1 (sequential) best (SL-data) 3. Loops good (TL and SL-instr) Example: clear1, clear2, and clear3 functions ***** Intro ***** Fundamental properties of storage and programs: 1. Fast storage devices are more expensive per byte, have less capacity 2. The gap between CPU and memory speeds is widening 3. Well written programs have good locality Punch line: Suggests beautiful idea for organizing storage as a memory hierarchy (MH). Todays topics: 1. Caching in memory hierarchy 2. Cache memories ********************************** 1. Caching in the Memory Hierarchy ********************************** Sketch memory hiearchy pyramid: smaller, faster, costlier per byte L0: registers (flip-flops) L1: on-chip L1 cache (SRAM) L2: off-chip L2 cache (SRAM) L3: main memory (DRAM) L4: local disk L5: remote storage larger, slower, cheaper per byte Key idea: For each k, the faster smaller device at level k serves as a cache for the larger slower device at level k+1 Why MH's work: 1. By locality, programs access data at level k more often than at level k+1 2. Thus, storage at level k+1 can be slower, and thus larger and cheaper per bit 3. Net effect: large pool of storage that costs like the bottom but performs like the top. ********************* General caching ideas ********************* Cache: A smaller faster storage device that acts as a staging area for a subset of the data in a larger slower device. Common in real life: * Backpack acts as a cache for your bookshelf at home. * Desktop acts as a cache for the contents of your backpack. Divide memory into 16 blocks of size s (e.g., 64 bytes) k: 12 9 14 3 (block b stored at slot b mod 3) k+1: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 Program needs object d, which is stored in block b * Cache hit: block b is already at level k * Cache miss: block b not at level k (e.g., block 12), so level k must fetch it from level k+1 * Placement policy: where can b go (e.g., b mod 4) * Replacement policy: which block should be evicted? (e.g., LRU) Types of misses: * Cold (compulsary miss): empty cache * Conflict miss: multiple blocks map to same slot: e.g. 0, 8, 0, 8, 0, 8, ... * Capacity miss: set of active cache blocks (working set) larger than cache e.g., 0, 1, 2, .., 8, 0, 1, 2, ..., 8 ******************************************* Examples of caching in the memory hierarchy ******************************************* Go over table from CS:APP Fig 6.23 ******* Summary ******* 1. MH fundamental consequence of storage and program properties. 2. Gives capacity of cheapest storage at performance of most expensive. 3. Caching is the key. ***************** 2. Cache memories ***************** Sketch on board, adding L1 and L2 caches to system picture: CPU chip -------- L1 cache register file ALU L2 <--backside bus-->bus interface <--sys bus-->I/O bridge<--mem bus-->Mem Main idea: CPU hardware looks for memory item in L1, then L2 Sketch cache organization picture: t tag bits B = 2^b data bytes per line per cache block valid tag 0 1 2 ... B-1 set 0 ... E lines per set valid tag 0 1 2 ... B-1 valid tag 0 1 2 ... B-1 set 1 ... S=2^s sets valid tag 0 1 2 ... B-1 valid tag 0 1 2 ... B-1 set S-1 ... valid tag 0 1 2 ... B-1 Cache size: C = B x E x S data bytes Key ideas: Cache is an array of sets. Each set contains one or more lines. Each line holds a block of data + valid bit + tag * block: contents of a chunk of memory * valid bit: empty (nonempty) line * tag: memory address bits that distinguish blocks Blocks are not lines (although they are often used as synonyms) ***************** Addressing caches ***************** (Draw to the right of big cache picture): Address A (m address bits) m-1 0 t bits s bits b bits |---------|----------------|---------------| tag set index block offset * CPU: movl A, %eax * CPU to cache: fetch word at A * Cache: * Locate set using set index * Locate line in set using tag (might not match!) * Check that line is valid (might not be!) * Hit: * Locate data word based on block offset * Miss: * Load block from memory (overwriting existing line) * Locate data word based on block offset * Return word to CPU * CPU: copy word to register *************** Types of caches *************** Direct mapped: E = 1 line per set E-way set associative: 1 < E < C/B lines per set Fully set associative: E = C/B lines per set (i.e. 1 set) ***************************** Cache management issues **************************** 1. line replacement on set-associative misses LRU (expensive), NRU (not recently used), random 2. Write hits: * Suppose CPU writes word w that is in cache. * After updating w in cache, what to do with memory copy of w? * write-through: update immediately * write-back: update when block is evicted 3. Write misses: * Suppose CPU writes word w that is not in cache * write-allocate: load memory block and then write w to cache block * no-write-allocate: write w directly to memory *********************** In-class cache problems *********************** 1. Cache organization m: #bits per address C: cache data size B: block size in bytes E: # lines per set S: # sets t: # tag bits s: # set index bits b: # block offset bits Assume: m=32 bits C=1024 bytes --------------------------------------------------------------------------- B E S t s b --------------------------------------------------------------------------- 8 1 128 22 7 3 --------------------------------------------------------------------------- 8 128 1 29 0 3 --------------------------------------------------------------------------- 32 1 32 22 5 5 --------------------------------------------------------------------------- 32 4 8 24 3 5 --------------------------------------------------------------------------- 2. Low level cache simulation Byte addressable memory Accesses are to 1-byte words m = 12 bits per address E = 2 (2-way set associative) B = 4 bytes per block S = 4 sets Assume write-through and write-allocate 2a. Show the portions of an address that make up the tag, set index, and block offset: 11 10 9 8 7 6 5 4 3 2 1 0 t t t t t t t t s s b b 2b. Consider following cache state (all values in hex): -------------------------------------------------- Set index tag val byte 0 byte 1 byte 2 byte3 -------------------------------------------------- 0 00 1 40 41 42 43 83 1 fe 97 cc d0 -------------------------------------------------- 1 00 1 44 45 46 47 83 0 -- -- -- -- -------------------------------------------------- 2 00 1 48 49 4a 4b 40 0 -- -- -- -- -------------------------------------------------- 3 ff 1 9a c0 03 ff 00 0 -- -- -- -- -------------------------------------------------- Fill in table for the following operations, when carried out ***in the order shown***: --------------------------------------------------------------- Operation Address Hit? Read value (or unknown) --------------------------------------------------------------- Read 0x834 No Unknown --------------------------------------------------------------- Write 0x836 Yes -- --------------------------------------------------------------- Read 0xffd Yes 0xc0 ---------------------------------------------------------------