Class 12: Caches I 

**************************************
Review of locality from previous class
**************************************

Rules for good locality:
1. Referencing same variable good (TL-data)
2. Small strides good (SL-data)
3. Stride-1 (sequential) best (SL-data)
3. Loops good (TL and SL-instr)

Example: clear1, clear2, and clear3 functions

*****
Intro
*****

Fundamental properties of storage and programs:
1. Fast storage devices are more expensive per byte, have less capacity
2. The gap between CPU and memory speeds is widening
3. Well written programs have good locality

Punch line:
Suggests beautiful idea for organizing storage as a memory hierarchy
(MH).

Todays topics:
1. Caching in memory hierarchy
2. Cache memories

**********************************
1. Caching in the Memory Hierarchy
**********************************

Sketch memory hiearchy pyramid:

smaller, faster, costlier per byte

L0: registers (flip-flops)
L1: on-chip L1 cache (SRAM)
L2: off-chip L2 cache (SRAM)
L3: main memory (DRAM)
L4: local disk 
L5: remote storage

larger, slower, cheaper per byte

Key idea: 
For each k, the faster smaller device at level k serves as a
cache for the larger slower device at level k+1

Why MH's work:
1. By locality, programs access data at level k more often 
than at level k+1
2. Thus, storage at level k+1 can be slower, and thus larger and
cheaper per bit
3. Net effect: large pool of storage that costs like the bottom
but performs like the top. 

*********************
General caching ideas
*********************

Cache: A smaller faster storage device that acts as a 
staging area for a subset of the data in a larger slower
device.

Common in real life: 
* Backpack acts as a cache for your bookshelf at home. 
* Desktop acts as a cache for the contents of your backpack.

Divide memory into 16 blocks of size s (e.g., 64 bytes)

k:    12 9 14 3  (block b stored at slot b mod 3)

k+1:  0	 1  2  3
      4	 5  6  7
      8	 9  10 11
      12 13 14 15

Program needs object d, which is stored in block b   

* Cache hit: block b is already at level k
* Cache miss: block b not at level k (e.g., block 12), 
  so level k must fetch it from level k+1
     * Placement policy: where can b go (e.g., b mod 4)
     * Replacement policy: which block should be evicted? (e.g., LRU)

Types of misses:
* Cold (compulsary miss): empty cache
* Conflict miss: multiple blocks map to same slot: 
  e.g. 0, 8, 0, 8, 0, 8, ...
* Capacity miss: set of active cache blocks (working set) larger than cache
  e.g., 0, 1, 2, .., 8, 0, 1, 2, ..., 8

*******************************************
Examples of caching in the memory hierarchy
*******************************************
Go over table from CS:APP Fig 6.23

*******
Summary
*******
1. MH fundamental consequence of storage and program properties.
2. Gives capacity of cheapest storage at performance of most expensive.
3. Caching is the key.


*****************
2. Cache memories
*****************

Sketch on board, adding L1 and L2 caches to system picture:

		     CPU chip
		     --------
		     L1 cache
		     register file
		     ALU
L2 <--backside bus-->bus interface <--sys bus-->I/O bridge<--mem bus-->Mem

Main idea: CPU hardware looks for memory item in L1, then L2

Sketch cache organization picture:
          
		t tag bits	B = 2^b data bytes 
		per line	per cache block

		valid	tag	0 1 2 ... B-1
set 0		...                                E lines per set
		valid	tag	0 1 2 ... B-1

		valid	tag	0 1 2 ... B-1
set 1		...                                S=2^s sets          
		valid	tag	0 1 2 ... B-1

		valid	tag	0 1 2 ... B-1
set S-1		...                                          
		valid	tag	0 1 2 ... B-1

Cache size: C = B x E x S data bytes
 
Key ideas:

Cache is an array of sets.
Each set contains one or more lines.
Each line holds a block of data + valid bit + tag
     * block: contents of a chunk of memory
     * valid bit: empty (nonempty) line
     * tag: memory address bits that distinguish blocks
Blocks are not lines (although they are often used as synonyms)


*****************
Addressing caches 
*****************
(Draw to the right of big cache picture):

Address A (m address bits)

 m-1					    0

 t bits		s bits		b bits
 |---------|----------------|---------------|
 tag		set index	block offset 


* CPU: movl A, %eax

* CPU to cache: fetch word at A
* Cache:
  * Locate set using set index
  * Locate line in set using tag (might not match!)
  * Check that line is valid (might not be!)
  * Hit: 
    * Locate data word based on block offset
  * Miss:
    * Load block from memory (overwriting existing line)
    * Locate data word based on block offset
  * Return word to CPU

* CPU: copy word to register

***************
Types of caches
***************
Direct mapped:         E = 1 line per set
E-way set associative: 1 < E < C/B lines per set
Fully set associative: E = C/B lines per set (i.e. 1 set)

*****************************
Cache management issues
****************************

1. line replacement on set-associative misses
   LRU (expensive), NRU (not recently used), random

2. Write hits: 
   * Suppose CPU writes word w that is in cache.
   * After updating w in cache, what to do with memory copy of w?
      *	write-through: update immediately    
      * write-back: update when block is evicted

3. Write misses:
   * Suppose CPU writes word w that is not in cache
      * write-allocate: load memory block and then write w to cache block
      * no-write-allocate: write w directly to memory

***********************
In-class cache problems
***********************

1. Cache organization

m: #bits per address
C: cache data size
B: block size in bytes
E: # lines per set
S: # sets
t: # tag bits
s: # set index bits
b: # block offset bits

Assume:
m=32 bits
C=1024 bytes

---------------------------------------------------------------------------
B	E		S		t		s		b
---------------------------------------------------------------------------
8	1		128		22		7		3
---------------------------------------------------------------------------
8	128		1		29		0		3
---------------------------------------------------------------------------
32	1		32		22		5		5
---------------------------------------------------------------------------
32	4		8		24		3		5
---------------------------------------------------------------------------


2. Low level cache simulation

Byte addressable memory
Accesses are to 1-byte words

m = 12 bits per address
E = 2 (2-way set associative)
B = 4 bytes per block
S = 4 sets
Assume write-through and write-allocate

2a. Show the portions of an address that make up the
tag, set index, and block offset:

11 10 9 8 7 6 5 4 3 2 1 0
t  t  t t t t t t s s b b

2b. Consider following cache state (all values in hex):

		--------------------------------------------------
Set index	tag	val	byte 0	byte 1	byte 2	byte3
		--------------------------------------------------
0		00	1	40	41	42	43
		83      1	fe	97	cc	d0
		--------------------------------------------------
1		00	1	44	45	46	47
		83      0	--	--	--	--
		--------------------------------------------------
2		00	1	48	49	4a	4b
		40      0	--	--	--	--
		--------------------------------------------------
3		ff      1	9a	c0	03	ff
		00      0	--	--	--	--
		--------------------------------------------------


Fill in table for the following operations, when carried out
***in the order shown***:

---------------------------------------------------------------
Operation	Address		Hit?	Read value (or unknown)	
---------------------------------------------------------------
Read		0x834		No	Unknown
---------------------------------------------------------------
Write		0x836		Yes	--		
---------------------------------------------------------------
Read		0xffd		Yes	0xc0
---------------------------------------------------------------