Memory Hierarchy


David A. Eckhardt
School of Computer Science
Carnegie Mellon University


de0u@andrew.cmu.edu
Outline

Lecture versus book
	ｷ Some of Chapter 2
	ｷ Some of Chapter 10

Memory hierarchy
	ｷ A principle (not just a collection of hacks)
Am I in the wrong class?

"Memory hierarchy": OS or Architecture?
	ｷ Yes

Why cover it here?
	ｷ OS manages several layers
		ｷ RAM cache(s)
		ｷ Virtual memory
		ｷ File system buffer cache
	ｷ Learn core concept, apply as needed
You can't have it all
Memory Desiderata
	ｷ big
	ｷ fast
	ｷ cheap
	ｷ compact
	ｷ cold
	ｷ non-volatile (can remember w/o electricity)

Pick one
	ｷ ok, maybe two

Why?
	ｷ Bigger -> slower (speed of light)
	ｷ Bigger -> more defects (assuming constant per unit area)
	ｷ Faster, denser -> hotter (at least for FETs)
Users want it all
The ideal
	ｷ Infinitely large, fast, cheap memory
	ｷ Users want it (those pesky users!)
	ｷ They can't have it
		ｷ Ok, so cheat!

Locality of reference
	ｷ Users don't really access 4 gigabytes uniformly by byte
	ｷ 80/20 "rule"
		ｷ 80% of the time is spent in 20% of the code
		ｷ Great, only 20% of the memory needs to be fast!

Deception strategy
	ｷ harness 2 (or more) kinds of memory together
	ｷ secretly move information among memory types
Cache
Small, fast memory...
	ｷ ...backed by a large, slow memory
	ｷ ...indexed according to the large memory's address space
	ｷ ...containing the most popular parts (now)
SRAM cache holds popular pixels
	ｷ DRAM holds popular image areas
		ｷ Disk holds popular satellite images
			ｷ Tape holds one orbit's worth of images

Clean general-purpose implementation?
	ｷ No: tradeoffs different at each level
		ｷ size ratio: data address / data size
		ｷ speed ratio
		ｷ access time = f(address)
But the idea is general-purpose
Deception Picture
				L1 CPU cache				
			L2 cache					
		RAM						
	disk							
CD-R								

The questions
	ｷ Line size
	ｷ Placement/search
	ｷ Miss policy
	ｷ Eviction
	ｷ Write policy
Today's Examples

L1 CPU cache
	ｷ Smallest, fastest
	ｷ Maybe on the same die as the CPU
	ｷ Maybe 2nd chip of multi-chip module
	ｷ As far as CPU is concerned, this is the memory

Disk block cache
	ｷ Holds disk sectors in RAM
	ｷ Entirely defined by software
	ｷ You will implement one
Line size

"Line size" = item size
	ｷ Many caches handle fixed-size objects
		ｷ Simpler
		ｷ Predictable operation times

L1 cache line size
	ｷ 4 32-bit words (486, IIRC)

Disk cache line size
	ｷ Maybe disk sector (512 bytes)
	ｷ Maybe "file system block" (small # of sectors)
Picking a Line Size
What should it be?
	ｷ See "locality of reference"
		ｷ ("typical" reference pattern)

Too big
	ｷ Waste throughput
		ｷ Fetch a megabyte, use 1 byte
	ｷ Reduce "hit rate"
		ｷ String move: *q++ = *p++
		ｷ Better have at least two cache lines!

Too small
	ｷ Waste latency
		ｷ Frequently need to fetch another line
Content-Addressable Memory
RAM
	ｷ store(address, value)
	ｷ fetch(address) -> value

CAM
	ｷ store(address, value)
	ｷ fetch(value) -> address
		ｷ Are we having P2P yet?

"It's always the last place you look"
	ｷ Not with a CAM!

Cool!
	ｷ But fast CAMs are small (speed of light)
Placement/search
Placement = "Where can we put ____?"
	ｷ "Direct mapped" - each item has one place
		ｷ Think: hash function
	ｷ "Fully associative" - each item can be any place
		ｷ Think: CAM

Direct Mapped
	ｷ Placement & search are trivial
	ｷ False collisions are common
		ｷ String move: *q++ = *p++
		ｷ Each iteration could be two cache misses!

Fully Associative
	ｷ No false collisions
	ｷ Cache size limited by CAM size
Sample choices

L1 cache
	ｷ Often direct mapped
	ｷ Sometimes 2-way associative
	ｷ Depends on phase of transistor

Disk block cache
	ｷ Fully associative
		ｷ Open hash table = large variable-time CAM
		ｷ Fine since "CAM" lookup time << disk seek time

Choosing associativity
	ｷ Trace-driven simulation
	ｷ Packaging constraints
Miss policy
Miss policy: {Read,Write} X {Allocate,Around}
	ｷ Allocate: miss -> allocate a slot
	ｷ Around: miss -> don't change cache state

L1 cache
	ｷ Mostly read-allocate, write-allocate
	ｷ But not for "uncacheable" memory
		ｷ ...such as Ethernet card ring buffers
	ｷ "Memory system" provides "cacheable" bit
	ｷ Some CPUs have "write block" instructions

Disk block cache
	ｷ Mostly read-allocate, write-allocate
	ｷ What about reading (writing) a huge file?
	ｷ see (e.g.) madvise()
Eviction
"The steady state of disks is `full'".
	ｷ Each placement requires an eviction
	ｷ Easy for direct-mapped caches
	ｷ Otherwise, policy is necessary

Ideal policy - consult an oracle!
	ｷ Evict whichever item won't be used longest
	ｷ Useful only in simulation comparisons

Least-recently-used (LRU)
	ｷ LRU may be a reasonable approximation of Ideal
		ｷ ("Past performance does not guarantee future results")
	ｷ Or it may be the worst possible thing
		ｷ Cache size: 4 (fully associative)
		ｷ Reference pattern: 1, 2, 3, 4, 5, ...
Eviction

Random
	ｷ Pick a random item to evict
	ｷ Randomness protects against pathological cases

Could "Random" be good?
	ｷ What would it take?

L1 cache
	ｷ LRU is easy for 2-way associative!
Disk block cache
	ｷ Frequently LRU, frequently modified
		ｷ "Prefer metadata"
		ｷ Other hacks
Write policy
Assume a write hit (not write-around)

Write-through
	ｷ Store new value in cache
	ｷ Also store it through to next level
	ｷ Simple

Write-back
	ｷ Store new value in cache
	ｷ Store it to next level only on eviction
		ｷ "Mandatory optimization": "dirty bit"
	ｷ May save substantial work
Write policy

L1 cache
	ｷ It depends
	ｷ May be write-through if next level is L2 cache

Disk block cache
	ｷ Write-back
	ｷ Popular mutations
		ｷ Pre-emptive write-back if disk idle
		ｷ Bound write-back delay (crashes happen)
Translation caches
Address mapping
	ｷ CPU presents virtual address (CS:EIP)
	ｷ Fetch segment descriptor from L1 cache (or not)
	ｷ Now fetch page table entry from L1 cache (or not)
	ｷ Now fetch the actual word from L1 cache (or not)

"Translation lookaside buffer" (TLB)
	ｷ Observe result of segmentation, virtual->physical mapping
	ｷ Key = virtual address
	ｷ Value = physical address
Challenges
Write-back failure
	ｷ Power failure?
		ｷ Battery-backed RAM!
	ｷ Crash?
		ｷ Maybe the old disk cache is ok after reboot?

Coherence
	ｷ What about shared caches?
		ｷ Multiprocessor: 4 L1 caches share L2 cache
		ｷ TLB: v->p all wrong after context switch
	ｷ What about non-participants?
		ｷ I/O device does DMA
	ｷ Solutions
		ｷ Snooping
		ｷ Invalidation messages
Summary


Memory hierarchy has many layers
	ｷ Size: kilobytes through terabytes
	ｷ Access time: nanoseconds through minutes


Common questions, solutions
	ｷ Each instance is a little different
		ｷ But there are lots of cookbook solutions