15-213  
"The course that gives CMU its Zip!"  
Cache Memories  
September 30, 2008

**Topics**  
- Generic cache memory organization  
- Direct mapped caches  
- Set associative caches  
- Impact of caches on performance

---

**Announcements**

**Exam grading done**
- Everyone should have gotten email with score (out of 72)  
  - mean was 50, high was 70  
  - solution sample should be up on website soon  
- Getting your exam back  
  - some got them in recitation  
  - working on plan for everyone else (worst case = recitation on Monday)  
- If you think we made a mistake in grading  
  - please read the syllabus for details about the process for handling it

---

**General cache mechanics**

- Smaller, faster, more expensive memory caches a subset of the blocks  
- Data is copied between levels in block-sized transfer units  
- Larger, slower, cheaper memory is partitioned into "blocks"

---

**Cache Performance Metrics**

**Miss Rate**
- Fraction of memory references not found in cache (misses / accesses)  
  - 1 – hit rate  
- Typical numbers (in percentages):  
  - 3-10% for L1  
  - can be quite small (e.g., < 1%) for L2, depending on size, etc.

**Hit Time**
- Time to deliver a line in the cache to the processor  
  - includes time to determine whether the line is in the cache  
- Typical numbers:  
  - 1-2 clock cycle for L1  
  - 5-20 clock cycles for L2

**Miss Penalty**
- Additional time required because of a miss  
  - typically 50-200 cycles for main memory (Trend: increasing!)  

---

*From lecture-9.ppt*
Let's think about those numbers

Huge difference between a hit and a miss
- 100X, if just L1 and main memory

Would you believe 99% hits is twice as good as 97%?
- Consider these numbers:
  - cache hit time of 1 cycle
  - miss penalty of 100 cycles

So, average access time is:
- 97% hits: 1 cycle + 0.03 * 100 cycles = 4 cycles
- 99% hits: 1 cycle + 0.01 * 100 cycles = 2 cycles

Many types of caches

Examples
- Hardware: L1 and L2 CPU caches, TLBs, ...
- Software: virtual memory, FS buffers, web browser caches, ...

Many common design issues
- each cached item has a “tag” (an ID) plus contents
- need a mechanism to efficiently determine whether given item is cached
  - combinations of indices and constraints on valid locations
- on a miss, usually need to pick something to replace with the new item
  - called a “replacement policy”
- on writes, need to either propagate change or mark item as “dirty”
  - write-through vs. write-back

Hardware cache memories

Cache memories are small, fast SRAM-based memories managed automatically in hardware
- Hold frequently accessed blocks of main memory

CPU looks first for data in L1, then in main memory

Typical system structure:

Inserting an L1 Cache Between the CPU and Main Memory

The transfer unit between the CPU register file and the cache is a 4-byte word

The small fast L1 cache has room for two 4-word blocks

The big slow main memory has room for many 4-word blocks
Inserting an L1 Cache Between the CPU and Main Memory

The tiny, very fast CPU register file has room for four 4-byte words.

The small fast L1 cache has room for two 4-word blocks.

The big slow main memory has room for many 4-word blocks.

General Organization of a Cache

Cache size: \( C = B \times E \times S \) data bytes

Addressing Caches

The word at address \( A \) is in the cache if the tag bits in one of the \(<valid>\) lines in set \(<set index>\) match \(<tag>\).

The word contents begin at offset \(<block offset>\) bytes from the beginning of the block.
**Addressing Caches**

*Address A:*

```
set 0: •••
tag 0 1 1 1 ... b 1
set 1: •••
tag 0 1 1 1 ... b 1
set S-1: •••
tag 0 1 1 1 ... b 1
```

1. Locate the set based on <set index>
2. Locate the line in the set based on <tag>
3. Check that the line is valid
4. Locate the data in the line based on <block offset>

---

**Accessing Direct-Mapped Caches**

*Set selection*

- Use the set index bits to determine the set of interest.

```
selected set
set 0: valid tag cache block
set 1: valid tag cache block
set S-1: valid tag cache block
```

---

**Example: Direct-Mapped Cache**

Simplest kind of cache, easy to build
(only 1 tag compare required per access)
Characterized by exactly one line per set.

```
set 0: valid tag cache block
set 1: valid tag cache block
set S-1: valid tag cache block
```

Cache size: \( C = B \times S \) data bytes

---

**Accessing Direct-Mapped Caches**

*Line matching and word selection*

- **Line matching:** Find a valid line in the selected set with a matching tag
- **Word selection:** Then extract the word

```
selected set (i):
0 1 1 0 1 0 1 0
(1) The valid bit must be set
(2) The tag bits in the cache line must match the tag bits in the address

If (1) and (2), then cache hit
```

---

**Accessing Direct-Mapped Caches**

```
selected set (i):
0 1 1 0 1 0 1 0
(1) The valid bit must be set
(2) The tag bits in the cache line must match the tag bits in the address

If (1) and (2), then cache hit
```
Accessing Direct-Mapped Caches

**Line matching and word selection**

- **Line matching:** Find a valid line in the selected set with a matching tag
- **Word selection:** Then extract the word

```
selected set (i):

<table>
<thead>
<tr>
<th></th>
<th>0</th>
<th>1</th>
<th>2</th>
<th>3</th>
<th>4</th>
<th>5</th>
<th>6</th>
<th>7</th>
</tr>
</thead>
<tbody>
<tr>
<td>tag set index block offset</td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
<td></td>
</tr>
</tbody>
</table>
```

(3) If cache hit, block offset selects starting byte.

```
t bits  s bits  b bits
0110    i      100
```

Direct-Mapped Cache Simulation

\[ M=16, B=2, S=4, E=1 \]

Address trace (reads):

<table>
<thead>
<tr>
<th>t=1</th>
<th>s=2</th>
<th>b=1</th>
</tr>
</thead>
<tbody>
<tr>
<td>x</td>
<td>xx</td>
<td>x</td>
</tr>
</tbody>
</table>

Example: Set Associative Cache

Characterized by more than one line per set

- **Set selection**
  - identical to direct-mapped cache

```
selected set

set 0:
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
```

Accessing Set Associative Caches

```
set 0:
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
```

set 1:

```
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
```

set S-1:

```
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
valid tag cache block
```

E-way associative cache

```
t bits  s bits  b bits
0001    0000   0000
```

```
m-1 tag set index block offset
```
1. The valid bit must be set
2. The tag bits in one of the cache lines must match the tag bits in the address

If (1) and (2), then cache hit

<table>
<thead>
<tr>
<th>t bits</th>
<th>s bits</th>
<th>b bits</th>
</tr>
</thead>
<tbody>
<tr>
<td>0110</td>
<td>i</td>
<td>100</td>
</tr>
</tbody>
</table>

3. If cache hit, block offset selects starting byte.

Notice that middle bits used as index
Why Use Middle Bits as Index?

High-Order Bit Indexing
- Adjacent memory lines would map to same cache entry
- Poor use of spatial locality

Middle-Order Bit Indexing
- Consecutive memory lines map to different cache lines
- Can hold S*B*E-byte region of address space in cache at one time

Sidebar: Multi-Level Caches

Options: separate data and instruction caches, or a unified cache

Processor

- Regs
- L1 d-cache
- L1 i-cache
- Unified L2 Cache
- Memory
- disk

Size: 200 B 8-64 KB 1-4 MB SRAM 128 MB DRAM 30 GB
Speed: 3 ns 3 ns 6 ns 60 ns 8 ns
$/Mbyte: $100/M $1.50/M $0.05/M
Line size: 8 B 32 B 32 B 8 KB

What about writes?

Multiple copies of data exist:
- L1
- L2
- Main Memory
- Disk

What to do when we write?
- Write-through
- Write-back
  - need a dirty bit
  - What to do on a write-miss?

What to do on a replacement?
- Depends on whether it is write through or write back
Software caches are more flexible

Examples
- File system buffer caches, web browser caches, etc.

Some design differences
- Almost always fully associative
- So, no placement restrictions
- Index structures like hash tables are common
- Often use complex replacement policies
- Misses are very expensive when disk or network involved
- Not necessarily constrained to single “block” transfers
- May fetch or write-back in larger units, opportunistically

Locality Example #1

Being able to look at code and get a qualitative sense of its locality is a key skill for a professional programmer

Question: Does this function have good locality?

```c
int sum_array_rows(int a[M][N])
{
   int i, j, sum = 0;
   for (i = 0; i < M; i++)
      for (j = 0; j < N; j++)
         sum += a[i][j];
   return sum;
}
```

Locality Example #2

Question: Does this function have good locality?

```c
int sum_array_cols(int a[M][N])
{
   int i, j, sum = 0;
   for (j = 0; j < N; j++)
      for (i = 0; i < M; i++)
         sum += a[i][j];
   return sum;
}
```

Locality Example #3

Question: Can you permute the loops so that the function scans the 3-d array `a[ ]` with a stride-1 reference pattern (and thus has good spatial locality)?

```c
int sum_array_3d(int a[M][N][N])
{
   int i, j, k, sum = 0;
   for (i = 0; i < M; i++)
      for (j = 0; j < N; j++)
         for (k = 0; k < N; k++)
            sum += a[i][j][k];
   return sum;
}
```