# Lectures 21 Prefetching Arrays I. Tolerating Memory Latency II. Prefetching Compiler Algorithm III. Experimental Results Material from: T.C. Mowry, M. S. Lam and A. Gupta. "Design and Evaluation of a Compiler Algorithm for Prefetching." In Proceedings of ASPLOS-V, Oct. 1992, pp. 62-73. ALSU 11.11.4 Phillip B. Gibbons 15-745: Prefetching Arrays



## Coping with Memory Latency Reduce Latency: - Locality Optimizations • reorder iterations to improve cache reuse Tolerate Latency: - Prefetching • move data close to the processor before it is needed



## Prefetching Research Goals • Domain of Applicability • Performance Improvement – maximize benefit – minimize overhead Carnegie Mellon



### Prefetching Concepts possible only if addresses can be determined ahead of time coverage factor = fraction of misses that are prefetched unnecessary if data is already in the cache effective if data is in the cache when later referenced Analysis: what to prefetch - maximize coverage factor - minimize unnecessary prefetches Scheduling: when/how to schedule prefetches - maximize effectiveness - minimize overhead per prefetch Carnegie Mellon



### Recall: Steps in Locality Analysis 1. Find data reuse — if caches were infinitely large, we would be finished 2. Determine "localized iteration space" — set of inner loops where the data accessed by an iteration is expected to fit within the cache 3. Find data locality: — reuse ∩ localized iteration space ⇒ locality Carnegie Mellon











### Loop Splitting

- · Decompose loops to isolate cache miss instances
  - cheaper than inserting IF(Prefetch Predicate) statements

| Locality Type | Predicate     | Loop Transformation |
|---------------|---------------|---------------------|
| None          | True          | None                |
| Temporal      | i = 0         | Peel loop i         |
| Spatial       | (i mod L) = 0 | Unroll loop i by L  |

(L elements/cache line)

- · Apply transformations recursively for nested loops
- Suppress transformations when loops become too large
  - avoid code explosion

15-745: Prefetching Arrays

Carnegie Mellon

```
Prefetching via Software Pipelining
                        Iterations Ahead = \left\lceil \frac{1}{c} \right\rceil
      where / = memory latency, s = shortest path through loop body
                                  Software Pipelined Loop
     Original Loop
                                    (6 iterations ahead)
                                                            /* Prolog */
 for (i = 0; i<100; i++)
                                for (i = 0; i < 6; i += 2)
    a[i] = 0;
                                   prefetch(&a[i]);
                                for (i = 0; i<94; i+=2){ /* Steady State*/
                                   prefetch(&a[i+6]);
                                   a[i] = 0;
(2 elements/cache line)
                                   a[i+1] = 0;
                                for (i = 94; i<100; i++) /* Epilog */
                                   a[i] = 0;
                                                             Carnegie Mellon
15-745: Prefetching Arrays
```



# Experimental Results (Dense Matrix Uniprocessor) Performance of Prefetching Algorithm Locality Analysis Software Pipelining Interaction with Locality Optimizer

### III. Experimental Framework Architectural Extensions: Prefetching support: - lockup-free caches - 16-entry prefetch issue buffer - prefetch directly into both levels of cache Contention: - memory pipelining rate = 1 access every 20 cycles - primary cache tag fill = 4 cycles Misses get priority over prefetches Simulator / Applications: • Detailed cache simulator driven by pixified object code · Memory subsystem: - 8K L1 / 256K L2 direct-mapped caches, 32 byte lines - miss penalties: 12 / 75 cycles · Applications from SPEC, SPLASH, and NAS Parallel Carnegie Mellon 15-745: Prefetching Arrays











## Prefetching Indirections for (i = 0; i<100; i++) sum += A[index[i]]; Analysis: what to prefetch - both dense and indirect references - difficult to predict whether indirections hit or miss Scheduling: when/how to issue prefetches - modification of software pipelining algorithm Carnegie Mellon



### Software Pipelining for Indirections Software Pipelined Loop Original Loop (5 iterations ahead) for (i = 0; i<100; i++) for (i = 0; i<5; i++) /\* Prolog 1 \*/ sum += A[index[i]]; prefetch(&index[i]); for (i = 0; i<5; i++) { /\* Prolog 2 \*/ prefetch(&index[i+5]); prefetch(&A[index[i]]); for (i = 0; i<90; i++) { /\* Steady State\*/ prefetch(&index[i+10]); prefetch(&A[index[i+5]]); sum += A[index[i]]; for (i = 90; i<95; i++) { /\* Epilog 1 \*/ prefetch(&A[index[i+5]]); sum += A[index[i]]; for (i = 95; i<100; i++) /\* Epilog 2 \*/ sum += A[index[i]]; Carnegie Mellon 15-745: Prefetching Arrays 26

### Summary of Results

### Dense Matrix Code:

- eliminated 50% to 90% of memory stall time
- overheads remain low due to prefetching selectively
- significant improvements in overall performance (6 over 45%)

### Indirections, Sparse Matrix Code:

- expanded coverage to handle some important cases

15-745: Prefetching Arrays 28

## Prefetching for Arrays: Concluding Remarks Demonstrated that software prefetching is effective - selective prefetching to eliminate overhead - dense matrices and indirections / sparse matrices - uniprocessors and multiprocessors Hardware should focus on providing sufficient memory bandwidth

