Investigating Load Redundancy

Sagar Chaki
Dominic Mazzoni

A load can be one of the more costly instructions for a modern processor to execute, especially if there is a cache miss. Processors can now execute multiple instructions in the time it takes to fetch one word from memory, and this gap in speed is only growing larger. Superscalar processors can execute other instructions while a load is taking place, but any instruction dependent on the result of the load must stall. It is reasonable to conjecture that a small but significant number of loads are in a sense redundant, because the value read from memory was already in the register. If we could speculatively bypass the load when it was suspected to be redundant (and of course clear the pipeline later if the speculation turned out to be incorrect), this could result in a significant gain in performance.

Milestone report: Nov. 10, 1999

Original report:

Preliminary investigation has showed that a significant fraction of loads may actually be redundant. In our profiling of some common functions on a 21164 Alpha, we found anywhere from 5% to 95% redundant loads, with typical results being >10%. Considering that many of these loads may be taking place in short inner loops, bypassing them could result in measurable performance gains.

Function Redundant Total Percentage
printf("\n"); 3 61 4.91%
malloc(1000); 6 48 12.5%
qsort(p,16,4,f); (reverse order) 179 850 21.06%
qsort(p,16,4,f); (random ints) 206 760 27.11%
fourier_transform(64,a0,0,r0,r1); 46 48 95.83%

Click here for our original project proposal

Update 10/24/99: Analysis of our initial tests indicate that there is high correlation between redundant loads and instruction addresses. So we will model our redundant load predictor exactly like a branch predictor: each load will lookup in a history cache, each one of which points to a 2-bit (or more sophisticated) predictor. If the prediction is that the load is redundant, speculatively execute dependent instructions as if the value to be loaded was correct (but save the current state in case the speculation is wrong). If the load turns out to be not redundant, we will flush the pipeline and go back to our saved state. Note that we are predicting based on the instruction that made the load, not the memory address we loaded from.

One concern is that we may not know if the load was redundant or not for a long time. Our solution to this is to always assume a load is not redundant if we have a cache miss. The main reason for this is that it is not possible to speculatively execute too many instructions anyway; eventually it gets too difficult to backtrack. So if there's a cache miss we treat it as a normal load, but if it's a hit we have the option of predicting that it's redundant. Since cache misses only occur about 5% of the time, it does not make sense to overoptimize this case anyway. However, it certainly would be possible, thought slightly more complicated, to keep speculatively executed instructions in the pipeline after a cache miss, and continue execution many cycles later if the prediction was correct.

We are currently trying to add code to SimpleScalar that will completely skip all redundant loads (by cheating and always guessing correctly). This will give us an upper bound on the speedup we could possible achieve with various benchmarks, and get us ready to start implementing the real predictor.

How is this different than Value Prediction? It has been observed that what we are doing is not very different than Value Prediction. Here are the main differences as we see them:

  1. Less Hardware. Value prediction requires more hardware because it must keep track of the value history of each instruction. Also, in order to be useful it needs to be large enough to predict any possible instruction. Since our predictor is only concerned with loads, it could be much smaller, and only requires a two-bit counter for each load instruction in the cache.
  2. Larger potential performance gains. A load typically requires ten cycles, even with a cache hit. If its value was correctly predicted, this could save much more than predicting the result of most other instructions.
  3. Value prediction does not catch all cases of redundancy. A value predictor typically only predicts the result of an instruction based on previous execution. A load predictor can predict that the value stays the same, regardless of what that value is.