Investigating Load Redundancy
Sagar Chaki
Dominic Mazzoni
A load can be one of the more costly instructions for a modern
processor to execute, especially if there is a cache miss.
Processors can now execute multiple instructions in the time
it takes to fetch one word from memory, and this gap in speed
is only growing larger. Superscalar processors can execute
other instructions while a load is taking place, but any
instruction dependent on the result of the load must stall.
It is reasonable to conjecture that a small but significant
number of loads are in a sense redundant, because the value
read from memory was already in the register. If we could
speculatively bypass the load when it was suspected to be
redundant (and of course clear the pipeline later if the
speculation turned out to be incorrect), this could result
in a significant gain in performance.
Milestone report: Nov. 10, 1999
- What We have Accomplished So Far:
Since our last report, we have modified SimpleScalar so that
it detects redundant loads, counts them, and (depending on
a compile-time option) can eliminate them by replacing them
with NOPS on the fly. We are finishing up an actual predictor
that will predict whether or not a load is redundant using
a two-bit counter associated with the load address.
We have run this modified simulator on our own qsort test
and on several of the Spec95 benchmarks. We are seeing
potential performance gains of anywhere from 2% to 13%,
but we have yet to distinguish between predictable redundant
loads and coincidental ones.
- Major Changes:
We are not going to simulate speculative execution.
SimpleScalar is too complicated and we have not figured out
how to safely use its speculative execution code. Instead,
we will provide three numbers for every benchmark:
the running time with normal (out of order) execution, the
best possible running time assuming all loads were predicted
correctly, and the running time assuming an oversimplified
(branch-predictor-based) load predictor. The actual
expected performance would be somewhere between these latter
two numbers, which essentially give bounds on the possible
speedup.
- Meeting Milestone:
We have essentially met our milestone, except that we have
simplified our proposed changes to SimpleScalar.
- Surprises:
We discovered that the majority of redundant loads
were actually cache hits, resulting in
negligible savings with our load predictor.
Obviously the cache misses give us the biggest wins,
so luckily there are some of these.
- Revised Schedule:
By next Monday, we will have completed our simulations, and
that will only leave tabulation, analysis, and presentation
for the final week of the project.
- Resources Needed:
We have all of the resources we need to complete this project.
Original report:
Preliminary investigation has showed that a significant
fraction of loads may actually be redundant. In our
profiling of some common functions on a 21164 Alpha, we
found anywhere from 5% to 95% redundant loads, with
typical results being >10%. Considering that many of
these loads may be taking place in short inner loops,
bypassing them could result in measurable performance
gains.
| Function |
Redundant |
Total |
Percentage |
| printf("\n"); |
3 |
61 |
4.91% |
| malloc(1000); |
6 |
48 |
12.5% |
| qsort(p,16,4,f); (reverse order) |
179 |
850 |
21.06% |
| qsort(p,16,4,f); (random ints) |
206 |
760 |
27.11% |
| fourier_transform(64,a0,0,r0,r1); |
46 |
48 |
95.83% |
Click here for our original project proposal
Update 10/24/99:
Analysis of our initial tests indicate that there is high correlation
between redundant loads and instruction addresses. So we will model
our redundant load predictor exactly like a branch predictor:
each load will lookup in a history cache, each one of which points to
a 2-bit (or more sophisticated) predictor. If the prediction is that
the load is redundant, speculatively execute dependent instructions as if
the value to be loaded was correct (but save the current state in case
the speculation is wrong). If the load turns out to be not redundant,
we will flush the pipeline and go back to our saved state. Note that
we are predicting based on the instruction that made the load, not the
memory address we loaded from.
One concern is that we may not know if the load was redundant or not for
a long time. Our solution to this is to always assume a load is not
redundant if we have a cache miss. The main reason for this is that it
is not possible to speculatively execute too many instructions anyway;
eventually it gets too difficult to backtrack. So if there's a cache miss
we treat it as a normal load, but if it's a hit we have the option of
predicting that it's redundant. Since cache misses only occur about 5% of
the time, it does not make sense to overoptimize this case anyway.
However, it certainly would be possible, thought slightly more complicated,
to keep speculatively executed instructions in the pipeline after a cache
miss, and continue execution many cycles later if the prediction was
correct.
We are currently trying to add code to SimpleScalar that will completely
skip all redundant loads (by cheating and always guessing correctly).
This will give us an upper bound on the speedup we could possible achieve
with various benchmarks, and get us ready to start implementing the real
predictor.
How is this different than Value Prediction?
It has been observed that what we are doing is not very different than
Value Prediction. Here are the main differences as we see them:
- Less Hardware.
Value prediction requires more hardware because it must keep track
of the value history of each instruction. Also, in order to be useful
it needs to be large enough to predict any possible instruction. Since
our predictor is only concerned with loads, it could be much smaller, and
only requires a two-bit counter for each load instruction in the cache.
- Larger potential performance gains.
A load typically requires ten cycles, even with a cache hit. If its
value was correctly predicted, this could save much more than predicting
the result of most other instructions.
- Value prediction does not catch all cases of redundancy.
A value predictor typically only predicts the result of an instruction
based on previous execution. A load predictor can predict that the value
stays the same, regardless of what that value is.