Value Prediction

Value Prediction

Mikko H. Lipasti and John Paul Shen. Exceeding the Dataflow Limit via Value Prediction, in Proceedings of 29th Annual International Symposium on Microarchitecture, November 1996.
Avinash Sodani and Gurindar S. Sohi. Understanding the Differences Between Value Prediction and Instruction Reuse, in Proceedings of the 31st Annual International Symposium on Microarchitecture, November 1998.

Presentation Slides are available in [ps] [ppt]

Speculative execution is a powerful new technique that can be applied in many different forms. In a typical processor, control speculation is a crucial tool that allows branch prediction both for the direction of a branch and the target of a branch. These papers explore a different realm of speculation, data speculation. Again, within data speculation it is possible to predict either the location of data or the value of data. Value prediction corresponds to the latter, and this is the type of speculation we explore in this discussion.

As with other forms of speculation, value prediction depends crucially on locality, both spatial and temporal. Thus, a first step in understanding value prediction, is to understand what leads to value locality. In order to investigate this, we can consider many common assembly instructions and evaluate how frequently the same value results from the same instruction twice in a row. It is obvious that not all instructions exhibit locality. For example, a memory store that updates a base register is quite unlikely to exhibit any sort of locality. On the other hand though, a load is very likely to have both spatial and temporal locality because any value stored in memory will likely be accessed more than once. Thus, some instructions will be predictable, while others will not. For a complete description of the locality of various assembly instructions we refer the reader to [LS96].

Now that we understand where locality occurs, the next important question to consider is whether there is a benefit to be gained from taking advantage of it. The answer is a clear yes. On any superscalar machine, multiple instructions are fetched at once, and the goal is to allow them to proceed through the pipeline in parallel. However, this is not possible if one of these instructions uses a computation made in another one of these instructions (e.g. add r1, r2, r3 and add r3, r4, r5). In this situation the dependent instruction must stall in the pipeline. However, if we can predict the result of the the computation, we can allow the instructions to continue through the pipeline in parallel. Thus, the benefits that value prediction allow are great.

The two papers listed here present two approaches for how to exploit value locality: Value Prediction (VP) and Instruction Reuse (IR). The two techniques are fundamentally different. VP is a speculative technique that attempts to predict the results of instructions based on previously seen results. IR, on the other hand, tries to recognize that a computation chain has been previously performed and therefore need not be performed again. The details of these two procedures are best understood by reading the above papers, but we will provide a general overview of each technique here.

We start with VP. As we mentioned, the main idea of VP is to predict the results of instructions based on previously seen results. This prediction is performed during the Fetch and Decode stages of the pipeline. Thus, the dependent instructions can be issued immediately. The only other addition to the pipeline necessary to implement VP is that before we commit the results of an instruction we must verify the prediction. If the prediction was correct we can commit; however, if the prediction was incorrect we will need to reissue and reexecute the instruction. In the actual implementation of VP two new tables are used: a Classification Table (CT) and a Value Prediction Table (VPT). The VPT stores the predicted values to be used for later instructions. It is a cache that is indexed by the instruction address and uses the standard LRU replacement mechanism. The CT is used to answer the question of when predictions should be made. Again the table is indexed by the instruction address, but now instead of storing a value, the table stores a saturating counter that keeps track of how predictable the instruction is. The detailed results of this implementation can be found in [LS96].

We now move to IR. IR takes a different approach to the task of value prediction. Instead of speculating, IR caches. In parallel to the Fetch and Decode stages, IR checks whether an instruction was previously used, and then verifies that the arguments are the same. If this verification succeeds, the previous value calculated can be reused, and IR commits the value. The implementation of this technique is performed by adding a Reuse Buffer (RB), which is again a cache indexed by instruction address. The cache stores the instruction along with the information needed to establish reusability. An important point here is that the setup of this cache causes it to require 4x as much space as VP uses. Again, we differ the discussion of the results of this technique to the paper [SS98].

We will conclude by comparing the two techniques described above with regards to five important architectural metrics: (i) the amount of redundancy captured, (ii) misprediction penalty, (iii) integration with branch prediction, (iv) resource contention, and (v) execution latency.

Redundancy Captured: VP captures 3 types of redundancy that IR cannot capture: (1) cases where the result is the same but the inputs are not, (2) cases where the inputs are not ready, (3) cases where VP makes a lucky guess.

Misprediction: IR is not speculative, so it never needs to pay a misprediction penalty; whereas VP will incur such penalties.

Integration with branches: Depending on the branch prediction mechanism, VP could actually increase the misprediction rate by adding more speculation into the mix. IR on the other hand will detect branch misprediction earlier (when it reuses instructions) and, even when misprediction cannot be avoided, IR will be caching the work done because of the misprediction for later reuse.

Resource Contention: There are two concerns here. First, as we noted, IR requires more memory for the RB than the VP. However, with respect to ALU contention VP may increase contention by causing mispredicted instructions to be executed twice; whereas IR will reduce contention by allowing some instructions to commit without executing.

Execution Latency: Again, in this case, VP will sometimes cause an instruction to be executed twice due to a misprediction. IR, however, will guarantee that instructions execute at most once, and sometimes not at all.

The above comparison makes the point that there is a tradeoff between these techniques. Thus, an important question suggested by these papers is whether it is possible to somehow obtain the best of both worlds. That is, does there exist a hybrid technique that can reuse instructions if they have been previously executed, but also make predictions when reuse is not possible?