CS

　
CS – 15740 Computer Architecture

TRACE CACHES

Leon Gu (gu+@cs.cmu.edu)
Dipti Motiani (dipti@cmu.edu)



Presentation Slides
(PDF, PPT)



PAPERS

n      Eric Rotenberg, Steve Bennett, and James E. Smith. A Trace Cache Microarchitecture and Evaluation, in IEEE Transactions on Computers, 48(2):111-120, February 1999.

n      Bryan Black, Bohuslav Rychlik, and John Paul Shen. The Block-based Trace Cache, in Proceedings of the 26th Annual International Symposium on Computer Architecture, pages 196-207, May 1999.

MOTIVATION

Each successive microprocessor generation has increased parallel functional units and larger instruction issue windows. In order to fully exploit instruction level parallelism, it is necessary to balance the increased issue rate with sufficient instruction fetch bandwidth. Increased fetch bandwidth can increase the overall performance in two ways: the increased number of fetched instructions can be used to fill idle functional unit resources and since there are more instructions, more independent instructions can be found for issue.

Conventional instruction caches cannot fetch noncontiguous blocks in a single cycle. Since taken conditional branches occur frequently, instruction fetch bandwidth is severely limited.

INTRODUCTION

A trace is a sequence of at most n instructions and at most m basic blocks starting at any point in a dynamic instruction stream. Each line in a trace cache stores a snapshot or trace of the dynamic instruction stream. The limit n is the trace cache line size and m is the branch predictor throughput. A trace is fully specified by starting address and a sequence of up to m-1 branch outcomes which describe the path followed. Hence, there can be more than one path associated with a single starting address. The trace cache is accessed in parallel with the instruction cache.

PAPER 1

This paper presents a microarchitecture organized around traces. The microarchitecture is compared to fetching single and multiple contiguous basic blocks per cycle.

The microarchitecture uses trace-level sequencing. It has a trace cache, a next trace predictor and a fill unit, in addition to the conventional instruction fetch units. The next trace predictor treats traces as basic units and predicts sequences of traces. As a trace can span multiple branches, high branch prediction throughput is implicitly achieved. The fill unit captures instructions as they retire, detects the end of a trace and writes the trace to the trace cache.

Analyses indicate that design decisions such as trace selection algorithm, cache size and allowing partial trace matches have a significant impact on the performance of the trace cache.

Conventional trace caches have drawbacks such as inefficient storage space utilization, fill unit latency and high power consumption. These can be (and to some extent have been) overcome by enhancements to the basic design, one of which is proposed in the second paper.

PAPER 2

The paper proposes a block-based trace cache that achieves higher IPC and more storage efficiency as compared to a conventional trace cache.

The basic difference between the two architectures is that the trace in the block-based trace cache is a sequence of basic-block ids as opposed a sequence of instructions in a conventional trace cache. Hence, the trace cache only stores sequences of basic-block ids resulting in a reduction in size as well as fetch time. The basic-blocks are stored in a separate (replicated) block cache. The blocks within a trace are fetched in parallel from the block caches.

A rename table and collapse mux are the additional architectural elements needed to implement the block-based trace cache. The rename table is a mapping between instruction fetch addresses and block ids. The final collapse mux is required to assemble the blocks fetched from the block cache into a continuous instruction stream.

The replicated block cache design could be improved to store different basic blocks and still achieve parallel access.

CONCLUSION

Trace caches are used in most modern architectures as they significantly increase IPC. Also, as they are accessed in parallel with the conventional instruction cache, they are not in the critical path of instruction fetch. However, this increases the power consumption to some extent.

REFERECES

n      Michael Sung. Design of Trace Cache for High Bandwidth Instruction Fetching. Masters Thesis, May 1998

n      Eric Rotenberg, Steve Bennett, Jim Smith. Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching. April, 1996