¡¡
CS ¨C 15740 Computer Architecture
TRACE CACHES
Leon Gu (gu+@cs.cmu.edu)
Dipti Motiani (dipti@cmu.edu)
Presentation Slides
(PDF,
PPT)
PAPERS
n
Eric Rotenberg, Steve Bennett, and James E. Smith.
A Trace Cache Microarchitecture and Evaluation, in IEEE
Transactions on Computers, 48(2):111-120, February 1999.
n
Bryan Black, Bohuslav Rychlik, and John Paul Shen.
The Block-based Trace Cache, in Proceedings of the 26th Annual
International Symposium on Computer Architecture, pages 196-207, May
1999.
MOTIVATION
Each successive microprocessor generation has increased parallel functional
units and larger instruction issue windows. In order to fully exploit
instruction level parallelism, it is necessary to balance the increased
issue rate with sufficient instruction fetch bandwidth. Increased fetch
bandwidth can increase the overall performance in two ways: the increased
number of fetched instructions can be used to fill idle functional unit
resources and since there are more instructions, more independent
instructions can be found for issue.
Conventional instruction caches cannot fetch noncontiguous blocks in a
single cycle. Since taken conditional branches occur frequently, instruction
fetch bandwidth is severely limited.
INTRODUCTION
A trace is a sequence of at most n instructions and at most m
basic blocks starting at any point in a dynamic instruction stream. Each
line in a trace cache stores a snapshot or trace of the dynamic instruction
stream. The limit n is the trace cache line size and m is the
branch predictor throughput. A trace is fully specified by starting address
and a sequence of up to m-1 branch outcomes which describe the path
followed. Hence, there can be more than one path associated with a single
starting address. The trace cache is accessed in parallel with the
instruction cache.
PAPER 1
This paper presents a microarchitecture organized around traces. The
microarchitecture is compared to fetching single and multiple contiguous
basic blocks per cycle.
The microarchitecture uses trace-level sequencing. It has a trace cache, a
next trace predictor and a fill unit, in addition to the conventional
instruction fetch units. The next trace predictor treats traces as basic
units and predicts sequences of traces. As a trace can span multiple
branches, high branch prediction throughput is implicitly achieved. The fill
unit captures instructions as they retire, detects the end of a trace and
writes the trace to the trace cache.
Analyses indicate that design decisions such as trace selection algorithm,
cache size and allowing partial trace matches have a significant impact on
the performance of the trace cache.
Conventional trace caches have drawbacks such as inefficient storage space
utilization, fill unit latency and high power consumption. These can be (and
to some extent have been) overcome by enhancements to the basic design, one
of which is proposed in the second paper.
PAPER 2
The paper proposes a block-based trace cache that achieves higher IPC and
more storage efficiency as compared to a conventional trace cache.
The basic difference between the two architectures is that the trace in the
block-based trace cache is a sequence of basic-block ids as opposed a
sequence of instructions in a conventional trace cache. Hence, the trace
cache only stores sequences of basic-block ids resulting in a reduction in
size as well as fetch time. The basic-blocks are stored in a separate
(replicated) block cache. The blocks within a trace are fetched in parallel
from the block caches.
A rename table and collapse mux are the additional architectural elements
needed to implement the block-based trace cache. The rename table is a
mapping between instruction fetch addresses and block ids. The final
collapse mux is required to assemble the blocks fetched from the block cache
into a continuous instruction stream.
The replicated block cache design could be improved to store different basic
blocks and still achieve parallel access.
CONCLUSION
Trace caches are used in most modern architectures as they significantly
increase IPC. Also, as they are accessed in parallel with the conventional
instruction cache, they are not in the critical path of instruction fetch.
However, this increases the power consumption to some extent.
REFERECES
n
Michael Sung. Design of Trace Cache for High Bandwidth
Instruction Fetching. Masters Thesis, May 1998
n
Eric Rotenberg, Steve Bennett, Jim Smith. Trace Cache: a
Low Latency Approach to High Bandwidth Instruction Fetching. April,
1996
¡¡ |