|
||||||||||||||||||||||||
|
The CASH CompilerApplication-Specific Hardware
Handling the translation from one source program to
devices with billions of gates is a tremendous engineering
task. The complexity of the designs is one reason that CAD
tools are so slow and buggy.
In order to scalably manage the compilation problem, we
decompose each program into a series of Split-phase Abstract
Machines (SAMs). The control-flow graph (CFG) of each
procedure is "covered" with disjoint SAMs. The intra-SAM
communication will be mapped to high-speed connections,
while the inter-SAM communication will occur on a slow,
pipelined interconnection network. Compiling for ASH
Our Intermediate Representation: PegasusFor SAM optimization and synthesis, CASH uses a new intermediate representation called Pegasus. Pegasus unifies several features of other intermediate representations:
In order to expose substantial instruction-level parallelism we use several compilation techniques borrowed from predicated architectures:
The resulting representation is a form of Gated Single
Assignment (a relative of Static-Single Assignment) in which
predicate conditions are explicitly represented.
This is an example program, its control-flow graph (which is
a single hyperblock) and the resulting representation:
Data-flow between hyperblocks is represented using
merge and eta operators introduced in the
dataflow-machine literature. Merge operators accept a data
value from one of multiple producers. Eta operators are
controlled by a predicate: when the predicate is
"true", they forward the data at the input.
When the predicate is "false", they simply discard
the data. The following figure displays the implementation
of a program comprising three hyperblocks:
Optimizing PegasusScalar optimizations can be implemented very efficiently on Pegasus.
A variety of new algebraic optimizations:
Implementing Pegasus in hardwareWhen implementing Pegasus in hardware we apply further optimizations. Boolean operations can sometimes produce a result before knowing all input values; for example, a logical "and" can immediately produce a zero when one input is zero, irrespective of the value of the second input. We call such operation implementations lenient. In hardware we implement leniently boolean operations, multiplexors and side-effect operations (which do not need to compute when the guarding predicate is false). The handshaking between operations gives rise to a
natural loop pipelining effect, in which multiple
iterations of a loop can proceed simultaneously.
int f(void)
{
int i, sum=0;
for (i=0; i < 100; i++)
sum += i;
return sum;
}
In this example there are two loops for the two induction variables i and sum. The multiplier operation is drawn as an 8-stage pipeline. The i loop can run ahead of the other one, filling the multiplier with multiple values. For maximum pipelining some FIFO elements have to be introduced (labeled cut-through registers in the figure) to buffer values of the loop-controlling predicate for later consumption by sum's loop. The deeper the FIFO, the better the simulated performance of the code. Memory DisambiguationPegasus explicitly represents may-dependences through
memory between operations using token edges. The
token edges are both a compile-time object, used to focus
the optimizer and scheduler and also a run-time object, used
for fine-grained synchronization for memory access only
between dependent operations.
Operations that may access the same memory regions and which do not commute have token edges inserted between them: the red dashed lines in this figure. Notice that there's no token edge between *q and *r, since there is no control-flow between them. The token edges give an SSA-like representation for memory, enabling compact and powerful optimizations for memory, as the traditional scalar SSA form does. The token edge graph is maintained in a transitively reduced for throughout compilation. The meaning of the edges can thus be summarized as follows:
Memory optimizations
Fine-grained tokens are also carried around loops using mu-eta operators. This gives rise to loop pipelining for memory, similar to the pipelining for scalars described above.
Energy EfficiencyWhen implemented directly in hardware, ASH is up to three orders of magnitude more power-efficient than superscalar processors, one to two orders of magnitude better than low-power DSP processors, one order of magnitude better than asynchronous processors, but one order of magnitude less efficient when compared to custom hand-designed hardware (the comparison is with hardware using similar technologies).
|