Cash Compiler::
  Application-Specific Hardware

The ASH model of computation translates complete applications written in high-level languages into hardware. For each source program (and its associated libraries) we have to carry the following tasks:

  • Partition the computation into pieces and synthesize circuits for each of the pieces.
  • Partition the program memory into pieces to obtain better locality and increased parallelism.
  • Synthesize an interconnection network connecting the circuits and the memories.
Handling the translation from one source program to devices with billions of gates is a tremendous engineering task. The complexity of the designs is one reason that CAD tools are so slow and buggy.

In order to scalably manage the compilation problem, we decompose each program into a series of Split-phase Abstract Machines (SAMs). The control-flow graph (CFG) of each procedure is "covered" with disjoint SAMs. The intra-SAM communication will be mapped to high-speed connections, while the inter-SAM communication will occur on a slow, pipelined interconnection network.

  Compiling for ASH
We have developed the CASH (Compiler for ASH) compiler which translates programs written in C into hardware implementations. The compiler applies several code transformations:
  • It performs traditional software optimizations.
  • We have developed a width analysis algorithm which can discover narrow width computations in C programs. 30% of the bits computed in average C programs are simply useless; in a reconfigurable hardware implementation we can optimize these away.
  • The program is decomposed into parts which are run on the reconfigurable fabric and on the supporting CPU; the CPU will handle tasks which do not map well on the reconfigurable hardware, like virtual memory management, operating system activity and I/O.
  • The program is analyzed and the memory used is partitioned into blocks which can be accessed in parallel. If a block is accessed only from a small portion of the computation, it will be laid out close to the circuits accessing it (we haven't implemented this part yet).
  • The program is decomposed in SAMs.
  • Each SAM is synthesized in hardware.

  Our Intermediate Representation: Pegasus

For SAM optimization and synthesis, CASH uses a new intermediate representation called Pegasus. Pegasus unifies several features of other intermediate representations:

  • Is based on Static-Single Assignment
  • Supports predication
  • Supports speculation
  • Unifies control-flow and data-flow.
In Pegasus operations are asynchronously handshaking with their neighbors, to produce and consume data. The resulting circuits can be actually synthesized as locally-synchronous, globally-asynchronous implementations.
asynchronous handshake

In order to expose substantial instruction-level parallelism we use several compilation techniques borrowed from predicated architectures:

  • We decompose the CFG in hyperblocks; a hyperblock is a unit of speculative work.
  • We use predication to transform the control-flow in each hyperblock into pure data flow. For this purpose, each basic block and each CFG edge has an associated "predicate". predicate
  • We use aggressive predicate promotion to execute speculatively all instructions with no side-effect. This figure shows how each basic block gives rise to two types of operations: computations and predicates. The predicate of block b initially guards all instructions in block b (shown by red arrows), but is later promoted to "true", unless it guards an instruction with side-effects (depicted in red).

    predication and speculation

The resulting representation is a form of Gated Single Assignment (a relative of Static-Single Assignment) in which predicate conditions are explicitly represented.

gated single-assignment

This is an example program, its control-flow graph (which is a single hyperblock) and the resulting representation:


Data-flow between hyperblocks is represented using merge and eta operators introduced in the dataflow-machine literature. Merge operators accept a data value from one of multiple producers. Eta operators are controlled by a predicate: when the predicate is "true", they forward the data at the input. When the predicate is "false", they simply discard the data. The following figure displays the implementation of a program comprising three hyperblocks:

  Optimizing Pegasus

Scalar optimizations can be implemented very efficiently on Pegasus.

various optimizations
Constant folding
Dead-code elimination
Common-subexpression elimination
These optimizations subsume:
  • Control-flow optimizations such as straightening, branch chaining, unreachable code elimination, code hoisting and tail merging.
  • Dataflow-based global optimizations optimizations, such as copy propagation, constant propagation, partial redundancy elimination, value numbering, redundant memory operation removal.
  • Mixed control- and data- optimizations: removal of if statements with constant conditions.

A variety of new algebraic optimizations:

multiplexor optimizations

  Implementing Pegasus in hardware

When implementing Pegasus in hardware we apply further optimizations.

Boolean operations can sometimes produce a result before knowing all input values; for example, a logical "and" can immediately produce a zero when one input is zero, irrespective of the value of the second input. We call such operation implementations lenient. In hardware we implement leniently boolean operations, multiplexors and side-effect operations (which do not need to compute when the guarding predicate is false).

The handshaking between operations gives rise to a natural loop pipelining effect, in which multiple iterations of a loop can proceed simultaneously.

	  int f(void)
	  	int i, sum=0;
	  	for (i=0; i < 100; i++)
	  	    sum += i;
		return sum;

In this example there are two loops for the two induction variables i and sum. The multiplier operation is drawn as an 8-stage pipeline. The i loop can run ahead of the other one, filling the multiplier with multiple values. For maximum pipelining some FIFO elements have to be introduced (labeled cut-through registers in the figure) to buffer values of the loop-controlling predicate for later consumption by sum's loop. The deeper the FIFO, the better the simulated performance of the code.

  Memory Disambiguation

Pegasus explicitly represents may-dependences through memory between operations using token edges. The token edges are both a compile-time object, used to focus the optimizer and scheduler and also a run-time object, used for fine-grained synchronization for memory access only between dependent operations. token edges

Operations that may access the same memory regions and which do not commute have token edges inserted between them: the red dashed lines in this figure. Notice that there's no token edge between *q and *r, since there is no control-flow between them. The token edges give an SSA-like representation for memory, enabling compact and powerful optimizations for memory, as the traditional scalar SSA form does. The token edge graph is maintained in a transitively reduced for throughout compilation. The meaning of the edges can thus be summarized as follows:

edge meaning

  Memory optimizations

Memory operations guarded by a false predicate can be completely removed.
removing dead memory
Two consecutive stores (i.e. connected by a token edge) to the same address can be simplified, since the second may overwrite the first. The first ought to be executed only when the second is not. Then the token edge connecting them can be removed, since the two stores will never occur simultaneously.
	  consecutive stores to the same address

Accesses at the same address can be coalesced by or-ing the predicates. This is a form of partial-redundancy elimination.
PRE for memory

Fine-grained tokens are also carried around loops using mu-eta operators. This gives rise to loop pipelining for memory, similar to the pipelining for scalars described above.

loop pipelining for memory

  Energy Efficiency
When implemented directly in hardware, ASH is up to three orders of magnitude more power-efficient than superscalar processors, one to two orders of magnitude better than low-power DSP processors, one order of magnitude better than asynchronous processors; it is roughly comparable to compared to custom hand-designed hardware.

energy efficiency