Scribe Notes
Function-Unit Architectures

By Steve Schlosser
Reconfigurable Computing Seminar 2/11/98

Slide 1 Slide 1: Function-Unit Architectures

Slide 2 Slide 2: Monday's Class
Everyone should prepare a brief overview of their approach to their kernel, explain how they'll implement it and prepare a few slides.

Slide 3 Slide 3: Tool Status
New cvhasm and cvhsim which add some command line switches and run on AIX and Solaris.

Slide 4 Slide 4: A RC Taxonomy
Some metrics for describing CCMs relative to picture at left.

bandwidth            -------------->
functionality        <--------------
target size          <--------------
latency              <--------------
available resources  <--------------
# of applications    -------------->
granularity          <--------------
arch complexity      -------------->
compiler complexity  <--------------

Slide 5 Slide 5: Application Grain-Size
Overhead is the limiting factor in many designs.

Slide 6 Slide 6: Functional-Unit Architectures
How do these affect the cycle time?
What effects do caches have?
How do you handle unsafe configurations? Compile time? Run time?

Slide 7 Slide 7: PRISC
The PRISC PFUs implement small logic functions and are simply an additional functional unit in the datapath. They are accessed through special instructions which load the results of the PFU computation into standard CPU registers.

Slide 8 Slide 8: PRISC Architecture

Slide 9 Slide 9: A PFU design
A modest fabric consisting only of LUTs and interconnect. There is no state at all. Among other things, this avoids costly state saving across context switching.

Slide 10 Slide 10: Programming a PRISC
One of the most intersting aspects of the PRISC is the automatic compiler. Using profile information, it extracts small sections of code to be converted into hardware and then called using PFU instructions. Several different methods are used to choose code sections and then optimize.

Slide 11 Slide 11: Function Width Analysis
An example of optimization used in PRISC is Function Width Analysis in which the compiler examines a computation to determine how wide of a bus it must generate. For example, in the following calculation, only 2 bits are required:

int x, y, z;

return (((x & 15) + (y | z)) & 2);

The compiler builds a tree which shows that the last operation (& 2) reduces the entire result to two bits and therefore builds the entire datapath accordingly.

Slide 12 Slide 12: Optimizations -1
Finds operations that can be done in parallel. This reduces branches but increases the number of required functional blocks. The other problem is that there is no way to communicate exceptions back to the main processor.

Slide 13 Slide 13: Optimizations -2
This kind of optimization reduces the number of branches, but leads to an exponential increase in the number of code blocks because the compiler must generate two blocks for each if statement.

Slide 14 Slide 14: PRISC Results

Slide 15 Slide 15: Extensions?
Some other extensions:

Prefetching of configurations
Add pipeline registers...?
More than 2 operands for each PFU, otherwise DAGs are limited.

Slide 16 Slide 16: Chimera
Do some of these extensions look familiar?

Slide 17 Slide 17: Chimera Architecture
Shadow Register file - 9 regs
Result Bus
config cache, memory bus
RISC core
Decode selects row to write back - CAM

Slide 18 Slide 18: Register Usage
The fact that the fabric uses system registers for state rather than having its own registers makes context switching much simpler/transparent to the operating system.

Slide 19 Slide 19: Processor/Fabric Communication
The fact that the compiler has to ensure the latency is problematic. What if the architecture changes slightly?

Still doesn't address exception handling.

Slide 20 Slide 20: Chimera Results

Slide 21 Slide 21: Garp
Garp addresses many of the same problems as the other architectures and has a similar approach. However, Garp is the only one in which the array can directly address memory, which is significant.

Slide 22 Slide 22: Processor/Fabric Communication
Much more co-processor-like than the other two architectures.

Slide 23 Slide 23: Micro-Architecture
The Garp takes an interesting approach to routing and delay estimation. Interconnect delays are defined to be either short or long. Logic functions are broken into three categories: simple, any function without the carry chain, and any function with the carry chain. The result is the ability to simplify timing specifications and make better estimates.

However, the Garp has a problem in that it is physically larger than a standard FPU. It is harder to justify the larger expense than that of the smaller PRISC system. The Hauser and Wawrzynek paper does not explain very clearly how the Garp is actually attached to its host RISC core.

Slide 24 Slide 24: Some Results

Slide 25 Slide 25: Comparison
Some other metrics for comparing the three machines:

                          PRISC     Chimaera      Garp

Automatic Compilation?    Yes       No            No
Forward Compatibility?    Yes       No            Yes

Scribed by Steve Schlosser