Scribe Notes -- 1/19/98
Slide 1: Admin
- Lab 1:
Typos have been fixed in the Lab 1 web page, and the missing links have
been filled in. Also, minor changes have been made to the setvar847 shell
scripts. If you have not already done so, please download the updated scripts.
You may use either VHDL or Verilog to do Lab 1. If you feel very confident
with one of these languages, try using the opposite one for Lab 1. If you
don't know either, it's probably best to start with Verilog. However, later
labs may require VHDL because the peripherals on the evaluation boards we
have use VHDL models.
Prof. Thomas' Verilog book is available in the bookstore with the 18-240 books.
It is bright red. It comes with an evaluation version of the Veriwell simulator
on CD-ROM. There is another good reference by Samir Palnitkar that is available
in the 18-347 section.
If anyone knows any good VHDL references (especially if it's available on the
Web) please speak up.
- Early paper handout:
We will begin handing out papers for the Wednesday classes one week in advance
instead of two days in advance.
Slide 2: FPGA Technology Intro
XC4000 Issues: We will discuss the following issues as they apply
to the Xilinx 4000 series FPGAs. We will look at the CAD design flow for
working with FPGAs, discuss the pros and cons of heterogenous and homogenous
architectures, determine wether the XC4000 obeys Rent's Rule, and finally
look at the speed and reprogrammibilty of the XC4000.
Altera FLEX: We'll take a quick look at how Altera handles these issues.
Taxonomy Issues: With all of the above in mind, we'll try to develop
a system to classify FPGAs based on their architecture.
Slide 3: Why Study Old FPGAs?
- 90% of the research being done in reconfigurable computing is using
commercially available FPGAs. Custom programmable devices make up only a small
portion of the field.
- We'll be using commercial FPGA's for lab 1.
- We will be studying two designs that use commercially available FPGAs:
SPLASH and PAM. Therefore, it is a good idea to learn about popular commercial
Slide 4: Xilinx 4000 Series CLB
Slide 5: Xilinx 4000 Series Interconnect
Slide 6: Area
- Interconnect accounts for the vast majority of the physical area of an
XC4000 FPGA. In general, interconnect accounts for 100 times the area
used by the configurable logic blocks. Also, interconnect uses nearly
10 times as much area as the configuration memory.
Slide 7: FPGA CAD Flow
- Timing Analysis: Xilinx's timing analysis tools tend to be very
pessimistic so that emotionally fragile engineers won't become upset if their
design won't run at the speed the tool estimated it would run at.
- Simulation: One step left out of this slide is Simulation. In big
designs, synthesis and Tech Mapping and Place/Route can take hours if not days.
You can't afford to go through this entire cycle only to find out you made a
typo in your state machine transition implementation. Simulation is very
useful to verify the functional correctness of a design before synthesis.
There are not many tools available to test the post-layout design. After
routing the chip, it's easiest to test it in-system.
- Class Discussion: One thing that would be useful is the ability
to partition a design across multiple FPGAs. Xilinx does not have tools to do
this - you have to do it by hand.
Slide 8: Synthesis
- FPGA designs are very different from ASIC designs. At the HDL level, an
FPGA designer must take into account the type of functionality available
in the FPGA hardware. For example, can a function be easily represented in
a LUT like the kind provided in the XC4000 architecture? ASIC designs usually
just go ahead and build their own combinational circuits. ASIC designers must
also build their own registers. In FPGA designs, Flip-Flops are abundant.
This makes it attractive to make pipelined designs and one-hot FSMs.
- FPGA designs do have some drawbacks. First of all, it is difficult to
determine the area a function will take up. The synthesis process fits
boolean functions into look-up tables. Often, a complex function can take
the same amount of space as a simple function if they each use a single LUT.
Slide 9: Tech Mapping
- The optimal solution for fitting a boolean expression into K-LUTs is
NP-complete. Good heuristics are needed to perform this task, and this is
an excellent area for research.
Slide 10: Heterogeneous vs. Homogeneous
- The main thing that can be gained from a heterogeneous structure is
efficiency. For certain applications, heterogenous features can be very
helpful for making things fast. However, for the majority of applications
specialized heterogenous features will not be used. By adding features
to make complex things faster, we give up versatility.
Slide 11: Hetero vs. Homo: Inside the cell
- A general trend in the FPGA industry is to dedicate chip area to features
that will make specific applications faster. We dubbed this process "throwing
area at speed." For example, the 3LUT HMAP in the Xilinx CLB provides the
primary function of helping implement Xilinx's RAM functionality. In almost
all other cases, the HMAP is left unused. CAD tools seldom take advantage of
this area. Only pre-made modules that have been highly optimized will make
use of the HMAP. But, when the CLB is configured as RAM, the HMAP is very
Slide 12: History
- As an aside, we looked at the history of the Xilinx CLB structure.
In the XC3000 series, the CLB had two 4LUTs that shared three of their inputs.
A mux could select between the outputs of the 4LUTs to make a 5LUT.
Slide 13: Fast Carry Logic
- Fast carry logic is another example of "throwing area at speed." In the
XC4000 series, extra space is dedicated to custom ripple-carry logic at the
input to each CLB. Each CLB has dedicated wiring extending North and South
to allow the synthesis tools to create fast-carry adders in columns going
up and down the chip. Whenever synthesis tools see the '+' operator, they
attempt to use the fast-carry logic. Although Xilinx has published an app note
or two about using the dedicated carry logic for other purposes, nobody ever
Slide 14: Fast Carry Logic
- Each Xilinx CLB can be used to make a 2x2 adder with fast-carry logic.
The heart of the carry chain is a 2x1 mux that appears horizontally in
the middle of the slide.
To set up the carry chain, the CAD tools program the carry logic to work as
- Cin is driven on the '1' input of the mux.
- Either A or B can be driven on the '0' input. For this example we will use B.
- The remaining multiplexers are set up to get the function A XOR B on
the select line of the mux.
- The output of the mux now has this function:
Cout = ((A XOR B) & Cin) | (~(A XOR B) & B)
This reduces to:
Cout = (~A & B & Cin) | (A & ~B & Cin) | (A & B)
This is the correct equation for Cout.
Slide 15: Rent's Rule
- For XC4000, Rent's exponent is approximately 0.5. This means that the amount
of available interconnect grows as the square root of the growth of the area
devoted to logic. In general, interconnect resources are not as rich as one
- Xilinx handles this problem by introducing new architectures, such as the
XC4000EX, which provides twice as much interconnect for high-density devices.
The EX series is generally upwardly-compatible with the 4000E series.
Slide 16: Speed
- Because logic is in fixed locations on FPGAs, interconnect becomes
a major source of delay.
Slide 17: Speed: The Event Horizon
- Von Herzen made a map of the area that is accessible within 1.6 ns from
any given CLB on a Xilinx part. The idea behind this is to figure out where
to place the critical portions of a circuit on an FPGA so that it will run
at the fastest rate possible.
Slide 18: Speed: The Event Horizon
- An interesting result of the experiment is the appearance of some
assymetry in the "event horizon". On Xilinx parts, it is faster to go to the
East and South of any given logic block. It is unclear whether this is true
for every CLB on the chip, but it is an interesting result nonetheless.
Slide 19: Reprogrammibility
- Current FPGAs take a long time to reprogram (on the order of a few hundred
milliseconds). Cutting this time will be important to reconfigurable computing.
- Another important aspect of reconfigurable computing will be the ability to
virtualize a design. To do this we will need to be able to change only part
of the reconfigurable fabric at a time.
- This is not currently possible on Xilinx parts. In order to reprogram only
part of a device while the other part continues to function, we need the ability
to partition off part of the device. With Xilinx's complex interconnect, it
isn't possible. The outcome we decided would be
"pass transistor semiconductor mayhem." Current Xilinx parts have to shut
down completely in order to reconfigure.
- Lastly, in order to program only certain configuration bits, we need an
addressing scheme. The ability to address individual configuration bits requires
lots of extra area devoted to configuration logic, and several additional I/O
Slide 20: Altera FLEX 8000
Slide 21: Altera FLEX 8000
- The functional unit structure of an Altera FLEX 8000 differs from its Xilinx
counterparts in that the Altera functional unit consists of 8 4LUTs. These LUTs
are fully interconnected by a local crossbar. Each LUT can also drive
the global interconnect. A LUT can drive one row channel and up to two column
Slide 22: FastTrack Routing
- Unfortunately, not every functional unit (called a LAB) can see every
wire in the global channels. This causes a big problem in that a small
change to the netlist of a design can end up causing a very big change
in the physical layout of the chip. Also, this mandates that the placement
and routing tools have to be very closely linked. The placement tools
have to place logic in LABs that can see the global interconnect they need
to make routing possible.
- The benefit of Altera's architecture is predictable timing. The intra-LAB
crossbar provides dependable timing for all local connections. Additionally,
the global row and column routing channels all stretch the entire length of the
device, so the timing of inter-LAB connections can also be predicted closely.
- The reason that Altera did not make every global connection available to
every LAB is load conisderations. If every global line crossed a switch at
every LAB, the parasitic capacitance of the switches would make the line
Slide 23: FPGA Taxonomy
- Over the next three slides, we'll discuss features useful for classifying
Slide 24: Programming Technology
Programming Technology: How is the FPGA programmed? What does it use
to store its configuration bits?
- SRAM: SRAM takes up lots of space (5 or 6 transistors for 1 bit),
but it can be reconfigured quickly and is generally very fast. Also, the config
bits can be used as RAM in the design.
- Antifuse: This technology dosn't really have a place in
reconfigurable computing because it is only one-time programmable. However,
Antifuse-programmed connections are very fast, take up very little space,
and don't load the line as much as transistor switches do.
- EEPROM: EEPROM is just as reconfigurable as SRAM, but it is very
slow. It also does not lose its configuration when power is removed from the
- Flash: This technology can only be reliably reprogrammed a
finite number of times. Its use in reconfigurable computing is probably limited.
Like EEPROM, it does not lose its program when the power is turned off.
Slide 25: Logic Cell Architecture
- The question we have to ask is "What is a good logic block
architecture?". Do we want a fine-grained structure or a coarse-grained one?
As a class, we came up with a big list of different possible things to put
inside a logic block.
- Look-up Tables
- PLA/PAL type structures
- Small DSP's
- Seas of Gates
- NAND Gates
- The tradeoff here is between complexity and versatility. A fine-grained
architecture like NAND gates offers a lot of versatility, but requires an
enormous number of configuration bits. On the other hand, a coarse-grained
array of Pentiums would require an enormous amount of interconnect resources.
A balance has to be reached, but there isn't necessarily a solution that
will be optimal for all applications.
Slide 26: Summary
- Architecture of Interconnect: Is the FPGA tiled, channeled, or
hierarchical? Or does it provide only local interconnect, like the CAL?
Slide 27: Summary
- All in all, one of the most important issues in FPGA technology right
now is the balance between homogeneity and heterogeneity. In general, CAD tools
deal with homogeneous structures much better than heterogeneous structures.
However, there is a push for more complex, heterogeneous structures because
they provide more speed and functionality. This is, of course, at the cost
of versatility and programmability.
- Xilinx's tiled interconnect structure dosn't obey Rent's rule. In general,
the number of I/O connections to a block of logic does not grow as fast as
the complexity of the logic. When the circuits get big, enough interconnect
won't always be available. Altera, on the other hand, has a hierarchical
interconnect scheme. The problem here is that because of load considerations,
not every logic block connects to every part of the interconnect. Small
changes in the HDL circuit description might provoke major changes to the
- Class Discussion: A topic that comes up repeatedly in discussion
is the concept of a module generator. If sophisiticated module generation
tools can be made, synthesis tools can be taught to make use of heterogeneous
FPGA features by looking for certain behavioral constructs in HDL descriptions
Scribed by Andrew Mihal