Scribe Notes -- 1/19/98

Lab 1: Typos have been fixed in the Lab 1 web page, and the missing links have been filled in. Also, minor changes have been made to the setvar847 shell scripts. If you have not already done so, please download the updated scripts.
You may use either VHDL or Verilog to do Lab 1. If you feel very confident with one of these languages, try using the opposite one for Lab 1. If you don't know either, it's probably best to start with Verilog. However, later labs may require VHDL because the peripherals on the evaluation boards we have use VHDL models.
Prof. Thomas' Verilog book is available in the bookstore with the 18-240 books. It is bright red. It comes with an evaluation version of the Veriwell simulator on CD-ROM. There is another good reference by Samir Palnitkar that is available in the 18-347 section.
If anyone knows any good VHDL references (especially if it's available on the Web) please speak up.
Early paper handout: We will begin handing out papers for the Wednesday classes one week in advance instead of two days in advance.

XC4000 Issues:

Altera FLEX: We'll take a quick look at how Altera handles these issues.

Taxonomy Issues: With all of the above in mind, we'll try to develop a system to classify FPGAs based on their architecture.

Slide 3: Why Study Old FPGAs?

90% of the research being done in reconfigurable computing is using commercially available FPGAs. Custom programmable devices make up only a small portion of the field.
We'll be using commercial FPGA's for lab 1.
We will be studying two designs that use commercially available FPGAs: SPLASH and PAM. Therefore, it is a good idea to learn about popular commercial FPGAs now.

Slide 4: Xilinx 4000 Series CLB

Slide 5: Xilinx 4000 Series Interconnect

Slide 6: Area

Interconnect accounts for the vast majority of the physical area of an XC4000 FPGA. In general, interconnect accounts for 100 times the area used by the configurable logic blocks. Also, interconnect uses nearly 10 times as much area as the configuration memory.

Slide 7: FPGA CAD Flow

Timing Analysis: Xilinx's timing analysis tools tend to be very pessimistic so that emotionally fragile engineers won't become upset if their design won't run at the speed the tool estimated it would run at.
Simulation: One step left out of this slide is Simulation. In big designs, synthesis and Tech Mapping and Place/Route can take hours if not days. You can't afford to go through this entire cycle only to find out you made a typo in your state machine transition implementation. Simulation is very useful to verify the functional correctness of a design before synthesis. There are not many tools available to test the post-layout design. After routing the chip, it's easiest to test it in-system.
Class Discussion: One thing that would be useful is the ability to partition a design across multiple FPGAs. Xilinx does not have tools to do this - you have to do it by hand.

Slide 8: Synthesis

FPGA designs are very different from ASIC designs. At the HDL level, an FPGA designer must take into account the type of functionality available in the FPGA hardware. For example, can a function be easily represented in a LUT like the kind provided in the XC4000 architecture? ASIC designs usually just go ahead and build their own combinational circuits. ASIC designers must also build their own registers. In FPGA designs, Flip-Flops are abundant. This makes it attractive to make pipelined designs and one-hot FSMs.
FPGA designs do have some drawbacks. First of all, it is difficult to determine the area a function will take up. The synthesis process fits boolean functions into look-up tables. Often, a complex function can take the same amount of space as a simple function if they each use a single LUT.

Slide 9: Tech Mapping

The optimal solution for fitting a boolean expression into K-LUTs is NP-complete. Good heuristics are needed to perform this task, and this is an excellent area for research.

Slide 10: Heterogeneous vs. Homogeneous

The main thing that can be gained from a heterogeneous structure is efficiency. For certain applications, heterogenous features can be very helpful for making things fast. However, for the majority of applications specialized heterogenous features will not be used. By adding features to make complex things faster, we give up versatility.

Slide 11: Hetero vs. Homo: Inside the cell

A general trend in the FPGA industry is to dedicate chip area to features that will make specific applications faster. We dubbed this process "throwing area at speed." For example, the 3LUT HMAP in the Xilinx CLB provides the primary function of helping implement Xilinx's RAM functionality. In almost all other cases, the HMAP is left unused. CAD tools seldom take advantage of this area. Only pre-made modules that have been highly optimized will make use of the HMAP. But, when the CLB is configured as RAM, the HMAP is very useful indeed.

Slide 12: History

As an aside, we looked at the history of the Xilinx CLB structure. In the XC3000 series, the CLB had two 4LUTs that shared three of their inputs. A mux could select between the outputs of the 4LUTs to make a 5LUT.

Slide 13: Fast Carry Logic

Fast carry logic is another example of "throwing area at speed." In the XC4000 series, extra space is dedicated to custom ripple-carry logic at the input to each CLB. Each CLB has dedicated wiring extending North and South to allow the synthesis tools to create fast-carry adders in columns going up and down the chip. Whenever synthesis tools see the '+' operator, they attempt to use the fast-carry logic. Although Xilinx has published an app note or two about using the dedicated carry logic for other purposes, nobody ever does.

Slide 14: Fast Carry Logic

Each Xilinx CLB can be used to make a 2x2 adder with fast-carry logic. The heart of the carry chain is a 2x1 mux that appears horizontally in the middle of the slide. To set up the carry chain, the CAD tools program the carry logic to work as follows:
- Cin is driven on the '1' input of the mux.
- Either A or B can be driven on the '0' input. For this example we will use B.
- The remaining multiplexers are set up to get the function A XOR B on the select line of the mux.
- The output of the mux now has this function:
  Cout = ((A XOR B) & Cin) | (~(A XOR B) & B)
  This reduces to:
  Cout = (~A & B & Cin) | (A & ~B & Cin) | (A & B)
  This is the correct equation for Cout.

Slide 15: Rent's Rule

For XC4000, Rent's exponent is approximately 0.5. This means that the amount of available interconnect grows as the square root of the growth of the area devoted to logic. In general, interconnect resources are not as rich as one would hope.
Xilinx handles this problem by introducing new architectures, such as the XC4000EX, which provides twice as much interconnect for high-density devices. The EX series is generally upwardly-compatible with the 4000E series.

Slide 16: Speed

Because logic is in fixed locations on FPGAs, interconnect becomes a major source of delay.

Slide 17: Speed: The Event Horizon

Von Herzen made a map of the area that is accessible within 1.6 ns from any given CLB on a Xilinx part. The idea behind this is to figure out where to place the critical portions of a circuit on an FPGA so that it will run at the fastest rate possible.

Slide 18: Speed: The Event Horizon

An interesting result of the experiment is the appearance of some assymetry in the "event horizon". On Xilinx parts, it is faster to go to the East and South of any given logic block. It is unclear whether this is true for every CLB on the chip, but it is an interesting result nonetheless.

Slide 19: Reprogrammibility

Current FPGAs take a long time to reprogram (on the order of a few hundred milliseconds). Cutting this time will be important to reconfigurable computing.
Another important aspect of reconfigurable computing will be the ability to virtualize a design. To do this we will need to be able to change only part of the reconfigurable fabric at a time.
This is not currently possible on Xilinx parts. In order to reprogram only part of a device while the other part continues to function, we need the ability to partition off part of the device. With Xilinx's complex interconnect, it isn't possible. The outcome we decided would be "pass transistor semiconductor mayhem." Current Xilinx parts have to shut down completely in order to reconfigure.
Lastly, in order to program only certain configuration bits, we need an addressing scheme. The ability to address individual configuration bits requires lots of extra area devoted to configuration logic, and several additional I/O pins.

Slide 20: Altera FLEX 8000

Slide 21: Altera FLEX 8000

The functional unit structure of an Altera FLEX 8000 differs from its Xilinx counterparts in that the Altera functional unit consists of 8 4LUTs. These LUTs are fully interconnected by a local crossbar. Each LUT can also drive the global interconnect. A LUT can drive one row channel and up to two column channels.

Slide 22: FastTrack Routing

Unfortunately, not every functional unit (called a LAB) can see every wire in the global channels. This causes a big problem in that a small change to the netlist of a design can end up causing a very big change in the physical layout of the chip. Also, this mandates that the placement and routing tools have to be very closely linked. The placement tools have to place logic in LABs that can see the global interconnect they need to make routing possible.
The benefit of Altera's architecture is predictable timing. The intra-LAB crossbar provides dependable timing for all local connections. Additionally, the global row and column routing channels all stretch the entire length of the device, so the timing of inter-LAB connections can also be predicted closely.
The reason that Altera did not make every global connection available to every LAB is load conisderations. If every global line crossed a switch at every LAB, the parasitic capacitance of the switches would make the line very slow.

Slide 23: FPGA Taxonomy

Over the next three slides, we'll discuss features useful for classifying FPGAs.

Slide 24: Programming Technology

Programming Technology: How is the FPGA programmed? What does it use to store its configuration bits?

SRAM: SRAM takes up lots of space (5 or 6 transistors for 1 bit), but it can be reconfigured quickly and is generally very fast. Also, the config bits can be used as RAM in the design.
Antifuse: This technology dosn't really have a place in reconfigurable computing because it is only one-time programmable. However, Antifuse-programmed connections are very fast, take up very little space, and don't load the line as much as transistor switches do.
EEPROM: EEPROM is just as reconfigurable as SRAM, but it is very slow. It also does not lose its configuration when power is removed from the circuit.
Flash: This technology can only be reliably reprogrammed a finite number of times. Its use in reconfigurable computing is probably limited. Like EEPROM, it does not lose its program when the power is turned off.

Slide 25: Logic Cell Architecture

The question we have to ask is "What is a good logic block architecture?". Do we want a fine-grained structure or a coarse-grained one? As a class, we came up with a big list of different possible things to put inside a logic block.
- Look-up Tables
- PLA/PAL type structures
- Small DSP's
- Seas of Gates
- MUX's
- NAND Gates
- Pentiums
The tradeoff here is between complexity and versatility. A fine-grained architecture like NAND gates offers a lot of versatility, but requires an enormous number of configuration bits. On the other hand, a coarse-grained array of Pentiums would require an enormous amount of interconnect resources. A balance has to be reached, but there isn't necessarily a solution that will be optimal for all applications.

Slide 26: Summary

Architecture of Interconnect: Is the FPGA tiled, channeled, or hierarchical? Or does it provide only local interconnect, like the CAL?

Slide 27: Summary

All in all, one of the most important issues in FPGA technology right now is the balance between homogeneity and heterogeneity. In general, CAD tools deal with homogeneous structures much better than heterogeneous structures. However, there is a push for more complex, heterogeneous structures because they provide more speed and functionality. This is, of course, at the cost of versatility and programmability.
Xilinx's tiled interconnect structure dosn't obey Rent's rule. In general, the number of I/O connections to a block of logic does not grow as fast as the complexity of the logic. When the circuits get big, enough interconnect won't always be available. Altera, on the other hand, has a hierarchical interconnect scheme. The problem here is that because of load considerations, not every logic block connects to every part of the interconnect. Small changes in the HDL circuit description might provoke major changes to the hardware layout.
Class Discussion: A topic that comes up repeatedly in discussion is the concept of a module generator. If sophisiticated module generation tools can be made, synthesis tools can be taught to make use of heterogeneous FPGA features by looking for certain behavioral constructs in HDL descriptions of circuits.

Scribed by Andrew Mihal