Relationship Strategies for the 90s, or, A Gentle Introduction to CPU Design (Warning: rather geeky. If you think a CPU is a kind of couch, and out of order execution is what you expect from a Russian firing squad, this may not be for you.) Old-fashioned relationships follow a simple model exemplified by the plot to a 1950s movie. Boy meets girl, boy gets girl's phone number, boy and girl date, boy and girl ... (here the screen grows misty at the most interesting moment), boy and girl break up, boy meets new girl, etc. etc. Nowadays, of course, sex roles, gender pairings, and mating rituals are more varied. But the big change has been streamlining the whole procedure. For clarity, let us name the various phases, and let us replace boy and girl with the more politically correct terms U and C. U and C meeting is the fetch phase. Exchanging phone numbers, recent histories, etc. is the decode phase. The execute phase is limited by the imagination. And lastly comes the write-back phase, whereupon U and C go their separate ways and, implausibly, promise to be friends. I call the 1950s model the sequential model. U is only interested in one C at a time, and after write-back completes searches for L. This model is still satisfactory for many people, but the process itself is slow and cumbersome, and, in the modern view, requires streamlining. The simplest way to make this process more efficient is called pipelining. While U is writing back X, U can also be executing C, decoding Q, and fetching Z! The elapsed time to process each significant other is unchanged, or more likely increased due to the distractions, but SO throughput can be greatly increased, up to a theoretical limit of four times faster. In addition, U will spend a much higher proportion of time in the desirable execute phase. Perhaps a hypothetical example will make this more clear. Geoff is in the execute phase with Cara. It has been a very deep and complex instruction, and has kept his pipeline stalled for months. Mary is in his decode stage, and his fetch and write-back slots are, sadly, empty. The next day Cara goes off to Italy and moves into Geoff's write-back slot, and Mary enters the execute phase. The next day he eyes Molly at a party and she moves quickly through the fetch and into the decode phase. Later in the week Mary moves onto the write-back phase and Molly enters the execute phase. Three instructions in one week is an impressive result, but this example also demonstrates the limitations of pipelining. Due to stalls in the execute phase or insufficient fetch bandwidth, it is very difficult to keep the pipeline full. Fetch bandwidth can be improved by instruction caching, but stalls in the execute phase are harder to handle. One popular approach is out of order execution. Suppose U is happily executing C, a rather complex instruction, and all of a sudden C must fly off to M for a weekend. Simple pipelines would stall, but with out-of-order execution D, who was in the decode phase, is rushed into the execute phase and out into the write-back phase before C's return. Of course this requires that C and D be entirely independent. Indeed if C and D were friends then the resulting situation could be highly inconsistent! These concerns with semantic consistency are a vexing problem for all of these aggressive approaches, requiring very careful implementation. The most famous approach to these problems is called RISC, short for Reduced Instruction Set Complexity. The idea is to reduce stalls in the execute phase by making each instruction as simple as possible. Mutual understanding, deepening the relationship, etc. are all very well, but the RISC view is that these are not compatible with high throughput. In addition, by making all of the instructions as dumb and simple as possible, the RISC approach eases the crucial task of maintaining semantic consistency. All in all RISC designs tend to be simpler and deliver more bang for the buck. The very highest throughput designs are superscalar. These designs remove the fundamental limitation of the pipelined method, which is that only one instruction can be in any pipeline phase at once. In particular, multiple instructions execute simultaneously. This requires multiple functional units. Imagine the possibilities! Original designs were only 2-way superscalar but less repressed architects have constructed 4-way designs and beyond. Of course, simultaneous execution produces the same consistency problems as out of order scheduling, but in spades, and typically a separate registry of processor state must be maintained for each active instruction, not just executing ones. Implementing these designs correctly can be a real challenge. We've sure come a long way from those clunky old designs, haven't we? What allowed this metamorphosis? Not any increase in the ingenuity of the designers, but rather the underlying improvements in technology, which have so greatly expanded the realm of possibilities. Those 1950's designers labored under strict limitations unimaginable to their modern descendants, who casually assume that U can fetch F from across the room while decoding D while executing C under the table. :)