Relationship Strategies for the 90s, or, A Gentle Introduction to CPU Design

(Warning: rather geeky.  If you think a CPU is a kind of couch, and
out of order execution is what you expect from a Russian firing squad,
this may not be for you.)

Old-fashioned relationships follow a simple model exemplified by the
plot to a 1950s movie.  Boy meets girl, boy gets girl's phone number,
boy and girl date, boy and girl ... (here the screen grows misty at
the most interesting moment), boy and girl break up, boy meets new
girl, etc. etc.  Nowadays, of course, sex roles, gender pairings, and
mating rituals are more varied.

  But the big change has been streamlining the whole procedure.
For clarity, let us name the various phases, and let us replace boy
and girl with the more politically correct terms U and C.  U and C
meeting is the fetch phase.  Exchanging phone numbers, recent
histories, etc. is the decode phase.  The execute phase is limited by
the imagination.  And lastly comes the write-back phase, whereupon U
and C go their separate ways and, implausibly, promise to be friends.

  I call the 1950s model the sequential model.  U is only interested
in one C at a time, and after write-back completes searches for L.
This model is still satisfactory for many people, but the process
itself is slow and cumbersome, and, in the modern view, requires
streamlining.

  The simplest way to make this process more efficient is called
pipelining.  While U is writing back X, U can also be executing C,
decoding Q, and fetching Z!  The elapsed time to process each
significant other is unchanged, or more likely increased due to the
distractions, but SO throughput can be greatly increased, up to a
theoretical limit of four times faster.  In addition, U will spend a
much higher proportion of time in the desirable execute phase.

  Perhaps a hypothetical example will make this more clear.  Geoff is
in the execute phase with Cara.  It has been a very deep and complex
instruction, and has kept his pipeline stalled for months.  Mary is in
his decode stage, and his fetch and write-back slots are, sadly,
empty.  The next day Cara goes off to Italy and moves into Geoff's
write-back slot, and Mary enters the execute phase.  The next day he
eyes Molly at a party and she moves quickly through the fetch and
into the decode phase.  Later in the week Mary moves onto the
write-back phase and Molly enters the execute phase.

  Three instructions in one week is an impressive result, but this
example also demonstrates the limitations of pipelining.  Due to
stalls in the execute phase or insufficient fetch bandwidth, it is
very difficult to keep the pipeline full.  Fetch bandwidth can be
improved by instruction caching, but stalls in the execute phase are
harder to handle.

  One popular approach is out of order execution.  Suppose U is
happily executing C, a rather complex instruction, and all of a sudden
C must fly off to M for a weekend.  Simple pipelines would stall, but
with out-of-order execution D, who was in the decode phase, is rushed
into the execute phase and out into the write-back phase before C's
return.  Of course this requires that C and D be entirely independent.
Indeed if C and D were friends then the resulting situation could be
highly inconsistent!  These concerns with semantic consistency are a
vexing problem for all of these aggressive approaches, requiring very
careful implementation.

  The most famous approach to these problems is called RISC, short for
Reduced Instruction Set Complexity.  The idea is to reduce stalls in
the execute phase by making each instruction as simple as possible.
Mutual understanding, deepening the relationship, etc. are all very
well, but the RISC view is that these are not compatible with high
throughput.  In addition, by making all of the instructions as dumb
and simple as possible, the RISC approach eases the crucial task of
maintaining semantic consistency.  All in all RISC designs tend to be
simpler and deliver more bang for the buck.

  The very highest throughput designs are superscalar.  These designs
remove the fundamental limitation of the pipelined method, which is
that only one instruction can be in any pipeline phase at once.  In
particular, multiple instructions execute simultaneously.  This
requires multiple functional units.  Imagine the possibilities!
Original designs were only 2-way superscalar but less repressed
architects have constructed 4-way designs and beyond.  Of course,
simultaneous execution produces the same consistency problems as out
of order scheduling, but in spades, and typically a separate registry
of processor state must be maintained for each active instruction, not
just executing ones.  Implementing these designs correctly can be a
real challenge.

  We've sure come a long way from those clunky old designs, haven't
we?  What allowed this metamorphosis?  Not any increase in the
ingenuity of the designers, but rather the underlying improvements in
technology, which have so greatly expanded the realm of possibilities.
Those 1950's designers labored under strict limitations unimaginable
to their modern descendants, who casually assume that U can fetch F
from across the room while decoding D while executing C under the
table.

                          :)