15-740 Fall '94
Final Exam Solutions


********** Part 1. Pipeline and Cache Performance ********

1. Timing of loop (10 points) 

Assume start loop at time 0.

$L5:                          Start     Finish     Comments
      l.d      $f0,0($4)          0          3     Two Loads + bubble
      l.d      $f2,0($5)          2          5     Two Loads + bubble
      mul.d    $f0,$f0,$f2        5         10     Load Stall, 5 cycle exe. 
      addu     $5,$5,8            6          7     No dependence on mult.
      addu     $4,$4,8            7          8     No dependence on mult.
      addu     $3,$3,1            8          9     No dependence on mult.  
      slt      $2,$3,512          9         10     No dependence on mult.
      bne      $2,$0,$L5         10         12     One slot delay
      add.d    $f4,$f4,$f0       11         13     Use delay slot, 2 cycle exe.

The next iteration could begin on cycle 12, with l.d overlapping 2nd
cycle of add.d.

2. Anomalous Performance (14 [2/6/6] points)

A. Increasing interleaving allows cost of updating of pointers & loop
   index (once per iteration) to be amortized over more data operations.
   In this case, the floating point add to accumulate one partial product
   can then overlap the fetching of the next element of y.

B. Anomalous behavior occurs when vectors x and y map to exact same
   cache locations.  Causes cache miss for every access to elements
   of x or y for inner_prod1 (i.e., two misses per element).

C. Cost of cache miss = (68-12)/2 = 28 cycles.

   Can also see that line size is 16 bytes (2 doubles):
      inner_prod2 has half as many cache misses as inner_prod1.
      It fetches x[i] & x[i+1] (same line) before fetching y[i] & y[i+1].
      inner_prod4 has same hit rate, implying line size is no larger.

3. 512 X 512 matrix multiplication [2/4/6/6]

A. Total Accesses = 512 rows X 2 accesses/element X 512 columns
   = 524,288 

B. With no collisions & with 2 words/line, would get 50% miss rate
   accessing matrix element and 50% miss rate on first read of X.

   Total Misses = (512 rows + 1 vector) X 1/2 X 512 columns 
   = 131,328 (25.0%)

C. Cache has capacity for 16 rows or vectors.

   Vector will interfere with row approx. 1/16 of the time (32 rows).
   This causes two misses per element. 

   Remaining rows will miss only matrix elements, 1/2 of the columns.

   Total Misses =
       (32 X 2 + 480 X 1/2) X 512 columns
   = 155,648 (29.7%)

D. Miss rate of 29.7% will add ~16.6 cycles per element to execution time,
   considerably increasing the time required over the ideal (0% miss rate)
   case of 12 cycles/element.  Note that the majority of the misses stem
   from the fact that the matrix is initially not in the cache.

Here is a complete list of possibilities:
	Case		CPU/ele		Mem/ele		Cycles/ele	Total
	All hit cache	12		0		12		3.15E6
	Ideal Cache	12		14		26		6.8E6
	Actual Cache	12		16.6		28.6		7.5E6
	No Cache	12		28(*)		40		10.5E6

(*) Assumes that a double word fetch requires 14 cycles.  


*********** Vector Machines
4. Vector coding of binary search (15 points)

 Have vectors:

	VY: Vector of keys  
	VindxA, VindxB: representing 64 indices
           (alternate back & forth to avoid conflicts) 
	Vdata: Fetched values from X.

 VindxA = all 0's
 for (int offset = 1<<15; offset > 0; offset = offset>>1) {
   Xoff = X + offset 	/* All fetches offset by same amount */
   Vdata = Xoff[VindxA]        { LVI }
   Compare VY >= Vdata         { SGEV } 
   VindxB = VindxA when mask   { ADDISV #2 }
   Set mask to all 1's         { CVM  } 
   Swap references to VindxA & VindxB
 }

Note: in the above code we need to use two index vectors VindxA &
VindxB, since can't have vector register be source and destination of
same instruction.  Actual code would unroll loop by two, alternating
the use of the two registers.

5. Timing of iteration (19 points)

It turns out that setting & clearing the vector mask register cannot
chain with any other operations.  [You could imagine implementing this
register as if it were a vector register with a 1-bit word size, but
this isn't usually done.]

  Assume start iteration at time 0:

  Instr.     Start     Finish     Comments
    LVI          0         84     20+64
    SGEV        20         90     Chains with LVI. 6+64
    ADDISV #2   91        161     Stall on mask, 6+64
    CVM        155        155     Must wait until all add's in pipe

  The next iteration would start on cycle 156
  Biggest limitation is vector mask register.  Prevents chaining 
  between interations.

  Even if the mask register didn't prevent chaining, we couldn't
  chain SGEV & ADDISV, since they both have same source register.


6. Effect of banks conflicts (14 [6/4/4] points)

  A. Bank conflicts would be severe.  For the first 8 iterations, we
     are guaranteed that the least significant 8 bits of the Vindx elements
     are 0, i.e., all references are to a single bank.  On iteration 9
     we would hit 2 banks, on iteration 10 4, etc.  As an approximation,
     consider only the bank conflicts through iteration 9.  Hitting all
     in one bank slows the LVI by a factor of 4 (note that the memory
     unit does not consolidate repeated accesses to a single memory location.)
     Hitting two banks slows it down by ~2.  This adds a total of

     8 * 64 * (4 - 1) + 64 * (2 - 1) = 1600 cycles

  B. Basic computation would require ~16*156 = ~2500 cycles, so bank
     conflicts would slow things down by factor of 1.6

  C. An idea:

     Initialize VindxA to range of values (e.g., 0-63).  That guarantees
     that accesses are initially all to different banks, and there would
     be minimal conflicts at all.  We'd have to stick large "sentinel"
     values in to the end of the data array, and we'd have to handle the
     rare cases that are below the starting index. 


*********** Butterfly Routing
7. Permutation Formulation (20 [10/10] points)

A. Output port y of stage i has bit representation:
   <y_{r-1}, ..., y_{j+1}, y_{j}, y_{j-1}, ..., y_0>.
   Where j = r-i-1.

   Connected to input port of stage i+1 having bit representation:
   <y_{r-1}, ..., y_{j+1}, y_0, y_{j-1}, ..., y_{j}>.

B. Suppose we are routing a message to a destination with bit representation
   <d_{r-1}, ..., d_0>. 

   A switch has the effect of taking a message on input port
   y having bit representation <y_{r-1}, ..., y_1, y_0>, and sending
   it out to output port with bit address <y_{r-1}, ..., y_1, b>, where
   b is either 0 or 1.

   With the butterfly routing scheme, b = d_{r-i-1} for a switch at stage i.
   The connection to the next stage will then be to an input port with
   its r-i-1st bit then equal to d_{r-i-1}.  Successive stages preserve
   the value of this bit, and this holds for all values of i.  Therefore,
   the message will end up at the destination numbered <d_r, ..., d_0>.
   

8. Reverse routing (15 [10/5] points)

A. For simplicity, keep the stages numbered as they are in the figure, even
   though messages travel from right to left.

   The routing follows the same general scheme as for forward routing, but
   the switching should be based on the destination address as follows:
     At stages i with 1 <= i < r, switch on d_{r-i}.
     At stage 0, switch on d_0.

B. Considering the connections in the reverse direction, those from
   stage i to stage i-1 (1 <= i < r) have the effect of interchanging
   bits 0 and r-i, justifying switching based on d_{r-i}.  Stage 0
   sets the LSB of the destination address.


*********** Fetch and Add Synch.

9.  MDLX implementation (10 [6/2/2] points)

A. Requires 5 states:

#1
	Ri1->A->ALU
	I16->ALU	Add
	T->PC
#2
			Read
	T->B->AO
#3
	Ri2->A->ALU	Add
	Din->B->ALU
#4
	T->A->D0	Write
	Din->B->Ri2
#5
	PC->A->ALU,AO	IR
	+4->ALU		Add
			Disp

B. Require some way to signal that no other operations on memory
between states 2 & 4.  

C. Not a significant problem.  Page fault event would occur on state
3, before any state has been modified.  We could take the page fault
exception and then restart.  Locking between states 2 & 4 would
guarantee that page cannot be swapped out between these two memory
operations.

10. Implementation of Locks (18 points)

The method is like taking a number at a store and waiting until your
number comes up as the next customer to be served.  Imagine lock
implemented by two variables: ticket and next_customer:

acquire()
{
  my_ticket = ticket++; /* Implemented by fetch and add */
  while (my_ticket < next_customer)
    ; /* Busy wait */
}

release()
{
  next_customer  = my_ticket + 1 /* Doesn't need fetch and add */
}

Note that processes guaranteed access in the order in which they fetch
tickets.  As long as fetch-and-add is fair, process will get valid
ticket and will eventually acquire lock.

11. Bus-based cache (12 [6/6] points)

A. Must be some way for processor to signal cache controller that it
wants to do a multicycle atomic memory operation.  Must be some way
for controller to be bus master for multiple cycles, or to
(temporarily) prevent other cache from grabbing copy of block owned by
the cache.

B. The bus traffic would be fairly minimal: A process executing
acquire would use one bus cycle to acquire ticket via fetch and add,
and lock out other accesses to this location for a few cycles.  It
would then obtain a clean copy of next_customer and busy wait on a
local copy.  The process releasing the lock would invalidate all the
copies of next_customer.  The waiting processes would get new clean
copies of next_customer, and the next one in line would enter its
critical section.


12. Combining in routing network (15 [10/5] points)

A. If two messages <fchadd, i, x> and <fchadd, i, y> were received at
a switch, these would be combined into a single update <fadd, i, x+y>
sent to the memory.  TWhen the response (say <i, z>) comes back, this
would be split into returning messages <i, z> for one, and <i, z+x>
for the other.  This would require buffering at the switch to keep
track of the original messages until the response had been received.
We would also require an adder.

B. This hardware is much more complex than that of the CM-5 control
network, mainly because of the increased buffering.  The CM-5
guarantees that there can be only one scan going on at a time, and it
doesn't have to keep track of the origins or destinations of scan
operations coming into the control network nodes.

In fact, the RP3 project at IBM was an attempt to build a system based
on the Ultracomputer.  They struggled to build a network that would
support fetch and add, but never got it going.