15-740 Fall '94 Final Exam Solutions ********** Part 1. Pipeline and Cache Performance ******** 1. Timing of loop (10 points) Assume start loop at time 0. $L5: Start Finish Comments l.d $f0,0($4) 0 3 Two Loads + bubble l.d $f2,0($5) 2 5 Two Loads + bubble mul.d $f0,$f0,$f2 5 10 Load Stall, 5 cycle exe. addu $5,$5,8 6 7 No dependence on mult. addu $4,$4,8 7 8 No dependence on mult. addu $3,$3,1 8 9 No dependence on mult. slt $2,$3,512 9 10 No dependence on mult. bne $2,$0,$L5 10 12 One slot delay add.d $f4,$f4,$f0 11 13 Use delay slot, 2 cycle exe. The next iteration could begin on cycle 12, with l.d overlapping 2nd cycle of add.d. 2. Anomalous Performance (14 [2/6/6] points) A. Increasing interleaving allows cost of updating of pointers & loop index (once per iteration) to be amortized over more data operations. In this case, the floating point add to accumulate one partial product can then overlap the fetching of the next element of y. B. Anomalous behavior occurs when vectors x and y map to exact same cache locations. Causes cache miss for every access to elements of x or y for inner_prod1 (i.e., two misses per element). C. Cost of cache miss = (68-12)/2 = 28 cycles. Can also see that line size is 16 bytes (2 doubles): inner_prod2 has half as many cache misses as inner_prod1. It fetches x[i] & x[i+1] (same line) before fetching y[i] & y[i+1]. inner_prod4 has same hit rate, implying line size is no larger. 3. 512 X 512 matrix multiplication [2/4/6/6] A. Total Accesses = 512 rows X 2 accesses/element X 512 columns = 524,288 B. With no collisions & with 2 words/line, would get 50% miss rate accessing matrix element and 50% miss rate on first read of X. Total Misses = (512 rows + 1 vector) X 1/2 X 512 columns = 131,328 (25.0%) C. Cache has capacity for 16 rows or vectors. Vector will interfere with row approx. 1/16 of the time (32 rows). This causes two misses per element. Remaining rows will miss only matrix elements, 1/2 of the columns. Total Misses = (32 X 2 + 480 X 1/2) X 512 columns = 155,648 (29.7%) D. Miss rate of 29.7% will add ~16.6 cycles per element to execution time, considerably increasing the time required over the ideal (0% miss rate) case of 12 cycles/element. Note that the majority of the misses stem from the fact that the matrix is initially not in the cache. Here is a complete list of possibilities: Case CPU/ele Mem/ele Cycles/ele Total All hit cache 12 0 12 3.15E6 Ideal Cache 12 14 26 6.8E6 Actual Cache 12 16.6 28.6 7.5E6 No Cache 12 28(*) 40 10.5E6 (*) Assumes that a double word fetch requires 14 cycles. *********** Vector Machines 4. Vector coding of binary search (15 points) Have vectors: VY: Vector of keys VindxA, VindxB: representing 64 indices (alternate back & forth to avoid conflicts) Vdata: Fetched values from X. VindxA = all 0's for (int offset = 1<<15; offset > 0; offset = offset>>1) { Xoff = X + offset /* All fetches offset by same amount */ Vdata = Xoff[VindxA] { LVI } Compare VY >= Vdata { SGEV } VindxB = VindxA when mask { ADDISV #2 } Set mask to all 1's { CVM } Swap references to VindxA & VindxB } Note: in the above code we need to use two index vectors VindxA & VindxB, since can't have vector register be source and destination of same instruction. Actual code would unroll loop by two, alternating the use of the two registers. 5. Timing of iteration (19 points) It turns out that setting & clearing the vector mask register cannot chain with any other operations. [You could imagine implementing this register as if it were a vector register with a 1-bit word size, but this isn't usually done.] Assume start iteration at time 0: Instr. Start Finish Comments LVI 0 84 20+64 SGEV 20 90 Chains with LVI. 6+64 ADDISV #2 91 161 Stall on mask, 6+64 CVM 155 155 Must wait until all add's in pipe The next iteration would start on cycle 156 Biggest limitation is vector mask register. Prevents chaining between interations. Even if the mask register didn't prevent chaining, we couldn't chain SGEV & ADDISV, since they both have same source register. 6. Effect of banks conflicts (14 [6/4/4] points) A. Bank conflicts would be severe. For the first 8 iterations, we are guaranteed that the least significant 8 bits of the Vindx elements are 0, i.e., all references are to a single bank. On iteration 9 we would hit 2 banks, on iteration 10 4, etc. As an approximation, consider only the bank conflicts through iteration 9. Hitting all in one bank slows the LVI by a factor of 4 (note that the memory unit does not consolidate repeated accesses to a single memory location.) Hitting two banks slows it down by ~2. This adds a total of 8 * 64 * (4 - 1) + 64 * (2 - 1) = 1600 cycles B. Basic computation would require ~16*156 = ~2500 cycles, so bank conflicts would slow things down by factor of 1.6 C. An idea: Initialize VindxA to range of values (e.g., 0-63). That guarantees that accesses are initially all to different banks, and there would be minimal conflicts at all. We'd have to stick large "sentinel" values in to the end of the data array, and we'd have to handle the rare cases that are below the starting index. *********** Butterfly Routing 7. Permutation Formulation (20 [10/10] points) A. Output port y of stage i has bit representation: . Where j = r-i-1. Connected to input port of stage i+1 having bit representation: . B. Suppose we are routing a message to a destination with bit representation . A switch has the effect of taking a message on input port y having bit representation , and sending it out to output port with bit address , where b is either 0 or 1. With the butterfly routing scheme, b = d_{r-i-1} for a switch at stage i. The connection to the next stage will then be to an input port with its r-i-1st bit then equal to d_{r-i-1}. Successive stages preserve the value of this bit, and this holds for all values of i. Therefore, the message will end up at the destination numbered . 8. Reverse routing (15 [10/5] points) A. For simplicity, keep the stages numbered as they are in the figure, even though messages travel from right to left. The routing follows the same general scheme as for forward routing, but the switching should be based on the destination address as follows: At stages i with 1 <= i < r, switch on d_{r-i}. At stage 0, switch on d_0. B. Considering the connections in the reverse direction, those from stage i to stage i-1 (1 <= i < r) have the effect of interchanging bits 0 and r-i, justifying switching based on d_{r-i}. Stage 0 sets the LSB of the destination address. *********** Fetch and Add Synch. 9. MDLX implementation (10 [6/2/2] points) A. Requires 5 states: #1 Ri1->A->ALU I16->ALU Add T->PC #2 Read T->B->AO #3 Ri2->A->ALU Add Din->B->ALU #4 T->A->D0 Write Din->B->Ri2 #5 PC->A->ALU,AO IR +4->ALU Add Disp B. Require some way to signal that no other operations on memory between states 2 & 4. C. Not a significant problem. Page fault event would occur on state 3, before any state has been modified. We could take the page fault exception and then restart. Locking between states 2 & 4 would guarantee that page cannot be swapped out between these two memory operations. 10. Implementation of Locks (18 points) The method is like taking a number at a store and waiting until your number comes up as the next customer to be served. Imagine lock implemented by two variables: ticket and next_customer: acquire() { my_ticket = ticket++; /* Implemented by fetch and add */ while (my_ticket < next_customer) ; /* Busy wait */ } release() { next_customer = my_ticket + 1 /* Doesn't need fetch and add */ } Note that processes guaranteed access in the order in which they fetch tickets. As long as fetch-and-add is fair, process will get valid ticket and will eventually acquire lock. 11. Bus-based cache (12 [6/6] points) A. Must be some way for processor to signal cache controller that it wants to do a multicycle atomic memory operation. Must be some way for controller to be bus master for multiple cycles, or to (temporarily) prevent other cache from grabbing copy of block owned by the cache. B. The bus traffic would be fairly minimal: A process executing acquire would use one bus cycle to acquire ticket via fetch and add, and lock out other accesses to this location for a few cycles. It would then obtain a clean copy of next_customer and busy wait on a local copy. The process releasing the lock would invalidate all the copies of next_customer. The waiting processes would get new clean copies of next_customer, and the next one in line would enter its critical section. 12. Combining in routing network (15 [10/5] points) A. If two messages and were received at a switch, these would be combined into a single update sent to the memory. TWhen the response (say ) comes back, this would be split into returning messages for one, and for the other. This would require buffering at the switch to keep track of the original messages until the response had been received. We would also require an adder. B. This hardware is much more complex than that of the CM-5 control network, mainly because of the increased buffering. The CM-5 guarantees that there can be only one scan going on at a time, and it doesn't have to keep track of the origins or destinations of scan operations coming into the control network nodes. In fact, the RP3 project at IBM was an attempt to build a system based on the Ultracomputer. They struggled to build a network that would support fetch and add, but never got it going.