Class 10. Optimization B. Last time we talked about the first step in optimization, which is basically: 1. Don't do anything stupid. E.g., use bad algorithms or create quadratic performance (easier than you might think). 2. Make your code compiler friendly. Specifically, do your own code motion / strength reduction on procedure calls and memory references. Going beyond that requires considering how modern processors operate. We call this "machine-dependent" optimizations, but really the same set of tricks works across all modern processors. Sketch out picture: * Instruction Processing Engine * Execution Engine. Contains multiple functional units. Pentium 4 has: * 2 "simple" integer units. E.g., add / bit ops * 1 "complex" integer unit. E.g., multiply/divide * FP/SSE3 unit. All FP arithmetic * FP Move. All conversions * Load. With address computation * Store. With address computation Instruction unit grabs big blocks of instructions from instruction stream & converts them into more primitive "Uops". E.g. addq %rax, (%rbx) converted to load (%rbx), temp add %rax, temp store temp, (%rbx) where each of these operations is handled by a different functional unit. (Exact uop format is kept secret) Registers are handled by "renaming". Think that name of register is only a tag to describe how instructions produce & consume results. E.g., addq %rax, %rbx RBX1 andq %rbx, %rdx Use RBX1 mulq %rcx, %rbx RBX2 xorq %rbx, %rdi Use RBX2 Combination of operand names and their program order determine operand relations. But this can be expanded to a "dataflow" view: RAX0 -- + -- RBX0 | RBX1--&--RDX0 | | | RDX1 RCX0 -- * | RBX2-- ^ -- RDI0 | RDI1 Goal is to extract possible parallelism. What about branches? Branch prediction guesses which way they go. Set up & perform computation speculatively. If mistake, then throw away partial results. Example from book. Compute sum or product of all elements in a "vector". /* Compute sum/product of all elements in vector */ void combine4(vec_ptr v, data_t *dest) { int i; int length = vec_length(v); data_t *data = get_vec_start(v); data_t x = IDENT; for (i = 0; i < length; i++) { x = x OPER data[i]; } *dest = x; } Can use declarations to get different data types. /* Integer Sum */ typedef int data_t; #define OPER + #define IDENT 0 /* Floating Point Product */ typedef float data_t; #define OPER * #define IDENT 1 Above code avoids stupid things. On Saltwater Fish Machine CPE Int + Int * FP + FP * 2.20 10.00 5.00 7.00 Measured in "Cycles Per Element". Tries to bring execution time down to level at which we want to optimize loops. Why this performance: With Pentium 4, get following latencies: Int +: 0.5 (But can't get things out of memory fast enough) Int *: 10.0 FP +: 5.0 FP *: 7.0 So, we see that for the last 3 cases, we're up against latency of a single unit: a[0] a[1] a[2] ... a[n-1] | | | | ----*----*- --*--> Can't do any better because of sequential dependencies Let's first try to speed up integer sum. Inner loop: .L543: movslq %edx,%rax incl %edx addl (%rsi,%rax,4), %ecx cmpl %ebp, %edx jl .L543 5 instructions --> 6 ops. Loop unrolling: Get more computation done in each iteration. Saves overhead. void unroll2a_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *data = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OPER data[i] OPER data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x = x OPER data[i]; } *dest = x; } On Saltwater Fish Machine CPE Int + Int * FP + FP * 1.50 10.00 5.00 7.00 .L518: movslq %edx,%rax addl $2, %edx addl (%rsi,%rax,4), %ecx addl 4(%rsi,%rax,4), %ecx cmpl %r12d, %edx jl .L518 6 instructions --> 7 ops. Getting 2 iterations / 3 cycles. Note that this doesn't help with those that are limited by sequential dependencies. Getting more parallelism. Need to break sequential dependencies. a[0] a[2] a[4] ... a[n-2] | | | | ----*----*- --*---+ | ----*----*- --*---*--> | | | | a[1] a[3] a[5] ... a[n-1] Accumulate separately. /* Unroll loop by 2, 2-way parallelism */ void combine6(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *data = get_vec_start(v); data_t x0 = IDENT; data_t x1 = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 = x0 OPER data[i]; x1 = x1 OPER data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 = x0 OPER data[i]; } *dest = x0 OPER x1; } Warning: changing associativity. Results same for Int, but not same for FP. On Saltwater Fish Machine CPE Int + Int * FP + FP * 1.50 5.00 2.50 3.50 See that we're getting exactly twice the performance for the long latency cases. void unroll2aa_combine(vec_ptr v, data_t *dest) { int length = vec_length(v); int limit = length-1; data_t *data = get_vec_start(v); data_t x = IDENT; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x = x OPER (data[i] OPER data[i+1]); } /* Finish any remaining elements */ for (; i < length; i++) { x = x OPER data[i]; } *dest = x; } Get almost the same with a different associativity: a[0]--*--a[1] a[2]--*--a[3] a[4]--*--a[5] ... a[n-2]--*--a[n-1] | | | | +-------------*-------------+------ ... --------*--> Get depth N/2. On Saltwater Fish Machine CPE Int + Int * FP + FP * 1.56 5.00 2.75 3.62 Not quite as good, but very close. Not as clean a split into two parallel streams. We can push the degree of unrolling and parallelism ("superscalar") further. Only rule: parallelism must be multiple of unrolling factor. Saltwater Fish Machines (Intel Nocona) FP * Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 7.00 7.00 7.01 7.00 | | 2 | 3.50 3.50 3.50 | | 3 | 2.34 | | 4 | 2.01 2.00 | | 6 | 2.00 2.00| | 8 | 2.01 | | 10 | 2.00 | | 12 | 2.00| +-----------------------------------------------------+ Note that 2.00 seems to be a limit FP + Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 5.00 5.00 5.02 5.00 | | 2 | 2.50 2.51 2.51 | | 3 | 2.00 | | 4 | 2.01 2.00 | | 6 | 2.00 1.99| | 8 | 2.01 | | 10 | 2.00 | | 12 | 2.00| +-----------------------------------------------------+ Also limited to 2.0. Why? FP unit can only accept new operation every 2 clock cycles. Saltwater Fish Machines (Intel Nocona) Int * Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 10.00 10.00 10.00 10.01 | | 2 | 5.00 5.01 5.00 | | 3 | 3.33 | | 4 | 2.50 2.51 | | 6 | 1.67 1.67| | 8 | 1.25 | | 10 | 1.09 | | 12 | 1.14| +-----------------------------------------------------+ Get close to theoretical limit of 1.0. Int + Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 2.20 1.50 1.10 1.03 | | 2 | 1.50 1.10 1.03 | | 3 | 1.34 | | 4 | 1.09 1.03 | | 6 | 1.01 1.01| | 8 | 1.03 | | 10 | 1.04 | | 12 | 1.11| +-----------------------------------------------------+ Unrolling gives necessary performance. Note that high degree of unrolling limits performance of shorter cases. General Results for AMD Opteron This is Intel's main competition. Note how much faster they are, but realize that clock rate is about 1/2 of Intel's. FP * Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 4.00 4.00 4.00 4.01 | | 2 | 2.00 2.00 2.00 | | 3 | 1.34 | | 4 | 1.00 1.00 | | 6 | 1.00 1.00| | 8 | 1.00 | | 10 | 1.00 | | 12 | 1.00| +-----------------------------------------------------+ Latency = 4.0, Issue time = 1.0 FP + Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 4.00 4.00 4.00 4.00 | | 2 | 2.01 2.00 2.01 | | 3 | 1.33 | | 4 | 1.01 1.00 | | 6 | 1.00 1.00| | 8 | 1.00 | | 10 | 1.00 | | 12 | 1.00| +-----------------------------------------------------+ Latency = 4.0, Issue time = 1.0 General Results for AMD Opteron Int * Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 3.00 3.00 2.25 1.88 | | 2 | 2.33 2.00 1.75 | | 3 | 2.00 | | 4 | 1.75 1.38 | | 6 | 1.50 1.50| | 8 | 1.75 | | 10 | 1.30 | | 12 | 1.33| +-----------------------------------------------------+ Latency = 3.0, Issue time = 1.0 Int + Unroll +-----------------------------------------------------+ | SS | 1 2 3 4 6 8 10 12 | | +------------------------------------------------+ | 1 | 2.32 1.50 0.75 0.63 | | 2 | 1.50 0.83 0.63 | | 3 | 1.00 | | 4 | 1.00 0.63 | | 6 | 0.83 0.67| | 8 | 0.63 | | 10 | 0.60 | | 12 | 0.85| +-----------------------------------------------------+ Wow, got it below 1.0! That means that can do more than one memory read/cycle. Performance of some machines and some functions Nocona Opteron Pentium M Pentium III Int + 0.5/0.5 1/1? 1/1 1/1 Int * 10/1 3/1 4/1 4/1 Int / 36/36 46/46 20/20 36/36 Long / 106/106 76/76 FP + 5/2 4/1 3/1 3/1 FP * 7/2 4/1 5/2 5/2 Float / 32 14 36 36 Double / 46 17 36 36 Here's something from an old exam problem Factorial Computation int rfact(int n) { if (n <= 1) return 1; return n * rfact(n-1); } Fish CPE = 15.5 int fact(int n) { int i; int result = 1; for (i = n; i > 0; i--) result = result * i; return result; } Fish CPE = 10.0 int fact_u3a(int n) { int i; int result = 1; for (i = n; i >= 3; i-=3) { result = result * i * (i-1) * (i-2); } for (; i > 0; i--) result *= i; return result; } Fish CPE = 10.0 (didn't help) int fact_u3b(int n) { int i; int result = 1; for (i = n; i >= 3; i-=3) { result *= i * (i-1) * (i-2); } for (; i > 0; i--) result *= i; return result; } Fish CPE = 3.34. Wow! What happened. Got parallelism through different associativity. 100 99 98 | | | 1-*---*----* 97 96 95 | | | | | *----*----* ---* | 94 93 92 | | | | | *-----*----* ----* Breaks depth to N/3 Note that changes associativity. Branch Prediction Processor tries to make guess of which way branches will go. InitiallY: Backward taken, forward not-taken Refine: Keep state of how branches have gone. Build simple predictors. Can recognize regular patterns. Examples: int max(int x, int y) { return (x < y) ? y : x; } (Compiles with conditional move) int bmax(int x, int y) { int mask = -(x>y); return (mask & x) | (~mask & y); } (Bit tricks similar to Lab 1) volatile int tcnt = 0; int bjtmax(int x, int y) { if (x < y) { tcnt++; return y; } return x; } Force it to use branching code. Penalty for < volatile int ecnt = 0; int bjemax(int x, int y) { if (x < y) { return y; } ecnt++; return x; } Force it to use branching code. Penalty for > Measurements. Always < Always > Alternate < > Random < > Function: max max(0, +1) CPT = 14.19 Function: max max(0, -1) CPT = 14.19 Function: max max(0, a+-1) CPT = 14.20 Function: max max(0, r+-1) CPT = 14.20 Conclusion: Oblivious to direction. Conditional move doesn't require branch prediction. Function: bmax max(0, +1) CPT = 17.23 Function: bmax max(0, -1) CPT = 17.25 Function: bmax max(0, a+-1) CPT = 17.23 Function: bmax max(0, r+-1) CPT = 17.25 Better to use conditional moves. Function: bjtmax max(0, +1) CPT = 12.27 Function: bjtmax max(0, -1) CPT = 16.19 Function: bjtmax max(0, a+-1) CPT = 14.45 Function: bjtmax max(0, r+-1) CPT = 32.09 <: 12 cycles > 16 cycles, alternating is avergage --> branch predictor learned the pattern. Random = 32 cycles = 16 cycles worse than average. Prediction rate = 0.5 ==> 32 cycle branch penalty. Function: bjemax max(0, +1) CPT = 16.19 Function: bjemax max(0, -1) CPT = 12.27 Function: bjemax max(0, a+-1) CPT = 14.22 Function: bjemax max(0, r+-1) CPT = 31.33 Same as above. Important lessons: * Can get major speedups. * Understand parallelism. * Don't need detailed knowledge of hardware. * Mispredicted branches are the worst performance enemy.