Class 09 Optimization A Focus on "machine-independent" optimizations, ones that make sense for any combination of machine & compiler. In ideal world, would let compilers figure out best way to transform our source code into efficient machine code. In reality, compilers have shortcomings that can be overcome by rewriting source code. We refer to these as "optimization blockers". Warning: code can become very obscure. Should limit hacking to parts of code that are performance critical. Optimizing Compilers * Handle low-level details well. Register allocation, code selection & ordering, eliminating minor inefficiencies * Don't (usually) improve asymptotic efficiency * Have difficulty overcoming "optimization blockers" * Work under constraint: Don't change valid program behavior, even for conditions that would occur under pathological conditions. * Behavior that may be obvious to programmer can be obfuscated by languages and coding styles. * Analysis performed only at procedural level Too expense to do whole-program analysis Can't rely on fixed external functions When in doubt, don't do an optimization. Code motion: moving computation to place where performed less frequently. E.g. funding element in dynamically allocated array long *a = calloc(n*n, sizeof(double)); Element i,j is at a[i*n+j] /* Set row i of n X n matrix a to row vector b */ void set_row(double *a, double *b, long i, long n) { long j; for (j = 0; j < n; j++) a[n*i+j] = b[j]; } Appears to require n*i in inner loop. But notice that n*i doesn't change set_row: xorl %r8d, %r8d cmpq %rcx, %r8 jge .L7 movq %rcx, %rax imulq %rdx, %rax # n*i outside of inner loop leaq (%rdi,%rax,8), %rdx # rowp = A + n*i*8 .L5: movq (%rsi,%r8,8), %rax incq %r8 movq %rax, (%rdx) addq $8, %rdx cmpq %rcx, %r8 jl .L5 .L7: rep ; ret (Notice that this is floating point code, but doesn't seem to do any floating point operations. That's because it just treats doubles as 8-byte values and moves them between memory and GPRs) C version of assembly code void set_rowp(double *a, double *b, long i, long n) { long j; long ni = n*i; double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; } Reduction in Strength. Use less costly operation when possible. E.g., shift and add to do constant multiply, shift to do divide by power of two /* Set all rows of matrix a to row vector b */ void set_rows(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[n*i+j] = b[j]; } } set_rows: # No multiplies anywhere in code! xorl %r9d, %r9d cmpq %rdx, %r9 jge .L28 xorl %r10d, %r10d .L26: xorl %r8d, %r8d cmpq %rdx, %r8 jge .L30 leaq (%rdi,%r10,8), %rcx .L25: movq (%rsi,%r8,8), %rax incq %r8 movq %rax, (%rcx) addq $8, %rcx cmpq %rdx, %r8 jl .L25 .L30: incq %r9 addq %rdx, %r10 cmpq %rdx, %r9 jl .L26 .L28: rep ; ret Approximate C version Keep track of n*i. Initially i = 0 ==> n*i = 0. When increment i, increase n*i by n. void set_rowsp(double *a, double *b, long n) { long i, j; long ni = 0; for (i = 0; i < n; i++) { double *rowp = a+ni; for (j = 0; j < n; j++) *rowp++ = b[j]; ni += n; } } Common Subexpressions. Places where single computation can be shared in several places. Compilers have basic capability, but not very clever about associativity and commutativity. Assum val is pointer to n*n array. Want to sum elements surrounding i,j /* Sum NSEW neighbors of array element */ double sum_NSEW(double *val, long i, long j, long n) { double up = val[(i-1)*n+j]; double down = val[(i+1)*n+j]; double left = val[i*n+j-1]; double right = val[i*n+j+1]; return up+down+left+right; } Seems to require 4 multiplications. Actual code uses three sum_NSEW: leaq 1(%rsi), %rax leaq -1(%rsi), %r8 imulq %rcx, %rsi # i*n imulq %rcx, %rax # (i+1)*n imulq %rcx, %r8 # (i-1)*n addq %rdx, %rsi addq %rdx, %rax movlpd (%rdi,%rax,8), %xmm0 # Floating point register addq %rdx, %r8 addsd (%rdi,%r8,8), %xmm0 # FP add addsd -8(%rdi,%rsi,8), %xmm0 # FP add addsd 8(%rdi,%rsi,8), %xmm0 # FP add ret Can be more clever. Recognize that basic computation can be improved by locating element i,j and offsetting from there: /* Sum NSEW neighbors of array element */ double sum_NSEW_opt(double *val, long i, long j, long n) { long inj = i*n+j; double up = val[inj - n]; double down = val[inj + n]; double left = val[inj - 1]; double right = val[inj + 1]; return up+down+left+right; } sum_NSEW_opt: imulq %rcx, %rsi # Single multiply addq %rdx, %rsi movq %rsi, %rax subq %rcx, %rax leaq (%rsi,%rcx), %rcx movlpd (%rdi,%rcx,8), %xmm0 addsd (%rdi,%rax,8), %xmm0 addsd -8(%rdi,%rsi,8), %xmm0 addsd 8(%rdi,%rsi,8), %xmm0 ret Optimization Blocker: Procedure calls Mostly will treat procedure call as black box. Will make whatever calls appear in code. Example for 213 student lab code /* Convert string to lower case */ void lower1(char *s) { int i; for (i = 0; i < strlen(s); i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } Lower Case Conversion Performance (Seconds) Length lower1 5000 0.027318 10000 0.109128 20000 0.462196 40000 1.850365 80000 7.424950 See quadratic performance. As double array size, quadruple seconds. Look at assembly code to see what's going on: /* Convert string to lower case */ # lower1 inner Loop .L79: movzbl (%rbx,%rbp), %edx leal -65(%rdx), %eax cmpb $25, %al ja .L77 leal 32(%rdx), %eax movb %al, (%rbx,%rbp) .L77: incl %r12d movq %rbp, %rdi movslq %r12d,%rbx call strlen # strlen being called in inner loop cmpq %rax, %rbx jb .L79 Look at how the loop gets compiled by do-while template (similar for others) void lower1_goto(char *s) { int i = 0; if (i >= strlen(s)) return; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < strlen(s)) # Call part of loop goto loop; } In general, the update expression of for loop is part of loop. In C, can only determine string length by traversing entire string /* Compute length of string */ size_t my_strlen(const char *s) { int length = 0; while (*s != '\0') { s++; length++; } return length; } Overall performance is quadratic: n calls to strlen, each taking time n. Obviously, we want to do code motion: /* Convert string to lower case */ void lower2(char *s) { int i = 0; int len = strlen(s); for (i = 0; i < len; i++) if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); } # lower2 inner Loop .L95: movslq %ebx,%rcx movzbl (%rcx,%rbp), %edx leal -65(%rdx), %eax cmpb $25, %al ja .L89 leal 32(%rdx), %eax movb %al, (%rcx,%rbp) .L89: incl %ebx cmpl %esi, %ebx jl .L95 void lower2_goto(char *s) { int i = 0; size_t len = strlen(s); if (i >= len) return; loop: if (s[i] >= 'A' && s[i] <= 'Z') s[i] -= ('A' - 'a'); i++; if (i < len) goto loop; } Lower Case Conversion Performance (Seconds) Length lower1 lower2 5000 0.027318 0.000012 10000 0.109128 0.000027 20000 0.462196 0.000050 40000 1.850365 0.000100 80000 7.424950 0.000202 Linear performance. Why can't GCC do this code motion? What if strlen incremented counter? Even if libc strlen doesn't do this, might link in some version that does. Would take subtle analysis to see that length of string invariant during this code, since we are actually modifying the string elements & could easily change some character to null. Optimization Blocker: Memory Aliasing /* Sum rows is of n X n matrix a and store in vector b */ void sum_rows1(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) { b[i] = 0; for (j = 0; j < n; j++) b[i] += a[i*n + j]; } } # sum_rows1 inner loop .L53: addsd (%rcx), %xmm0 # FP add addq $8, %rcx decq %rax movsd %xmm0, (%rsi,%r8,8) # FP store jne .L53 /* Sum rows is of n X n matrix a */ void sum_rows1p(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) { double val = 0; b[i] = val; for (j = 0; j < n; j++) { val += a[i*n + j]; b[i] = val; } } } This looks weird. Why does it keep storing intermediate values to b[i]? Reason: What if a & b overlap. E.g.: a: [0 1 2][4 8 16][32 64 128] b = a+3 Intitially b = [4 8 16] i = 0: get b = [3 8 16] i = 1: j = 0: b = [3 3 16] j = 1: b = [3 6 16] j = 2: b = [3 22 16] i = 2: get b = [3 22 224] /* Sum rows is of n X n matrix a */ void sum_rows2(double *a, double *b, long n) { long i, j; for (i = 0; i < n; i++) { double v = 0; for (j = 0; j < n; j++) v += a[i*n + j]; b[i] = v; } } Intitially b = [4 8 16] i = 0: get b = [3 8 16] i = 1: get b = [3 27 16] i = 2: get b = [3 27 224] # sum_rows2 inner loop .L66: addsd (%rcx), %xmm0 addq $8, %rcx decq %rax jne .L66 Pathological? Yes! But, compiler must assume aliasing is possible. General rule: Accumulate in temporaries.