Class 10.  Optimization B.

Last time we talked about the first step in optimization,
which is basically:

1. Don't do anything stupid.  E.g., use bad algorithms or
   create quadratic performance (easier than you might think).

2. Make your code compiler friendly.  Specifically, do your
   own code motion / strength reduction on procedure calls
   and memory references.

Going beyond that requires considering how modern processors operate.
We call this "machine-dependent" optimizations, but really the same
set of tricks works across all modern processors.

Sketch out picture:

* Instruction Processing Engine

* Execution Engine.  Contains multiple functional units.  Pentium 4 has:

* 2 "simple" integer units.  E.g., add / bit ops
* 1 "complex" integer unit.  E.g., multiply/divide
* FP/SSE3 unit.  All FP arithmetic
* FP Move.  All conversions
* Load.  With address computation
* Store.  With address computation

Instruction unit grabs big blocks of instructions from instruction
stream & converts them into more primitive "Uops".  E.g.

addq %rax, (%rbx)

converted to

	  load (%rbx), temp
	  add  %rax, temp
	  store temp, (%rbx)

where each of these operations is handled by a different functional unit.
(Exact uop format is kept secret)

Registers are handled by "renaming".  Think that name of register is
only a tag to describe how instructions produce & consume results.  E.g.,

     addq %rax, %rbx  RBX1
     andq %rbx, %rdx    Use RBX1
     mulq %rcx, %rbx  RBX2
     xorq %rbx, %rdi    Use RBX2

Combination of operand names and their program order determine operand
relations.  But this can be expanded to a "dataflow" view:

   RAX0 -- + -- RBX0
       	   |
       	   RBX1--&--RDX0
       	   |	 |
       	   |	 RDX1
   RCX0 -- *
   	   |
           RBX2-- ^ -- RDI0
	       	  |
		  RDI1

Goal is to extract possible parallelism.

What about branches?  Branch prediction guesses which way they go.
Set up & perform computation speculatively.  If mistake, then throw
away partial results.

Example from book.  Compute sum or product of all elements in a
"vector".

/* Compute sum/product of all elements in vector */
void combine4(vec_ptr v, data_t *dest)
{
    int i;
    int length = vec_length(v);
    data_t *data = get_vec_start(v);
    data_t x = IDENT;

    for (i = 0; i < length; i++) {
	x = x OPER data[i];
    }
    *dest = x;
}

Can use declarations to get different data types.

/* Integer Sum */
typedef int data_t;
#define OPER  +
#define IDENT 0

/* Floating Point Product */
typedef float data_t;
#define OPER  *
#define IDENT 1

Above code avoids stupid things.

On Saltwater Fish Machine
CPE	Int +	Int *	FP +	FP *
	2.20	10.00	5.00	7.00

Measured in "Cycles Per Element".  Tries to bring execution time down
to level at which we want to optimize loops.


Why this performance:  With Pentium 4, get following latencies:

Int +:  0.5 (But can't get things out of memory fast enough)
Int *: 10.0
FP  +:  5.0
FP  *:  7.0

So, we see that for the last 3 cases, we're up against latency of a
single unit:

 a[0] a[1] a[2] ... a[n-1]
   |   |    |  	      |
   ----*----*-      --*-->

Can't do any better because of sequential dependencies


Let's first try to speed up integer sum.

Inner loop:

.L543:
	movslq	%edx,%rax
	incl	%edx
	addl	(%rsi,%rax,4), %ecx
	cmpl	%ebp, %edx
	jl	.L543

5 instructions --> 6 ops.

Loop unrolling: Get more computation done in each iteration.  Saves
overhead.

void unroll2a_combine(vec_ptr v, data_t *dest)
{
    int length = vec_length(v);
    int limit = length-1;
    data_t *data = get_vec_start(v);
    data_t x = IDENT;

    int i;

    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
	x = x OPER data[i] OPER data[i+1];
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
	x = x OPER data[i];
    }
    *dest = x;
}

On Saltwater Fish Machine
CPE	Int +	Int *	FP +	FP *
	1.50	10.00	5.00	7.00

.L518:
	movslq	%edx,%rax
	addl	$2, %edx
	addl	(%rsi,%rax,4), %ecx
	addl	4(%rsi,%rax,4), %ecx
	cmpl	%r12d, %edx
	jl	.L518

6 instructions --> 7 ops.  Getting 2 iterations / 3 cycles.

Note that this doesn't help with those that are limited by
sequential dependencies.


Getting more parallelism.  Need to break sequential dependencies.

 a[0] a[2] a[4] ... a[n-2]
   |   |    |  	      |
   ----*----*-      --*---+
                          |
   ----*----*-      --*---*-->
   |   |    |  	      |
 a[1] a[3] a[5] ... a[n-1]

Accumulate separately.

/* Unroll loop by 2, 2-way parallelism */
void combine6(vec_ptr v, data_t *dest)
{
    int length = vec_length(v);
    int limit = length-1;
    data_t *data = get_vec_start(v);
    data_t x0 = IDENT;
    data_t x1 = IDENT;
    int i;

    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
	x0 = x0 OPER data[i];
	x1 = x1 OPER data[i+1];
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
	x0 = x0 OPER data[i];
    }
    *dest = x0 OPER x1;
}

Warning: changing associativity.  Results same for Int, but not same
for FP.

On Saltwater Fish Machine
CPE	Int +	Int *	FP +	FP *
	1.50	 5.00	2.50	3.50

See that we're getting exactly twice the performance for the long
latency cases.


void unroll2aa_combine(vec_ptr v, data_t *dest)
{
    int length = vec_length(v);
    int limit = length-1;
    data_t *data = get_vec_start(v);
    data_t x = IDENT;
    int i;

    /* Combine 2 elements at a time */
    for (i = 0; i < limit; i+=2) {
	x = x OPER (data[i] OPER data[i+1]);
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
	x = x OPER data[i];
    }
    *dest = x;
}

Get almost the same with a different associativity:

 a[0]--*--a[1] a[2]--*--a[3] a[4]--*--a[5]    ...   a[n-2]--*--a[n-1]
       |       	     | 	       	   |   	       	       	    |
       +-------------*-------------+------    ...   --------*-->

Get depth N/2.

On Saltwater Fish Machine


CPE	Int +	Int *	FP +	FP *
	1.56	 5.00	2.75	3.62

Not quite as good, but very close.  Not as clean a split into two
parallel streams.

We can push the degree of unrolling and parallelism ("superscalar")
further.  Only rule: parallelism must be multiple of unrolling factor.

Saltwater Fish Machines (Intel Nocona)

FP *             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  7.00  7.00        7.01        7.00            |
 | 2  |        3.50        3.50        3.50            |
 | 3  |              2.34                              |
 | 4  |                    2.01        2.00            |
 | 6  |                          2.00              2.00|
 | 8  |                                2.01            |
 | 10 |                                      2.00      |
 | 12 |                                            2.00|
 +-----------------------------------------------------+

 Note that 2.00 seems to be a limit

FP +             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  5.00  5.00        5.02        5.00            |
 | 2  |        2.50        2.51        2.51            |
 | 3  |              2.00                              |
 | 4  |                    2.01        2.00            |
 | 6  |                          2.00              1.99|
 | 8  |                                2.01            |
 | 10 |                                      2.00      |
 | 12 |                                            2.00|
 +-----------------------------------------------------+

 Also limited to 2.0.

 Why? FP unit can only accept new operation every 2 clock cycles.


Saltwater Fish Machines (Intel Nocona)

Int *             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  | 10.00 10.00       10.00       10.01            |
 | 2  |        5.00        5.01        5.00            |
 | 3  |              3.33                              |
 | 4  |                    2.50        2.51            |
 | 6  |                          1.67              1.67|
 | 8  |                                1.25            |
 | 10 |                                      1.09      |
 | 12 |                                            1.14|
 +-----------------------------------------------------+

Get close to theoretical limit of 1.0.

Int +             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  2.20  1.50        1.10        1.03            |
 | 2  |        1.50        1.10        1.03            |
 | 3  |              1.34                              |
 | 4  |                    1.09        1.03            |
 | 6  |                          1.01              1.01|
 | 8  |                                1.03            |
 | 10 |                                      1.04      |
 | 12 |                                            1.11|
 +-----------------------------------------------------+

Unrolling gives necessary performance.  Note that high degree of
unrolling limits performance of shorter cases.


General Results for AMD Opteron

This is Intel's main competition.  Note how much faster they are, but
realize that clock rate is about 1/2 of Intel's.

FP *             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  4.00  4.00        4.00        4.01            |
 | 2  |        2.00        2.00        2.00            |
 | 3  |              1.34                              |
 | 4  |                    1.00        1.00            |
 | 6  |                          1.00              1.00|
 | 8  |                                1.00            |
 | 10 |                                      1.00      |
 | 12 |                                            1.00|
 +-----------------------------------------------------+

Latency = 4.0, Issue time = 1.0

FP +             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  4.00  4.00        4.00        4.00            |
 | 2  |        2.01        2.00        2.01            |
 | 3  |              1.33                              |
 | 4  |                    1.01        1.00            |
 | 6  |                          1.00              1.00|
 | 8  |                                1.00            |
 | 10 |                                      1.00      |
 | 12 |                                            1.00|
 +-----------------------------------------------------+

Latency = 4.0, Issue time = 1.0

General Results for AMD Opteron

Int *             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  3.00  3.00        2.25        1.88            |
 | 2  |        2.33        2.00        1.75            |
 | 3  |              2.00                              |
 | 4  |                    1.75        1.38            |
 | 6  |                          1.50              1.50|
 | 8  |                                1.75            |
 | 10 |                                      1.30      |
 | 12 |                                            1.33|
 +-----------------------------------------------------+

Latency = 3.0, Issue time = 1.0

Int +             Unroll
 +-----------------------------------------------------+
 | SS |  1     2     3     4     6     8     10    12  |
 |    +------------------------------------------------+
 | 1  |  2.32  1.50        0.75        0.63            |
 | 2  |        1.50        0.83        0.63            |
 | 3  |              1.00                              |
 | 4  |                    1.00        0.63            |
 | 6  |                          0.83              0.67|
 | 8  |                                0.63            |
 | 10 |                                      0.60      |
 | 12 |                                            0.85|
 +-----------------------------------------------------+

Wow, got it below 1.0!  That means that can do more than one memory
read/cycle.

Performance of some machines and some functions

	    Nocona	     Opteron		Pentium M		Pentium III

Int +	    0.5/0.5	     1/1?		1/1			1/1
Int *	    10/1	     3/1		4/1			4/1
Int /	    36/36	     46/46		20/20			36/36
Long /	    106/106	     76/76		

FP +	    5/2		     4/1		3/1			3/1
FP *	    7/2		     4/1		5/2			5/2
Float /	    32		     14			36			36
Double /    46		     17			36			36


Here's something from an old exam problem

Factorial Computation

int rfact(int n)
{
    if (n <= 1)
	return 1;
    return n * rfact(n-1);
}

Fish CPE = 15.5

int fact(int n)
{
    int i;
    int result = 1;

    for (i = n; i > 0; i--)
	result = result * i;
    return result;
}

Fish CPE = 10.0

int fact_u3a(int n)
{
    int i;
    int result = 1;

    for (i = n; i >= 3; i-=3) {
	result = result * i * (i-1) * (i-2);
    }
    for (; i > 0; i--)
	result *= i;
    return result;
}

Fish CPE = 10.0 (didn't help)

int fact_u3b(int n)
{
    int i;
    int result = 1;

    for (i = n; i >= 3; i-=3) {
	result *= i * (i-1) * (i-2);
    }
    for (; i > 0; i--)
	result *= i;
    return result;
}

Fish CPE = 3.34.  Wow!  What happened.  Got parallelism through
different associativity.

            100  99   98
             |   |    |
           1-*---*----*  97   96   95
		      |	 |    |    |
	       	      |	 *----*----*
		      ---*
			 |   94    93   92
			 |   |     |    |
			 |   *-----*----*
			 ----*
Breaks depth to N/3

Note that changes associativity.

Branch Prediction

Processor tries to make guess of which way branches will go.

InitiallY: Backward taken, forward not-taken

Refine: Keep state of how branches have gone.  Build simple
predictors.  Can recognize regular patterns.

Examples:
int max(int x, int y)
{
    return (x < y) ? y : x;
}

(Compiles with conditional move)

int bmax(int x, int y)
{
    int mask = -(x>y);
    return (mask & x) | (~mask & y);
}

(Bit tricks similar to Lab 1)

volatile int tcnt = 0;
int bjtmax(int x, int y)
{
    if (x < y) {
	tcnt++;
	return y;
    }
    return x;
}

Force it to use branching code.  Penalty for <

volatile int ecnt = 0;
int bjemax(int x, int y)
{
    if (x < y) {
	return y;
    }
    ecnt++;
    return x;
}

Force it to use branching code.  Penalty for >

Measurements.
Always <
Always >
Alternate < >
Random    < >

Function: max	max(0,   +1)	CPT = 14.19
Function: max	max(0,   -1)	CPT = 14.19
Function: max	max(0, a+-1)	CPT = 14.20
Function: max	max(0, r+-1)	CPT = 14.20

Conclusion: Oblivious to direction.  Conditional move doesn't require
branch prediction.

Function: bmax	max(0,   +1)	CPT = 17.23
Function: bmax	max(0,   -1)	CPT = 17.25
Function: bmax	max(0, a+-1)	CPT = 17.23
Function: bmax	max(0, r+-1)	CPT = 17.25

Better to use conditional moves.

Function: bjtmax max(0,   +1)	CPT = 12.27
Function: bjtmax max(0,   -1)	CPT = 16.19
Function: bjtmax max(0, a+-1)	CPT = 14.45
Function: bjtmax max(0, r+-1)	CPT = 32.09

<: 12 cycles > 16 cycles, alternating is avergage
--> branch predictor learned the pattern.

Random = 32 cycles = 16 cycles worse than average.
Prediction rate = 0.5 ==> 32 cycle branch penalty.


Function: bjemax max(0,   +1)	CPT = 16.19
Function: bjemax max(0,   -1)	CPT = 12.27
Function: bjemax max(0, a+-1)	CPT = 14.22
Function: bjemax max(0, r+-1)	CPT = 31.33

Same as above.


Important lessons:

* Can get major speedups.
* Understand parallelism.
* Don't need detailed knowledge of hardware.
* Mispredicted branches are the worst performance enemy.