======================
Back-End Optimizations

1> instruction scheduling
	motivate with example below
	is this really important? only in corner cases for x86 superscalar
	but might be important for arm - some ARM processors don't use out of order execution, but recent ones do (since 2018)
	list scheduling - greedy, priority based.  Ch. 17 has a good description.
		build dependecy graph
			dataflow - need only consider this when all virtual registers/SSA
			anti-dependence: read followed by write or write after write
			have to conservatively approximate data dependences for memory reads/writes
		simulate execution
		maintain a list of ready instructions - those that can run without stalls
		pick ready instruction by priority (longest latency-weighted path to return value, number of successors, number of descendants, latency).  Heuristic, no metric is always best.
	schedule first, allocate registers, then schedule again
	trace scheduling - tries to optimize most common path
	
Example - assume * and / cost 4 cycles, other arithmetic costs 1 cycle
	a := x * 37
	(3 stall)
	b := a / 3
	(3 stall)
	c := a + b
	i = y >> 2
	h = i - 1
	g = h << 2
	f = h + i
	e = f - g
	d = c + e
	
	15 cycles
	
	to
	
		[live x y]
	a = mult x 37
		[live a y]
	i = shiftr y 2
		[live a i]
	h = sub i 1
		[live a h i]
	g = shiftl h 2
		[live a g h i]
	b = div a 3
		[live a b g h i] [ will end up spilling something here, probably g - high degree and only two defs/uses.  one heuristic = maximize degree / (defs + uses)]
	f = add h i
		[live a b f g]
	e = sub f g
	(stall) [live a b e]
	c = add1 a b
	d = addlast c e

	10 cycles
	
2> register allocation (ch 17) & spilling (ch 15) [1:45]
	Chaitin: register allocation and spilling via graph coloring
		discover live ranges
		build interference graph
		algorithm:
		
	color(g, r)
		let stack = empty
		while true do
			while exists node n in g of degree < r
				remove n from g
				push n on to stack
			if g = empty
				while s is nonempty
					pop n from s
					add n to g
					assign n a color that doesn't conflict with its neighbors
				break
			else
				select a node n in g according to a heuristic
				remove n from g
		
		one heuristic to spill: maximize (degree / (defs + uses))
	Linear scan algorithm used for JITs (because graph coloring is too slow)


3> Loop optimizations [2:10]
	loop invariant code motion (ch 15, well described, uses reaching defs)
		apply to an expression if all reaching defs are outside the loop
			(or if some are inside, but are also loop-invariant)
			
Example:

void copy_offset(int *a, int offset, int n)
	for (int i = 0; i < n; ++i) {
		int *c = a + offset; // invariant
		int *d = c + i;
		int *e = c + n;  // invariant
		int *f = e + i;
		*d = *f;
	}
	
	loop unrolling
		duplicate the body of a loop.  Can reduce overhead of header and allow optimization across two loop bodies.  In some cases you know the loop executes exactly n times and you can unroll it n times and get straightline code (example below)

	if time: software pipelining
	
		loop
			load a
			t := a * 15
			sum := sum + a
		
	is transformed to
		load a
		
		load b
		t := a * 15
	
		loop
			load c
			u := b * 15
			sum := sum + t
			b = c
			t = u
		
		u := b * 15
		sum := sum + t
		
		sum := sum + u
		

4> interprocedural optimization: inlining (and knowing what to inline)
	[2:25]
	
	substitute body of a function at a call site
		benefit: eliminates overhead of function call (good for small functions)
		benefit: enables optimization of function based on call site information
		cost: duplicates function (worse for large functions)
		
	requirements: need to know what function is called
		dataflow analysis
			trace function definitions in functional language
			type of objects in an OO language

	[2:30] example:
	
	let fact n = {
		var result:int = 1;
		for (var i = n; i > 0; --i)
			result = result * i
		return result
	}
	let apply (f, v) = f(v)
	apply fact 4
	
	inline apply -> last line becomes fact(4)
	inline fact ->
		var result:int = 1;
		for (var i = 4; i > 0; --i)
			result = result * i
		result
	unroll loop ->
		var result:int = 1;
		result = result * 4
		result = result * 3
		result = result * 2
		result = result * 1
		result
	constant folding and propagation ->
		24