====================== Back-End Optimizations 1> instruction scheduling motivate with example below is this really important? only in corner cases for x86 superscalar but might be important for arm - some ARM processors don't use out of order execution, but recent ones do (since 2018) list scheduling - greedy, priority based. Ch. 17 has a good description. build dependecy graph dataflow - need only consider this when all virtual registers/SSA anti-dependence: read followed by write or write after write have to conservatively approximate data dependences for memory reads/writes simulate execution maintain a list of ready instructions - those that can run without stalls pick ready instruction by priority (longest latency-weighted path to return value, number of successors, number of descendants, latency). Heuristic, no metric is always best. schedule first, allocate registers, then schedule again trace scheduling - tries to optimize most common path Example - assume * and / cost 4 cycles, other arithmetic costs 1 cycle a := x * 37 (3 stall) b := a / 3 (3 stall) c := a + b i = y >> 2 h = i - 1 g = h << 2 f = h + i e = f - g d = c + e 15 cycles to [live x y] a = mult x 37 [live a y] i = shiftr y 2 [live a i] h = sub i 1 [live a h i] g = shiftl h 2 [live a g h i] b = div a 3 [live a b g h i] [ will end up spilling something here, probably g - high degree and only two defs/uses. one heuristic = maximize degree / (defs + uses)] f = add h i [live a b f g] e = sub f g (stall) [live a b e] c = add1 a b d = addlast c e 10 cycles 2> register allocation (ch 17) & spilling (ch 15) [1:45] Chaitin: register allocation and spilling via graph coloring discover live ranges build interference graph algorithm: color(g, r) let stack = empty while true do while exists node n in g of degree < r remove n from g push n on to stack if g = empty while s is nonempty pop n from s add n to g assign n a color that doesn't conflict with its neighbors break else select a node n in g according to a heuristic remove n from g one heuristic to spill: maximize (degree / (defs + uses)) Linear scan algorithm used for JITs (because graph coloring is too slow) 3> Loop optimizations [2:10] loop invariant code motion (ch 15, well described, uses reaching defs) apply to an expression if all reaching defs are outside the loop (or if some are inside, but are also loop-invariant) Example: void copy_offset(int *a, int offset, int n) for (int i = 0; i < n; ++i) { int *c = a + offset; // invariant int *d = c + i; int *e = c + n; // invariant int *f = e + i; *d = *f; } loop unrolling duplicate the body of a loop. Can reduce overhead of header and allow optimization across two loop bodies. In some cases you know the loop executes exactly n times and you can unroll it n times and get straightline code (example below) if time: software pipelining loop load a t := a * 15 sum := sum + a is transformed to load a load b t := a * 15 loop load c u := b * 15 sum := sum + t b = c t = u u := b * 15 sum := sum + t sum := sum + u 4> interprocedural optimization: inlining (and knowing what to inline) [2:25] substitute body of a function at a call site benefit: eliminates overhead of function call (good for small functions) benefit: enables optimization of function based on call site information cost: duplicates function (worse for large functions) requirements: need to know what function is called dataflow analysis trace function definitions in functional language type of objects in an OO language [2:30] example: let fact n = { var result:int = 1; for (var i = n; i > 0; --i) result = result * i return result } let apply (f, v) = f(v) apply fact 4 inline apply -> last line becomes fact(4) inline fact -> var result:int = 1; for (var i = 4; i > 0; --i) result = result * i result unroll loop -> var result:int = 1; result = result * 4 result = result * 3 result = result * 2 result = result * 1 result constant folding and propagation -> 24