Lecture 17, 18, 19 - Optimizing JIT compilers Optimizing compilers - look a lot like a traditional compiler - designed for better code quality - SSA or other IR designed for optimization - performs one or more optimizations on IR - full-fledged register allocator - linear scan, graph coloring, another algorithm - multiple backends for multiple architectures - make use of profiling and type feedback - differentiation from static compilers - speculative optimization and deopt - higher emphasis on compile speed - assembles bytes of machine code directly in memory Design of an optimizing compiler - Typical design looks like a traditional compiler - Frontend: parse bytecode in IR directly or walk AST - Middle: optimizations and analysis - Backend: generate code from IR - possible to skip IR, but that falls into the baseline or single-pass design - will focus on the basic "three ends" model First design choices: what does IR look like? - Instructions and basic blocks - Static single assignment form - every variable assigned only once in the program - source-level variables become multiple IR-level variables - phi nodes merge data flow from different branches at merge points - every variable defined by exactly one instruction => the instruction *is* the value - definition must *dominate* uses - domination: X dom Y if all paths from start to Y must first pass through X - guarantees that - Without SSA, many optimizations require reasoning about overwriting variables - ultimately another form of abstract interpretation - Graph-based IR, often called "sea of nodes" - "SSA on steroids" - control and effect edges represent ordering constraints - requires a scheduling pass before generating code Implications of IR design: - How easy/fast is it to build IR? - transform from AST or bytecode into IR - how easy/fast are optimizations? - reasoning based on facts about inputs: forward dataflow problem - reasoning based on uses: backward dataflow problem - how easy/fast is it to generate code? - maybe a second IR, before machine code is necessary? - SSA construction via abstract interpretation - Single-pass IR construction is similar to single-pass compilation; is a form of abstract interpretation - abstract values are *references* to SSA values, i.e. IR nodes - instruction pops its input values off the abstract operand stack, creates a node or nodes, and pushes the node(s) - SSA renaming can occur in forward pass on locals, operand stack - Same problems as baseline compilers; how to handle loops and merges func #1: x 1 +1 | local.get[0] +3 | i32.const[0] +5 | i32.lt_s +6 | ╭ if[i32] | | x 0 +8 | | local.get[0] +10 | | call[func=0] +12 | ╭─else | │ +13 | │ local.get[0] +15 | │ i32.const[1] +17 | │ i32.sub +18 | ╰>end x 1 +19 ╰>end x 1 - Difference with baseline: - building an IR, not generating code - at merges, build phis for control flow paths with differing values - no need to register allocate, build IR nodes instead => can afford to relax state "lazily" when reaching merges - a pre-pass to determine variables modified in loops is worthwhile - Wasm structured control flow makes this easier - What about JVM bytecode? - pre-pass to determine basic blocks, loops - keep a count of predecessors seen - only construct IR when all predecessors seen, except loops, where backends won't be seen yet - Does it pay off to do optimizations while parsing? - SSA-renaming of inputs, plus abstract interpretation provides opportunities for peephole optimizations - usually easy constant folding and strength reduction pay off - some compilers do obvious inlining as well - Optimizations in the optimizing compiler - Constant folding - Common subexpression elimination - Speculative optimizations - speculate on types (source-types, shapes) - speculate on numeric ranges (e.g. int not double) - Load elimination - Type analysis and propagation - Devirtualization - Inlining - Needs heuristics on: - what to inline: monomorphic calls, IC feedback, hot, small - when to inline: recompilation heuristics, callsite, hotness - Check elimination - nullchecks, zero checks (division), bounds checks - Write barrier elimination - Dead code elimination - Code layout - hot/cold path layout - Load elimination getfield [C.f] (o) ... // (no stores to C.f o) getfield [C.f] (o) // redundant - Load-store elimination getfield [C.f] (o) // redundant ... // (no uses of load) putfield [C.f] (o) - Store-elimination putfield [C.f] (o) // redundant ... // (no loads) putfield [C.f] (o) - Load-store forwarding putfield [C.f] (o, v) ... getfield [C.f] (o) // replace with v - Speculation - Idea: incorporate information recording in interpretation/baseline execution - Turn previous observations into assumptions - Insert guards into the code to check if assumptions hold - Optimize code under new assumptions - Usually happens fairly early in compilation pipeline - Helps global type analysis do better - What to do if assumptions fail? - tail-duplication (copy merge into slow/fast path) - compile slow path (compensating code) - deoptimize to a lower tier (optimized code can assume slowpath never happens) - Deoptimization - requires reconstructing execution frame(s) for lower tiers - "side-exits" or "deopt points" -- places where guards fail and return to lower tier - every deopt point contains metadata or compensating code to allow reconstruction - maps abstract state (e.g. locals and operands slots) to reg, stack, or constant - can consume a lot of space that is not used unless deopt happens - runtime reconstructs source frames from metadata - when can deoptimization happen? - failed guard check - debugger activated - global invariant changed - Lowering - While single-pass compilers directly generate the machine code sequence for each bytecode or AST, node, optimizing compilers usually *lower* - Translate the high-level operations into low-level operations within same IR (TurboFan) - Translate high-level IR to low-level IR (CrankShaft and TurboFan) - Common subexpression elimination - Many programs recompute the same values repeatedly - especially true once many methods are inlined - load a field (load/store elimination) - conversion between types - Intermediate computations (e.g. object header, vtable) used repeatedly - Cache previously-computed results and remove redundant computations - Local CSE: use a value-numbering table, aka "local value numbering" - Global CSE/GVN: often done by making use of the dominator tree - Partial redundancy elimination: some computations only used on some paths - Instruction selection - Low-level IR often has simpler operations than are available in CPU instructions - Or, conversely, most ISAs have more complex instructions than just load/store/arith - Examples are x86 addressing modes, arith-with-immediate, load/store at offset, etc - Instruction selection takes 1 or more low-level operations and generates 1 machine instruction - Tree-based pattern matching - Work on instructions in reverse order - Try to "cover" expression trees with "tiles" - a "tile" corresponds to a single machine instruction (typically) - a tile is connected to other tiles via virtual registers (vregs) - an unbounded supply of virtual registers is usually assumed - SSA deconstruction - need to convert SSA form into non-ssa form for execution on real machine - assign phis virtual registers during instruction selection - typical approach: insert moves (to vregs) at predecessors of phi locations (merges) - Register allocation - Typically done after instruction selection, when machine instructions are known - task: map each virtual register to a machine register - constraint: two vregs live at the same time cannot be in same physical register - too many live vregs at once: split and spill vregs to stack frame - spill: vreg lives on the stack for its entire lifetime and only loaded just before use - split: vreg live range broken into shorter ranges and (re)allocated individually