Lecture 17, 18, 19 - Optimizing JIT compilers

Optimizing compilers
  - look a lot like a traditional compiler
  - designed for better code quality
  - SSA or other IR designed for optimization
  - performs one or more optimizations on IR
  - full-fledged register allocator
    - linear scan, graph coloring, another algorithm
  - multiple backends for multiple architectures
  - make use of profiling and type feedback
  - differentiation from static compilers
    - speculative optimization and deopt
    - higher emphasis on compile speed
    - assembles bytes of machine code directly in memory

Design of an optimizing compiler
 - Typical design looks like a traditional compiler
  - Frontend: parse bytecode in IR directly or walk AST
  - Middle: optimizations and analysis
  - Backend: generate code from IR
 - possible to skip IR, but that falls into the baseline or single-pass design
 - will focus on the basic "three ends" model

First design choices: what does IR look like?
 - Instructions and basic blocks
 - Static single assignment form
   - every variable assigned only once in the program
   - source-level variables become multiple IR-level variables
   - phi nodes merge data flow from different branches at merge points
   - every variable defined by exactly one instruction => the instruction *is* the value
   - definition must *dominate* uses
     - domination: X dom Y if all paths from start to Y must first pass through X
   - guarantees that
 - Without SSA, many optimizations require reasoning about overwriting variables
   - ultimately another form of abstract interpretation
 - Graph-based IR, often called "sea of nodes"
   - "SSA on steroids"
   - control and effect edges represent ordering constraints
   - requires a scheduling pass before generating code

Implications of IR design:
 - How easy/fast is it to build IR?
   - transform from AST or bytecode into IR
 - how easy/fast are optimizations?
   - reasoning based on facts about inputs: forward dataflow problem
   - reasoning based on uses: backward dataflow problem
 - how easy/fast is it to generate code?
   - maybe a second IR, before machine code is necessary?

- SSA construction via abstract interpretation

  - Single-pass IR construction is similar to single-pass compilation; is a form of abstract interpretation
  - abstract values are *references* to SSA values, i.e. IR nodes
  - instruction pops its input values off the abstract operand stack, creates a node or nodes, and pushes the node(s)
  - SSA renaming can occur in forward pass on locals, operand stack
  - Same problems as baseline compilers; how to handle loops and merges

func #1:        x 1
 +1     | local.get[0]      
 +3     | i32.const[0]      
 +5     | i32.lt_s          
 +6     | ╭ if[i32]         
        | |                         x 0
 +8     | | local.get[0]    
 +10    | | call[func=0]    
 +12    | ╭─else            
        | │                 
 +13    | │ local.get[0]    
 +15    | │ i32.const[1]    
 +17    | │ i32.sub         
 +18    | ╰>end                     x 1
 +19    ╰>end                       x 1

  - Difference with baseline:
    - building an IR, not generating code
    - at merges, build phis for control flow paths with differing values
    - no need to register allocate, build IR nodes instead
      => can afford to relax state "lazily" when reaching merges
    - a pre-pass to determine variables modified in loops is worthwhile
  - Wasm structured control flow makes this easier
  - What about JVM bytecode?
    - pre-pass to determine basic blocks, loops
    - keep a count of predecessors seen
    - only construct IR when all predecessors seen, except loops, where backends won't be seen yet
  - Does it pay off to do optimizations while parsing?
    - SSA-renaming of inputs, plus abstract interpretation provides opportunities for peephole optimizations
    - usually easy constant folding and strength reduction pay off
    - some compilers do obvious inlining as well

- Optimizations in the optimizing compiler
  - Constant folding
  - Common subexpression elimination
  - Speculative optimizations
    - speculate on types (source-types, shapes)
    - speculate on numeric ranges (e.g. int not double)
  - Load elimination
  - Type analysis and propagation
  - Devirtualization
  - Inlining
    - Needs heuristics on:
      - what to inline: monomorphic calls, IC feedback, hot, small
      - when to inline: recompilation heuristics, callsite, hotness
  - Check elimination
      - nullchecks, zero checks (division), bounds checks
  - Write barrier elimination
  - Dead code elimination
  - Code layout
    - hot/cold path layout

- Load elimination
  getfield [C.f] (o)
    ...               // (no stores to C.f o)
  getfield [C.f] (o)  // redundant
- Load-store elimination
  getfield [C.f] (o)  // redundant
    ...               // (no uses of load)
  putfield [C.f] (o)
- Store-elimination
  putfield [C.f] (o)  // redundant
    ...               // (no loads)
  putfield [C.f] (o)
- Load-store forwarding
  putfield [C.f] (o, v)
    ...
  getfield [C.f] (o)  // replace with v

- Speculation
  - Idea: incorporate information recording in interpretation/baseline execution
  - Turn previous observations into assumptions
  - Insert guards into the code to check if assumptions hold
  - Optimize code under new assumptions
  - Usually happens fairly early in compilation pipeline
  - Helps global type analysis do better
  - What to do if assumptions fail?
    - tail-duplication (copy merge into slow/fast path)
    - compile slow path (compensating code)
    - deoptimize to a lower tier (optimized code can assume slowpath never happens)
  - Deoptimization
    - requires reconstructing execution frame(s) for lower tiers
    - "side-exits" or "deopt points" -- places where guards fail and return to lower tier
    - every deopt point contains metadata or compensating code to allow reconstruction
    - maps abstract state (e.g. locals and operands slots) to reg, stack, or constant
      - can consume a lot of space that is not used unless deopt happens
    - runtime reconstructs source frames from metadata
    - when can deoptimization happen?
      - failed guard check
      - debugger activated
      - global invariant changed

- Lowering
  - While single-pass compilers directly generate the machine code sequence for each
    bytecode or AST, node, optimizing compilers usually *lower*
  - Translate the high-level operations into low-level operations within same IR (TurboFan)
  - Translate high-level IR to low-level IR (CrankShaft and TurboFan)

- Common subexpression elimination
  - Many programs recompute the same values repeatedly
    - especially true once many methods are inlined
    - load a field (load/store elimination)
    - conversion between types
  - Intermediate computations (e.g. object header, vtable) used repeatedly
  - Cache previously-computed results and remove redundant computations
  - Local CSE: use a value-numbering table, aka "local value numbering"
  - Global CSE/GVN: often done by making use of the dominator tree
  - Partial redundancy elimination: some computations only used on some paths

- Instruction selection
  - Low-level IR often has simpler operations than are available in CPU instructions
  - Or, conversely, most ISAs have more complex instructions than just load/store/arith
  - Examples are x86 addressing modes, arith-with-immediate, load/store at offset, etc
  - Instruction selection takes 1 or more low-level operations and generates 1 machine instruction
  - Tree-based pattern matching
    - Work on instructions in reverse order
    - Try to "cover" expression trees with "tiles"
    - a "tile" corresponds to a single machine instruction (typically)
    - a tile is connected to other tiles via virtual registers (vregs)
    - an unbounded supply of virtual registers is usually assumed

- SSA deconstruction
  - need to convert SSA form into non-ssa form for execution on real machine
  - assign phis virtual registers during instruction selection
  - typical approach: insert moves (to vregs) at predecessors of phi locations (merges)

- Register allocation
  - Typically done after instruction selection, when machine instructions are known
  - task: map each virtual register to a machine register
  - constraint: two vregs live at the same time cannot be in same physical register
  - too many live vregs at once: split and spill vregs to stack frame
  - spill: vreg lives on the stack for its entire lifetime and only loaded just before use
  - split: vreg live range broken into shorter ranges and (re)allocated individually