Lecture - Threads

- Some languages offers threads as a primitive, such as Java, C#
- Other languages offer lightweight fibers (e.g. Go)
- Other languages have threads "snuck in" through libraries: e.g. Python

- To coordinate concurrent work, synchronization mechanisms are needed
  - locks (mutexes)
  - reader/writer locks
  - monitors (reentrant locks)
  - condition variables

- Threads can be mapped 1:1 or M:N on OS (hardware threads)

- 1:1 mapping allows true parallelism from OS, but could be costly in terms of memory
  - also requires runtime to track all OS threads
  - In some systems that allow native code, threads be created "outside" the runtime and enter,
    which may require the runtime system to accomodate new threads dynamically
 
- User-level threads (M:N or green threading)
  - threads logically run in parallel, but runtime multiplexes (M) stacks onto (N) OS threads
  - used in Golang, where programs often have a large number of Goroutines
  - can also be an implementation detail, e.g. in early JVMs or in early Solaris threading library
  - saves memory when OS threads are expensive
  - runtime manages stacks (or lightweight fibers or continuations) separately from threads
    - generally lower memory overhead than 1:1 threading model
    - segmented stacks (cactus stacks)
    - growable stacks
  - cooperative: threads explicitly yield
  - preemptive: runtime uses OS timers or inserts polling
    - polling often piggy-backed on GC safepoint polling mechanism
  - blocking I/O: runtime has to carefully interact with OS, e.g. by spawning additional threads

- Runtime data structures
  - thread-local allocation buffers (TLABs) allow threads to allocate moderate amounts of objects
    (e.g. 100's of KB or a few MB) without contention
  - thread table: central registry of all threads/stacks
    - needed, e.g. to stop them for GC
  - per-object locks
  - per-class locks (e.g. Java metadata for linking)

- Java monitors
  - Every object has a reentrant mutex associated with it
  - The monitorenter and monitorexit bytecodes lock and unlock the object
    - these are inserted by the Java source compiler for synchronized blocks
    - also happens implicitly for synchronized methods
  - A given thread can *reenter* a monitor => monitor has a counter
  - The Java bytecode verifier checks that monitorenter/exit are "balanced"
    - every control flow path from a monitorenter to the end contains exactly one monitorexit
  - When a stack frame is unwound because of an exception, the VM implicitly unlocks
    all monitors the method acquired

- Java condition variables
  - Every object has .wait() .notify() and .notifyAll() methods
  - A thread must own the monitor for an object to call any of them
  - A .wait() will block a thread (temporarily giving up the monitor) until another thread calls .notify()
  - When .wait() returns successfully, the monitor is implicitly reacquired

- Thin locks
  - Pioneered for Java, where every thread has both a reentrant monitor and a condition variable
  - reuses an existing full synchronization implementation; i.e. an optimization
  - header for most objects only has a few bits to indicate it is has never been locked
  - In the original paper from 1998, 24 bits in the header are used:
    1 bit : thin or fat (if fat, 23 bits refer to an inflated lock)
    16 bits: thread id
    8 bits: locking count
  - while locked, an object knows which thread owns it
  - Assumption of frequency rank of operations
     1. locking an unlocked object.
     2. locking an object already locked by the current thread a small number of times (shallowly nested locking).
     3. locking an object already locked by the current thread many times (deeply nested locking).
     4. attempting to lock an object already locked by another thread, for which no other threads are waiting.
     5. attempting to lock an object already locked by another thread, for which other threads are already waiting.
  - locking an unlocked object requires a compare-and-swap
  - unlocking an owned object is just a store
  - locking a locked object (if owned) is an increment and store
  - inflate the lock if the locking count overflows, or if .wait() or notify() is called
  - inflate the lock when a second thread attempts to lock a locked object (contention)

- Biased locking
  - Often an object is only ever locked by one thread
  - By biasing an object *to a thread*, it can be locked and unlocked by that thread without any synchronization
  - Upon contention, re-biasing is done by the VM to stop the thread to which it was biased and revoke the bias
  - The implementation in the "Eliminating Synchronization-Related Atomic Operations with Biased Locking and Bulk Rebiasing" paper
    relies on HotSpots lock records which are stored in the stack
  - OpenJDK in the process of deprecating biased locking because HW atomics are much cheaper now

- Lock coarsening
  - static analysis (or JIT compiler) can remove synchronization operations under certain circumstances
  - e.g. locking the same object twice in a nested fashion (e.g. due to inlining multiple synchronized methods)
  - whole-program analysis to determine that one lock dominates another

- Java.util.concurrent
  - Threads, monitors and condition variables are the base level of language support
  - java.util.concurrent includes more customized and sophisticated mechanisms
    - more specialized locks, such as reader/writer
    - concurrent data structures such as ConcurrentHashMap, queues, executors, futures
  - Some methods are native (i.e. intrinsified) because they need carefully-tuned machine code sequences
    (both for correctness and performance)

- Memory models
  - VMs and languages with threads should define a *memory model* that specifies what is legal
    and what outcomes are legal
  - E.g. data races; two or more accesses to the same memory location where one is a write
  - Some languages make this undefined behavior (C/C++), making no guarantees
  - Other languages define rules, e.g. cannot break source-level type system
  - What about tearing and out of thin air values?
    - Java allows long and double writes to be *torn*
    - Reference writes are atomic
    - Java Memory Model forbids out-of-thin-air values

- Using threads for parallelizing runtime services
  - Many runtime services can be made parallel and/or asynchronous
  - Parallel: parse, verify, and compile modules
  - Asynchronous: optimized compilations, GC, I/O
  - Though JavaScript has no threads, many JSVMs use threads to make VM services go faster

https://dl.acm.org/doi/10.1145/2660193.2660210
https://dl.acm.org/doi/10.1145/2554850.2554858