15-213 Recitation 11: Malloc
Discussion/board notes
Author: Sol Boucher <sboucher@cmu.edu>
===
Goal: Understand how allocator design decisions affect performance/functionality.

Think-pair-share activity:
Why use the heap? And what constraints does this place on allocator implementations?
====================================================================================
Adjacency: (Need a contiguous block of addresses, as guaranteed by the C standard)
	(This makes the consequences of fragmentation severe: we might have to tell the application we failed!) [External fragmentation]
Flexibility: (Might use to store any type of data)
	(This mandates using the widest possible alignment.) [Internal fragmentation]
Lifetime: (Need to keep data around for a long time)
	(This makes our decisions really important: each placement could haunt us for the rest of the run.)
Memory: (Not required to remember how large the space is)
	(This means the application won't remind us at deallocation-time, so we need to track it as part of our metadata.) [Internal fragmentation]
Ordering: (Might deallocate in any order)
	(This means that we need to think about how space might [or might not] be reclaimed on deallocation, rather than simply how it might be affected at allocation-time.) [External fragmentation]
Size: (Need to store a lot of data)
	(This basically means that blocks can't be of a fixed size: need to allow really big allocations without compromising on all allocations.) [External fragmentation]
Speed: (Need the space immediately)
	(The inability to perform batching means we always have to predicate decisions on incomplete knowledge, and that we want to avoid syscalls.) [External fragmentation]
Stability: (Need stable addresses)
	(This forces us to use a greedy approach, so there's always guaranteed to be a sequence for which we're suboptimal.) [External fragmentation]

Define types of fragmentation. Which of the implementation constraints contribute to each?
==========================================================================================
Internal fragmentation: w/in allocation, due to:
	(Alignment)
	(Overhead)
	In the worst case, we have a ton of *small allocations* and accounting information begins to dominate
External fragmentation: across allocations, due to:
	(Adjacency)
	(Ordering)
	(Size)
	(Speed)
	(Stability)
	In the worst case, we either fail to accommodate a *large allocation* or have to expand the heap.
Notice that application behavior can cause both types of fragmentation; all the implementation can do is try to minimize fragmentation.

Begin tracethrough/discussion of needed structures and design decisions
=======================================================================
Keep this on the board:
	Map of process image (text, rodata, data, bss, heap, free space, stack)
	Instance data
	Globals
	Application calls
	Linear heap layout
Use a different color for design decisions!

3 main ways to perform global block tracking
============================================
Implicit free list
Explicit free list
	Segregated list (multiple explicit lists)

Develop implicit list
=====================
What's the first thing we have to do to handle malloc(8)?
1. Know whether a location is free.
	-> [Global data] Root node
	-> [Instance data] Header w/ valid bit
		! It's the *payload* (not the header) that must be aligned, but we're going to disregard altogether for now.
	*  Initialize root block to invalid!
	How initialization happens
	--------------------------
	mm_init() in the lab
	Normal C programs don't have to do anything at start of main(), though...
		Kernel calls _start()---NOT main()---on new processes!
2. Know where to put next allocation
	-> [Instance data] Size of this allocation
		Decide whether to include header and alignment in size, then be consistent or bug out!
		! I'm assuming sizes can fit in a single byte for simplicity/laziness.
		(Note: initial free size was 27.)
	*  Initialize next block to invalid!
	DESIGN: {First,Next,Best} fit
	-----------------------------
	First is easy to implement
	Next requires remembering to wrap back to beginning if nothing found in latter part
	-> [Global data] Fit start (for next fit or efficient first fit)
	Next might cause more fragmentation
	Best is hard to do with an implicit list
	-> [Global data] Some kind of list for efficient best fit (we'll come back to it)
Have people do malloc(3), then just add malloc(2) and malloc(5).
3. Know how to dealloc
	*  Reset valid bit
Anything else we need to do?
Run through dealloc() of 2 then 3.
4. Know how to coalesce.
	*  What happens when we dealloc() 3 then 2 instead?
	-> [Instance data] Footers w/ sizes
	(UPDATE THEM WITH FOOTERS and change their usable data sizes.)
	DESIGN: When to coalesce
	========================
	Disadvantage of doing it at dealloc-time?
		(Causes cache misses when we wouldn't have to traverse anyway.)
What happens when we try to do a malloc(6)?
5. Know when we need to expand the heap.
	-> [Global data] Current heap size (to avoid extra sbrk(), which is expensive due to poor interface)
	DESIGN: How much to sbrk()
	--------------------------
	More: Faster due to fewer syscalls
	Less: Faster due to reduced space usage/swapping avoidance

Implicit pros/cons
==================
+ LEAST BOOKKEEPING/less internal fragmentation
- SEARCHES are expensive, especially for large allocs.
- Hard to implement BEST FIT efficiently/more external fragmentation
- Numerous CACHE MISSES on every allocation search
+ Makes COALESCING EASY

Changes for explicit list
=========================
+ Want to prevent *number of active allocations* from hurting search time.
	Keep a list of just the free blocks, rather than commingling free and alloc'd.
	Repurpose start pointer to indicate free list head node
Do we still need to store sizes?
	(Yes: for coalescing)
Can keep all instance data within allocation. Do we need to expand the header?
	(No: only free nodes need pointers, so we can store within payload BUT we will increase the minimum block size.)
1. Changes to allocation: search free list instead of all nodes, remember to remove from list!
2. Changes to freeing: re-add to list
	DESIGN: Where to place?
	-----------------------
	FIFO (deallocated blocks aren't reused for a while)
	LIFO:   + Finding the front constant-time regardless of double linkage
		+ Fewer cache misses due to block reuse
		- Worse fragmentation
	Addressed-ordered: Allows you to do away with size footer if you keep a doubly-linked list
		Sacrifice payload space/minimum block size to reduce instance metadata
3. Changes to splitting/coalescing: remember to remove the original entry and add the new one!

Explicit pros/cons
==================
+ From above: active allocations don't affect SEARCH TIME
- Still hard to do BEST FIT
- Fewer CACHE MISSES because searches traverse fewer nodes
- Easier to incorporate bugs/attain inconsistent state

Changes for segregated list
===========================
+ Want to efficiently implement *best fit* without driving up search time.
	Keep multiple free lists, one for each size (range)
What if we run out of a particular size?
	(Need to search larger sizes and split.)
Better to assign size buckets linearly or logarithmically?
	(Determines asymptotic time of small allocations)
	Can do a combination: linear for small sizes, then logarithmic, then catch-all

Segregated pros/cons
====================
+ Most GLOBAL OVERHEAD
+ SEARCHES can be cheaper, especially for large allocs.
- SMALL ALLOCATIONS disadvantaged as lists run out
+ Can finally reasonably implement BEST FIT
- Coalescing requires DOUBLY-LINKED LIST to avoid searching other lists
- Most COMPLICATED to implement

Heap checker
============
Implement mm_heapcheck() early on and keep it updated.
What are some invariants to check? (Annotate existing stuff with a new color.)
	Block: (Header/footer consistent), (*Payload* aligned)
	List: (Prev/next consistent between blocks?), (Blocks in list iff free), (No contiguous free blocks?), (No cycles in list?), (Segregated sizes obeyed)
	Heap: (Nothing outside heap boundaries)
	Any others? (e.g. Address order)

Safety and debugging
====================
Look at slides for a pointer arithmetic reminder
Write reusable helper functions and macros to avoid duplication bugs
Include conditionally-compiled debugging code
Check invariants with assertions