Floating Point Idea: |--|---------|---|---|-----------|--| -inf -norm -dn +dn +norm +inf NaN -0 +0 Let's build up a simple 7-bit format: Fractional part: n=3 bits. Viewed as fractional binary numbers this gives: 000 0 001 1/8 010 2/8 = 1/4 011 3/8 100 4/8 = 1/2 101 5/8 110 6/8 = 3/4 111 7/8 Now, let's add k=3 exponent bits with a bias of 3 (= 2^{k-1}-1) 001 1 2^{1-3} = 1/4 010 2 2^(2-3) = 1/2 011 3 2^{3-3} = 1 100 4 2^{4-3} = 2 101 5 2^{5-3} = 4 110 6 2^{6-3} = 8 Now, let's encode normalized numbers. Rule (1 + f) * 2^E 001 000 (1+0) * 1/4 = 1/4 = 8/32 001 001 (1+1/8) * 1/4 = 9/32 (Delta = 1/32) ... 001 111 (1+7/8) * 1/4 = 15/32 010 000 (1+0) * 1/2 = 1/2 = 16/32 010 001 (1+1/8) * 1/2 = 9/16 = 18/32 (Delta = 1/16 = 2/32) ... 010 111 (1+7/8) * 1/2 = 30/32 011 000 (1+0) * 1 = 1 = 32/32 011 001 (1+1/8) * 1 = 9/8 = 36/32 (Delta = 1/8 = 4/32) 011 010 (1+1/4) * 1 = 5/4 = 40/32 ... 110 000 (1+0) * 8 = 8 = 256/32 110 001 (1+1/8) * 8 = 9 = 288/32 (Delta = 1 = 32/32) 110 111 (1+7/8) * 8 = 15 = 480/32 Let's add in the denormalized. Rule (0 + f) * 2^{1-Bias} = f * 1/4 000 000 0 * 1/4 = 0 000 001 1/8 * 1/4 = 1/32 ... 000 111 7/8 * 1/4 = 7/32 (Observe smooth transition from denormalized to normalized) Other numbers: 111 000 Infinity 111 XXX (other than 000) NaN Getting numbers from bits: float b2f(unsigned int b) { union { float f; unsigned u; } b4; b4.u = b; return b4.f; } double b2d(unsigned int bl, unsigned int bh) { union { double d; unsigned u[2]; } b4; if (is_little_endian()) { b4.u[0] = bl; b4.u[1] = bh; } else { b4.u[1] = bl; b4.u[0] = bh; } return b4.d; } Sign bit: Just negates entire number. In general: k exponent bits, n fraction bits, bias = 2^{k-1} - 1 IEEE FP Examples Single Precision: k = 8 + n = 32 Bias = 2^{8-1} - 1 = 127 Double Precision k = 11 + n = 52 Bias = 2^{11-1) - 1 = 1023 Extended Double Precision k = 15 + n = 63 Bias = 2^{15-1) - 1 = 16383 Smallest Pos. Denorm Fraction 000 ... 1 1/2^{n} = 2^{-n} Exponent 000 ... 0 2^{1-Bias} = 2^{2-2^{k-1}} SP: 2^{-23} * 2^{-126} ~= 1.4 X 10^{-45} fshow 0x00000001 DP: 2^{-52} * 2^{-1022} ~= 4.9 X 10^{-324} fshow 0x00000000 0x00000001 Smallest Pos. Norm Fraction 000 ... 0 1+0 = 1 Exponent 000 ... 0 2^{1-Bias) = 2^{2-2^{k-1}} SP: 2^{-126} ~= 1.2 X 10^{-38} fshow 0x00800000 DP: 2^{-1022} ~= 2.3 X 10^{-308} fshow 0x00100000 0x00000000 One Fraction 000 ... 0 Exponent: Bias SP 0x3F800000 DP 0x3FF00000 0x00000000 FIVE Fraction 010 ... 0 Exponent: Bias + 2 SP 0x40a00000 [Bias = 129], 0 (100 0000 1) (010 0000 0000 ...) DP 0x40140000 0x00000000 [Bias = 1025], 0 (100 0000 0001) (0100 0000 ... ) Biggest Norm Fraction 111 ... 1 ~= 2 Exponent 111 ... 0 = 2^Bias SP 0x7F7FFFFF ~= 3.4 X 10^{38} DP 0x7FEFFFFF 0xFFFFFFFF ~= 1.8 X 10^{308} Infinity SP 0x7F800000 DP 0x7FF00000 0x00000000 NAN SP 0x7FFFFFFF Negatives: Just add leading one. Rounding: Round to even: make trailing bit be 0, but only when exactly on boundary Examples: Round 4 bits to two 01.11 = 1 3/4 --> 10.00 = 2 (up) 01.01 = 1 1/4 --> 01.00 = 1 (down) 01.10 = 1 1/2 --> 10.00 = 2 (even ==> up) 00.11 --> 01 (up) 00.01 --> 00 (down) 00.10 --> 00 (even ==> down) Rule for rounding: Perform precise computation and then round result. FLOATING-POINT PUZZLES x+(y-z) == (x+y)-z print (3.14+1e20)-1e20 --> 0 print 3.14+(1e20-1e20) --> 3.1400000000000001 Try 1e5, 1e13 1e15 1e16 1e17 x*x >= 0? print 1e154*1e154 --> 1e308 print -1e154*-1e154 --> 1e308 print 1e155*1e155 --> Inf print 1e-161 * 1e-161 --> 9.8813129168249309e-323 print 1e-162 * 1e-162 --> 0 Puzzles x == (int)(float) x No: 24 bit significand x == (int)(double) x Yes: 53 bit significand f == (float)(double) f Yes: increases precision d == (float) d No: loses precision f == -(-f); Yes: Just change sign bit 2/3 == 2/3.0 No: 2/3 == 0 d < 0.0 ==> ((d*2) < 0.0) Yes! d > f ==> -f > -d Yes! d * d >= 0.0 Yes! (d+f)-d == f No: Not associative