Class 01a Notes Great Realities of Computer Science #1 Int's are not Integers, Floats are not Reals INTEGER PUZZLES x * x >= 0 print 5000 * 5000 --> 25000000 print 50000 * 50000 --> -1794967296 x*(y*z) == (x*y)*z print 100*(10000*1000000) --> -727379968 print (100*10000)*1000000 --> -727379968 print 10000*1000000 --> 1410065408 print 1410065408*100 --> -727379968 FLOATING-POINT PUZZLES x+(y-z) == (x+y)-z print (3.14+1e20)-1e20 --> 0 print 3.14+(1e20-1e20) --> 3.1400000000000001 Try 1e5, 1e13 1e15 1e16 1e17 x*x >= 0? print 1e154*1e154 --> 1e308 print -1e154*-1e154 --> 1e308 print 1e155*1e155 --> Inf print 1e-161 * 1e-161 --> 9.8813129168249309e-323 print 1e-162 * 1e-162 --> 0 Lessons: - Finite precision of representation leads to unexpected behavior - It's not black magic. There are underlying principles. - Critical for writing reliable code. - Compiler writers must be especially careful Reality #2: You've got to know assembly - It's the window between your code & the system - Performance tuning - Implementing system - Debugging double fun(int i) { volatile double d[1] = {3.14}; volatile long int a[2]; a[i] = 1073741824; /* Possibly out of bounds */ return d[0]; } refbug: Try with i = 0, i = 1, i = 2, i = 3, i = 4; What's going on: +--------------------+ | Saved State | +--------------------+<-- %ebp | d7 ... d4 | +--------------------+ | d3 ... d0 | +--------------------+ -8 | a[1] | +--------------------+ | a[0] | +--------------------+ -16 How'd you figure that out? _fun: pushl %ebp movl $1073741824, %edx movl %esp, %ebp subl $16, %esp movl 8(%ebp), %eax fldl LC0 fstpl -8(%ebp) # See that d starts at bp-8 movl %edx, -16(%ebp,%eax,4) # See that a starts at bp-16 fldl -8(%ebp) leave ret Reality #3: Memory Matters Matrix copying code void copyij(int src[2048][2048], int dst[2048][2048]) { int i,j; for (i = 0; i < 2048; i++) for (j = 0; j < 2048; j++) dst[i][j] = src[i][j]; } 67,545,663 cycles (CPE = 16) (2GHz machine. How long does it take? Answer ~33ms) void copyji(int src[2048][2048], int dst[2048][2048]) { int i,j; for (j = 0; j < 2048; j++) for (i = 0; i < 2048; i++) dst[i][j] = src[i][j]; } 1,566,311,479 cycles (CPE = 373.4) (~770ms total) (>20X slower) Explain why: a[10][10] layed out as: a[0][0] a[0][1] ... a[0][9] a[1][0] ... a[1][9] ..... a[9][0] ... a[9][9] Copyij accesses memory in linear fashion, Copyji in strided. Bad locality Bad point on memory mountain. Reality #4 Big-Oh is nice, but don't forget about constants. Example: Compute product of elements in a vector typedef struct { int len; long int *data; } vec_rec, *vec_ptr; Nice abstract code: void abs_combine(vec_ptr v, long int *dest) { int i; *dest = 1; for (i = 0; i < vec_length(v); i++) { long int val; get_vec_element(v, i, &val); *dest = *dest * val; } } Performance: 20.6 cycles/element on laptop, 32 on server Take away data abstractions: void direct_combine(vec_ptr v, long int *dest) { int i; int length = vec_length(v); long int *data = get_vec_start(v); long int x = 1; for (i = 0; i < length; i++) { x = x * data[i]; } *dest = x; } Performance: 4 cycles/element on laptop, 10 on server (delay through multiplier) void parallel_combine(vec_ptr v, long int *dest) { int length = vec_length(v); int limit = length-7; long int *data = get_vec_start(v); long int x = 1; int i; /* Combine 8 elements at a time */ for (i = 0; i < limit; i+=8) { long int t1 = data[i] * data[i+1]; long int t2 = data[i+2] * data[i+3]; long int u1 = t1 * t2; long int t3 = data[i+4] * data[i+5]; long int t4 = data[i+6] * data[i+7]; long int u2 = t3 * t4; x = x * (u1 * u2); } /* Finish any remaining elements */ for (; i < length; i++) { x = x * data[i]; } *dest = x; } Performance: 1.15 CPE on laptop, 1.75 on server Overall improvement: ~18X Inner loop of abs_combine L21: movl %ebx, 4(%esp) leal -16(%ebp), %eax incl %ebx movl %eax, 8(%esp) movl %edi, (%esp) call _get_vec_element # Call get_vec_element movl -16(%ebp), %eax movl (%esi), %edx imull %edx, %eax movl %eax, (%esi) movl %edi, (%esp) call _vec_length # Call vec_length cmpl %ebx, %eax jg L21 Inner loop of direct_combine L30: movl (%eax,%edx,4), %ebx incl %edx imull %ebx, %ecx cmpl %esi, %edx jl L30 Reality #5: Computers don't just run programs Read & write data through I/O devices Connected to world via networks Leads to many system-level issues: Concurrency Unreliable media (e.g., dropped packets) Cross platform compatibility Protocol standards Learning about OS features to support concurrency, process management, exceptional conditions Generalities - Programmer centric, rather than builder centric Quiz Problem Think powers of 2: 2^{10} = 1024 ~= 1000 = 10^3 Example: How big is 2^32? Answer: 2^30 ~= 10^9, .... 4 billion 0. Question: How many seconds are there in a day? Answer: 3600 * 24 * 365 ~= 2^12 * 2^4 * 2^9 = 2^{25} ~= 32 million (Precise answer 31,536,000) Function int fun(long int x); Run on 4GHz machine. Each call to fun takes 64 (2^6) clock cycles. 1. How run exhaustively when long int's are 32 bits long? Answer: 4GHz = 2^32 cycles/second. Call takes 2^6 ==> 2^26 calls/second Around 2^6 = 64 seconds ~= 1 minute. 2. How long to run exhaustively when long int's are 64 bits long? Answer: Now need 2^38 seconds ~= 2^13 years ~= 8 millenia World's fastest supercomputer: 60,000X faster than single processor. 60,000 ~= 2^6 * 2^10 = 2^16. Time reduces to 2^12 seconds ~= 1 hour. Lessons: -- 2^32 evaluations is manageable -- 2^64 is not. Even for a supercomputer -- And that's just to exhaustively run a single function.