Class 01a Notes

Great Realities of Computer Science

#1 Int's are not Integers, Floats are not Reals

INTEGER PUZZLES

x * x >= 0
  print 5000 * 5000
  --> 25000000
  print 50000 * 50000
  --> -1794967296

x*(y*z) == (x*y)*z
  print 100*(10000*1000000)
  --> -727379968
  print (100*10000)*1000000
  --> -727379968
  
  print 10000*1000000
  --> 1410065408
  print 1410065408*100
  --> -727379968

FLOATING-POINT PUZZLES

x+(y-z) == (x+y)-z

 print (3.14+1e20)-1e20
 --> 0

 print 3.14+(1e20-1e20)
 --> 3.1400000000000001

 Try 1e5, 1e13 1e15 1e16 1e17

x*x >= 0?

 print 1e154*1e154
 --> 1e308

 print -1e154*-1e154
 --> 1e308

 print 1e155*1e155
 --> Inf

 print 1e-161 * 1e-161
 --> 9.8813129168249309e-323

 print 1e-162 * 1e-162
 --> 0

Lessons:

- Finite precision of representation leads to unexpected behavior
- It's not black magic.  There are underlying principles.
- Critical for writing reliable code.
- Compiler writers must be especially careful


Reality #2: You've got to know assembly

- It's the window between your code & the system
- Performance tuning
- Implementing system
- Debugging

double fun(int i)
{
  volatile double d[1] = {3.14};
  volatile long int a[2];
  a[i] = 1073741824; /* Possibly out of bounds */
  return d[0];
}

refbug: Try with i = 0, i = 1, i = 2, i = 3, i = 4;

What's going on:

+--------------------+
|  Saved State       |
+--------------------+<-- %ebp
|  d7 ... d4         |
+--------------------+
|  d3 ... d0         |
+--------------------+  -8
|  a[1]              |
+--------------------+
|  a[0]              |
+--------------------+ -16

How'd you figure that out?
_fun:
	pushl	%ebp
	movl	$1073741824, %edx
	movl	%esp, %ebp
	subl	$16, %esp
	movl	8(%ebp), %eax
	fldl	LC0
	fstpl	-8(%ebp)                # See that d starts at bp-8
	movl	%edx, -16(%ebp,%eax,4)  # See that a starts at bp-16
	fldl	-8(%ebp)
	leave
	ret


Reality #3: Memory Matters

Matrix copying code

void copyij(int src[2048][2048],
            int dst[2048][2048])
{
  int i,j;
  for (i = 0; i < 2048; i++)
    for (j = 0; j < 2048; j++)
      dst[i][j] = src[i][j];
}
67,545,663 cycles (CPE = 16)
(2GHz machine.  How long does it take?  Answer ~33ms)

void copyji(int src[2048][2048],
            int dst[2048][2048])
{
  int i,j;
  for (j = 0; j < 2048; j++)
    for (i = 0; i < 2048; i++)
      dst[i][j] = src[i][j];
}

1,566,311,479 cycles (CPE = 373.4) (~770ms total)
(>20X slower)

Explain why:

a[10][10] layed out as:

a[0][0] a[0][1] ... a[0][9]  a[1][0] ... a[1][9] ..... a[9][0] ... a[9][9]

Copyij accesses memory in linear fashion, Copyji in strided.  Bad locality

Bad point on memory mountain.


Reality #4

Big-Oh is nice, but don't forget about constants.

Example: Compute product of elements in a vector

typedef struct {
    int len;
    long int *data;
} vec_rec, *vec_ptr;

Nice abstract code:
void abs_combine(vec_ptr v, long int *dest)
{
    int i;
    *dest = 1;
    for (i = 0; i < vec_length(v); i++) {
	long int val;
	get_vec_element(v, i, &val);
	*dest = *dest * val;
    }
}
Performance: 20.6 cycles/element on laptop, 32 on server

Take away data abstractions:
void direct_combine(vec_ptr v, long int *dest)
{
    int i;
    int length = vec_length(v);
    long int *data = get_vec_start(v);
    long int x = 1;

    for (i = 0; i < length; i++) {
	x = x * data[i];
    }
    *dest = x;
}
Performance: 4 cycles/element on laptop, 10 on server (delay through multiplier)

void parallel_combine(vec_ptr v, long int *dest)
{
    int length = vec_length(v);
    int limit = length-7;
    long int *data = get_vec_start(v);
    long int x = 1;
    int i;

    /* Combine 8 elements at a time */
    for (i = 0; i < limit; i+=8) {
	long int t1 = data[i] * data[i+1];
	long int t2 = data[i+2] * data[i+3];
	long int u1 = t1 * t2;
	long int t3 = data[i+4] * data[i+5];
	long int t4 = data[i+6] * data[i+7];
	long int u2 = t3 * t4;
	x = x * (u1 * u2); 
    }

    /* Finish any remaining elements */
    for (; i < length; i++) {
	x = x * data[i];
    }

    *dest = x;
}
Performance: 1.15 CPE on laptop, 1.75 on server

Overall improvement: ~18X

Inner loop of abs_combine
L21:
	movl	%ebx, 4(%esp)
	leal	-16(%ebp), %eax
	incl	%ebx
	movl	%eax, 8(%esp)
	movl	%edi, (%esp)
	call	_get_vec_element  # Call get_vec_element
	movl	-16(%ebp), %eax
	movl	(%esi), %edx
	imull	%edx, %eax
	movl	%eax, (%esi)
	movl	%edi, (%esp)
	call	_vec_length       # Call vec_length
	cmpl	%ebx, %eax
	jg	L21


Inner loop of direct_combine
L30:
	movl	(%eax,%edx,4), %ebx
	incl	%edx
	imull	%ebx, %ecx
	cmpl	%esi, %edx
	jl	L30

Reality #5: Computers don't just run programs

Read & write data through I/O devices
Connected to world via networks

Leads to many system-level issues:
  Concurrency
  Unreliable media (e.g., dropped packets)
  Cross platform compatibility
  Protocol standards
  Learning about OS features to support concurrency, process management,
  exceptional conditions


Generalities

- Programmer centric, rather than builder centric

Quiz Problem

Think powers of 2:

2^{10} = 1024 ~= 1000 = 10^3

Example: How big is 2^32?  Answer: 2^30 ~= 10^9, .... 4 billion

0. Question:

How many seconds are there in a day?

Answer: 3600 * 24 * 365 ~= 2^12 * 2^4 * 2^9 = 2^{25} ~= 32 million
  (Precise answer 31,536,000)

Function 

int fun(long int x);

Run on 4GHz machine.  Each call to fun takes 64 (2^6) clock cycles.

1. How run exhaustively when long int's are 32 bits long?

Answer: 4GHz = 2^32 cycles/second.  Call takes 2^6 ==> 2^26 calls/second
Around 2^6 = 64 seconds ~= 1 minute.

2. How long to run exhaustively when long int's are 64 bits long?
Answer: Now need 2^38 seconds ~= 2^13 years ~= 8 millenia

World's fastest supercomputer: 60,000X faster than single processor.

60,000 ~= 2^6 * 2^10 = 2^16.  Time reduces to 2^12 seconds ~= 1 hour.

Lessons:

-- 2^32 evaluations is manageable
-- 2^64 is not.  Even for a supercomputer
-- And that's just to exhaustively run a single function.