15-213 Lecture 1 (January 17, 2002)

Welcome
Welcome to 15-213. We're really exited to be here, and we hope that you are, too. Randy Bryant and I (Gregory Kesden) will be co-teaching the course this semester. I'm particularly exited, because it is my first time teaching the course.

I've taught the OS course as either the lab instructor or instructor for five of the last six semester. And, in that time, I have really observed the impact of 15-213. We now require it as a prerequisite for the other systems classes -- and with good reason. The impact has been tremendous. I'm very exited to be here, because I'll get to see what is "inside of the box" and how the magic happens.

So, Why Are We Here?

So far in the curriculum, you have studied computer science from the perspective of many different and powerful abstractions. You have studied High Level Languages (HLLs) such as Java. You have learned to specify and use Abstract Data Types (ADTs), such as classes. You have also studied powerful generic algorithms for solving problems, and learned to analyze their performance through amortized analysis and outer-bound complexity analysis.

No doubt about it, through abstraction you have, and will continue, to learn many tremendous things. But, what does abstraction mean?

Abstraction: From Latin, ab-stractus and ab-strahere. The prefix ab- means "from or away" and the roots "stractus" and "stratahere" mean "to drag or pull".

Abstract (V) means "to drag or pull away". And Abstraction (N) is what is left after something has been pulled away and isolated from its own, or any, real circumstances.

That's what we've done so far. We've dragged important ideas away from some distracting realities to make them clearer. But, in the real word, the ugly details sometimes matter. That's why we're here. We here to learn about real computer -- the machines, themselves, with all of their real world parts and esoteric features.

Let's take a few minutes to take a look at a few examples of real world details that are inconsistent with common abstraction. I think these examples will illustrate at least a few good reasons to study the inner workings of real-world hardware and software systems.

"ints" are not Integers and "floats" are not Reals

In the abstract, "ints" are just like Integers. But, in reality, nothing is infinite. All of our data types have finite capacities. If an "int" gets too big for its storage, it no longer models an integer. Its value may change irratically -- and may even become negative. This is what we call "overflow".

Many people dismiss overflow and blame the problem on the machine. But the machine is not a fault -- it is functioning correctly. Instead, the problem is that the abstraction doesn't exactly match the real word a machine.

Here's another one. I coach the our team for the world-wide ACM programming competition. Although it was slightly before my time, the team actually lost the world finals in Amsterdam, because of a problem like this. And last year, when we finished 20-something in the world, we missed out on top-10 standing because of an error like this.

Here's the question:

(x + y) + z ?= x + (y + z)

Almost everyone in the class immediately reacted "yes" -- and a few "associative property of addition". But, this isn't necessarily the case if x, y, and z are float point numbers.

Floating point numbers are designed to make the best use they can of limited storage. We'll take a look at the details very soon. But for now, let's think of things this way. If the numbers are very small, the decimal place floats left and the bulk of the space is used to store the fraction with great precision. If the number is very large, the decimal place floats right, and the space is used to track the integer component of the number. As the decimal place moves around with computation, "rounding error" is introduced.

In the example above, we can run into problems if the numbers are of vastly different sizes. Due to "rounding error", x and y might not cancel unless they are added directly together. The left side and right side may have slightly different values.

This is why we tell the programming team that they are never allowed to use == with floats or doubles. Instead they should subtract and check to make sure that the result is less than some epsilon. This makes sure that "noise" doesn't become meaningful.

I also expereinced a related problem when doing research with neural networks. The decision to implement the network using floating point number instead of integers lead to a problem with rounding error dominating (almost) the computation. We fixed it by scaling and normalizing our input to a range that suffered less rouding...but the real solution (pardon the pun) would have been to recognize the problem and use ints. ints can represent just as wide a range of numbers, but don't have the rounding problems.

Memory is complex. And it comes in all different shapes, sizes, and styles

Computers have different types of storage. Disk. Main memory. L2-cache, L1-cache. Registers. They work together, and in the common case can appear to form a nearly-ideal storage system. But, in certain edges cases, things may fall apart. Although caches, small amounts of very fast memory, are designed to hold onto the data that is most critically needed, in reality, the process is speculative and reactive, not perfect. Bad memory access patterns can ruin a cache's effectiveness. You'll see this in the optimization lab.

Many people take the effectiveness of the memory system for granted. Most of the time, it works well by itself. But, if your application is one that breaks it -- you will suffer unless you know how to fix it. I had this experience when I was doing Automated Target Recognition (ATR) research for a company called Neural Technologies, Inc. (NTI). Prior to using neural networks to classify images, we did extensive preprocessing. This preprocessing used wavelet filters that looked at picel values, performed some computation, and then made adjustments. It had to consider each picel in the image many, many times. Memory access latency -- often from cache misses -- contributed to weeks worht of delay during testing. Perhaps if we had paid more attention to this part of our design before the problem got so large, we could have reduce the problem (or, perhaps that was just as good as it got).

Another memory-related reality is that objects that are unrelated in your mind might be right-next-door in memory. This doesn't matter, until something goes wrong and one object scribbles outside of its box into a neighboring objects memory. The results can be very, very hard to debug. Unless you have an understanding about how memory is organized and where objects live within memory, it can be hard to find the offending code.

My longest all-nighter in college or grad school was the result of an error like this -- 4 days straight! I actually suffered dilusions. Only an effective understanding of memory -- an skill using tools like debuggers -- can make this type of situation managable.

Assembly Is the Only Language the Computer Understands

We usually program in a HLL like C, C++, or Java. We do this, because these languages map well to our problem domain. Unfortunately, they don't map well to the actual orgnaization of the machine. So, they are translated by the compiler through the compilation process, into assembly code (and ultimately machine code, which is roughly equivalent) so the computer can execute the instructions. Although we're not likely to program a lot in assembly it is true: no high-level language can be more powerful than assembly.

But, more importantly, we may need to look at assembly from time to time. It is a useful debugging tool. As a graduate student, I found an incorrect optimization in the Sun workshop C compiler by reading a piece of the assembly. The problem cause my team's project to hang occasionally if it was compiled with "-O". Without the ability to do this, we would have been dead in the water -- our running hypothesis had been a pointer-related memory error.

There are also other times when assembly might be useful. Sometimes you find yourself debugging an application that is linked against someone else's libraries. Although debugging in this environment isn't fun, being able to read the assembly can often provide valuable hints about what is happening "at the other side" (as can looking at the names of variable preserved as part of debugging information).

Perhaps most importantly, studying assembly helps us to learn how the compiler works. By understanding that process, we can write better code in HLLs. We can select constructs that map better to the hardware, avoid constructs that don't optimize well, &c.

I/O is critical

Computers are pretty close to useless unless they can exchange information with their environment. But, there is little in the computer world that is less diverse or more essoteric than I/O devices. From disks to networks, each device, and the software systems we build around them, can have a very different "personality". Only by understading the machinery can we write software that is truely correct.

I shudder to think about how many "network errors" are really programming errors. The same goes for "heat related" problems associated with disks. The edge cases are tough and programmers often squint and ignore them -- but for truely important or dependable systems, this isn't a valid approach. Things must really be done right.

I probably should also note that many problems are also caused by incompatible or mis-used communiation paradigms. Again here, many programmers assume an idea world -- even though bad things can happen. The overhead of communication can easily cripple a system -- and the lack of communication can leave resources valuabel resources idle.

In the Real-World, Errors Happen

Typically, we just ignore the error cases when we code. They probably won't happen. And, besides that, coding to handle them takes time and makes the code look uglier.

But, in real world systems, errors are common place. Data is corrupted, communications fail. Users do stupid things. Life happens. We need to be able to manage errors and problems, instead of allowing them cause catastrophic failures.

Syllabus

We talked our way through the class syllabus and also about the course schedule. They are available on-line.