15-410 Triple-Fault Advice Page

Introduction

What should you do when you first experience the dreaded "triple fault"? Here are some suggestions.

First, take a moment to smile. If you had been running your kernel on a real PC, it would have suddenly cleared the screen, beeped, and rebooted. You would have no way to figure out what had happened aside from comparing your source code to a previous version and reasoning forward from there. In 15-410, you have thousands of dollars of the world's best PC simulation software at your beck and call. So maybe as you smile you should remind yourself how grateful you are to the guys in Sweden.
Now, take a second moment to smile. A triple fault means you are about to learn something! What's more, it's the kind of learning you can't get from books. Maybe you will learn something about the x86, something about OS structure, or something about debugging strategy.

Q: Ok, Mr. Miyagi, enough telling me how to polish the car, can you give me some actual advice about my triple fault already?

A: Ah, grasshopper, you are in such a hurry...

Advice

Generally, a triple fault means that you (the OS kernel author) told the machine to do something it couldn't do. For example, you may have asked it to execute a trap handler that doesn't exist, asked it to run code which you didn't give it permission to run, told it to access memory you didn't give it permission to access, etc. This condition is very unlikely to result from a bug in Simics or in the course-provided run-time environment. So sending us mail of the form
I have a triple fault, now what?
isn't going to get you very far. We genuinely don't know what's going wrong!

Sending us register dumps won't help much either. As you will see below, it is unlikely we will be able to point out one "incorrect" bit. Things are right or wrong in context, and you (the kernel author) are responsible for defining the context and therefore what is right and wrong.
Since you have access to all processor state and all of memory, you should be able to figure out first what was happening when the processor ran into trouble and then what the problem was.
- Use the 15-410 Simics Command Guide to collect as much information as you can about the processor state. Collectively the registers and stack trace should suggest what was supposed to happen.
  
  Note: though there is a command called preg, it does not print all the registers...
- Then try to figure out why you thought that thing should have worked. Your answer will probably consist of multiple stages, probably involving two or three kinds of memory and maybe a privilege level.
- Then try to figure out which part didn't work and why not.
- Don't be too eager to skip past information which is "odd". If some lines on the stack trace are incomplete, look at the parts which are there and see if they tell you anything about why the missing parts are missing.
  
  For example, if the debugger complains at you, think about why... if it says "not in TLB", apply the steps of this list, i.e., ask yourself
  1. what things should be in the TLB (translation lookaside buffer),
  2. how those things are supposed to get there,
  3. why you think the thing should have been there,
  4. how you could check to see why the thing didn't get there
This will probably be an interative process. You may need to change your test code, add trace information to your kernel, come up with an innovative breakpoint strategy, etc.
Lots of faults are somehow related to memory. Make sure you've gone over what we've given you related to memory. For example, the textbook devotes several pages to x86 virtual memory, and we have written a handout devoted to that topic from a different angle. The textbook also covers TLBs.

[Last modified Sunday January 11, 2004]