15-410 Triple-Fault Advice Page

Introduction

What should you do when you first experience the dreaded "triple fault"? Here are some suggestions.

First, take a moment to smile. If you had been running your kernel on a real PC, it would have suddenly cleared the screen, beeped, and rebooted. You would have no way to figure out what had happened aside from comparing your source code to a previous version and reasoning forward from there. In 15-410, you have thousands of dollars of the world's best PC simulation software at your beck and call. So maybe as you smile you should remind yourself how grateful you are to the Simics developers.
Now, take a second moment to smile. A triple fault means you are about to learn something! What's more, it's the kind of learning you can't get from books. Maybe you will learn something about the x86, something about OS structure, or something about debugging strategy.

Q: Ok, Mr. Miyagi, enough telling me how to polish the car, can you give me some actual advice about my triple fault already?

A: Ah, grasshopper, you are in such a hurry...

The Problem

Students are sometimes unsure about the definitions of double and triple faults.

A regular fault (or exception) occurs when the processor is unable to successfully execute an instruction to completion—for example, a page-fault exception occurs when one of the memory references needed to execute an instruction can't be completed because paging is enabled and the paging system decides that the reference is to a page that does not exist, or that the reference is not a legitimate type for the page in question.

A double fault occurs when there is a fault, but the processor cannot successfully execute to completion the first instruction of the handler for the primary fault; this causes the processor to switch to running the first instruction of the double-fault handler.

A triple fault occurs when the processor cannot successfully execute to completion the first instruction of the double-fault handler. It is possible that a single underlying problem makes it impossible to execute the three instructions in question, or it may be the case that each instruction cannot be executed for its own personal reason. Regardless, at this point the processor gives up on the entire endeavor of executing instructions and resets.

Advice

Generally, a triple fault means that you (the OS kernel author) told the machine to do something it couldn't do. For example, you may have asked it to execute a trap handler that doesn't exist, asked it to run code which you didn't give it permission to run, told it to access memory you didn't give it permission to access, etc. This condition is very unlikely to result from a bug in Simics or in the course-provided run-time environment. So sending us mail of the form
I have a triple fault, now what?
isn't going to get you very far. We genuinely don't know what's going wrong!

Sending us register dumps won't help much either. As you will see below, it is unlikely we will be able to point out one "incorrect" bit. Things are right or wrong in context, and you (the kernel author) are responsible for defining the context and therefore what is right and wrong.
Since you have access to all processor state and all of memory, you should be able to figure out first what was happening when the processor ran into trouble and then what the problem was.

When you encounter a fault or exception, you must determine three key pieces of information:
1. You must determine which instruction (not "line of code") can't be executed. Processors don't execute "lines of code"; they execute instructions.
2. Based on the surrounding code, determine what that instruction was intended to accomplish. Generally speaking, the instruction was selected by a compiler, based on preconditions expected to be true before the instruction executes and on conditions desired to be true after it's done. It is possible you will need to look up a description of exactly what the instruction does.
3. You will need to determine exactly why the instruction could not be executed. Generally speaking, some precondition isn't true, or some input value is wrong. Depending on the exception, the processor may write down some information about this particular execution failure; you will need to consult appropriate documentation to find what information is available and how to decode it. It is unwise to guess at which precondition/value is the source of the problem.
In more detail...
- Use the 15-410 Simics Command Guide to collect as much information as you can about the processor state. Collectively the registers and stack trace should suggest what was supposed to happen.
  
  Note: though there is a command called preg, it does not necessarily print all the registers...
- Then try to figure out why you thought that thing should have worked. Your answer will probably consist of multiple stages, probably involving two or three kinds of memory and maybe a privilege level.
- Then try to figure out which part didn't work and why not.
- Don't be too eager to skip past information which is "odd". If some lines on the stack trace are incomplete, look at the parts which are there and see if they tell you anything about why the missing parts are missing.
  
  For example, if the debugger complains at you, think about why... if it says "not in TLB", apply the steps of this list, i.e., ask yourself
  1. what things should be in the TLB (translation lookaside buffer),
  2. how those things are supposed to get there,
  3. why you think the thing should have been there,
  4. how you could check to see why the thing didn't get there
Actually solving the problem will probably be an iterative process. You may need to change your test code, add trace information to your kernel, come up with an innovative breakpoint strategy, etc.
Lots of faults are somehow related to memory. Make sure you've gone over what we've given you related to memory. For example, the textbook devotes several pages to x86 virtual memory, and we have written a handout devoted to that topic from a different angle. The textbook also covers TLBs.

[Last modified Friday September 10, 2021]