Project 3: Writing a Kernel From Scratch

Table of Contents

Project Overview

In the past two projects we have explored implementing device drivers and a thread library, learning about interrupts and concurrency issues in the along the way. The creation of a thread library relied on the services of a fully functional kernel whose implementation was closed to you. In this project you will be implementing your own kernel. This kernel will support multiple address spaces (through the mechanism of paging), preemptive scheduling, and a small set of important system calls.

The kernel project is a four week project, unlike the two projects you have completed. It will be very important to adhere to a reasonable schedule throughout the four weeks to ensure completion of the project. While project four (the file system) is probably more code than the kernel, the kernel is arguably the most complicated project conceptually.

Goals

Technology Disclaimer

Because of the availability, low cost, and widespread use of the x86 architecture, it was chosen as the platform for this sequence of projects. As its creator, Intel Corporation has provided much of the documentation used in the development of these projects. In its literature Intel uses and defines terms like interrupt, fault, etc. On top of this the x86 architecture will be the only platform used in these projects.

The goal of this project set is certainly not to teach the idiosyncrasies of the x86 architecture (or Intel's documentation). That said, it will be necessary to become accustomed to the x86 way of doing things, and the Intel nomenclature, for the purposes of completing this project set. Just keep in mind that the x86 way of doing things is not the only way of doing things. It is the price to be paid for learning the principles of operating systems on a real world system instead of a simulated architecture.

Important Dates

Wednesday, February 19th: Project three assigned.
Friday, February 28th:Checkpoint one.
Friday, March 7th:Checkpoint two.
Friday, March 14th:Checkpoint three.
Friday, March 21st: Project three is due at midnight (Friday night).
This is the last day of class before spring break. If you are planning on leaving before this time, you may certainly turn in your kernel early. You have over four weeks to complete your kernel, plan accordingly.

Groups

The kernel project is a group assignment. You should already be in a group of two from the previous project. If you are not in a group, or you are having other group difficulties, send email to staff-412@cs.cmu.edu.

Grading

The primary criteria for grading the kernel will be correctness and robustness. Does your kernel crash if you pass a null pointer to your printline system call? Kernels are expected to take abuse and continue to run.

Although it is not the first thing we will look at, style is also important. Use an appropriate level of commenting, name variables consistently, and try not to resort to global variables when possible.

It is also our intent that part of the project 3 grade will be based on a face to face interview process. We reserve the right to give partners in a group different project grades if circumstances make that necessary.

Please hand in a design document with your kernel. It should talk about all the important design decisions you made, what works, and what does not. If you tell us about something that doesn't work, it puts you in a better position to be given more partial credit than if you do not tell us and we stumble across it ourselves, because it lets us know that you understand that there is a problem and have spent some nonzero amount of time trying to find it.

Handin

The procedure for handing in project 3 will be the same as it was for project 2. You should have a group directory...

Hardware Primitives

Project one involved a small amount of talking to devices and dealing with processor data structure (like the interrupt descriptor table). In this project, you will need to understand more processor data structures (such as the x86 page table structure), and you will need to incorporate some of the functionality of project 1 as well. In this section we will look at some of the important hardware primitives that you will need to use.

Segment Selector Registers

As was mentioned in the project 1 handout, we will be making as little use of segmentation as possible in these projects. However we do need to specify code, stack, and data segments for the processor. These segments are specified in segment registers. Specifically, the code segment for the currently running process is stored in the %cs register. The stack segment for the currently running process is stored in the %ss register. It is possible to specify up to four data segments for the processor, and they are stored in %ds, %es, %fs, and %gs. The code segment is used to access instructions. The stack segment is used in stack related operations (such as push and pop instructions). The data segments are used in all other operations that access memory, such as data structure reads and writes.

Each segment has a privilege level associated with it, which is the minimum privilege to access that segment. In project 1, all of our segments were privilege level 0, and we never left kernel mode. In this project however, we will have user processes which cannot access segments that require privilege level 0. Fortunately we have already defined segments for user code and data (user data can be used for both the user's stack segment and data segment). They are specified in seg.h.

When you start a user process, you will need to specify a user-level code and stack segment. You will also need to load a user-level data segment into the data segment registers. When a user process takes an interrupt, the code and stack segment registers will get saved automatically (see intel-sys.pdf page 153). The data segment registers will not be saved, though. You will need to save these along with the other general purpose registers, and then restore them before the user process is run again.

The EFLAGS Register

The EFLAGS register controls some important state like whether or not interrupts are enabled. You will never need to modify the EFLAGS register directly, however it may be useful to understand its format. The EFLAGS register is discussed in section 2.3 of intel-sys.pdf.

Although the EFLAGS register stores a flag controlling whether interrupts are enabled or disabled, there is a much faster way to disable and enable interrupts. The sti and cli instructions enable and disable interrupts. You can find more information on them in intel-isr.pdf.

Control Registers

The control registers contain some of the most important flags on the x86 architecture. A summary of the control registers can be found on page 50 of intel-sys.pdf. Control register 0 (CR0) contains the most powerful system flags. The only flag in CR0 that you should modify is bit 31 (the top bit), which activates paging. Paging is active when the flag is set, and inactive when the flag is cleared. Make sure you have set up your page directory (discussed below) properly before setting this flag. Do not modify the state of any of the other flags in this register.

Control register 1 (CR1) is reserved and should not be touched.

Control register 2 (CR2) is only used in one instance. When your kernel takes a page fault, CR2 will contain the address that caused the page fault. You will need to get the value of this register inside your page fault handler.

Control register 3 (CR3) is sometimes referred to as the page directory base register, or PDBR. It holds the address of the current page directory in the top 20 bits (these are the top twenty bits of the page directory's address, the bottom 12 must be zero since the page directory must be page-aligned). It also has two flags, present in bits 3 and 4. Bit 3 controls whether the processor's internal caches are write-back or write-through (if the bit is set, write-through caching is enabled, if it is clear, write-back caching is enabled). Bit 4 controls whether the page directory may be cached (if the bit is set, the page directory may not be cached). Both of these flags should be cleared (not set). You will need to update CR3 whenever you wish to switch address spaces. Fortunately the x86 architecture flushes the TLB for you when you update CR3.

Control register 4 (CR4) contains a variety of extension flags, which for the most part we will not be using. The one flag we do care about is the Page Global Enable (PGE) flag, stored in bit 7. We will want to set this flag to enable global pages. The reason for this is made clear below. Do not modify the state of any of the other flags in this register.

We have provided assembly macros to read and write from the control registers. Be very careful about reading and writing to control registers (think about interrupts).

The Kernel Stack Pointer

On the x86 architecture, the stacks for user level code and kernel level code are separate. When an interrupt occurs that transitions the current privilege level of the processor to kernel mode (refer to the project one handout for background on privilege levels), the stack pointer is set to the top of the kernel stack. A small amount of context information is then pushed onto the stack to allow the previously running process to resume once the interrupt handler has finished running.

The initial value of the stack pointer when we enter kernel mode is defined by the currently running task. Tasks are a hardware process mechanism provided by the x86 architecture. We will not be using tasks, however. Not only is it faster to manipulate the process abstraction in software, we can also tailor that abstraction to our particular design needs. However it is necessary on the x86 to define at least one task. The OSKit libraries take care of setting that up for you. We have provided a function to go into the processor data structure for this single task and set the initial kernel stack pointer value. It is called set_esp0(void* addr), defined in seg.h.

For more information on hardware tasks, refer to chapter 6 of intel-sys.pdf.

Paging

The x86 architecture uses a two-level paging scheme with 4k pages. It is also possible to use larger page sizes, however 4k is by far the most commonly used page size. The top level of the paging structure is the page directory, while the second level consists of page tables. The format of the entries in the page directory and page tables are very similar, however, their fields have slightly different meaning. Here is the format of both a page directory entry and a page table entry.

Both entry formats use the top 20 bits to specify an address (a page directory entry specifies the address of a page table, a page table entry specifies the address of a frame in memory). Note that both page tables and frames must be page aligned (they are both exactly 4k in size and the bottom 12 bits of their addresses must be zero), which is why we can simply specify the top 20 bits of the address.

The designers of the x86 architecture have kindly left us three bits which we can use as we please. These are the bits in the field marked "Avail" in both types of entries.

Bit 8 is the global flag, which has no meaning in the page directory entry (it can be set to either 0 or 1). In a page table entry however, if the global flag is set the virtual-to-physical mapping for this page will not be flushed from the TLB automatically. This can be useful for pages that are the same for all address spaces. You should use this to prevent the kernel mappings from being flushed on context switches. To use this bit, you also need to set the page global enable flag (PGE) in control register 4.

Bit 7 is the page size flag in the page directory entry, and in the page table entry it is the page attribute table index flag. Since we are only using 4k pages and they are all the same type of page, both of these flags should be set to zero.

Bit 6 is the dirty bit, and is only valid in page table entries. This flag gets set by the hardware when the page pointed to by the page table entry is written to. This can be used in implementing demand paging. It will not be used, however, in this project.

Bit 5 is the accessed bit. It gets set by the hardware when the page pointed to by a page table entry is accessed. The accessed bit gets set in a page directory entry as soon as any of its page tables point to pages have been accessed. You will not need to worry about using this flag.

Bit 4 is the page-level disable caching bit. If the flag is set then caching of the associated page (or page table if the entry is a page directory entry) is disabled. This flag should always be cleared to enable caching.

Bit 3 is the page-level write through bit. If it is set, write-through caching is enabled for that page (or page table if the entry is a page directory entry), otherwise write-back caching is used. This flag should be left clear.

Bits 1 and 2 of the page directory and page table entries interact with each other to provide page-level protection. We will be using page-level protection in the kernel to make sure that user processes cannot clobber the kernel. See the table on page 136 of intel-sys.pdf for the specific behavior of these two flags:
Bit 2 is the user/supervisor flag. If the flag is set, then the page is user accessible. This flag has different meaning depending on whether it is set in the page directory entry or the page table entry. It also interacts with the user/supervisor flag. See the table on page 136 of intel-sys.pdf for specific behavior.

Bit 1 is the read/write flag. If the flag is set, the page may be written to. If it is clear, the page is read-only. This flag has different meaning depending on whether it is set in the page directory entry or the page table entry. It also interacts with the user/supervisor flag. See the table on page 136 of intel-sys.pdf for specific behavior.

Bit 0 is the present flag in both the page directory and page table entries. If the page is marked not present, then an attempt to read, write, or execute data stored at an address within that page will generate a page fault. When you install a page into a page frame, you should set this bit. Likewise, if a page table is marked not present, then an attempt to read, write, or execute data stored anywhere at an address spanned by that page table will generate a page fault. When you install a new page table, make sure to set the present bit in its corresponding page directory entry.

Recall from the section on control registers that the address of the active page directory is kept in CR3, and the top bit of CR0 controls whether paging is active or not. You will need to install a page fault handler (this is similar to your installation of timer and keyboard interrupt handlers). That page fault handler will need to look at CR2 for the address of the fault, and decide whether it is a valid request (in which case you will need to allocate a new page and possibly a new page table), or an invalid request (an attempt to access kernel memory or the void between the heap and the stack). In order to get page faults when a user process attempts to access kernel memory, set the proper permission bits in the corresponding page directory and page table entries.

Of course, some accesses between the heap and the stack are valid; in fact, any time the user process grows its stack, it will access memory between the heap and the current stack pointer. You will need to decide how far is "too far" when judging whether a pointer access is an attempt to grow the stack, or a wild read/write into the middle of nowhere.

The Interrupt Descriptor Table

You should remember the interrupt descriptor table from project 1. It is where we specify the handlers for interrupts, both hardware and software, as well as exceptions. You will eventually need to install handlers for page faults. You also need to install a handler for the illegal instruction exception, divide by zero exception, and general protection fault. Details on all of the exceptions (and their index into the IDT) can be found in section 5.12 of intel-sys.pdf. The general protection fault is the x86 catch-all exception that you will probably get a lot of in your kernel development - it would be valuable to get this handler installed early. The need for the other exception handlers should be obvious.

Installing a handler should be familiar from project one; if you're a little rusty, consult your project one solution or the project one handout.

The Layout of Physical Memory

Although there are many ways to partition physical memory, it is required that you follow the following model. The bottom 16MB of memory (from 0x0 to 0xffffff) is reserved for the kernel. This kernel memory should appear as the bottom 16MB of each process' address space. This kernel memory is also direct mapped; that is, the virtual addresses for the bottom 16MB of memory are the same as the physical addresses. Note that user processes should not be able to read or write kernel memory, even though it is resident at the bottom of each user process' address space.

The rest of physical memory, i.e. from 0x1000000 on up, should be reserved for page frames. The simulated machines have 256MB of memory. Although it is quite possible to determine how much memory there is in the system, for now your kernel should have a #define for the amount of memory.

#define MACHINE_MEMORY_SIZE_MB 256

You should use this memory size in determining the size of your kernel data structures.

Kernel Programming Environment

The programming environment for this project is quite similar to the one provided in project one. OSKit will take care of rudimentary machine initialization and get you to the main() function in kernel.c. If this main() function ever returns, OSKit will print a message prompting you to press a key to reboot. Your kernel should never reach this point. We will provide you a shell which does not exit. If you ever detect that the shell process has died, you should respawn another one.

While we could make you write your own memory management routines for kernel space, we have decided that you should be focusing on process management, virtual memory and system calls instead. To that end, there is a small collection of malloc functions that are available to you. OSKit supports these malloc functions with a list-based memory manager. In particular, here are the malloc functions that you can use.

To use these functions, #include <oskit/c/malloc.h.> You may also use the str*** and mem*** functions, as well as bzero. To use any of those functions, #include <oskit/c/string.h>. The prototypes of these functions are the same as their Unix counterparts, so you can look up the behavior/prototype using common man pages.

Your console output should be done using your solution to project one. If you did not get project one to work, or were not satisfied with it, you may contact us for some simple console and keyboard routines. We would much rather you use your own code if possible however.

Context Switching

Context switching is probably the messiest part of this project. You will have to write some assembly to do it, and it may seem a little mind bending at first. It is a great feeling though when you successfully schedule two processes for the first time, however!

Context switching involves saving the general purpose registers and the segment selector registers of one process, halting its execution somewhere, and then loading the general purpose registers and segment selector registers of another process and resuming its execution where it previously was stopped. You will also need to switch the active page directory to correspond to the new process' address space.

One way to simplify the burden of context switching is to always do it in the same place. If you have a single function which performs the context switch, then the point of execution for every process not currently running is inside that function. You need not explicitly save the instruction pointer.

Writing the routine to context switch processes that have already run is actually not so bad. The messier part is bootstrapping a process to get it to execute for the first time. Before you attempt to run a process for the first time, you will need to set up the top of its kernel stack so that your context switching code loads meaningful context. You might also have a routine which context switches a little differently if the new process has never run before.

A useful tool in bootstrapping processes is the iret instruction. The iret, which you should be familiar with from project one, is used to return from interrupt handler routines. It is capable of changing the code and stack segments, stack pointer, instruction pointer, and EFLAGS register all in one step. It is particularly useful in starting user processes for the first time. A good place to find more information on the behavior of the iret instruction (and what it expects on the stack) is page 153 of intel-sys.pdf. This gives a diagram of what iret expects on the stack.

Do not forget that you will need to register a handler for the timer interrupt, and that it may need to call your scheduler and context switch. Refer to your solution for project one and the project one handout to see how to use the timer. As in project one, your timer should go off every ten milliseconds (i.e. "ticks" are separated by ten milliseconds). As far as your scheduler is concerned, a simple round robin scheme is sufficient for this project.

A Loader for a.out

It will be a tough task indeed to convince everyone in the world to use your new operating system if it cannot run any user processes. In order to take a binary executable and get it running inside a process, you need to load it from the executable into its address space. However this is not as simple as just copying the file into some fixed location. There are a variety of executable image formats. Because we do not want you to dwell on the loader too much, we have chosen the simple a.out format as the executable format of choice for the project three kernel. Your task will be to write a loader for a.out, and use it to load your executables.

Getting the Executable's Bytes

Because we do not have a file system as of yet (you will be working on that in your next project), you will be loading your executables from large arrays compiled into the kernel. We are providing to you a utility, exec2obj, which takes as an argument an a.out executable and creates a .c file. The .c file simply contains a char array with the name of the a.out executable, and its contents are the data of that executable. So, if I had an a.out executable whose filename was hello, and I executed

exec2obj hello

the result would be a new file in that directory called hello_obj.c. Inside that hello_obj.c file would be a char array called hello, and it would be initialized to the data contained within the hello executable. You can then refer to that array in your loader when you wish to load that particular user application. A table of contents is also created which you can use to locate the array (of which there may be more than one) based on the filename of the executable. The definition for the table of contents is in exec2obj.h.

Eventually, you will be reading your executables from a file system. To facilitate an easy switch from exec2obj to a file system, we would like you to create the following interface to the arrays provided by exec2obj:

int getbytes(char* filename, int offset, int size, char* buf)

The getbytes function should find the file 'filename' in the table of contents in exec2obj.h, seek to the specified offset within the array of bytes for that file, and copy size bytes to buf. The function should return the number of bytes succesfully copied (this could be cut short if you run into the end of the array). If the offset is to the end of the file, getbytes should return zero (regardless of size). If offset goes past the end of the file, -1 should be returned.

Your loader should make calls to your getbytes function in order to acquire the actual bytes of the executable.

Details on the a.out Format

For information on the a.out executable format, refer to the FreeBSD a.out man page, available at:

http://www.freebsd.org/cgi/man.cgi

There are some macros that you will find useful in dealing with the a.out format. They are:

#define N_MAGIC(exec) ((exec).a_info & 0xffff)
#define N_BADMAG(x) (N_MAGIC(x) != NMAGIC)
#define N_TXTOFF(x) (sizeof( exec_t ))
#define N_DATOFF(x) (N_TXTOFF + (x)->a_text)

In addition, we will only be using the NMAGIC variant of a.out. You will need this then as well:

#define NMAGIC 0410 /* 0x108 */

The rest of the information should be available in the man page on a.out.

Linking User Processes

Your user processes require some special care in linking, since they are going to be run in a very strange environment. In particular, you will want to compile them with these flags:

You will also need to pass some special flags to the linker. These flags are:

Finally, your user programs will need some sort of C runtime setup function. This function is responsible for setting up the arguments to the user process' main function. It also ensures that if the user process' main function does not call the exit system call (described later in this document), that it is still called before execution runs off the edge of the code.

Kernel System Calls

The heart of your kernel will be the system call interface. User processes need to be able to fork, exec, getpid, etc. You used system calls in project 2 to implement your thread library. Now, in this project, you will implement those same system calls, as well as many others.

To receive the system call you will need to install a handler for the int instruction you have been using. You should remember how to do this from project one, if not, consult your project one handout or solution. In short you will need to add an entry into the interrupt descriptor table (IDT).

To ensure that we can test for robustness easily, we are requiring you to use int 0x40 as your system call software interrupt. Please do not use a different software interrupt. We also require that you store your system call number in %edi, and a single argument to your system call in %esi. For system calls that take more than one parameter, an argument structure is defined below. The system call should, in this case, take a pointer to this structure in user space.

System Call Interface

You need to provide the following system calls. Of course, all of these system calls will use an int instruction. As you no doubt discovered in project two it is inconvenient for the user to have to write wrappers around the int instruction. Please write user-level wrappers around these functions that take care of collecting arguments into a structure if necessary, and calling the int instruction. This will make it easier for you to test these system calls as well.

Hints on Implementing a Kernel

Here are some tips on implementing tricky areas of the kernel.

Plan of Attack

The kernel project will seem very daunting at first, which is probably a good thing, because hopefully that will drive you to start right away! There is a lot of work here, and some bugs may take you one or two days to find. Please do not be fooled into thinking you can waste two of your four weeks. This is a four week assignment because it takes four weeks of work.

To help you, we have established a recommended plan of attack. Hopefully this will give you an idea of how to start. This also establishes the criteria used for the checkpoints.

  1. Read this handout thoroughly.

  2. Re-read the complicated parts to make sure you understand what's being said.

  3. Think about it all for a while.

  4. Write up pseudocode for all the system calls, as well as the timer interrupt handler, scheduler, paging system, and context switcher. You may want to do this in several iterations. Start by just writing down how all these pieces fit together. Then take the level of detail up a notch and start thinking about how they break down into functions. Then write some high-level pseudocode for those functions.

  5. Using the previous step as a guide, come up with the data structures you think you will need for your kernel. The most important data structure is probably your process control block, or PCB. What needs to be in the PCB?

  6. Write the a.out loader. The a.out loader is very straightforward and you will need it to get your user processes running.

  7. Write a user-level idle process (a simple while(1); will do), create a PCB for it, and switch to it in your kernel. At this point you will not be running virtual memory, and you will not have any other processes. Being able to switch to a user-mode process is an important step however.

  8. If you can switch to the idle process, you have enough for checkpoint one.

  9. Implement paging. Write a page fault handler that allocates frames for legal accesses in unallocated memory, and prints a message when a bad memory reference occurs.

  10. Now that you can load a user process and switch to it, do the same thing except this time load a second user process (init) and have the timer interrupt switch between the two of them. This will require you to write your context switching code.

  11. If you can load two processes and context switch between them, you have successfully reached checkpoint two.

  12. Implement the getpid system call. Once this is working, your system call interface is functioning correctly.

  13. Implement the brk system call and test it with user processes that use malloc.

  14. Implement sleep. This will test blocking a process from executing.

  15. Implement fork.

  16. Implement exec. You can now test fork and exec by having your init process spawn off a third user process.

  17. Implement wait and exit.

  18. If fork, exec, wait, and exit are all working, you have enough to pass checkpoint three.

  19. Implement yield,deschedule, and make_runnable.

  20. Implement minclone. You should now be able to use your thread library with your kernel.

  21. Implement the terminal I/O related system calls.

  22. Implement rand.

  23. Bang on your kernel to shake all the bugs out until your fists are sore.

  24. Hand your kernel project in.

  25. Have an excellent Spring Break!

Steve Muckle
Last modified: Sun Mar 2 16:08:07 EST 2003