Return to the lecture notes index
Lecture 19 (October 10, 2001)

From open() to the inode

The operating system maintains two data structures representing the state of open files: the per-process file descriptor table and the system-wide open file table.

When a process calls open(), a new entry is created in the open file table. A pointer to this entry is stored in the process's file descriptor table. The file descriptor table is a simple array of pointers into the open file table. We call the index into the file descriptor table a file descriptor. It is this file descriptor that is returned by open(). When a process accesses a file, it uses the file descriptor to index into the file descriptor table and locate the corresponding entry in the open file table.

The open file table contains several pieces of information about each file:

Each entry in the open file table maintains its own read/write pointer for three important reasons:

One important note: In modern operating systems, the "open file table" is usually a doubly linked list, not a static table. This ensures that it is typically a reasonable size while capable of accomodating workloads that use massive numbers of files.

Session Semantics

Consider the cost of many reads or writes may to one file.

The solution is to amortize the cost of this overhead over many operations by viewing operations on a file as within a session. open() creates a session and returns a handle and close() ends the session and destroys the state. The overhead can be paid once and shared by all operations.

Consequences of Fork()ing

In the absence of fork(), there is a one-to-one mapping from the file descriptor table to the open file table. But fork introduces several complications, since the parent task's file descriptor table is cloned. In other words, the child process inherits all of the parent's file descriptors -- but new entries are not created in the system-wide open file table.

One interesting consequence of this is that reads and writes in one process can affect another process. If the parent reads or writes, it will move the offset pointer in the open file table entry -- this will affect the parent and all children. The same is of course true of operations performed by the children.

What happens when the parent or child closes a shared file descriptor?

Why clone the file descriptors on fork()?

Memory-Mapped Files

Now that we've talked about how the file system usually maintains and access files, let me take the opportunity to point out that there is actually another way. It is actually possible to hand a file over to the VMM and ask it to manage it, as if it were backing store for virtual memory. If we do this, we only use the file system to set things up -- and then, only to name the file.

If we do this, when a page is accessed, a page fault will occur, and the page will be read into a physical frame. The access to the data in file is conducted as if it were an access to data in the backing-store. The contents of the file are then accessed via an address in virtual memory. The file can be viewed as an array of chars, ints, or any other primitive variable or struct.

Only those pages that are actually used are read into memory. The pages are cached in physical memory, so frequently accessed pages will not need to be read from external storage each access. It is important to realize that the placement and replacement of the pages of the file in physical memory competes with the pages form other memory mapped files and those from other virtual memory sources like program code, data, &c and is subject to the same placement/replacement scheme.

As is the case with virtual memory, changes are written upon page-out and unmodified pages do not require a page-out.

The system call to memory map a file is mmap(). It returns a pointer to the file. The pages of the file are faulted in as is the case with any other pages of memory. This call takes several parameters. See "man mmap" for the full details. But a simplified version is this:

void *mmap (int fd, int flags, int protection)

The file descriptor is associated with an already open file. In this way the filesystem does the work of locating the file. Flags specifies the usual type of stuff: executable, readable, writable, &c. Protection is something new.

Consider what happens if multiple processes are using a memory-mapped file. Can they both share the same page? What if one of them changes a page? Will each see it?

MAP_PRIVATE ensures that pages are duplicated on write, ensuring that the calling process cannot affect another process's view of the file.

MAP_SHARED does not force the duplication of dirty pages -- this implies that changes are visible to all processes.

A memory mapped file is unmapped upon a call to munmap(). This call destroys the memory mapping of a file, but it should still be closed using close() (Remember -- it was opened with open()). A simplified interface follows. See "man munmap" for the full details.

int munmap (void *address) // address was returned by mmap.

If we want to ensure that changes to a memory-mapped file have been committed to disk, instead of waiting for a page-out, we can call msync(). Again, this is a bit simplified -- there are a few options. You can see "man msync" for the details.

int msync (void *address)

Cost of Memory Mapped Access To Files

Memory mapping files reduces the cost of accessing files imposed by the need for traditional access to copy the data first from the device into system space and then from system space into user space.

BUt it does come at another, somewhat interesting cost. Since the file is being memory mapped into the VM space, it is competing with regular memory pages for frames. That is to say that, under sufficient memory pressure, access to a meory-mapped file can force the VMM to push a page of program text, data, or stack off to disk.

Now, let's consider the cost of a copy. Consider for example, this "quick and dirty" copy program:

int main (int argc, char *argv)
{
  int fd_source; 
  int fd_dest;
  struct stat info;
  unsigned char *data;

  fd_source = open (argv[1], O_RDONLY);
  fd_dest = open (argv[2], O_WRONLY | O_CREAT | O_TRUNC, 0666);

  fstat (fd_source, &info);

  data = mmap (0, info.st_size, PROT_READ, MAP_SHARED, fd_source, 0);
  write (fd_dest, data, info.st_size);

  munmap (data, info.st_size);
  close (fd_source);
  close (fd_dest);
}

Notice that in copying the file, the file is viewed as a collection of pages and each page is mapped into the address space. As the write() writes the file, each page, individually, will be faulted into physical memory. Each page of the source file will only be accessed once. After that, the page won't be used again.

The unfortunate thing is that these pages can force pages that are likely to be used out of memory -- even, for example, the text area of the copy program. The observation is that memory mapping files is best for small files, or those (or parts) that will be frequently accessed.