Simulation Environment Overview
15-441 Project 2 and 3, Fall 2001

1 Overview

In this document, we describe the simulation environment which you will be using in projects 2 and 3 for this class. The simulator implements the basic components of an operating system kernel, as well as the socket, transport, link and physical layers. You will be responsible for adding the network layer to the kernel. The details of your project assignments can be found in the project handouts.

**Figure 1:** Simulator Overview: The kernel of each node is implemented as a separate UNIX process. Each application running on a node is in a separate process.
$\includegraphics[height=3.5in, keepaspectratio]{fig-simpict.eps}$

Figure 1 shows a picture of a sample simulated network. The kernel of each node in the network is a separate UNIX process. An application running on top of a node is a UNIX process separate from the kernel process. The fact that each node is implemented as a separate process enables you to simulate communications between nodes even though all the nodes are actually running on the same machine. Applications are implemented as separate processes so that they can be started after the simulation is already running (i.e. the kernel on each node is running) and so that more than one application can be run on the same node.

Each node has its own operating system kernel which you implement. Some nodes utilize all the layers of the network stack implemented in your kernel, and there are applications running on top of them (Nodes 2 and 3 in the figure). These nodes represent end-systems or communication endpoints. Other nodes, e.g. Node 1, only use the physical, link and network layers of the network stack. These nodes are routers. They are responsible only for forwarding packets, and since forwarding is a function provided by the network layer, they do not need to use the layers above the network layer. Endpoints on the other hand, do need to have all layers of the network stack since packets that are sent and received by the application layer need to undergo processing by all layers below the application layer.

In this handout, we will use $PDIR to denote the project directory.

The project directory for project 2 will be:
/afs/cs.cmu.edu/academic/class/15441-f01/projects/project2/

The project directory for project 3 will be:
/afs/cs.cmu.edu/academic/class/15441-f01/projects/project3/

2 Building the kernel and running a network simulation

The support code for your projects provides an environment that emulates a simple machine with hardware-level network devices and a system call interface. It is provided to you in the form of a library: libkernel.a.

When building the kernel, all your files will be compiled and linked against the libraries we provide to form a single Solaris executable.

You will be using the simulator to simulate a network. Typically, a network consists of more than one node (otherwise it is not very interesting). A sample network configuration is shown in Figure 2. This configuration may be represented in the simulator as shown in Figure 1.

A script $PDIR/template/startkernel.pl is provided to help you bring your network up when you start the simulation. This script reads a network configuration file (see Section 2.1) that you specify and launches the appropriate number of kernels. Each kernel is started in its own xterm window. An optional second argument (debug) may be specified to startkernel.pl so that it runs each kernel within gdb. If you specify the debug option, you will have to start the kernel in gdb manually. You have to set the arguments for the kernel using ``set args ...'' (see the script for what arguments are needed). Then you can issue the command ``run''. If you don't specify the debug option, problems may be difficult to debug since when a kernel crashes, the xterm window corresponding to that kernel will close.

**Figure 2:** A sample network configuration.
$\includegraphics[height=1.5in, keepaspectratio]{fig-config.eps}$

2.1 Network configuration file

As mentioned above, you need to specify a network configuration file when you run a simulation. This configuration file specifies each node in the network along with all of its interfaces and their respective addresses, as well as all the links that exist between each node and other nodes in the network.

We use the network from Figure 2 to illustrate how network configuration files are built. Interface 1 on node R1 is connected to interface 1 on node R2, and interface 2 on node R1 is connected to interface 1 on node R3. The network configuration file for this network is the following:

# Configuration for Router 1
Router 1 {
     1  1.1.1.1  255.255.255.255
     2  1.1.2.1  255.255.255.255
     1:1 2:1
     1:2 3:1
}

# Configuration for Router 2
Router 2 {
     1  1.1.1.2  255.255.255.255
     2:1 1:1
}

# Configuration for Router 3
Router 3 {
     1  1.1.2.2  255.255.255.255
     3:1 1:2
}

As usual, lines that start with a ``#'' are comments and will be ignored by the simulator. Each interface on each router has its own IP address and netmask. The notation X:Y refers to interface Y on node X. Thus, the line ``1:1 2:1'' in the configuration entry for node R1, shown above, specifies that interface 1 on R1 should be connected to interface 1 on R2. Note that in this configuration, R2 and R3 are actually end points, not routers. However, the simulator requires the word ``Router'' for each node in the configuration file.

This sample configuration file is provided in $PDIR/template/network3.cfg. You can modify the sample or create your own configuration for testing purposes.

3 Building and running user programs

User programs can be run on each simulated node. Each user program is run as a separate user process as shown in Figure 1.

All user programs used with the simulator must be linked against the user libraries we provide (see the template Makefile for more details). There are two important issues regarding user programs:

Use Main() instead of main(). The entry point of a user program is actually in the simulator, not the code you write. After the simulator is done with the necessary initializations, it invokes your Main() function. The Main() function is exactly the same as main(); that is, the usual argc and argv are still there.
User programs must be run with ``-n i'' as the first argument. This argument is to specify that this user program should be run on node i. Note that the Main() function will not get this argument (i.e. the simulator will strip this argument before passing the arguments to Main()).
You should close all open sockets before returning from Main(). If you forget to close an open socket, the simulator will not close it for you and future invocations of your program might fail.

4 The `pbuf` structure

A packet sent or received by an application is processed by several different layers in the network stack. In real BSD-style implementations, an mbuf structure is used for passing the packet between the different layers. In projects 2 and 3, you will be using a pbuf structure for building and passing packets between network stack layers. The pbuf structure is simplified version of the BSD mbuf.

The definition of the pbuf structure is the following:

    struct p_hdr {
            struct  pbuf *ph_next;    /* next buffer in chain */
            struct  pbuf *ph_nextpkt; /* next chain in queue/record */
            caddr_t ph_data;          /* location of data */
            int     ph_len;           /* amount of data in this mbuf */
            int     ph_type;          /* type of data in this mbuf */
            int     ph_flags;         /* flags; see below */
    };

    struct pbuf {
            struct p_hdr p_hdr;
            char         p_databuf[PHLEN];
    };
    #define p_next    p_hdr.ph_next
    #define p_nextpkt p_hdr.ph_nextpkt
    #define p_data    p_hdr.ph_data
    #define p_len     p_hdr.ph_len
    #define p_type    p_hdr.ph_type
    #define p_flags   p_hdr.ph_flags
    #define p_dat     p_databuf

pbuf's must be allocated and deallocated using the routines p_get() and p_free() declared in
$PDIR/include/pbuf.h. Since a pbuf contains less than 512 bytes of data (PHLEN is defined as 512 minus header length), an MTU-sized packet (1500 bytes in your projects) will consist of 4 pbuf structures linked together by the p_next field in each pbuf -- this is called a pbuf chain. The p_nextpkt field can be used to link multiple packets together on a queue. By convention, only the first pbuf in a pbuf chain should be used to link to another pbuf chain (through p_nextpkt).

**Figure 3:** A 48-byte IP packet spread out over 2 `pbuf` structures. There is a 20-byte IP header, an 8-byte UDP header, and 20-bytes of user data. The IP header starts at the beginning of the first pbuf's `p_databuf`, while the UDP header and data bytes start in the middle of the second pbuf's `p_databuf`. Placing data in the middle of `p_databuf` and modifying `p_data` to point to it is a clever way to leave space for headers, or to push and pop headers, without requiring additional pbufs.
$\includegraphics[height=2.5in, keepaspectratio]{fig-pbufs.eps}$

The field p_data points to the location where the packet data starts within the p_databuf[PHLEN] buffer. Why implement pbufs this way? Suppose your transport layer has built a UDP packet with 20 bytes of data and an 8-byte UDP header. Before this packet gets sent on the wire, it will have to go through netowrk and link layer processing. If you place the data at the beginning of the pbuf, the network layer will have to allocate a new pbuf in which to store the 20-byte IP header and prepend this pbuf to the packet. However, if you were clever enough to leave 20 bytes of space at the beginning of the p_databuf[] buffer, you could simply subtract 20 from the value of p_data and then copy the 20-byte IP header to the address indicated by this pointer. An example of a packet consisting of multiple pbuf structures is shown in Figure 3.

The field p_len is the length of data contained in the pbuf; it is not the total length of the packet. p_type is managed by the pbuf allocation code and p_flags is presently not used at all by the kernel.

In addition to the functions p_get() and p_free() mentioned above, there are some other functions that you may find useful for manipulating pbuf's. They are defined in $PDIR/include/pbuf.h. Some examples of these are:

p_pktlen(): Returns the total length of a packet.
p_freep(): Frees all pbuf's of a packet.
p_copyp(): Makes a copy of a packet.

5 Interacting with the link layer

In your projects you will be adding a network layer to the simulator. The network layer transmits and receives packets from the network with the help of the link layer. In this section, we describe how to do this.

5.1 The network interface list

The link layer at each node is intialized by the simulated kernel at boot time. The kernel boot code reads the network configuration file (Section 2.1) and creates a list of networking interfaces on the node. Each element on this list is a struct ifnet defined in $PDIR/include/if.h:

  struct ifnet {
    TAILQ_ENTRY(ifnet)      if_next;

    int                     if_index;       /* interface number */
    struct sockaddr_in      if_addr;        /* address of interface */
    struct sockaddr_in      if_netmask;     /* netmask of if_addr */
    int                     if_mtu;         /* MTU of interface */

    void (*if_start)(struct ifnet *ifp, struct pbuf *p);

    struct hwif             *if_hwif;       /* hardware device */
  };

The head of this list can be accessed by calling the function ifnet_listhead() provided by the simulator. The TAILQ_ENTRY() macro is a macro defined in $PDIR/include/queue.h that is useful for creating linked lists. Iterating over the interface list can be done as follows:

  struct ifnet *ifp = ifnet_listhead();

  for( ; ifp; ifp = TAILQ_NEXT(ifp, if_next)) {
    printf(``interface index: %d\n'', ifp->if_index);
  }

5.2 Handing packets to the network interface for transmission

Once your IP layer has completely built a packet and has determined which interface the packet should be sent out on, the IP layer can send this packet by calling the if_start() routine of the appropriate interface. Note that if_start() will free the pbuf of a packet after transmitting the packet. For example, if your routing table indicates that a packet should go out interface 1, you would do the following:

  struct ifnet *ifp;
  struct pbuf *p;            /* packet to be sent */

  /* ifp = code to find interface 1 here */

  ifp->if_start(ifp, p);     /* send the packet */

5.3 Getting packets received by the network interface

When a network interface receives a packet from the network, it copies the packet from its own internal buffer into a pbuf data structure (or a pbuf chain if the packet is larger than a pbuf) in main memory. The interface then calls your ip_input() routine which is the entry point into the network layer. The device knows it should call this function because the function is registered as the entry point into the network layer during kernel initialization.

6 The Socket API

The socket layer provides an API (application program interface) for user programs to access the networking functionality of the kernel. In project 1, you wrote your FTP server using the socket API provided by the Solaris kernel, for example, socket(), bind(), accept(), etc. These calls are ``system calls'' provided by the kernel so that user programs can use kernel functionalities.

For user programs to interface to the simulator, you can use the following socket calls: Socket(), Close(), Bind(), Read(), Write(), Sendto(), Recvfrom(), and Setsockopt(). Their prototypes are defined in $PDIR/include/Socket.h (this header file should be included by user programs, not your kernel). There are two important issues regarding these calls:

Observe that the first letter of each call is capitalized. This is to distinguish them from the actual Solaris system calls, which will go into the Solaris kernel upon invocation. All your user programs will be linked against a library provided by us so that when they invoke these capitalized calls, the corresponding handlers in our simulated kernel (not Solaris) are called.
These calls meet the standard specifications, as described by the man pages of the ``lower-case'' versions on a Solaris machine or by Stevens' network programming book [1]. However, there are some exceptions (for example, added flags for some calls) which will be described in the remainder of this section.

6.0.1 The `Socket()` call

The Socket() call accepts three arguments: family, type, and protocol. It supports the following three combinations of family and type:

AF_INET/SOCK_STREAM: this combination specifies that the user wants to create a TCP socket. The following system calls are allowed on a TCP socket: Close(), Bind(), Accept(), Connect(), Write(), Read(), and Setsockopt(). There is no Listen() call, its functionality is subsumed by the Accept() call as described in Section 7
AF_INET/SOCK_DGRAM: this combination specifies that the user wants to create a UDP socket. The following system calls are allowed on a UDP socket: Close(), Bind(), Sendto(), Recvfrom(), and Setsockopt().
AF_INET/SOCK_ICMP: this combination specifies that the user wants to create an ICMP socket. The following system calls are allowed on an ICMP socket: Close() and Recvfrom().
AF_ROUTE/SOCK_RAW: this combination specifies that the user wants to create a routing socket. The following system calls are allowed on a routing socket: Close(), Write(), and Setsockopt().

6.0.2 The `Recvfrom()` call

Here is the prototype of the Recvfrom() system call:

int Recvfrom(int s, void *buf, int len, int flags, struct sockaddr *from, int *fromlen);

The arguments to this call are basically the same as the standard socket call. The Recvfrom() call reads ``one packet at a time''. It returns the length of the message written to the buffer pointed to by the buf argument (the second argument). Even if one packet worth of message does not ``fill up'' the buffer, Recvfrom() will return immediately and will not read the second packet. However, if a message in a packet is too long to fit in the supplied buffer, the excess bytes are discarded.

By default, Recvfrom() is blocking: when a process issues a Recvfrom() that cannot be completed immediately (because there is no packet), the process is put to sleep waiting for a packet to arrive at the socket. Therefore, a call to Recvfrom() will return immediately only if a packet is available on the socket.

When the argument flags of Recvfrom() is set to MSG_NOBLOCK, Recvfrom() does not block if there is no data to be read, but returns immediately with a return value of 0 bytes. MSG_NOBLOCK is defined in $PDIR/include/systm.h. In an actual UNIX system, socket descriptors are set to be non-blocking using fcntl() with type O_NONBLOCK, and Recvfrom() returns errno EWOULDBLOCK when there is no data to be read on the non-blocking socket.

6.0.3 The `Sendto()` call

The Sendto() call has an argument flags, which is ignored by the current implementation.

6.0.4 The `Write()` call

As you can see in Section 6.0.1 , the Write() call cannot be used with a UDP socket. This is quite different from the standard write() call in UNIX, which can be used with any socket. The Write() call can only be used with a routing or a TCP socket.

6.0.5 The `Setsockopt()` call

The Setsockopt() call has five arguments. The arguments level and optname specify the option that you want to set. The simulator supports only the IPPROTO_IP level and the IP_FW_SET, IP_NAT_SET, and IP_IF_SET options (defined in $PDIR/include/Socket.h). The option value for the IP_IF_SET option needs to be a pointer to a struct if_info (defined in $PDIR/include/route.h). The data structures of the option values for IP_FW_SET and IP_NAT_SET have not been defined.

7 TCP

Typically, there are two types of applications that use TCP sockets - servers and clients. A TCP server listens on a well-known port (or IP address and port pair) and accepts connections from TCP clients. A TCP client initiates a connection request to a TCP server in order to setup a connection with the server. A real TCP server can accept multiple connections on a socket. A server socket in the simulator accepts only one TCP connection in its lifetime.

Below is the sequence of socket calls made by a TCP server and a TCP client:

Server: Socket -> Bind -> Accept -> Read/Write -> Close
Client: Socket -> (Bind) -> Connect -> Read/Write -> Close

Below are the details of the socket calls specification.

A server must call Bind() in order to bind the socket to a port, before calling Accept(). Otherwise, Accept() returns an error. Bind() to a client socket is optional.
We make the following changes to the Accept() specification:
1. Accept() returns 0 on success (instead of a new file descriptor), and -1 on failure. Accept() does not create a new file descriptor (unlike the Berkeley Socket specification), and uses the same file descriptor for all subsequent socket calls.
2. The socket starts accepting client connection requests only after the Accept() call has been made (i.e. packets arriving before the Accept() call are discarded).
3. If during the connection establishment phase the protocol times out (e.g. during the TCP three-way handshake) or an error occurs, Accept() waits for other connection attempts, and returns only when a connection has been established successfully.
4. Accept() is always blocking--Accept() should block until after a connection establishment is completed.
Connect() is always blocking. It returns 0 if the connection establishment (TCP three-way handshake) succeeds, and returns -1 if an error occurs. One possible error is a timeout during the three-way handshake. In all cases where errors occur during the Connect() call, an application should not call Connect() again but should call Close() immediately.
Read() is always blocking. On success, the number of bytes read is returned. If Read() returns 0, this indicates an ``End of File'', meaning that the other side has closed the connection, and no more data will be received by this socket.
Write() returns almost immediately, except when the send buffer is full. This buffer is used by TCP to keep all data that could not be sent immediately (because of window limitations), as well as, data that has been sent but has not yet been acknowledged. If the send buffer is full, Write() blocks until the send buffer is freed to enqueue another packet.
When a user program calls Close() on a TCP socket, the socket is marked as closed and the Close() function returns to the process immediately. The socket descriptor is no longer usable by the process, i.e. it would not be usable as an argument to Read() or Write(). TCP tries to send all data that is already queued to be sent to the other end, and after this occurs the normal TCP connection termination sequence takes place [1].

8 Special IP addresses

8.1 The IP address `INADDR_ANY`

When you wrote your simple FTP server in project 1, you probably bound your listening socket to the special IP address INADDR_ANY. This allowed your program to work without knowing the IP address of the machine it was running on, or, in the case of a machine with multiple network interfaces, it allowed your server to receive packets destined to any of the interfaces. In reality, the semantics of INADDR_ANY are more complex and involved.

In the simulator, INADDR_ANY has the following semantics: When receiving, a socket bound to this address receives packets from all interfaces. For example, suppose that a host has interfaces 0, 1 and 2. If a UDP socket on this host is bound using INADDR_ANY and udp port 8000, then the socket will receive all packets for port 8000 that arrive on interfaces 0, 1, or 2. If a second socket attempts to Bind to port 8000 on interface 1, the Bind will fail since the first socket already ``owns'' that port/interface.

When sending, a socket bound with INADDR_ANY binds to the default IP address, which is that of the lowest-numbered interface.

8.2 The IP address `INADDR_BROADCAST`

The kernel picks the UDP or TCP socket to which a packet sent to the INADDR_BROADCAST address (255.255.255.255) is delivered in the following way: If there is a socket that is bound to the address assigned to the interface from which the packet was received, the packet will be delivered to this socket. If there is no such socket, the packet will be delivered to one of the sockets bound to INADDR_ANY. Obviously, the destination port of the packet and the port to which the socket was bound to need to match in both cases.

9 Routing Sockets

**Figure 4:** An example of how a user process could add a route to the kernel's routing table.
$\begin{figure}\hrule {\scriptsize\begin{verbatim}int s; char buf[1024]; str... ...msglen) < 0) { perror(''Write''); exit(1); }\end{verbatim}}\hrule\end{figure}$

Routing sockets are created with domain AF_ROUTE and type SOCK_RAW. This type of socket is used with the following system calls: Socket(), Close(), Write(), and Setsockopt(). A routing socket is a special type of socket that is not specific to any particular network protocol, but allows ``privileged'' user processes to write information into the kernel. User processes use this type of socket to add and remove information from the routing table. This is done by filling in the rt_msghdr structure and passing it to Write(). The rt_msghdr structure is defined in $PDIR/include/route.h. Figure 9 shows example code for modifying the route table.

The following values of the rtm_type field of the rt_msghdr structure are supported:

RTM_ADD: add an entry to the routing table;
RTM_DELETE: delete an entry from the routing table;
RTM_CHANGE: change an entry in the routing table. The implementation of this command is equivalent to performing an RTM_DELETE followed by an RTM_ADD.

These constants are defined in $PDIR/include/route.h.

To look up a route within the kernel, call rt_lookup_dest() (defined in $PDIR/include/rtable.h). This function will return the address of the gateway to which a packet with the given destination address should be sent. It also returns the index of the interface over which the packet should be sent. Note that the current implementation of the simulator supports only host routes, there is no support for network masks or prefix matching. If there is no host route available in the routing table, the function returns the default gateway. The default gateway route is defined as a route that has a destination address of 0.0.0.0. It can be inserted into the routing table in the same way as host routes.

10 ICMP

ICMP is an integral part of any IP implementation. ICMP is normally used to communicate error messages between IP nodes (both routers and endhosts), but it is occasionally used by user-level applications such as traceroute. If you are not familiar with ICMP, you should consult RFC 792 [2] and/or Stevens' TCP/IP Illustrated, Volume 1 [3].

An ICMP message can be either a query message or an error message, and it has a type field and a code field. To send an ICMP message within the kernel, call icmp_send() (defined in $PDIR/include/icmp.h) and pass it the packet causing the error condition and the desired type and code.

There is also a mechanism for a user-level process to read ICMP packets received by a host. In a real UNIX socket implementation, a process opens a RAW socket to receive ICMP packets. In the simulator, there is a new socket type SOCK_ICMP, which is defined in $PDIR/include/systm.h. You can open an ICMP socket as follows:

Socket(AF_INET, SOCK_ICMP, 0);

The ICMP socket implementation also supports the Recvfrom() and Close() socket calls. Note that the ICMP header of a received packet is not stripped by the kernel so that a user-level process can access the ICMP header to see the error type and code, among other things.

Another important issue is that there is no need to bind an ICMP socket to a particular IP address: the ICMP socket will get ICMP messages received by any of the host's interface(s) (similar to the use of INADDR_ANY). As a result, you should make sure that at most one ICMP socket is opened at any given time. If an ICMP packet arrives at a host, and no ICMP socket is opened, the host drops the packet.

11 Internet checksum: the `in_cksum()` function

Checksum computation is one of the operations that dominate packet processing time. Efficient checksum computation is difficult to implement since it is hardware dependent. Therefore, an operating system kernel usually implements several machine-dependent versions of the Internet checksum function in_cksum() to be used on different platforms. To simplify your task in projects 2 and 3, we provide you with a portable C version of the in_cksum() function (see $PDIR/include/in_cksum.h). This version is from the BSD TCP/IP implementation (though modified to use our pbuf structure, instead of the BSD mbuf). in_cksum() calculates the checksum of the packet specified by the first argument (a pointer to pbuf) with length specified by the second argument. You can use this function to compute all checksums in projects 2 and 3.

Bibliography

1: W. Richard Stevens.
UNIX Network Programming Volume 1, Networking APIs: Sockets and XTI.
Prentice Hall, second edition, 1997.
2: J. Postel.
Internet Control Message Protocol.
RFC 792, USC/Information Sciences Institute, September 1981.
3: W. Richard Stevens.
TCP/IP Illustrated, Volume 1: The Protocols.
Addison-Wesley, 1994.

About this document ...

Simulation Environment Overview
15-441 Project 2 and 3, Fall 2001

This document was generated using the LaTeX2HTML translator Version 99.2beta8 (1.43)

The command line arguments were:
latex2html -mkdir -dir htmlsim_single -image_type gif -link 3 -toc_depth 3 -split 3 -nonavigation -noaddress -antialias -white -notransparent -show_section_numbers simulator.tex

The translation was initiated by Urs Hengartner on 2001-10-02

Simulation Environment Overview 15-441 Project 2 and 3, Fall 2001