# TCP/IP: Inside the Internet

Textbook: since this section is inexplicably missing from this year's textbook, we'll make do with these notes, and the chapters handed out in class.

Note: A large part of this we will undoubtedly be deferring to the next session.

## The Internet: Big, and Getting Bigger

```     Number of host computers on the Internet

^                                Jan 2007: 433,193,199
|                                Jan 2003: 171,638,297
100,000,000 |                                Jan 1999:  43,200,000
|              16,000,000
10,000,000 |                   #
|        1,100,000  #
1,000,000 |              #    #
|              #    #
100,000 |              #    #
|      28,000  #    #
10,000 |         #    #    #
|         #    #    #
1,000 |         #    #    #
|   213   #    #    #
100 |    #    #    #    #
|    #    #    #    #
+----+----+----+----+---->
'81  '87  '92  '97
```

How does my message get through the Internet?
Also, I can send pictures and movies (and viruses) as attachments. How does that work?

## Representing information: Bits and bytes

A bit is a single binary digit, either 0 or 1. Not the only way to build hardware, but the simplest.
Think:

electricity through wire = 1
no electricity through wire = 0

Bits are so small that they're inconvenient. So: a byte is a binary number contained in 8 bits:

00000000(2) = 0(10) 11111111(2) = 255(10)
A byte can represent/stand for/symbolize 256 things.
Which 256 things depends on the byte's type!
The most basic types are integers, reals, and characters(!)

(A kilobyte is 210=1,024 bytes.
A megabyte is 220=1,048,576 bytes.)

### Integers

Integers are represented fairly directly as binary numbers in N bytes,
with N depending on how new the computer is.
The only tricky part is representing negative integers;
twos complement is the name of the dominant method.
For our purposes, just think of using one bit to represent negative.
So one byte can hold -127 to +127.

(As an aside, there are some situations where you need really big integers, and some languages like Lisp support that.)

### Reals (or "doubles"!)

Real numbers are trickier, because you'd like to work with really big and really small numbers.
So we typically use floating-point numbers, which work like scientific notation, e.g., -1.6745 x 104 or 3.142 x 10-14, except in base 2.
For this, the dominant scheme is the IEEE Standard.
This uses:
• 1 bit for the sign
• 8 bits for the (signed) exponent (+ 128)
• 23 bits for the fractional part of the mantissa

So 3.14 in 32 bits is

`0 10000001 10010001111010111000010`

### Characters

Characters are translated into bits using the ASCII standard (American Standard Code for Information Interchange - not that it matters).

```letter  binary code      decimal
'A'      01000001     64 +  1 = 65
'B'      01000010     64 +  1 = 66
:
'Z'      01011010     64 + 26 = 90
```
And so on, to include lower case letters, punctuation, etc. So
`01001001 00100000 01100001 01101101 00101110`
is the string "I am."
(Languages with other character sets use other encoding schemes, many requiring more than one byte per letter. Like "Unicode".)

The main point: it's all just a bunch of bytes!

## Our goal

On one level... The Internet is a network of wires connected with computers.

On another... The Internet gives programs the capability to communicate between computers.

We'll see how the Internet bridges this gap.

## Division of labor: Layers

To simplify the matter, we split the bridge into four layers:

• application layer: transmits data between application programs (e.g., HTTP, SMTP, FishNet).
• transport layer: works with packets but provides the illusion of a telephone-like connection between programs (e.g., TCP).
• internetwork layer: ``best-effort'' delivery of packets across Internet (e.g., IP).
• physical layer: gets message through single physical network (e.g., Ethernet, DSL, cable-modem).

Each layer adds a header explaining how to handle the message at its level:

```+----------+--------------+---------------+------------------------------+
| physical | internetwork | transport     |       application            |
+----------+--------------+---------------+------------------------------+
<--- front
```

Of the physical layer we will say little: It magically sends a message across a single network.

The IP software must figure out where a packet should go, and how to get it there.

Machines have two names: a mnemonic name (composed of words) for humans to remember:

```  jasmine.bh.andrew.cmu.edu
```
and a 4-byte numerical IP address that is really used by the machines:
```  128.2.124.152
```
The IP address is used to describe the destination of a message. The first two numbers, 128.2, indicate CMU's network domain, cmu.edu.
(CMU has so many machines that we now use other numeric domains in addition to 128.2.)

(The hierarchical naming in both the mnemonic and numerical forms is very clever, but not essential to examine for our purposes.)

In order to send a message, the computer must first convert the mnemonic name into the real numerical IP address. This is called name resolution.

Since the Internet is big and always changing, the computer contacts a domain name server (DNS) to resolve the IP address. In principle, to find jasmine.bh.andrew.cmu.edu, your computer

• contacts the edu domain name server,
• which sends you to the cmu name server,
• which sends you to the andrew name server,
• which sends you to the bh name server,
• which returns the IP address for jasmine.

Of course, in pactice this would make the .edu (or .com!) domain name server really busy, and take a long time for each name resolution.
So in practice the system uses caches: each computer in the chain stores the IP addresses it sees. This saves time and network traffic, and allows names to be resolved quickly most of the time. But if you're the first person in a while to try to contact a webserver in Zanzibar, it will take noticeably longer for the DNS to resolve the name.

Who is in charge of domain names? Look here.

### Routing

Since we aren't worrying about the physical layer, the IP only needs to worry about routing between networks. Messages get between networks via gateways, computers that are members of more than one network, which transfer packets between networks.

The gateway computers have routing tables that tell them where non-local messages go. Consider the following relatively simple case:

```                gateway                	gateway                	gateway
10.0.0.5                20.0.0.6                30.0.0.7
20.0.0.5                30.0.0.6                40.0.0.7

network 10.?.?.?        network 20.?.?.?        network 30.?.?.?        rest of Internet
```

The routing table of the middle gateway above might look something like this:
 if destination is: then route to: 10.?.?.? 20.0.0.5 20.?.?.? local destination 30.?.?.? local destination else 30.0.0.7

These routing tables need to evolve over time. Periodically, gateways tell their neighbors about the best routes they know. If the recipient decides it needs to update its routing table, it tells its neighbors.

As mentioned before, gateways do not guarantee delivery, only best-effort. They frequently drop packets, for a number of reasons:

• the gateway is too busy
• the gateway doesn't know a route to the destination
• the packet has passed through too many computers (the net might have a loop!)
• and other reasons
50% packet loss is not uncommon!

## TCP: making IP look better

IP gives us

• best-effort delivery
• between computers
But our programs want
• reliable delivery
• between programs
This is a job for TCP (Transport Control Protocol). TCP covers for the packet loss, reordering, and other nasty details of IP.

### Ports

Since more than one program might want to use the Internet on a single computer, each program reserves a port when it wants to communicate. There are 65k port numbers (0--65,535), which can be specified using two-byte port numbers.

When a program establishes a TCP connection, it sends its port number, so that the other program knows how to find it to respond (its numerical internet address is already in the IP header).

### Clients and Servers

A server is a program waiting for connections on a computer with a port reserved. Common servers have certain well-known ports reserved for them, so that other programs can easily find them and send them messages:

portprotocol
21FTP
25SMTP
53DOMAIN
80HTTP
1530FishNet

A client reserves a port on its own computer and sends messages to the server by sending messages to the server's port-computer combination. Then the server can respond by sending messages to the client's port-computer combination. And they talk.

(Notice there's nothing wrong with a server or client talking to multiple programs using the same port.)

### Reliable delivery

The basic approach to providing reliable delivery is straightforward, but things get complicated in order to be efficient.

The receiver sends an acknowledgement message (ACK) when it receives some data. If the sender doesn't get an ACK soon enough, it resends the data:

The packets in one connection are numbered in order to allow the receiver to be sure it has them all, and in the right order.

One challenging problem is deciding how long to wait before giving up on acknowledgements.

The sender adapts based on what it has recently seen. Doing this well turns out to be quite complicated.

Let's skip that part.

### Sliding window

Our simple acknowledgement protocol is very slow, like a bucket brigade with only one bucket:

It would clearly be better to use many buckets at once:
This is done with a sliding window:
```     +-------------------+
+----|----+----+----+----|----+----+----+----+----+----+
|  1 |  2 |  3 |  4 |  5 |  6 |  7 |  8 |  9 | 10 | 11 |
+----|----+----+----+----|----+----+----+----+----+----+
+-------------------+
A sliding window of size 4
```
The window size corresponds to the number of buckets.

### Sliding-window delivery

Armed with the information above, we can understand the actual TCP headers used in the Internet:
```byte  0   source port
byte  2   destination port
byte  4   sequence number
(tells which segment is sent)
byte  8   acknowledgement number
(tells which segment has been received)
byte 12.5 ignore
byte 14   desired window size
byte 16   ignore

byte 20
:      options
:
byte ??
:      application message
```

## Application protocols

Okay, so TCP/IP gives programs the ability to communicate smoothly between computers. Now, what do we want to do with that?
Let's look at simple examples of two of the most popular Internet applications: the WWW and email.

### HTTP: Web content

HTTP, the HyperText Transfer Protocol, is the basis for Web communication.
Suppose we point our browser at
```		http://avrim.pc.cs.cmu.edu/index.html
```
This indicates to the browser that it should use HTTP to request the file /index.html from avrim.pc.cs.cmu.edu. So the browser uses TCP to open a connection to port 80 on that machine (since that is HTTP's well-known port number).

Once the connection is open, it sends the following message to the web server there:

```	GET /index.html HTTP/1.1
Accept: text/html

```
(This ends with a blank line.)

The server responds with a message like the following message, and then closes the connection:

```	HTTP/1.0 200 Document follows
Server: CERN/3.0A
Date: Mon, 11 Jan 1999 03:22:42 GMT
Content-Type: text/html
Content-Length: 115
Last-Modified: Mon, 11 Jan 1999 03:17:24 GMT

<p>I'm <tt>avrim.pc.cs.cmu.edu</tt>; my primary user is
<a href=http://www.cburch.com/>Carl Burch</a>.</p>
```

The HTML (HyperText Markup Language) encoding seen here is not part of the network protocols, but rather a well-designed way of embedding addresses etc. invisibly into text.

### SMTP: email

Most email on the Internet is transferred using SMTP (Simple Mail Transfer Protocol).

Suppose I'm spot@cburch.com working on avrim.pc.cs.cmu.edu and I tell my email program to send email to burch@andrew.cmu.edu. It uses TCP on avrim to open a connection to port 25 on andrew.cmu.edu. (We'll use boldface to distinguish text sent from avrim below.)

First andrew responds with a welcome message (220 codes let any program reading this know that it's a welcome message):

```	220-andrew.cmu.edu ESMTP Sendmail 8.8.5/8.8.2
220-Mis-identifying the sender of mail is an abuse of computing facilities
220 ESMTP spoken here
helo avrim.pc.cs.cmu.edu
250 andrew.cmu.edu Hello AVRIM.PC.CS.CMU.EDU [128.2.185.114], pleased to meet you
mail from: spot@cburch.com
250 spot@cburch.com... Sender ok
rcpt to: burch@andrew.cmu.edu
250 burch@andrew.cmu.edu... Recipient ok
data
354 Enter mail, end with "." on a line by itself
Arf, arf!
.
250 XAA21092 Message accepted for delivery
quit
221 andrew.cmu.edu closing connection
```

## Current network research

• malevolent traffic
• mobile computers
• wireless networks
• guarantees on throughput, delay, etc
• multimedia
• speed