Basics of Information Theory

David S.Touretzky
Computer Science Department
Carnegie Mellon University
Pittsburgh,PA 15213

dst@cs.cmu.edu

Version of 24 November 2004

Translations available: Belorussian

Preliminaries

Although information is sometimes measured in characters, as when describing the length of an email message, or in digits (as in the length of a phone number), the convention in information theory is to measure information in bits. A "bit" (the term is a contraction of binary digit) is either a zero or a one. Because there are 8 possible configurations of three bits (000, 001, 010, 011, 100, 101, 110, and 111), we can use three bits to encode any integer from 1 to 8. So when we refer to a "3-bit number", what we mean is an integer in the range 1 through 8. All logarithms used in this paper will be to the base two, so log 8 is 3. Similarly, log 1000 is slightly less than 10,and log 1,000,000 is slightly less than 20.

Efficient Encodings

Suppose you flip a coin one million times and write down the sequence of results. If you want to communicate this sequence to another person, how many bits will it take? If it's a fair coin, the two possible outcomes, heads and tails, occur with equal probability. Therefore each flip requires 1 bit of information to transmit. To send the entire sequence will require one million bits.

But suppose the coin is biased so that heads occur only 1/4 of the time, and tails occur 3/4. Then the entire sequence can be sent in 811,300 bits, on average. (The formula for computing this will be explained below.) This would seem to imply that each flip of the coin requires just 0.8113 bits to transmit. How can you transmit a coin flip in less than one bit, when the only language available is that of zeros and ones? Obviously, you can't. But if the goal is to transmit an entire sequence of flips, and the distribution is biased in some way, then you can use your knowledge of the distribution to select a more efficient code. Another way to look at it is: a sequence of biased coin flips contains less "information" than a sequence of unbiased flips, so it should take fewer bits to transmit.

Let's look at an example. Suppose the coin is very heavily biased, so that the probability of getting heads is only 1/1000, and tails is 999/1000. In a million tosses of this coin we would expect to see only about 1,000 heads. Rather than transmitting the results of each toss, we could just transmit the numbers of the tosses that came up heads; the rest of the tosses can be assumed to be tails. Each toss has a position in the sequence: a number between 1 and 1,000,000. A number in that range can be encoded using just 20 bits. So, if we transmit 1,000 20-bit numbers, we will have transmitted all the information content of the original one million toss sequence, using only around 20,000 bits. (Some sequences will contain more than 1,000 heads, and some will contain fewer, so to be perfectly correct we should say that we expect to need 20,000 bits on average to transmit a sequence this way.)

What if we invented an even cleverer encoding? What is the limit on how efficient any encoding can be? The limit works out to about 0.0114 bits per flip, so we're already very close to the optimal encoding.

The information content of a sequence is defined as the number of bits required to transmit that sequence using an optimal encoding. We are always free to use a less efficient coding, which will require more bits, but that does not increase the amount of information transmitted.

Variable Length Codes

The preceding examples were based on fixed-length codes,such as 12-bit numbers encoding values between 1 and 4,000. We can often do better by adopting a variable length code. Here is an example. Suppose that instead of flipping a coin we are throwing an eight-sided die. Label the sides A-H. To encode a number between 1 and 8 (or between 0 and 7, if you're a computer scientist) takes 3 bits, so a thousand throws of a fair die will take 3,000 bits to transmit. Now suppose the die is not fair, but biased in a specific way: the chances of throwing an A are 1/2, the chances of throwing a B are 1/4, C is 1/8, D is 1/16, E is 1/32, F is 1/64, and G and H are each 1/128. Let us verify that the sum of these probabilities is 1, as it must be for any proper probability distribution: Now let's consider an encoding ideally suited to this probability distribution. If we throw the die and get an A, we will transmit a single 0. If we throw a B we will transmit a 1 followed by a 0, which we'll write 10. If we throw a C the code will be 11 followed by 0, or 110. Similarly we'll use 1110 for D, 11110 for E, 111110 for F, 1111110 for G, and 1111111 for H. Notice that the code for A is very concise, requiring a single bit to transmit. The codes for G and H require 7 bits each, which is way more than the 3 bits needed to transmit one throw if the die were fair. But Gs and Hs occur with low probability, so we will rarely need to use that many bits to transmit a single throw. On average we will need fewer than 3 bits. We can easily calculate the average number of bits required to transmit a throw: it's the sum of the number of bits required to transmit each of the eight possible outcomes, weighted by the probability of that outcome: So 1,000 throws of the die can be transmitted in just 1,984 bits rather than 3,000. This simple variable length code is the optimal encoding for the probability distribution above. In general, though, probability distributions are not so cleanly structured,and optimal encodings are a lot more complicated.

Exercise: suppose you are given a five-sided biased die that has a probability of 1/8 of coming up A, 1/8 for B, and 1/4 for each of C, D, and E. Design an optimal code for transmitting throws of this die. (Answer at end.)

Measuring Information Content

In the preceding example we used a die with eight faces. Since eight is a power of two, the optimal code for a uniform probability distribution is easy to caclulate: log 8 = 3 bits. For the variable length code, we wrote out the specific bit pattern to be transmitted for each face A-H, and were thus able to directly count the number of bits required.

Information theory provides us with a formula for determining the number of bits required in an optimal code even when we don't know the code. Let's first consider uniform probability distributions where the number of possible outcomes is not a power of two. Suppose we had a conventional die with six faces. The number of bits required to transmit one throw of a fair six-sided die is: log 6 = 2.58. Once again,we can't really transmit a single throw in less than 3 bits, but a sequence of such throws can be transmitted using 2.58 bits on average. The optimal code in this case is complicated, but here's an approach that's fairly simple and yet does better than 3 bits/throw. Instead of treating throws individually, consider them three at a time. The number of possible three-throw sequences is = 216. Using 8 bits we can encode a number between 0 and 255, so a three-throw sequence can be encoded in 8 bits with a little to spare; this is better than the 9 bits we'd need if we encoded each of the three throws seperately.

In probability terms, each possible value of the six-sided die occurs with equal probability P=1/6. Information theory tells us that the minmum number of bits required to encode a throw is -log P = 2.58. If you look back at the eight-sided die example,you'll see that in the optimal code that was described, every message had a length exactly equal to -log P bits. Now let's look at how to apply the formula to biased (non-uniform) probability distributions. Let the variable x range over the values to be encoded,and let P(x) denote the probability of that value occurring. The expected number of bits required to encode one value is the weighted average of the number of bits required to encode each possible value,where the weight is the probability of that value: Now we can revisit the case of the biased coin. Here the variable ranges over two outcomes: heads and tails. If heads occur only 1/4 of the time and tails 3/4 of the time, then the number of bits required to transmit the outcome of one coin toss is: A fair coin is said to produce more "information" because it takes an entire bit to transmit the result of the toss: The Intuition Behind the -P log P Formula

The key to gaining an intuitive understanding of the -P log P formula for calculating information content is to see the duality between the number of messages to be encoded and their probabilities. If we want to encode any of eight possible messages, we need 3 bits, because log 8 = 3. We are implicitly assuming that the messages are drawn from a uniform distribution.

The alternate way to express this is: the probability of a particular message occurring is 1/8, and -log(1/8) = 3, so we need 3 bits to transmit any of these messages. Algebraically, log n = -log (1/n), so the two approaches are equivalent when the probability distribution is uniform. The advantage of using the probability approach is that when the distribution is non-uniform, and we can't simply count the number of messages, the information content can still be expressed in terms of probabilities.

Sometimes we write about rare events as carrying a high number of bits of information. For example, in the case where a coin comes up heads only once in every 1,000 tosses, the signal that a heads has occurred is said to carry 10 bits of information. How is that possible, since the result of any particular coin toss takes 1 bit to describe? Transmitting when a rare event occurs, if it happens only about once in a thousand trials, will take 10 bits.  Using our message counting approach, if a value occurs only 1/1000 of the time in a uniform distribution, there will be 999 other  possible values, all equally likely, so transmitting any one value would indeed take 10 bits.

With a coin there are only two possible values. What information theory says we can do is consider each value separately.  If a particular value occurs with probability P, we assume that it is drawn from a uniformly distributed set of values when calculating its information content. The size of this set would be 1/P elements. Thus, the number of bits required to encode one value from this hypothetical set is -log P. Since the actual distribution we're trying to encode is not uniform, we take the weighted average of the estimated information content of each value (heads or tails, in the case of a coin), weighted by the probability P of that value occuring. Information theory tells us that an optimal encoding can do no better than this. Thus, with the heavily biased coin we have the following:

P(heads) = 1/1000,  so heads takes -log(1/1000) = 9.96578 bits to encode

P(tails) = 999/1000,  so tails takes -log(999/1000) = 0.00144 bits to encode

Avg.bits required  = = (1/1000) × 9.96578 + (999/1000) × 0.00144  =  0.01141 bits per coin toss