Information theory

Information theory is a branch of mathematics dealing with the representation, storage, and transmission of information. It has found important applications in biology, electrical engineering, linguistics, and computer science.

This learning resource is intended to give an intuitive introduction to the ideas of information theory. We encourage you to follow the links above and below to other online resources to broaden the understanding you may begin to build here.

Information
Have you ever played the parlor game called Twenty Questions? It's both easy and fun: one person (Answerer) thinks of some specific object (for example, the Earth), and the other (Questioner) asks questions to figure out which object Answerer has chosen. Questioner can ask any question about the object, so long as it can be answered simply "yes" or "no". One question to ask first might be "Is it alive?" Since the Earth itself isn't alive, Answerer says "no" (Answerer must tell the truth). This answered question gives Questioner information about what the object is and helps Questioner decide which question to ask next. If Questioner guesses the object by the $$20^{\rm th}$$ question, Questioner wins; otherwise Answerer wins.

Information theory is simply a way to think about quantifying, representing, and communicating information of the sort the answered question above provides. To be more precise, information theory is relevant to any setting in which there is a set of events $$\{ x_i \}$$ along with one or more ideas of how likely each event is to occur (or to have occurred). A particular idea of how likely each event is to occur is a probability function $$P = \{ p_i \}$$, where $$p_i = \Pr{\{ x_i\} }$$ is the probability that event $$ x_i$$ will occur. In Twenty Questions, the events refer to Answerer's choice of an object: "Answerer chose a tiger", "Answerer chose the Earth", or "Answerer chose his left thumb" are all events that might have occurred at the start of the game. In general we can think of event $$x_i$$ as referring to "Answerer chose the $$i^{\rm th}$$ object from a master list of all objects that exist".

At the start of a game of Twenty Questions, Questioner doesn't know anything about the object Answerer has chosen. That is, as far as Questioner knows any object is equally likely to be the one Answerer has thought of: a bacterium, Answerer's left thumb, the entire universe. In mathematical terms, $$p_i = p_j$$ for any two events $$i$$ and $$j$$. After the first question is answered, Answerer forms a new probability function from the new information. In information theory we assume that Questioner understands the full importance of the information the answered question provided, because the information is there to be used whether Questioner is a perfect player or not. Most of the questions asked in Twenty Questions will rule out some set of objects entirely, however in other settings new information will change the probability function in other ways, making some events seem more likely or others seem less likely. It is the difference between the best probability function that could have been formed before a question was asked, taking full advantage of everything known to that point, and the best probability function available after the question was answered which illustrates how much information the answered question provided.

Quantities of information
Quantities of information are relevant to a lot of different questions, such as:


 * How many yes/no questions does it take to guess a number between 1 and 3 billion?
 * How many songs can my MP3 player hold?
 * NASA needs to send a message to a spacecraft orbiting Mars, but up to 10% of the message may get garbled along the way. How can NASA ensure the spacecraft gets the full message?
 * Would a book written in German be shorter than the same book written in Spanish?
 * Which parts of the human genome are genes?

The standard unit of information is the bit: one bit is the amount of information provided by choosing one of two equally likely possibilities. One toss of a fair coin provides one bit of information: before tossing the coin either heads or tails is equally likely to be the result, but afterwards we know whether the coin landed heads up or tails up.

We can now answer the first question above: how many yes/no questions does it take to guess a number between $$1$$ and $$3$$ billion? The very best we can do is to ask if the number is in the upper half or lower half of the range, until the range is just a single number. Doing so eliminates half the possibilities on each question, no matter whether the answer is "upper half" or "lower half". The number of questions is therefore the number of times one can divide the total range in half before getting to a range of just one number; i.e. what is the smallest $$N$$ such that $$2^N$$ is at least $$3$$ billion? If you were served exactly one of the first $$3$$ billion hamburgers served by McDonald's, knowing exactly which one you were served gives you about $$32$$ bits of information, because $$3,000,000,000$$ must be halved $$32$$ times to reach a range of only $$1$$ number. In general, guessing a number between $$1$$ and $$N$$ requires $$\log_2{(N)}$$ bits of information.

Entropy and Information
Let $$\{ x_i \}$$ be a finite set of possible events (for example, possible event $$i$$ corresponds to Answerer chose the $$i^{\rm th}$$ object from a known list of all objects Answerer is allowed to choose in a game of Twenty Questions). We can represent the full information we have about which of these possibilities is the truth (which object Answerer chose) as a discrete probability distribution $$P = \{ p_i = \Pr{\{x_i\}}\}.$$ As a step toward determining the amount of information we have, we define the uncertainty or entropy of $$P$$ as

$$H(P) = -\sum_i{p_i \log_2{(p_i)}}.$$

Note that if $$x_i$$ is one of $$N$$ equally likely possibilities, the amount of information needed to determine that $$x_i$$ is the event that occurred is $$\log_2{(N)} = -\log_2{(1/N)} = -\log_2{(\Pr{\{x_i\}})}$$ bits. Thus we can think of $$ -\log_2{(p_i)}$$ as the amount of information required to identify that $$x_i$$ occurred, and by weighting each $$x_i$$ by its probability $$p_i$$ the entropy of a probability distribution can be thought of as the average information required to identify which event occurred, under the assumption that the events occur according to the probability distribution in question (i.e. according to the information the probability distribution represents). Entropy is therefore a measure of how much doubt remains in determining which event occurred, after having taken into account the information represented by $$P$$. It is therefore appropriate to consider entropy as defined above a measure of uncertainty.

Receiving the answer to a (helpful) question reduces the amount of uncertainty we have about which event occurred. Information therefore corresponds to a reduction in uncertainty. In fact, this is how information is defined in mathematical terms. Formally, if $$p_i$$ is the probability distribution over a finite set of possibilities before a question is asked and $$q_i$$ is the probability distribution over the same set of possibilities after the question is asked, then the information provided by the answered question is

Information $$= H(q) - H(p) = \sum_i {( p_i\, \log_2{p_i} - q_i \log_2{q_i})}$$

The information provided by a question that changes the probability of each object from $$p$$ to $$q$$ is therefore defined as the reduction in entropy from $$p$$ to $$q$$.

Textbooks
Thomas M. Cover and Joy A. Thomas, Elements of Information Theory, 2nd Ed., Wiley-Interscience, 2006. A modern textbook.

Robert B. Ash, Information Theory, Dover Publications, 1990.

John R. Pierce, An Introduction to Information Theory, Dover Publications, 1980. A very readable introduction.

Claude E. Shannon and Warren Weaver, The Mathematical Theory of Communication, University of Illinois Press, 1963.

On-line tutorials
David P. Hornacek, A Brief Tutorial on Information Theory, Excess Entropy and Statistical Complexity: Discovering and Quantifying Statistical Structure, Lecture notes downloadable in several formats, 1997.