Information retrieval is concerned with how to classify information and how to judge the similarity between two objects, such as written documents. As the amount of information available in digital form has grown, so too has the need for accurate and scalable algorithms for handling this information. Information theory is concerned with the production and transmission of information. Using a framework known as the source-channel model of communication, information theory has established theoretical bounds on the limits of data compression and communication in the presence of noise and has contributed to practical technologies as varied as cellular communication and statistical translation. The StART project relies heavily on the source-channel model when considering problems in information retrieval.

The figure below depicts the standard information theoretic view of communication. In some sense, information theory is the study of what belongs in the boxes in this diagram. Before transmitting some information across an unreliable channel, the sender can add redundancy to it, so that noise introduced in transit can be identified and corrected by the receiver. This is known as encoding. There are different ways to model how information is compromised in traveling through a channel, but typically one characterizes a channel's behavior by a conditional probability distribution p(y|x), where x is a random variable representing the input to the channel, and y the output. Decoding is the inverse of encoding: given a message which was encoded and then corrupted, recover the original message.

The classical application of information theory is communication between source and receiver separated by some distance. Deep-space probes and cellular phones, for example, use a form of codes based on polynomial arithmetic in a finite field to guard against losses and errors in transmission. Error-correcting codes are also becoming popular for guarding against packet loss in Internet traffic, where the technique is known as forward error correction.

The source-channel framework has also found application in settings seemingly unrelated to communication. For instance, the now-standard approach to automatic speech recognition views the problem of transcribing a human utterance from a source-channel perspective. In this case, the source message is a sequence of (English, say) words s. In contrast to communication via error-correcting codes, we aren't free to select the code here---rather, it's the product of thousands of years of linguistic evolution. The encoding function maps a sequence of words to a pronounciation x, and the channel ``corrupts'' this into an acoustic signal y. The decoder's responsibility is to recover the original word sequence s, given

One can also apply the source-channel model to language translation. Imagine that the person generating the text to be translated originally thought of a string x of English words, but the words were ``corrupted'' into a French sequence y in writing them down. Here the channel is purely conceptual, but no matter; decoding is still a well-defined problem of recovering the original English x, given the observed French sequence y, a model p(y| x) for how English translates to French, and a prior p(x) on English word sequences.

For document retrieval, we take the view that in formulating a query, a user is distilling an information need into a succinct representation. The goal of a retrieval algorithm is to discover, among the documents available to it, which are closest to this information need. The distillation plays the role of a noisy channel, corrupting the information need into a query; retrieval is a decoding step, from query to back to document(s). For single-label document classification, we take the source message---the message to be recovered---to be the correct label of the document. The encoding step converts this label into a member of an error-correcting code. The channel will corrupt this codeword by changing some of its bits. These corrupted bits correspond to errant binary classifiers. As with the source-channel model for machine translation, the encoding and channel are purely conceptual: the classification step occurs in decoding, when the correct label is (hopefully) recovered by comparing the received bitvector to all codewords, and taking the label corresponding to the closest codeword.

We emphasize that the source-channel framework is only a restatement of a problem, not a solution: our goal is not to frame a problem in a new way, but to show that doing so leads to practical and high-performance algorithms.