Information Processing and Learning

Aarti Singh

Teaching Assistant: Shashank Singh
Class Assistant: Sandra Winkler

Project Instructions:

There will be four components to the project:

Project proposal (10%) - 1 page, due Oct 5
Midterm report (25%) - 4 pages, due Nov 7
Class presentation (25%) - 10-15 mins, Dec 5 and 7
Final report (40%) - 8 pages, due Dec 4

All project reports will be in NIPS format.

The class project is an opportunity for you to exlore (existing or new) connections of information theory or signal processing to some machine learning task or application that is not discussed in class. In the case of exisiting literature survey, please provide a detailed and clear summary of what is known, and identify possible new directions. Projects will be judged based on scope, novelty, and clarity of description. Explorations that lead to interesting, albeit negative results, are acceptable. All projects must involve either information theory or signal processing ideas along with a machine learning aspect. Projects that connect fundamentals of these areas to your current research are encouraged. Please consult with the TA or instructor if you have any questions.

Ideas:

Possible ideas for projects include -

Online learning - We will see in class that apart from probabilistic coding of a source, universal coding that compresses a sequence without assuming it necessarily comes from a probabilistic model is often desirable and more fundamental. Universal prediction is closely related to universal coding, see for example paper by Merhav and Feder. Online learning is a machine learning paradigm where universal prediction is achieved without placing any probabilistic assumptions on the data source. Investigate the information theoretic foundations of online learning and connection of some online learning algorithms to universal coding.
Clustering - There are several connections of clustering algorithms to information theory, e.g. k-means and data compression by vector quantization, ML estimation in mixture models and EM algorithm are closely related to maximum entropy modeling. Summarize this connection in detail. Also, there is recent work on selecting models and number of clusters using informtion theoretic principles e.g. see paper by Buhmann. Explore literature or new connections.
Dimensionality Reduction - Independent Component Analysis or ICA tries to find a linear representation of nongaussian data so that the components are statistically independent. Since independence is related to small or zero mutual information, ICA algorithms are often based on mutual information, e.g. see papers by Bell-Sejnowski and Cardoso. Explore this connection.
Boosting and Maximum Entropy - The connection between minimizing the exponential loss used by AdaBoost and maximum likelihood for exponential models was shown by Lebanon & Lafferty. Also, Kivinen & Warmuth provide a view of AdaBoost as entropy projection. Summarize the information theoretic connections of boosting.
Active learning - Information content of the outcome of a random experiment is equal to the minimum number of binary queries needed to describe the outcome. If the nature of the binary queries is restricted (e.g. in the ship example in class, you can only ask if ship is in one of the cells or not, not whether it is in first 32 cells or not), then you might need more queries. Active learning is a machine learning paradigm that aims to minimize the number of training samples or queries needed to learn a target function. Often queries are selected based on information gain or maximal uncertainty e.g. see MacKay's paper or the active learning tutorial. Is there a principled way to think about the nature of queries and hypothesis to be learnt that characterizes the benfits of active learning? Since queries are selected in a sequential manner based on feedback from prior queries, active learning gains are closely related to the capacity of a channel with feedback, see for example the paper by Horstein - for other references, see the instructor. Explore this connection. Also, lower bounds for active sensing (as for most other learning paradigms) are based on information theory. Here is a simple paper to read by Arias-Casto,Candes & Davenport.
Estimating Entropy - While initial results in information theory assumed that the source distribution was known (since it was often designed), estimating the entropy efficiently from finite samples particularly for high-dimensional variables has only recently received attention from the research community. See paper by Sricharan-Hero and also work of Barnabas Poczos. Summarize what all is known and what are the remaining challenges. A question to ask is: are there "high-dimensional" assumptions (e.g. sparsity) under which estimating entropy is more tractable?
Coding Theory and Inference in graphical models - Some of the best codes such as LDPC codes and Turbo codes are graph codes and their decoding is closely tied to inference in graphical models. See for example Montanari-Urbanke and Wainwright-Jordan. We will cover this briefly in class. But the scope of the connection is much more broad, especially for inference in high-dimensional settings where this connection has been exploited for Compressed sensing (also see this talk) and Matrix Completion. For an information theoretic treatment of Matrix Completion, also see paper by Vishwanath
(Conditional) Independence Testing - One approach for testing if two random variables are independent is based on estimating the mutual information and building a confidence interval around this estimate (See this paper for some details). There are many other approaches (See Arthur Gretton's project page and related work in his papers) for independence testing but it is not clear when these tests work well or when one test is better than another. An empirical project could consist of a thorough comparison of these testing procedures. A more theoretical project could comprise of analyzing the power of these test statistics under various alternatives. Another project could consist of an empirical or theoretical exploration into conditional independence testing (Bergsma's paper and this paper might be good starting points).
Structure Learning in Graphical Models via Submodularity - Recent work has exploited the submodularity properties of entropy and mutual information for learning the structure of graphical models (See this paper by Chechetka and Guestrin or this paper by Narasimhan and Bilmes). These works have focused on learning specific forms of graphical models (trees, bounded tree-width, etc.). Are there ways to extend this to more general settings? A theoretical or empirical exploration of these papers and related ideas could make for an interesting project.

Information Processing and Learning

10-704, Fall 2016

Aarti Singh