Undirected graphical models compactly represent the structure of large, high-dimensional data sets, which are especially important in science. Some data sets may run to multiple terabytes, and current methods are intractable in both memory size and running time. We introduce a new, highly scalable optimization algorithm (that grew out of the work on my Data Analysis Project) to estimate a sparse inverse covariance matrix based on a regularized pseudolikelihood framework. Our parallel proximal gradient method runs across a multi-node cluster and achieves parallel scalability using a novel communication-avoiding linear algebra algorithm. We demonstrate scalability on problems with 1.28 million dimensions (over 800 billion parameters), and show that it can outperform a previous method on a single node. We use our method to estimate the underlying conditional dependency structure of the brain from fMRI data, and use the result to automatically identify functional regions. The results show good agreement with a state-of-the-art clustering from the neuroscience literature.
Deep learning techniques have phenomenal successes in computer vision, speech, and natural language processing, but their applications in information retrieval just began to show promising results. In this talk, I will present our recent work in neural information retrieval: a kernel-based neural ranking model (K-NRM) and its convolutional version (Conv-KNRM). K-NRM leverages large scale user feedback from search logs to train IR-customized word embeddings, which behave very differently with word2vec and better reflex the relevancy signals between query and document. At the same time, K-NRM also resembles the core intuitions in standard IR approaches, for example, translation models, frequency-based ranking signals, learning to rank, and soft matches. These IR customizations are necessary for neural methods' effectiveness in search engines: K-NRM and Conv-KNRM were the first to outperform feature-based machine learning models on search logs from Sogou (Chinese) and Bing (English), as well as on the TREC benchmarks.
Machine learning conferences such as NIPS are growing at an exponential rate. There is thus an urgent need to design peer-review schemes that guarantee high accuracy at scale. With this goal in mind, this talk will present a posthoc analysis of the NIPS 2016 review data. The analysis reveals some surprising observations, actionable items, and many open problems.
We study large-scale multi-label classification (MLC) on two recently released datasets: Youtube-8M and Open Images that contain millions of data instances and thousands of classes. The unprecedented problem scale poses great challenges for MLC. First, finding out the correct label subset out of exponentially many choices incurs substantial ambiguity and uncertainty. Second, the large data-size and class-size entail considerable computational cost. To address the first challenge, we investigate two strategies: capturing label-correlations from the training data and incorporating label co-occurrence relations obtained from external knowledge, which effectively eliminate semantically inconsistent labels and provide contextual clues to differentiate visually ambiguous labels. Specifically, we propose a Deep Determinantal Point Process (DDPP) model which seamlessly integrates a DPP with deep neural networks (DNNs) and supports end-to-end multi-label learning and deep representation learning. The DPP is able to capture label-correlations of any order with a polynomial computational cost, while the DNNs learn hierarchical features of images/videos and capture the dependency between input data and labels. To incorporate external knowledge about label co-occurrence relations, we impose relational regularization over the kernel matrix in DDPP. To address the second challenge, we study an efficient low-rank kernel learning algorithm based on inducing point methods. Experiments on the two datasets demonstrate the efficacy and efficiency of the proposed methods.
We consider the problem of estimating a function defined over n locations on a d-dimensional grid. When the function is constrained to have discrete total variation bounded by C_n, we derive the minimax optimal \ell_2 estimation error rate, parametrized by n and C_n. Total variation denoising, also known as the fused lasso, is seen to be rate optimal. Linear smoothers are shown to be suboptimal extending fundamental findings of Donoho and Johnstone [1998] to higher dimensions. We also derive minimax rates for discrete Sobolev spaces, which are, smaller than the total variation function spaces. Indeed, these are small enough spaces that linear estimators can be optimal. We define two higher-order TV classes and derive lower bounds on their minimax errors. We also analyze two naturally associated trend filtering methods; when d=2, each is seen to be rate optimal over the appropriate class.
While Bayesian methods are praised for their ability to incorporate useful prior knowledge, in practice, convenient priors that allow for computationally cheap or tractable inference are commonly used. In this work, we investigate the following question: for a given model, is it possible to compute an inference result with any convenient false prior, and afterwards, given any target prior of interest, quickly transform this result into the target posterior? A potential solution is to use importance sampling (IS). However, we demonstrate that IS will fail for many choices of the target prior, depending on its parametric form and similarity to the false prior. Instead, we propose prior swapping, a method that leverages the pre-inferred false posterior to efficiently generate accurate posterior samples under arbitrary target priors. Prior swapping lets us apply less-costly inference algorithms to certain models, and incorporate new or updated prior information “post-inference”. We give theoretical guarantees about our method, and demonstrate it empirically on a number of models and priors.
In this talk, we will explore some application of machine learning (deep learning) outside of conventional homogeneous data sources like images and text. We begin by looking at the scenario when each input data instance is neither a fixed dimensional vector nor a sequence with an ordering but is just a set of points. Elements of the set have no ordering and each set can be of different size. Such problems are widespread, ranging from estimation of population statistics, to point cloud classification, to audience expansion, to cosmology. We design a deep network architecture that can operate on sets and which can be deployed on a variety of scenarios including both unsupervised and supervised learning tasks by characterizing the permutation invariant functions which have a special structure. Next, we consider utilising compressed data source directly for doing deep learning. A particularly useful setting would be for videos where an hour long raw 720p video would have a size of 222GB compared to just under 1GB using MPEG codec, without much loss of visual information. Traditional deep learning methods require the decompressed frames to perform tasks like action recognition making them computation and space wise very expensive. To address these issues, we design a convolution network that can be directly fed the compressed signal instead of decompressed frames. Finally, we explore the case of heterogeneous data sources. Specifically, we look at the task of question answering where traditional methods infer answers either from a knowledge base or from raw text. Knowledge bases are useful in answering compositional questions, but are highly incomplete. Whereas, web text contains millions of facts that are absent in the KB, however in an unstructured form. We design universal schema for natural language question answering that can support reasoning on the union of both structured KBs and unstructured text by aligning them in a common embedded space.
We present a new model, Predictive State Recurrent Neural Networks (PSRNNs), for filtering and prediction in dynamical systems. PSRNNs draw on insights from both Recurrent Neural Networks (RNNs) and Predictive State Representations (PSRs), and inherit advantages from both types of models. Like many successful RNN architectures, PSRNNs use (potentially deeply composed) bilinear transfer functions to combine information from multiple sources. We show that such bilinear functions arise naturally from state updates in Bayes filters like PSRs, in which observations can be viewed as gating belief states. We also show that PSRNNs can be learned effectively by combining Backpropogation Through Time (BPTT) with an initialization derived from a statistically consistent learning algorithm for PSRs called two-stage regression (2SR). Finally, we show that PSRNNs can be factorized using tensor decomposition, reducing model size and suggesting interesting connections to existing multiplicative architectures such as LSTMs. We applied PSRNNs to 4 datasets, and showed that we outperform several popular alternative approaches to modeling dynamical systems in all cases.
When performing inference in a dynamic partially observable environment, a core operation is filtering: maintaining an estimate of the state of the environment and updating it given new observations. In this talk, I will present predictive state models, a class of filters for partially observable environments that enjoy several favorable qualities: They can represent controlled environments which can be affected by actions, they have a scalable and theoretically justified learning algorithm, and they admit a non-parametric representation that is suitable for non-linear dynamics.
I will start by setting up a framework for the uncontrolled setting, showing how to construct a filter with a consistent learning algorithm based on generalized method of moments. Then, I will show how to extend this framework to the controlled setting, where an agent can affect the environment through actions.
I will show promising results for the proposed method in two settings: learning to predict observations in an environment controlled by an external agent, and learning to control the environment using reinforcement learning.
We present a simple nearest-neighbor (NN) approach that synthesizes high-frequency photorealistic images from an "incomplete" signal such as a low-resolution image, a surface normal map, or edges. Current state-of-the-art deep generative models designed for such conditional image synthesis lack two important things: (1) they are unable to generate a large set of diverse outputs, due to the mode collapse problem. (2) they are not interpretable, making it difficult to control the synthesized output. We demonstrate that NN approaches potentially address such limitations, but suffer in accuracy on small datasets. We design a simple pipeline that combines the best of both worlds: the first stage uses a convolutional neural network (CNN) to maps the input to a (overly-smoothed) image, and the second stage uses a pixel-wise nearest neighbor method to map the smoothed output to multiple high-quality, high-frequency outputs in a controllable manner. We demonstrate our approach for various input modalities, and for various domains ranging from human faces to cats-and-dogs to shoes and handbags.
The project page can be found here: http://www.cs.cmu.edu/~aayushb/pixelNN/
Probabilistic modeling is a powerful paradigm which has seen dramatic innovations in recent years. These innovations in approximate inference, mainly due to automatic differentiation and stochastic optimization, have made probabilistic modeling scalable and broadly applicable to many complex model classes. I start my talk by reviewing the dynamic skip-gram model (ICML 2017) as an example of this class. The model results from combining a probabilistic interpretation of word2vec with latent diffusion priors, and allows us to study the dynamics of word embeddings for text data that are associated with different time stamps. Our Bayesian approach allows us to share information across the time domain, and is robust even when the data at individual points in time is small. As a result, we can automatically detect words that change their meanings even in moderately-sized corpora. Yet, the model is Bayesian non-conjugate, and therefore we have to draw on modern variational inference methods to train it efficiently on large data. The second part of my talk is therefore devoted to advances in variational inference. Here, I will review our very recent perturbative black box variational inference algorithm (NIPS 2017), that uses variational perturbation theory of statistical physics to construct corrections to the standard variational lower bound. Last, I will demonstrate that simple stochastic gradient descent with a constant step size is a form of approximate Bayesian inference (JMLR and ICML 2016).
References: R. Bamler and S. Mandt. Dynamic Word Embeddings. ICML 2017. R. Bamler*, C. Zhang*, M. Opper, and S. Mandt. Perturbative Black Box Variational Inference. NIPS 2017. S. Mandt, M. Hoffman, and D. Blei. A Variational Analysis of Stochastic Gradient Algorithms. ICML 2016. S. Mandt, M. Hoffman, and D. Blei. Stochastic Gradient Descent as Approximate Bayesian Inference. JMLR 2017.
Several scientific and engineering problems can be cast as the optimisation of an expensive black box function. Bayesian optimisation (BO), a method which models this function as a sample from a Gaussian Process, is used quite successfully in several applications for such problems, e.g. model selection, computational astrophysics, drug discovery, materials science and online advertising. In this work, we study parallelised Bayesian optimisation, where we can simultaneously evaluate the function of interest at multiple points. We adopt the classical Thompson sampling (TS) algorithm and show that a straightforward application of TS in either synchronous or asynchronous parallel settings yields a surprisingly powerful result: making n evaluations distributed among M workers is essentially equivalent to performing n evaluations in sequence. Further, when there is variability in evaluation times, asynchronously parallel TS achieves asymptotically lower regret than both the synchronous and sequential versions. The proposed procedure is conceptually and computationally much simpler than existing work for parallel BO and outperforms baselines in synthetic and real experiments. I will conclude the talk with a discussion of some open challenges, both theoretical and practical, which are essential to scaling up TS for large scale and high dimensional optimisation problems.
One characteristic that sets humans apart from modern learning-based computer vision algorithms is the ability to acquire knowledge about the world and use that knowledge to reason about the visual world. Humans can learn about the characteristics of objects and the relationships that occur between them to learn a large variety of visual concepts, often with few examples. This paper investigates the use of structured prior knowledge in the form of knowledge graphs and shows that using this knowledge improves performance on image classification. We build on recent work on end-to-end learning on graphs, introducing the Graph Search Neural Network as a way of efficiently incorporating large knowledge graphs into a vision classification pipeline. We show in a number of experiments that our method outperforms standard neural network baselines for multi-label classification.
In everyday life, language derives its meaning not only from the semantics of its constituent words, but also from the environment in which it is spoken. In particular, awareness of the situational context of the environment (pragmatics) can improve computational models of language interpretation. In this talk, I will present two scenarios where contextual cues from the environment assist the process of semantic interpretation.
In the first half, we focus on semantic parsing of conversations. Most existing methods for semantic parsing have focused on interpreting single sentences at a time. However, understanding real-life conversations requires an understanding of pragmatics, discourse structures and conversational context. We formulate semantic parsing of conversations as a sequence prediction task, incorporating structural features that model the flow of discourse.
In the second half, we explore a framework where language interpretation is not the goal in itself, but is integrated in a real-world learning task. We consider the problem of concept learning from natural language explanations. For example, in learning the concept of a phishing email, one might say ‘this is a phishing email because it asks for your bank account number’. Solving this problem involves both learning to interpret open ended natural language explanations, and learning the concept itself. We present a joint method for (1) language interpretation (semantic parsing) and (2) concept learning (classification) that does not require labeling statements with semantic representations. Instead, the model prefers discriminative interpretations of statements as a weak signal for driving the learning of a semantic parser.
Machine learning (ML) has become one of the most powerful classes of tools for artificial intelligence, personalized web services and data science problems across fields. However, the use of ML on sensitive data sets involving medical, financial and behavioral data are greatly limited due to privacy concern. In this talk, we consider the problem of statistical learning with privacy constraints. Under Vapnik's general learning setting and the formalism of differential privacy (DP), we establish simple conditions that characterizes the private learnability, which reveals a mixture of positive and negative insight. We then identify generic methods that reuse existing randomness to effectively solve private learning in practice; and discuss a weaker notion of privacy — on-avg KL-privacy — that allows for orders-of-magnitude more favorable privacy-utility tradeoff, while preserving key properties of differential privacy. Moreover, we show that On-Average KL-Privacy is **equivalent** to generalization for a large class of commonly-used tools in statistics and machine learning that sample from Gibbs distributions---a class of distributions that arises naturally from the maximum entropy principle. Finally, I will describe a few exciting future directions that use statistics/machine learning tools to advance he state-of-the-art for privacy, and use privacy (and privacy inspired techniques) to formally address the problem of p-hacking (or selective bias) in scientific discovery.
Note: this will be a practice job talk.
References: Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "Learning with differential privacy: Stability, learnability and the sufficiency and necessity of ERM principle." *Journal of Machine Learning Research* 17.183 (2016): 1-40. [PDF] jmlr.org Yu-Xiang Wang, Stephen E. Fienberg, and Alexander J. Smola. "Privacy for Free: Posterior Sampling and Stochastic Gradient Monte Carlo." *ICML*. 2015. [PDF] jmlr.org Yu-Xiang Wang, Jing Lei, and Stephen E. Fienberg. "On-Average KL-Privacy and Its Equivalence to Generalization for Max-Entropy Mechanisms." *International Conference on Privacy in Statistical Databases*. Springer International Publishing, 2016. [PDF] arxiv.org
As technological advancements facilitate the collection of datasets with increasing size and dimensionality, data analysis is becoming more and more challenging due to computational constraints. Motivated by the observation that real data tend to concentrate near regions of lower intrinsic dimension, subspace learning techniques like Principal Component Analysis (PCA) have become invaluable tools with applications ranging from noise reduction to visualization. Unfortunately, underlying linearity assumptions can often limit their effectiveness.
Thus, in this talk we propose Additive Component Analysis (ACA), a novel nonlinear extension of PCA that explicitly learns data manifolds as generalizations of subspaces. Inspired by multivariate nonparametric regression with additive models, ACA learns a smooth nonlinear mapping from a low-dimensional latent space to the input space, which trivially enables tasks like denoising. Furthermore, ACA can be used as a drop-in replacement in algorithms that use linear component analysis as a subroutine via the tangent space of the learned manifold. Unlike many other nonlinear dimensionality reduction techniques such as kernal PCA and Isomap, ACA can be efficiently applied to large datasets since it does not require computing pairwise similarities or storing training data during testing. Multiple ACA layers can also be composed and learned jointly with essentially the same training procedure for improved representational power, demonstrating the encouraging feasibility of nonparametric deep learning. We evaluate ACA as an alternative to PCA for geometric data analysis on a variety of datasets, showing improved robustness, reconstruction performance, and interpretability.
The associated paper will be presented at CVPR 2017 and can be found at www.calvinmurdock.com/aca.
There are two major programming paradigms for creating and training neural networks: static declaration, where we define the computation that we would like to perform at the beginning of training then feed data to this computation, and dynamic declaration, where we re-define computation for each data point we would like to process. The latter paradigm of dynamic neural networks has many advantages: it allows for simpler creation and debugging of networks in the native semantics of standard programming languages such as C++ or Python. This is particularly true for networks with complicated structure such as those encountered in natural language processing or complicated control tasks. However, dynamic declaration has the disadvantage that it introduces some overhead to the processing of every instance, and thus graph creation must be light-weight to the point that it doesn't significantly slow down training. In this talk, I will describe DyNet, a neural network toolkit that was designed with dynamic declaration in mind, and discuss some current and future research challenges in the creation of efficient dynamic neural networks.
Sample-efficient exploration is a cornerstone of successful reinforcement learning in challenging real-world tasks. In particular, in high-stakes applications (e.g. in healthcare or education), not only good empirical performance but also theoretical sample-complexity guarantees that ensure the performance of algorithms a-priori are highly desirable. In this talk I will present UBEV, a simple and efficient reinforcement learning algorithm for fixed-horizon episodic Markov decision processes with finite state- and action-spaces. UBEV enjoys a sample-complexity bound that holds for all accuracy levels simultaneously with high probability, and matches the lower bound except for logarithmic terms and one factor of the horizon. I will illustrate that this new uniform type of sample-complexity bound provides more meaningful performance guarantees than the predominant PAC (Probably Approximately Correct) and regret frameworks. In fact, our uniform sample-complexity bound implies that our UBEV algorithm achieves near-optimal PAC and regret bounds simultaneously. Besides stronger theoretical guarantees, empirical comparisons show that UBEV is also practically superior to existing algorithms with known sample-complexity guarantees.
This presentation is based on our recent work available on arxiv: https://arxiv.org/abs/1703.07710
In this talk, I will present two recent ideas that can help solve large scale optimization problems. In the first part, I will present a method for solving an ell-1 penalized linear and logistic regression problems where data are distributed across many machines. In such a scenario it is computationally expensive to communicate information between machines. Our proposed method requires a small number of rounds of communication to achieve the optimal error bound. Within each round, every machine only communicates a local gradient to the central machine and the central machine solves a ell-1 penalized shifter linear or logistic regression. In the second part, I will discuss usage of sketching as a way to solve linear and logistic regression problems with large sample size and many dimensions. This work is aimed at solving large scale optimization procedures on a single machine, while the extension to a distributed setting is work in progress.
This presentation is based on our work available on arxiv: https://arxiv.org/abs/1605.07991 https://arxiv.org/abs/1610.03045
Dictionary learning with Bayesian nonparametric priors is a promising technique for sparse coding. In this talk, I will review a dictionary learning method using the beta process for nonparametric sparse coding called BPFA, and show an example application to compressed sensing MRI problem. I then discuss two new directions: Scaling inference to large data sets using a stochastic extension of a new EM algorithm for BPFA, and modeling greater structure within the data by extending BPFA to modeling subspaces. This new model, called BPSA, can be viewed as a blending of the Bayesian mixture of factor analyzers (MFA) and non-Bayesian independent subspace analysis (ISA) models.
Bio: John Paisley is an assistant professor in the Department of Electrical Engineering at Columbia University, where he is also a member of the Data Science Institute. He received the B.S. and Ph.D. degrees in Electrical and Computer Engineering from Duke University in 2004 and 2010. From 2010 to 2013 he was a postdoc in the Computer Science departments at Princeton University and UC Berkeley. His research focuses on Bayesian methods for machine learning, including Bayesian nonparametrics and variational inference techniques. He applies these to several problems in signal and information processing, including compressed sensing and topic modeling.
Detecting causal associations in time series datasets is a key challenge for novel insights into complex dynamical systems such as the Earth system or the human brain. Interactions in high-dimensional dynamical systems often involve time-delays, nonlinearity, and strong autocorrelations. These present major challenges for causal discovery techniques such as Granger causality or the PC algorithm leading to low detection power, biases, and unreliable hypothesis tests. Here we introduce a reliable and fast method that outperforms current approaches in detection power and scales up to high-dimensional datasets. It overcomes detection biases, especially when strong autocorrelations are present, and allows ranking associations in large-scale analyses by their causal strength. We provide analytical results evaluate our method in extensive numerical experiments, and illustrate its capabilities in a large-scale analysis of the global surface-pressure system where we unravel spurious associations and find several potentially causal links that are difficult to detect with standard methods.
Computational approaches to drug discovery can reduce the time and cost associated with experimental assays and enable the screening of novel chemotypes. Structure-based drug design methods rely on scoring functions to rank and predict binding affinities and poses. The ever-expanding amount of protein-ligand binding and structural data enables the use of deep machine learning techniques for protein-ligand scoring.
We describe convolutional neural network (CNN) scoring functions that take as input a comprehensive 3D representation of a protein-ligand interaction. A CNN scoring function automatically learns the key features of protein-ligand interactions that correlate with binding. We train and optimize our CNN scoring functions to discriminate between correct and incorrect binding poses and known binders and non-binders. We find that our CNN scoring function outperforms the AutoDock Vina scoring function when ranking poses both for pose prediction and virtual screening.
Restricted Boltzmann Machine (RBM) is a bipartite graphical model that is used as the building block in energy-based deep generative models. Due to its numerical stability and quantifiability of its likelihood, RBM is commonly used with Bernoulli units. Here, we consider an alternative member of the exponential family RBM with leaky rectified linear units – called leaky RBM. We first study the joint and marginal distributions of the leaky RBM under different leakiness, which leads to interesting interpretation of the leaky RBM model as truncated Gaussian distribution. We then propose a simple yet efficient method for sampling from this model, where the basic idea is to anneal the leakiness rather than the energy; – i.e., start from a fully Gaussian/Linear unit and gradually decrease the leakiness over iterations. This serves as an alternative to the annealing of the temperature parameter and enables numerical estimation of the likelihood that are more efficient and far more accurate than the commonly used annealed importance sampling (AIS). We then show the generative power of leaky RBM which has not been studied before. Last, I will also discuss some unsolved problems we faced.
The advancement of AI technology will revolutionize our society. However, there is a dangerous cycle in the current state of AI development. There is a distinct lack of diversity of people in the AI community. This lack of diversity of people leads to a lack of diversity of thought. This results in technology biased for certain demographics, needs and values, accruing the benefits of AI to the few instead of the all. This lack of diversity of thought becomes a self-fulfilling prophecy by further discouraging diversity in the next generation of AI technologists and technologies.
In this talk, I will propose a humanistic-focused education approach to break the cycle. Studies have shown that women and other underrepresented groups are drawn to humanitarian applications of computing and are often alienated by the existing homogeneous culture. I will describe the Stanford Artificial Intelligence Laboratory’s Outreach Summer (SAILORS) program, which is a two-week summer camp teaching humanistic AI to rising 10th grade girls. SAILORS has been featured in Wired and its impact have been documented in a SIGCSE publication. I will then introduce the new AI4ALL foundation which was inspired by SAILORS and which will facilitate the launch of similar programs at other academic institutions nationwide.
AI4ALL is working with Prof. Manuela Veloso on creating a humanistic AI educational program for low-income high school students at CMU. Come hear more about how you can be involved in this effort. With your help, we can take the first key steps in breaking the dangerous cycle.
Bio: Olga Russakovsky is currently a postdoctoral research fellow at Carnegie Mellon University and will be an Assistant Professor at Princeton University starting in July 2017. She completed her PhD in computer science at Stanford University in August 2015. Her research is in computer vision, closely integrated with machine learning and human-computer interaction. Her work was featured in the New York Times and MIT Technology Review; she is also a recipient of the PAMI Everingham Prize and the “100 Leading Global Thinkers” award. She is a co-founder and director of the Stanford AI Laboratory’s outreach camp SAILORS, and a co-founder and board member of the AI4ALL foundation which aims to bring a diversity of thought and voices to the education, research, development and policy making of AI.
In this presentation, we consider anytime linear prediction in the common machine learning setting where features are in groups that have costs. We achieve anytime (or interruptible) predictions by sequencing the computation of feature groups and reporting results using the computed features at interruption. We present our extension to Orthogonal Matching Pursuit (OMP) and Forward Regression (FR) to learn the sequencing greedily under this group setting with costs. Both of our algorithms can provably achieve near-optimal linear predictions at each budget when a feature group is chosen.
In addition, we present a further extension that can achieve uniformly near-optimal predictions at any cost B. Our novel algorithm at any cost B can approximate the optimal performance of cost B/4; we also prove that with a cost less than B any anytime algorithm cannot approximate the optimal of cost B/4. Finally, we introduce extensions of our methods to generalized linear models.
In this presentation we analyze functional regression problems where input covariates, and possibly output responses, are functions from a nonparametric function class. Such problems cover a large range of interesting applications including time-series prediction problems, and also more general tasks like parameter estimation.
We present two novel scalable nonparametric estimators: the Double-Basis Estimator (2BE) for function-to-real regression problems; and the Triple-Basis Estimator (3BE) for function-to-function regression problems. Both the 2BE and 3BE can scale to massive data-sets. We show an improvement of several orders of magnitude in terms of prediction speed and a reduction in error over previous estimators in various synthetic and real-world data-sets.
The Bayesian paradigm is attractive in big data settings such as text and image processing because it allows for the construction of rich and understandable models, along with propagation of inferential uncertainty to predictive uncertainty. However, most forms of modern Bayesian computation are unable to scale to these problems. To make Bayesian inference in these settings tractable, researchers typically either alter the models being fitted so that they lead to scalable algorithms, or develop novel algorithms with favorable complexity-theoretic properties in the standard setting. In this talk, we present a third approach: adapting Bayesian algorithms to the novel computational environments that big data is typically found in. We review and compare the characteristics of these environments, showing that different types of parallelism have different requirements and performance consequences. To motivate their importance, we show how differences in processing and IO performance, combined with parallelism, can in some cases be exploited to allow for computations to be performed with no cost. Finally, we demonstrate that in spite of the fact that many Bayesian methods such as Markov Chain Monte Carlo algorithms are inherently iterative, they can successfully be adapted to massively parallel architectures.
Despite progress in visual perception tasks such as image classification and detection, computers still struggle to understand the interdependency of objects in the scene as a whole, e.g., relations between objects or their attributes. Existing methods often ignore global context cues capturing the interactions among different object instances, and can only recognize a handful of types by exhaustively training individual detectors for all possible relationships. To capture such global interdependency, we propose a deep Variation-structured Reinforcement Learning (VRL) framework to sequentially discover object relationships and attributes in the whole image. First, a directed semantic action graph is built using language priors to provide a rich and compact representation of semantic correlations between object categories, predicates, and attributes. Next, we use a variation-structured traversal over the action graph to construct a small, adaptive action set for each step based on the current state and historical actions. In particular, an ambiguity-aware object mining scheme is used to resolve semantic ambiguity among object categories that the object detector fails to distinguish. We then make sequential predictions using a deep RL framework, incorporating global context cues and semantic embeddings of previously extracted phrases in the state vector. Our experiments on the Visual Relationship Detection (VRD) dataset and the large-scale Visual Genome dataset validate the superiority of VRL, which can achieve significantly better detection results on datasets involving thousands of relationship and attribute types.
In this talk, I will discuss the problem of recovering an incomplete m\times n matrix of rank r with columns arriving online over time. This is known as the problem of life-long matrix completion, and is widely applied to recommendation system, computer vision, system identification, etc. The challenge is to design provable algorithms tolerant to a large amount of noises, with small sample complexity. I will give algorithms achieving strong guarantee under two realistic noise models. In bounded deterministic noise, an adversary can add any bounded yet unstructured noise to each column. For this problem, I will present an algorithm that returns a matrix of a small error, with sample complexity almost as small as the best prior results in the noiseless case. For sparse random noise, where the corrupted columns are sparse and drawn randomly, I will show you an algorithm that exactly recovers an \mu_0-incoherent matrix by probability at least 1-\delta with sample complexity as small as O(\mu_0rn\log (r/\delta)). This result advances the state-of-the-art work and matches the lower bound in a worst case. I will also discuss the scenario where the hidden matrix lies on a mixture of subspaces and show that the sample complexity can be even smaller. The proposed algorithms perform well experimentally in both synthetic and real-world datasets.
Data driven approaches to modeling time-series are important in a variety of applications from market prediction in economics to the simulation of robotic systems. However, traditional supervised machine learning techniques designed for i.i.d. data often perform poorly on these sequential problems. We propose that time series and sequential prediction, whether for forecasting, filtering, or reinforcement learning, can be effectively achieved by directly training recurrent prediction procedures rather then building generative probabilistic models.
To this end, we introduce a new training algorithm for learned time-series models, Data as Demonstrator (DaD), that theoretically and empirically improves multi-step prediction performance on model classes such as recurrent neural networks, kernel regressors, and random forests. Additionally, experimental results indicate that DaD can accelerate model-based reinforcement learning for control tasks. We next show that latent-state time-series models, where a sufficient state parametrization may be unknown, can be learned effectively in a supervised way. Our approach, Predictive State Inference Machines (PSIMs), directly optimizes – through a DaD-style training procedure – the inference performance by identifying the recurrent hidden state as a predictive belief over statistics of future observations. Fundamental to our learning frameworks is that the prediction of observable quantities is a lingua franca for building AI systems.
Social network services generally allow users to post, forward, share, or “like” a piece of information; order, or comment a product or hotel; “check in” a place of interest. All the above behaviors can be grouped as temporal sequences of users and at what time users act, which are called temporal cascades. So a temporal cascade contains users’ actions on a specific piece of information, product, hotel, place, etc. Since such actions are publicly visible, or pushed to related users or communities on purpose by the systems, users in a cascade influence each other, like the contagion of behaviors. Based on the phenomenon of contagions, people study viral marketing, like to maximize influence, and understand how opinion forms. Those studies and applications need to know the causality of who influences whom, as well as accurate values of influences. So a fundamental problem in those applications is that users' influences are difficult to quantify. Thus we propose a model that defines parameters on every user with latent influence vector and susceptibility vector. Such low-dimensional and distributed representations naturally consider the dependencies of interpersonal influences, and reduce the model complexity, comparing to the previous models. We conduct extensive experiments on real Microblog data, showing that our model with distributed representations achieves better performance.
We focus mainly on the offline 1-dimensional multiple changepoint detection problem with Gaussian errors where we want to detect locations where the underlying mean changes. In this problem, we prove that any procedure with a fast enough \ell_2 error rate, in terms of its estimation of the underlying piecewise constant mean vector, automatically has an (approximate) changepoint screening property. Specifically, each true changepoint in the underlying mean vector has an estimated changepoint nearby. We also show, again assuming only knowledge of the \ell_2 error rate, that a simple post-processing step can eliminate spurious estimated changepoints, and thus deliver asymptotic bounds on the Hausdorff distance between the true and estimated set of changepoints. Specifically, in addition to the screening property described previously, we are assured that each estimated changepoint has a true changepoint nearby.
As a special case, we focus on the application of our results whereby an improved \ell_2 error rate for the 1-dimensional fused lasso for a fixed number of changepoints can yield competitive bounds on the Hausdorff distance compared to other changepoint methods (i.e., binary segmentation, wild binary segmentation, SMUCE). If time remains, I will discuss our simulations and extensions to other variants of offline changepoint problems (changepoint detection on graphs, trend filtering).
Paper: https://arxiv.org/abs/1606.06746. This is joint work with James Sharpnack (UC Davis), Alessandro Rinaldo (CMU) and Ryan Tibshirani (CMU).
Halloween is a spooky time of year, and forensic science can be a spooky area of application. As a special halloween treat, this week's machine learning lunch will feature statistical methods for analyzing two types of forensic evidence: bullet cartridges and shoe prints.
Bullet Cartridges: When a gun is fired, it leaves marks on cartridges that are believed to be unique to the gun. This means that cartridges collected from crime scenes can be compared to cartridge images stored in a database, to determine if they were fired from the same gun. In this talk, we will describe fully automated methods for comparing 2D cartridge images. We pre-process and register the images, and use correlation to measure the similarity between them. We quantify the uncertainty in making any statement of a match using a hypothesis test.
Shoe prints: The President’s Council of Advisors on Science and Technology recently released a report which criticized the science behind latent footwear impression analysis based on accidentals (e.g. cuts, holes, and debris on a shoe’s sole which accumulate on a shoe through wear). In particular, the report acknowledged that the assumptions of existing methods are inappropriate and need to be empirically evaluated. Acting on these recommendations, we propose a new nonparametric Bayesian hierarchical model for the distribution of accidentals on shoe surfaces. This model features nonparametric mixtures of Bernstein polynomials hierarchically modelled using compound random measures. To train the model, we are using a database of 386 shoes collected and marked by the Israeli Police Department. Wearing of costumes to the talk is encouraged. Best costume gets a free lunch. Okay everyone gets a free lunch.
In many settings, predictions must be made over structured output spaces. Examples include both discrete structures such as sequences and clusterings, as well as continuous ones such as trajectories. The conventional machine learning approach to such "structured prediction" problems is to learn over a holistically pre-specified structured model class (e.g., via conditional random fields or structural SVMs). In this talk, I will discuss recent work along an alternative direction of using learning reductions, or "learning to optimize".
In learning to optimize, the goal is to reduce the structured prediction problem into a sequence of standard prediction problems that can be solved via conventional supervised learning. Such an approach is attractive because it can easily leverage powerful function classes such as random forests and deep neural nets. The main challenge lies in identifying a good learning reduction that is both principled and practical. I will discuss two projects in detail: contextual submodular optimization, and smooth online sequence prediction.
In these days, computer vision and machine learning area are largely affected by the availability of dataset. For example, we have seen great achievements in image classification and object detection problems by exploiting the publicly available datasets such as ImageNet and PASCAL VOC dataset. However, a similar dataset to understand human's interaction and non-verbal behaviors is extremely rare. One of the major obstacles in building such dataset is the fact that measuring non-verbal cues of interacting multiple people is also challenging.
The Panoptic Studio is a system composed of more than 500 diverse sensors, specifically designed to measure subtle non-verbal signals of interacting multiple people. The system takes, as input, 480 synchronized video streams of multiple people engaged in social activities, and produces, as output, the time-varying 3D structure of anatomical body landmarks. Using this system, we have collected and processed various interesting scenes where multiple people are naturally interacting, and recently we have released the CMU Panoptic Studio Dataset. In this dataset, we publicly share 500+ video inputs, fully automatically reconstructed 3D body pose, and calibration data for all the sequences, with a toolbox for a quick start. In this talk, I will introduce our system, reconstruction algorithms, and the dataset with various potential applications.
Empirical risk minimization is a key tool in modern machine learning which is well understood when the underlying problem is well-behaved. Many emerging problems, however, are not so well-behaved and contain complicated nonconvex, nonsmooth, and unbounded structures. Examples include graphical models, neural networks, dictionary learning, and matrix factorization. For such problems, existing techniques are unsuitable and new principles are necessary to solve them. Using directed graphical models as a motivating example, this talk will explore the problem of regularized empirical risk minimization with nonconvex and nonsmooth parameter constraints as well as unbounded loss functions. We will focus on both computational and theoretical aspects of this problem in a high-dimensional setting, motivated by applications to computational biology and human genetics. The main result is an efficient algorithm for approximating directed graphs with millions of free parameters that is several orders of magnitude faster than existing approaches and comes with strong statistical guarantees. We will also discuss progress towards a general nonasymptotic framework for understanding the statistical behaviour of estimators defined via nonconvex empirical risk minimization.
Bio: Bryon recently joined CMU as a postdoc in the SAILING Lab after receiving his PhD in Statistics from UCLA. His research interests include graphical models, multi-task learning, and high-dimensional statistics with applications to computational biology, genomics, and causal inference.
We present an algorithm for estimating bounds on causal effects from observational data which combines graphical model search with simple linear regression. We assume that the underlying system can be represented by a linear structural equation model with no feedback, and we allow for the possibility of latent variables. Under assumptions standard in the causal search literature, we use conditional independence constraints or greedy search to learn an equivalence class of ancestral graph Markov models. Then, for each model in the equivalence class, we perform the appropriate regression (using causal structure information to guide covariate adjustment) to estimate a set of possible causal effects. Our approach is based on the "IDA" procedure of Maathuis et al. (2009), which assumes that all relevant variables have been measured (i.e., no unmeasured confounders). We generalize their work by relaxing this assumption, which is often violated in applied contexts. We validate the performance of our algorithm on simulated data and demonstrate improved precision over IDA when latent variables are present.
Reference: M. H. Maathuis, M. Kalisch, and P. Buhlmann. Estimating high-dimensional intervention effects from observational data. The Annals of Statistics, 37(6A):3133–3164, 2009.
Paper: http://jmlr.org/proceedings/papers/v52/malinsky16.pdf
Bio: Daniel Malinsky is a PhD student in Logic, Computation, and Methodology at Carnegie Mellon University. He works on causal inference with graphical models, especially structure learning from time series data and applications in policy and health.
We explore architectures for general pixel-level prediction problems, from low-level edge detection to mid-level surface normal estimation to high-level semantic segmentation. Convolutional predictors, such as the fully-convolutional network (FCN), have achieved remarkable success by exploiting the spatial redundancy of neighboring pixels through convolutional processing. Though computationally efficient, we point out that such approaches are not statistically efficient during learning precisely because spatial redundancy limits the information learned from neighboring pixels. We demonstrate that (1) stratified sampling allows us to add diversity during batch updates and (2) sampled multi-scale features allow us to explore more nonlinear predictors (implemented though multiple fully-connected layers followed by ReLU) that improve overall accuracy. We demonstrate that our single architecture produces state-of-the-art results for semantic segmentation on PASCAL-Context, surface normal estimation on NYUD dataset, and edge detection on BSDS without contextual post-processing.
Given a stream of multimodal sensory data, an autonomous robot must continuously refine its understanding of itself and its environment as it makes decisions on how to act to achieve a goal. These are difficult problems that roboticists have attacked using classical tools from mechanics and controls and, more recently, machine learning. However, the classical methods and machine learning algorithms are often seen to be at odds, and researchers continue to debate the merits of engineering vs. learning.
A recurring theme in this talk will be that prior knowledge and domain insights can make learning and inference easier. I will discuss several fundamental robotics problems including continuous-time motion planning, localization, and mapping from a unified probabilistic inference perspective. I will show how models from statistical machine learning like Gaussian Processes can be tightly integrated with insights from engineering expressed as differential equations to solve these problems efficiently. Finally, I will demonstrate the effectiveness of these algorithms on several real-world robotics platforms.
Bio: Byron Boots is an assistant professor in the School of Interactive Computing and the Institute for Robotics and Intelligent Machines at Georgia Tech. Prior to joining Georgia Tech, Byron was a postdoctoral researcher working with Dieter Fox in the Robotics and State Estimation Lab at the University of Washington. He received his Ph.D. in Machine Learning from Carnegie Mellon in 2012 where he was advised by Geoff Gordon. Byron’s work on learning models of dynamical systems received the 2010 Best Paper award at ICML. His current research focuses on developing theory and systems that integrate perception, learning, and decision making.
When human annotators are given a choice about what to label in an image, they apply their own subjective judgments on what to ignore and what to mention. We refer to these noisy ``human-centric'' annotations as exhibiting human reporting bias. Examples of such annotations include image tags and keywords found on photo sharing sites, or in datasets containing image captions. In this paper, we use these noisy annotations for learning visually correct image classifiers. Such annotations do not use consistent vocabulary, and miss a significant amount of the information present in an image; however, we demonstrate that the noise in these annotations exhibits structure and can be modeled. We propose an algorithm to decouple the human reporting bias from the correct visually grounded labels. Our results are highly interpretable for reporting ``what's in the image'' versus ``what's worth saying.'' We demonstrate the algorithm's efficacy along a variety of metrics and datasets, including MS COCO and Yahoo Flickr 100M. We show significant improvements over traditional algorithms for both image classification and image captioning, doubling the performance of existing methods in some cases.
We propose a Laplace approximation that creates a stochastic unit from any smooth monotonic activation function, using only Gaussian noise. This paper investigates the application of this stochastic approximation in training a family of Restricted Boltzmann Machines (RBM) that are closely linked to Bregman divergences. This family, that we call exponential family RBM (Exp-RBM), is a subset of the exponential family Harmoniums that expresses family members through a choice of smooth monotonic non-linearity for each neuron. Using contrastive divergence along with our Gaussian approximation, we show that Exp-RBM can learn useful representations using novel stochastic units.
How can we enable computers to process natural language in multilingual settings (think airports, social media, etc.)? Thanks to decades of natural language processing (NLP) research, we have been able to develop language analyzers in many languages, including languages with little or no training data. I will review some of the key developments in multilingual NLP research, with an emphasis on syntax.
Despite this progress, the mainstream approach to developing multilingual NLP models (for high-resource languages) has been to independently train one model for each language, which is unsatisfactory for practical as well as theoretical reasons. To that end, I will describe a general framework for training language-universal models, and how to instantiate it for dependency parsing.
To address several important issues involved in latent variable models (LVMs), such as effective capturing of long-tail patterns, shrinkage of model size without sacrificing modeling power, improving interpretability, several studies have been devoted to ``diversify'' LVMs, which aim to encourage the components in LVMs to be diverse. Most of existing study is performed in a frequentist-style regularization framework, where the components are learned via point estimation. In this work, we investigate how to ``diversify'' LVMs in another learning paradigm -- Bayesian learning -- which has advantages complementary to point estimation, such as alleviating overfitting via model averaging and quantifying uncertainty. We propose two approaches that have complementary advantages. One is to define diversity-promoting mutual angular priors which assign larger density to components with larger mutual angles based on Bayesian Network and von-Mises Fisher distribution and use these priors to affect the posterior via Bayes rule. We develop two efficient approximate posterior inference algorithms based on variational inference and Markov chain Monte Carlo sampling. The other approach is to impose diversity-inducing regularization directly over the post-data distribution of components. These two methods are applied to the Bayesian mixture of experts model to encourage the ``experts'' to be diverse and experimental results demonstrate the effectiveness and efficiency of our methods.
Bio: Pengtao Xie is a PhD student in the Machine Learning Department at Carnegie Mellon University. His primary research focus is latent variable models and distributed machine learning systems. He received a M.E. from Tsinghua University in 2013 and a B.E. from Sichuan University in 2010. He is the recipient of Siebel Scholarship, Goldman Sachs Global Leader Scholarship and National Scholarship of China.
We propose a novel high dimensional nonparametric model named ATLAS which is a generalization of the sparse additive model. The ATLAS model assumes the high dimensional regression function can be locally approximated by a sparse additive function, while such an approximation may change from the global perspective. We aim to estimate high dimensional function using a novel kernel-sieve hybrid regression estimator that combines the local kernel regression with the B-spline basis approximation. We show the estimation rate of true function in the supremum norm. We also propose two types of confidence bands for true function. Both procedures proceed in two steps: (1) a novel bias correction method is applied to remove the shrinkage introduced by the model selection penalty and (2) quantiles of the normalized de-biased estimator are approximated by quantiles of the limiting distribution or a Gaussian multiplier bootstrap. We further show that the covering probability of the bootstrap confidence bands converges to the nominal one at a polynomial rate.
Joint work with Junwei Lu and Han Liu
In many animation projects, the animation artist typically spends significant time animating the face. This process involves many labor-intensive tasks that offer relatively little potential for creative expression. One particularly tedious task is speech animation: animating the face to match spoken audio. Indeed, the often prohibitive cost of speech animation has limited the types of animations that are feasible, including localization to different languages.
In this talk, I will show how to view speech animation through the lens of data-driven sequence prediction. In contrast to previous sequence prediction settings, visual speech animation is an instance of contextual spatiotemporal sequence prediction, where the output is continuous and high-dimensional (e.g., a configuration of the lower face), and also depends on an input context (e.g., audio or phonetic input).
I will present a decision tree framework for learning to generate context-dependent spatiotemporal sequences given training data. This approach enjoys several attractive properties, including ease of training, fast performance at test time, and the ability to robustly tolerate corrupted training data using a novel latent variable approach. I will showcase this approach in a case study on speech animation, where our approach outperforms several competitive baselines in both quantitative and qualitative evaluations, and also demonstrates strong robustness to corrupted training data.
This is joint work with Taehwan Kim, Sarah Taylor, Barry-John Theobald, and Iain Matthews.
Intelligent agents acting in the real world need advanced vision capabilities to perceive, learn from, reason about and interact with their environment. In this talk, I will explore the role that humans play in the design and deployment of computer vision systems. Large-scale manually labeled datasets have proven instrumental for scaling up visual recognition, but they come at a substantial human cost. I will first briefly talk about strategies for making optimal use of human annotation effort for computer vision progress. However, no dataset can foresee all the visual scenarios that a real-world system might encounter. I will argue that seamlessly integrating in human expertise at runtime will become increasingly important for open-world computer vision. I will introduce both mathematical frameworks for human-machine collaboration as well as deep reinforcement learning models that open up new avenues for human-in-the-loop exploration.
Bio: Olga Russakovsky recently completed her PhD in computer science at Stanford and is now a postdoctoral fellow at Carnegie Mellon University. Her research is in computer vision, closely integrated with machine learning and human-computer interaction. Her work was featured in the New York Times and MIT Technology Review. She served as an Area Chair for WACV’16, led the ImageNet Large Scale Visual Recognition Challenge effort for two years, and organized multiple workshops and tutorials on large-scale recognition at premier computer vision conferences ICCV’13, ECCV’14, CVPR’15, ICCV’15 and CVPR’16. In addition, she founded and directs the Stanford AI Laboratory’s outreach camp SAILORS (featured in Wired and published in SIGCSE’16) designed to expose high school students in underrepresented populations to the field of AI.
Infection and diffusion processes over networks arise in many domains. These introduce many challenging prediction tasks, such as influence estimation, trend prediction, and epidemic source localization. The standard approach to such problems is generative: assume an underlying infection model, learn its parameters, and infer the required output. In order to learn efficiently, the chosen infection models are often simple, and learning is focused on inferring the parameters of the model rather than on optimizing prediction accuracy. Here we argue that for prediction tasks, a discriminative approach is more adequate.
We introduce DIMPLE, a novel discriminative learning framework for training classifiers based on dynamic infection models. We show how highly non-linear predictors based on infection models can be "linearized" by considering a larger class of prediction functions. Efficient learning over this class is performed by constructing "infection kernels" based on the outputs of infection models, and can be plugged into any kernel-supporting framework. DIMPLE can be applied to virtually any infection-related prediction task and any infection model for which the desired output can be calculated or simulated. For influence estimation in well-known infection models, we show that the kernel can either be computed in closed form, or reduces to estimating co-influence of seed pairs.
We apply DIMPLE to the tasks of influence estimation on synthetic and real data from Digg, and to predicting customer network value in Polly, a viral phone-based development-related service deployed in low-literate communities. Our results show that DIMPLE outperforms strong baselines.
Computer vision is currently undergoing a period of rapid progress, brought in part through the integration of machine-learning techniques with big training datasets. This talk will attempt to examine some of the modeling insights behind this progress, as well as open challenges that remain. A well-known but under-appreciated observation is that visual phenomena follows a long-tail distribution: a few modes of appearance are common, while many rare modes are in the tail. As an example, people commonly stand or walk, but can contort their body into many more poses. I will argue that the "tail" remains the open challenge because training data is limited (even in the big-data setting). I will describe some promising methods that address this difficulty by synthesizing new data examples, either explicitly with a computer graphics pipeline or implicitly through compositional representations. The latter view suggests novel variants of deep architectures that reason about compositional variables. I will conclude by demonstrating such architectures on various visual recognition tasks, including perceptual grouping, object recognition, and people tracking.
Deep learning (DL) has achieved notable successes in many machine learning tasks. A number of software frameworks have been developed to expedite the process of designing and training deep neural networks (DNNs), such as Caffe, Torch, and Theano. Currently these frameworks can harness multiple GPUs on the same machine, but are unable to use GPUs that are distributed across multiple machines; as even average-sized deep networks can take days to train on a single GPU with 100s of GBs to TBs of data, distributed GPUs present a prime opportunity for scaling up DL. However, the limited inter-machine bandwidth available on commodity Ethernet networks presents a bottleneck to distributed GPU training, and prevents its trivial realization. To investigate how to adapt existing software frameworks to efficiently support distributed GPUs, we propose Poseidon, a scalable system architecture for distributed inter-machine communication in existing DL frameworks. We integrate Poseidon into the Caffe framework and evaluate its performance at training convolutional neural networks for object recognition in images. Poseidon features three key contributions that accelerate DNN training on clusters: (i) a three-level hybrid architecture that allows Poseidon to support both CPU-only and GPU-equipped clusters, (ii) a distributed wait-free backpropagation (DWBP) algorithm to improve GPU utilization and to balance communication, and (iii) a structure-aware communication protocol (SACP) to minimize communication overheads. We empirically show that Poseidon converges to the same objectives as a single machine, and achieves state-of-the-art training speedup across multiple models and well-established datasets, using a commodity GPU cluster of 8 nodes (e.g. 4.5x speedup on AlexNet, 4x on GoogLeNet, 4x on CIFAR-10). On the much larger ImageNet 22K dataset, Poseidon with 8 nodes achieves better speedup and competitive accuracy to recent CPU-based distributed deep learning systems such as Adam and Le et al, which use 10s to 1000s of nodes.
As both a computer scientist and a musician, I design intelligent systems to understand and extend human musical expression. To understand means to model the musical expression conveyed through acoustic, gestural, and emotional signals. To extend means to use this understanding to create expressive, interactive, and autonomous agents, serving both amateur and professional musicians. In particular, I create interactive artificial performers that are able to perform expressively in concert with humans by learning musicianship from rehearsal experience. This study unifies machine learning and knowledge representation of music structure and performance skills in an HCI framework. In this talk, I will go over the learning techniques and present robot musicians capable of playing collaboratively and reacting to musical nuance with facial and body gestures.
As data acquisition methods improve and the volume of data for analysis increases in brain science, computer- and neuroscientists have begun to work more closely together to extract fundamental principles that govern the organization and function of neural circuits. To facilitate these interactions, BrainHub will sponsor a hackathon for CS and ML graduate students using experimental data acquired by Carnegie Mellon neuroscientists. The basic goal will be to use the data to find something interesting, and the prize will involve graduate student support for the following academic year. Please come to this open discussion to help organize the event -- what datasets are most useful, how much time should be allotted, how much faculty involvement would be optimal, as well as other important details that will help you and your peers get the most out of this competition.
Learning to reason and understand the world’s knowledge is a fundamental problem in Artificial Intelligence (AI). While it is always hypothesized that both the symbolic and statistical approaches are necessary to tackle complex problems in AI, in practice, bridging the two in a combined framework might bring intractability—most probabilistic first-order logics are simply not efficient enough for real-world sized tasks. In this talk, I will describe some of my recent progress on theories and practices in statistical relational learning: 1) a scalable learning and reasoning framework called ProPPR, whose inference time does not depend on the size of knowledge graph; 2) a meta-reasoning theory that learns structures from relational data; 3) and a joint approach for scalable information extraction and relational reasoning. This is joint work with William Cohen and Katie Mazaitis.
In this talk I am going to focus on the distribution regression problem: regressing to vector-valued outputs from probability measures. Many important machine learning and statistical tasks fit into this framework, including multi-instance learning or point estimation problems without analytical solution such as hyperparameter or entropy estimation. Despite the large number of available heuristics in the literature, the inherent two-stage sampled nature of the problem makes the theoretical analysis quite challenging: in practice only samples from sampled distributions are observable, and the estimates have to rely on similarities computed between sets of points. To the best of our knowledge, the only existing technique with consistency guarantees for distribution regression requires density estimation as an intermediate step (which often performs poorly in practice), and the domain of the distributions to be compact Euclidean. I propose a simple, analytically computable, ridge regression based alternative to distribution regression by embedding the distributions to a reproducing kernel Hilbert space, and learning the regressor from the embeddings to the outputs. I am going to present the main ideas why this scheme is consistent in the two-stage sampled setup under mild conditions (on separable topological domains enriched with kernels) and present an exact computational-statistical efficiency tradeoff description showing that the studied estimator is able to match the one-stage sampled minimax optimal rate. Specifically, this result answers a 16-year-old open question by establishing the consistency of the classical set kernel [Haussler, 1999; Gartner et. al, 2002] in regression, and also covers more recent kernels on distributions, including those due to [Christmann and Steinwart, 2010]. [Joint work with Bharath Sriperumbudur, Barnabas Poczos, Arthur Gretton.]
Bio: Zoltan Szabo is a Research Associate at the Gatsby Unit, University College London (2013 - present). He holds a double PhD in Computer Science and Applied Mathematics from the Eotvos Lorand University (2009-2012; Budapest, Hungary). His primary research interests are information theory, statistical machine learning, empirical processes and kernel methods with applications in remote sensing (sustainability), distribution regression, structured sparsity, independent subspace analysis and its extensions, collaborative filtering.
This talk emphasizes the use of structured methods to model Natural Language Understanding (NLU) problems. We argue that many NLU tasks can benefit from using models that are capable of incorporating not just linguistic cues but also the contexts in which these cues appear. In this talk, we use a structured approach to model the 'flow of information' in text to solve two problems: (i) Analyzing a paragraph to identify if a desire expressed in the paragraph was fulfilled, and (ii) Predicting need for instructor intervention in MOOC discussion forums.
We first address the problem of reading and understanding a textual paragraph containing an expression of a desire to identify if the desire got fulfilled. The method reads the paragraph as a story from the perspective of the protagonist - the entity that expressed the desire. We track the protagonist's actions and emotional states to make the binary prediction.
We then analyze contents of online educational discussion forums to automatically suggest threads to instructors that require their intervention. This can alleviate the need for the instructor to manually peruse all threads of the forum, and help students who need to interact with the instructor. Our method incorporates thread structure for the problem by using latent variables that abstract contents of individual posts and model the flow of information in the thread.
We propose a class of nonparametric two-sample tests with a cost linear in the sample size. Two tests are given, both based on an ensemble of distances between analytic functions representing each of the distributions. The first test uses smoothed empirical characteristic functions to represent the distributions, the second uses distribution embeddings in a reproducing kernel Hilbert space. Analyticity implies that differences in the distributions may be detected almost surely at a finite number of randomly chosen locations/frequencies. The new tests are consistent against a larger class of alternatives than the previous linear-time tests based on the (non-smoothed) empirical characteristic functions, while being much faster than the current state-of-the-art quadratic-time kernel-based or energy distance-based tests. Experiments on artificial benchmarks and on challenging real-world testing problems demonstrate that our tests give a better power/time tradeoff than competing approaches, and in some cases, better outright power than even the most expensive quadratic-time tests. This performance advantage is retained even in high dimensions, and in cases where the difference in distributions is not observable with low order statistics.
Several scientific and engineering problems can be cast as the optimisation of an expensive black box function. Bayesian Optimisation (BO), a method which models this function as a sample from a Gaussian Process, is used quite successfully in a plethora of applications. In Machine Learning, BO is fast becoming the method of choice to tune hyper parameters for expensive Machine Learning algorithms (e.g. Neural Networks).
In this talk, I will start with a general introduction to the techniques used in BO and survey existing theoretical results. We will look at the statistical and computational challenges in scaling BO to high dimensions. Then I will present some of our recent work which tackles these challenges using additive Gaussian processes. We show that regret improves from exponential in dimension to linear in dimension for additive functions. Empirically, our methods outperform naive BO and other global optimisation methods on several synthetic and real problems.
Some relevant papers - Brochu et al, "A Tutorial on Bayesian Optimisation ..." Arxiv - Srinivas et al, "Gaussian Process Optimization in the Bandit Setting: No Regret and Experimental Design". Arxiv - Kandasamy et al, "High Dimensional Bayesian Optimisation and Bandits via Additive Models." Arxiv
I will introduce the Non-Metric Space Library, which is an ongoing effort to develop and evaluate methods for generic non-metric spaces. First, I will review applications of nearest-neighbor (NN) search in ML/NLP and explain why non-metric spaces are important. Second, I will survey the state of the art and explain how ML can be used to devise more efficient search methods. Third, I will talk about the library itself (i.e., technical details, use cases, and underlying engineering decisions) and how it compares to the state of the art. The talk may be useful to those planning to apply NN-methods in their work. Furthermore, individuals convinced that locality sensitive hashing methods are superior to all other approaches may also find this talk interesting.
Lexical semantics is a sub-field of natural language processing that aims to obtain and represent the meaning of words in a format that is mutually intelligible to humans and computers. In this talk we will explore two different forms of word meaning representations: word vector representations and word lexicons. In the first part of the talk, I will present a method to improve existing models of word vector representations with explicit knowledge from semantic lexicons using a graph-based model. In the second part of the talk, I will use graph-based semi-supervised learning to construct wide-coverage morpho-syntactic lexicons with high quality from a small seed lexicon. The automatically constructed lexicons provide features that significantly improve performance in two downstream tasks: morphological tagging and dependency parsing.
Trend filtering is a recently developed tool Steidl et al. (2006), Kim et al. (2009) for nonparametric regression. Given n points, the trend filtering estimate is defined as the minimizer of a penalized least squares, where the penalty is the l1-norm of the kth order discrete derivatives over the input points. We will give an overview of some interesting connections between these estimates and adaptive spline estimation though a new function basis called the "Falling Factorial Basis", and illustrate the provable statistical superiority of trend filtering to other common nonparametric regression tools, such as smoothing splines and kernel smoothing.
I will also present a generalization of trend filtering to nonparametric estimation over graphs. This approach is more locally adaptive compared to the standard methods such as wavelet smoothing and Laplacian smoothing and can be generically used in many applications. This is closely related to TV denoising, graph cut based image segmentation, and is applied to event detection and semi-supervised learning. Finally, I will talk about statistical and computational challenges on this problem that are still open to date.
Sparse Gaussian graphical model estimation has become a modern workhorse for learning networks from high-dimensional datasets of single modality such as gene-expression levels. However, we often would like to learn a cascade of networks that combines multiple types of data into a single statistical analysis. In biology, this desire is motivated by hope that integrating various kinds of genomic and clinical data will allow us to better dissect the underpinnings of complex diseases.
This talk describes a principled, scalable approach to learning and interpreting such cascades of networks with sparse Gaussian chain graph models. We propose a general recipe for learning such models using sparse conditional Gaussian graphical models as components. I will also discuss how our method can be used to discover structured sparsity and can naturally handle partially-available data. Finally, I will describe recent work on learning such networks with a million variables on a single machine in about a day.
Apache Spark is a fast and general engine for large-scale data processing, and it is one of the fastest-growing open-source projects in big data, with ~1K contributors from academia and industry. This talk will discuss MLlib, the Machine Learning library built on top of Spark, and our approach to distributed ML. We will overview the project's current status and future directions, including algorithmic coverage and scaling/speed improvements. Finally, we will discuss one or two algorithms in more detail, mentioning challenges in distributed computing.
This talk will provide background on Spark and MLlib for a general audience, as well as provide roadmap and implementation details interesting to experienced Spark users.
Bio: Joseph Bradley is an Apache Spark Committer working on MLlib at Databricks, the startup founded by the creators of Spark. He worked with Prof. Carlos Guestrin at CMU, receiving a Ph.D. in Machine Learning in 2013 (conditional random fields, parallel sparse regression). He spent a year as a postdoc at UC Berkeley (sparse models, peer grading in MOOCs) before joining Databricks.
Diversity regularization of latent variable models (LVMs) aims to encourage the latent factors in LVMs to be different from each other, which can 1) capture long-tail latent factors; 2) reduce the complexity of models without losing their modeling power. In this talk, I will introduce 1) how the diversity regularizer is defined; 2) how to optimize it; 3) a theoretical justification on why it works; 4) its applications in document modeling and distance learning.
Bio: Pengtao Xie is a PhD student in the machine learning department of CMU. His research interests lie in the diversity regularization and scalability of latent variable models. He obtained a M.E. from Tsinghua University in 2013 and a B.E. from Sichuan University in 2010. He is the recipient of Siebel Scholarship, Goldman Sachs Global Leader Scholarship and National Scholarship of China.
Convex optimization has developed a wide variety of useful tools critical to many applications in machine learning. However, unlike linear and quadratic programming, general convex solvers have not yet reached sufficient maturity to fully decouple the convex programming model from the numerical algorithms required for implementation. Especially as datasets grow in size, there is a significant gap in speed and scalability between general solvers and specialized algorithms.
This talk addresses this gap with a new model for convex programming based on an intermediate representation of convex problems as a sum of functions with efficient proximal operators. This representation serves two purposes: 1) many problems can be expressed in terms of functions with simple proximal operators, and 2) the proximal operator form serves as a general interface to any specialized algorithm that can incorporate additional l2-regularization. On a single CPU core, numerical results demonstrate that the sum-of-prox form results in significantly faster algorithms than existing general solvers based on conic forms. In addition, splitting problems into separable sums is attractive from the perspective of distributing solver work amongst multiple cores and machines. We develop a system that scales to 100s of CPU cores and gigabyte-scale data, enabling general convex programming frameworks to be applied a much larger class of problems.
We apply large-scale convex programming to several problems arising from building the next-generation, information-enabled electrical grid. In these problems (as is common in many domains) large, high-dimensional datasets present opportunities for novel data-driven solutions. We present approaches based on convex models for several problems: probabilistic forecasting of electricity generation and demand, model predictive control for device energy management and source separation for whole-home energy disaggregation.
Data clustering and representation learning are two closely related tasks that can mutually benefit each other. On one hand, the feature vectors produced by representation learning are the inputs of data clustering and largely affect the clustering performance. On the other hand, the cluster labeled generated by clustering algorithms can supervise representation learning to make the learned representations tailored to clustering tasks. In this work, we aim to integrate these two tasks into one unified framework to perform them simultaneously and enable them to mutually promote each other. Specifically, for text data, we design a Multi-grain Clustering Topic Model to simultaneously perform document clustering and topic learning; for image data, we design a Double Layer Gaussian Mixture Model to integrate image clustering and codebook learning. Experiments on various datasets demonstrate that integrating data clustering and representation learning can effectively improve the performances of both tasks.
Bio: Pengtao Xie is a graduate student in the Language Technologies Institute, working with Professor Eric Xing. His primary research interests lie in latent space models and large scale distributed machine learning. He received a M.E. from Tsinghua University in 2013 and a B.E. from Sichuan University in 2010. He is the recipient of Siebel Scholarship, Goldman Sachs Global Leader Scholarship and National Scholarship of China.
An accurate model of a patient's individual survival distribution can help determine the appropriate treatment and care of terminal patients. The common practice of estimating such survival distributions uses only population averages for (say) the site and stage of cancer; however, this is not very precise, as it ignores many important individual differences among patients. This paper describes a novel technique, PSSP (patient-specific survival prediction), for estimating a patient's individual survival curve, based on the characteristics of that specific patient, using a model that was learned from earlier patients. We describe how PSSP works, and explain how PSSP differs from the more standard tools for survival analysis (Kaplan-Meier, Cox Proportional Hazard, etc). We also show that PSSP is "calibrated", which means that its probabilistic estimates are meaningful. Finally, we demonstrate, over many real-world datasets (various cancers, and liver transplantation), that PSSP provides survival estimates that are helpful for patients, clinicians and researchers. This tool is freely available at http://pssp.srv.ualberta.ca/.
Bio: After earning a PhD from Stanford, Russ Greiner worked in both academic and industrial research before settling at the University of Alberta, where he is now a Professor in Computing Science and the founding Scientific Director of the Alberta Innovates Centre for Machine Learning, which won the ASTech Award for "Outstanding Leadership in Technology" in 2006. He has been Program Chair for the 2004 "Int'l Conf. on Machine Learning", Conference Chair for 2006 "Int'l Conf. on Machine Learning", Editor-in-Chief for "Computational Intelligence", and is serving on the editorial boards of a number of other journals. He was elected a Fellow of the AAAI (Association for the Advancement of Artificial Intelligence) in 2007, and was awarded a McCalla Professorship in 2005-06 and a Killam Annual Professorship in 2007. He has published over 200 refereed papers and patents, most in the areas of machine learning and knowledge representation, including 4 that have been awarded Best Paper prizes. The main foci of his current work are (1) bioinformatics and medical informatics; (2) learning and using effective probabilistic models and (3) formal foundations of learnability.
In this talk, I will describe some theoretical results that provide new perspectives on why kernel methods might work well in practice. In particular, we provide some partial answers to (Q) why might the Gaussian RBF kernel (which projects the data into infinite dimensions) often yield high classification accuracies without overfitting in practice?
To answer this question, I first present results from a simpler setting called two-sample testing (TST), a hypothesis testing problem that has close ties with classification. For TST, we prove that* the power of an RBF kernel test statistic equals the power of the linear kernel test statistic in high dimensions, whenever the linear kernel actually suffices to differentiate the two distributions. I will then relate the linear kernel TST to the Fisher LDA classifier, and present a result claiming that* the accuracy of this classifier also performs optimally for TST.
Based on this theory, my conjectured answer for (Q) has two parts - the first is that practical examples have fairly simple decision boundaries (low polynomials), and the second is that the RBF kernel is "automatically adaptive" to such simple boundaries (i.e. it behaves similar to a linear kernel whenever a linear classifier suffices).
* means under some reasonable assumptions.
I am currently proving similar results for the related problems of independence testing and regression. Many people have contributed to this sequence of works (some in preparation), including Larry Wasserman, Aarti Singh, Sashank Reddi, and Barnabas Poczos.
Many emerging applications of machine learning involve time series and spatio-temporal data. In this talk, I will discuss a collection of machine learning approaches to effectively analyze and model large-scale time series and spatio-temporal data, including temporal causal models, sparse extreme-value models, and fast tensor-based forecasting models. Experiment results will be shown to demonstrate the effectiveness of our models in climate science and social media applications.
Many developments in Statistics and Machine Learning involve the computation of higher order derivatives of Gaussian density functions. In the multidimensional case, it is necessary to first establish a convenient formulation to assemble all the partial derivatives. Theoretically, it is possible to derive succinct explicit expressions for the multidimensional Gaussian density derivatives of an arbitrarily high order if we formally arrange all the partial derivatives into a high-dimensional vector. However, the huge matrices involved in these general expressions have traditionally made them of little practical use. In this talk we propose several recursive algorithms to overcome these difficulties that allow to compute these higher order derivatives in a very efficient way. We will also highlight the underlying connections between higher order derivatives of Gaussian density functions, the expected value of products of quadratic forms in Gaussian random variables, and V-statistics of degree two based on Gaussian density functions, and their applications to nonparametric kernel smoothing.
Proteins are the workhorses of the cellular machinery. Disease causing pathogens such as bacteria and viruses, introduce their proteins into the host cells. There they interact with the host's proteins and enable the pathogen to obtain nutrients, replicate and survive inside the host. Often, multiple diseases involve organisms that are related phylogenetically or share some biological properties. For instance, related viruses will employ similar strategies to infect the host cells. Therefore, knowledge can be shared across diseases to better understand various biological phenomena.
In this talk I will present two approaches towards integrating host-pathogen protein interactions from various diseases, with the goal of building good predictive models for each disease. While one approach uses the commonality in the infection mechanisms to enable sharing of information across tasks, the other enforces a shared low-rank structure along with a task-specific sparse structure. We see significant improvements in the prediction performance. This is joint work with Jaime Carbonell and Judith Klein-Seetharaman.
Contextual bandit learning is a fairly recent learning paradigm that captures a fundamental tension between exploration and exploitation in many real-world learning problems. In this setting, a learner, for several rounds, observes a context, makes a decision, and receives reward for her decision, with a goal of having low regret relative to a class of policies mapping contexts to actions.
In this talk, I will give a brief overview of the many algorithms and techniques for contextual bandit learning, highlighting the positive and negative aspects of these algorithms. I will also, time permitting, talk about several extensions to the classical set up, and the approaches used for these settings.
I will present a method that detects compact low-dimensional structures if they exist, and uses them to construct compact interpretable models for different machine learning tasks that can benefit practical applications. To start with, I will formalize Informative Projection Recovery, the problem of extracting a small set of low-dimensional projections of data that jointly support an accurate model for a given learning task. Our solution to this problem is a regression-based algorithm that identifies informative projections by optimizing over a matrix of point-wise loss estimators. It generalizes to multiple types of machine learning problems, offering solutions to classification, clustering, regression, and active learning tasks. Experiments show that our method can discover and leverage low-dimensional structures in data, yielding accurate and compact models. Our method is particularly useful in applications in which expert assessment of the results is essential, such as classification tasks in the healthcare domain.
Additionally, we developed an active learning framework that works with the obtained compact models in finding the most informative unlabeled data. For this purpose, we enhance standard active selection criteria using the information encapsulated by the trained model. The advantage of our approach is that the labeling effort is expended mainly on samples that benefit models from the hypothesis class we are considering. Experiments show that this results in an improved learning rate over standard selection criteria for data from the clinical domain, while the comprehensible view of the data supports the labeling process and helps preempt labeling errors.
The development of sparse methods for high-dimensional supervised and unsupervised learning has been one of the main directions of research in machine learning and statistics for several years. Recently, this trend has seen the proposal of several feature-sparse algorithms for the twin problems of clustering and Gaussian Mixture Model (GMM) learning. The motivation for the development of these methods is twofold. First is the typical statistical benefit, namely the potential reduction of the effects of the curse of dimensionality. Second, as clustering and GMM learning are often used for exploratory data analysis, sparse results may improve the intuitive interpretability of results in a high-dimensional setting.
In this talk, I will give an overview of some of the existing feature-sparse algorithms for clustering and GMM learning, focusing mainly on what is known about their statistical properties. I will describe recent advances made towards developing methods with proven statistical benefits, and the large gaps left to fill to make these methods practical.
Unsupervised discovery of synonymous phrases in useful in a variety of tasks ranging from text mining and search engines to semantic analysis and machine translation. The Near Synonym System (NeSS) presents an unsupervised corpus-based conditional model for finding phrasal synonyms and near synonyms that require only a large monolingual corpus. The method is based on maximizing information-theoretic combinations of shared contexts and is parallelizable for large-scale processing. An evaluation framework with crowd-sourced judgments is proposed and the results are compared with alternate methods, demonstrating considerably superior results to the literature and to thesaurus look up for multi-word phrases. Moreover, the results show that the statistical scoring functions and overall scalability of the system are more important than language specific NLP tools. The method is language-independent and is practically useable due to accuracy and real-time performance via parallel decomposition.
How is information organized in the brain when it reads? Where and when do the required processes occur, such as perceiving the individual words, combining them with the previous words and maintaining a representation of the overall meaning?
I will present results from a recent experiment in which we align context-based neural network language models and brain activity during reading. When processing a text word by word, both the brain and the neural networks perform the same processes. They both maintain a representation for the previous context. They both represent the properties of the incoming word and then integrate it with context. We study the alignment between the latent vectors used by these neural networks and the brain activity observed via Magnetoencephalography (MEG) when subjects read a chapter from Harry Potter and the Sorcererâs Stone. For that purpose we apply the neural network to the same chapter the subjects are reading, and explore the ability of these vector representations to predict the observed word-by-word brain activity.
Our novel results reveal that context is more predictive of brain activity than the properties of the current word, hinting that more brain activity is involved in storing context than in perceiving the current word. We uncover the time-line of how the brain updates its representation of context. We demonstrate the incremental perception of every new word starting early in the visual cortex, moving next to the temporal lobes and finally to the frontal regions. We show the integration process occurring in the temporal lobes after the new word has been perceived.
This is joint work with Ashish Vaswani, Kevin Knight and Tom Mitchell, and is a part of a larger effort to understand how the brain organizes information in natural reading. I will describe this general research direction. I will also mention results from a sister experiment in which we demonstrate how the brain areas involved in reading are processing different types of information (such as syntax, semantics or narrative information). This second experiment is joint work with Brian Murphy, Partha Talukdar, Alona Fyshe, Aaditya Ramdas and Tom Mitchell.
One of the fundamental challenges in reinforcement learning (RL) is to guarantee that a newly proposed policy that has not yet been deployed will be an improvement upon the current policy---that the RL algorithm is "safe". Such an algorithm would be a significant step towards widespread application of RL to real-life problems where deployment of an RL algorithm can be costly or dangerous if the policies it proposes perform worse than the current policy.
I will discuss my recent work (in collaboration with Adobe Research) on such a safe policy search algorithm. The viability of our approach hinges on a new concentration inequality particularly well suited to this application. I will present results from a large real-world digital marketing application before concluding with discussion of potential future research directions.
Bio: Philip is a PhD student supervised by Professor Andrew Barto at the University of Massachusetts Amherst. During his PhD he has been an intern with Adobe Research and Kenji Doya's Neural Computation Unit. He received his B.S. and M.S. in computer science from Case Western Reserve University, where he was supervised by Professor Michael Branicky and where he was a member of Team Case in the DARPA Urban Challenge.
We study the distributed computing setting in which there are multiple servers, each holding a set of points, who wish to compute functions on the union of their point sets. We first discuss two popular center-based clustering objectives, k-median and k-means. Following a classic approach in clustering by Har-Peled and Mazumdar, we reduce the problem of finding a clustering with low cost to the problem of finding a coreset (i.e., a summary) of small size. We provide a distributed method for constructing a global coreset which improves over the previous methods by reducing the communication cost.
We then consider Principal Component Analysis (PCA) where the servers would like to compute a low dimensional subspace capturing as much of the variance of the union of their point sets as possible. We provide a computation and communication efficient algorithm for distributed PCA and also demonstrate how this can be used to further improve the communication and computational costs of k-means clustering and related problems. Our empirical study on real world data shows a speedup of orders of magnitude, preserving communication with only a negligible degradation in solution quality.
Communication costs, resulting from synchronization requirements during learning, can greatly slow down many parallel machine learning algorithms. We present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication. First, we arbitrarily partition data onto multiple machines. Then, on each machine, any classical MCMC method may be used to draw samples from a posterior distribution given the data subset. Finally, the samples from each machine are combined to form samples from the full posterior. This embarrassingly parallel algorithm allows each machine to act independently on a subset of the data (without communication) until the final combination stage. We prove that our algorithm generates asymptotically exact samples and empirically demonstrate its ability to parallelize burn-in and sampling in several models.
This is joint work with Chong Wang and Eric Xing.
This is joint work with Ryan Tibshirani (Statistics and ML, Carnegie Mellon University).
This talk will be about a common practical problem - estimating piecewise constant/linear/quadratic fits to low-dimensional data. I will first introduce Trend Filtering, a recently proposed tool for this problem, and compare it to the popular smoothing splines. Given its theoretical optimality (that I will briefly touch upon), the only roadblock to using it in practice is having robust and efficient algorithms. We take a major step in overcoming this problem with a state-of-the-art specialized "ADMM" algorithm. Furthermore, the proposed implementation is very simple, and importantly, it is flexible enough to extend to many interesting related problems, such as sparse trend filtering and isotonic trend filtering, as I will demonstrate with an application to a neuroscience problem.
Software for our method is easy to use, and it is made freely available, highly optimized in C++, and interfaced through an R package (function "trendfilter" in package "genlasso").
This is joint work with Ryan Tibshirani (Statistics and ML, Carnegie Mellon University).
Dependency trees used in syntactic parsing often include a root node representing a dummy word prefixed or suffixed to the sentence, a device that is generally considered a mere technical convenience and is tacitly assumed to have no impact on empirical results. We demonstrate that this assumption is false and that the accuracy of data-driven dependency parsers can in fact be sensitive to the existence and placement of the dummy root node. In particular, we show that a greedy, left-to-right, arc-eager transition-based parser consistently performs worse when the dummy root node is placed at the beginning of the sentence (following the current convention in data-driven dependency parsing) than when it is placed at the end or omitted completely. Control experiments with an arc-standard transition-based parser and an arc-factored graph-based parser reveal no consistent preferences but nevertheless exhibit considerable variation in results depending on root placement. We conclude that the treatment of dummy root nodes in data-driven dependency parsing is an underestimated source of variation in experiments and may also be a parameter worth tuning for some parsers.
Miguel is a Visiting lecturer - Postdoc in Pompeu Fabra University, Barcelona, Spain. He works on natural language processing and machine learning with a special interest on linguistic structure prediction problems, such as dependency parsing and phrase structure parsing. He completed his BsC, MsC and PhD at the Universidad Complutense de Madrid. During the last years, he was a Visiting Researcher in Universities of Uppsala, Birmingham and Singapore.
A common assumption in Economics is that agents are utility maximizers: an agent, facing prices for a set of goods, will choose to buy the bundle of goods that she most prefers among all bundles that she can afford, according to some concave, non-decreasing utility function. In the classical revealed preference analysis, the goal is to fit some concave, non-decreasing model to data from an agents past purchases. However, due to the richness of this model class, this approach will not generalize to predict future purchases well. A recent line of work, starting with Beigman and Vohra (2006) and Zadimoghaddam and Roth (2012), has addressed this issue by suggesting to learn restricted classes of utility function from revealed preference data.
This talk will present recent advances in this line of work. We provide sample complexity guarantees and efficient algorithms for learning linear utility functions from revealed preference data. At a technical level, our work establishes connections between learning from revealed preferences and problems of multi-class learning, combining recent advances on intrinsic sample complexity of multi-class learning based on compression schemes with a new algorithmic analysis yielding time- and sample-efficient procedures. Our technique yields numerous generalizations including the ability to learn other well-studied classes of utility functions, to deal with a misspecified model, and with non-linear prices. We believe it may lead to new solutions to a variety of learning problems in economic and game theoretic contexts.
This talk is based on a joint work with Nina Balcan, Amit Daniely, Ruta Mehta and Vijay V. Vazirani, that will appear at WINE '14.
Datamining--i.e. finding repeated, informative patterns in large datasets--has proven extremely difficult for visual data. A key issue in visual data mining is the lack of a reliable way to tell whether two images or image patches even depict the same thing. In this talk, I'll cover two recent works which can successfully mine discriminative sets of image patches in both weakly-supervised and fully unsupervised settings.
Our first work proposes discriminative mode seeking, an extension of Mean Shift to weakly-labeled data. Instead of finding the local maxima of a density, we exploit the weak label to partition the data into two sets and find the maxima of the density ratio. Given a dataset of image patches, and weak labels such as scene categories, these 'discriminative modes' correspond to remarkably meaningful visual patterns, including objects and object parts. Using these discriminative patches as an image representation, we obtain state-of-the-art results on a challenging indoor scene classification benchmark.
In the second part of the talk, I will discuss how we can extend this formulation to a fully unsupervised setting. Instead of using weak labels as supervision, we use the ability of an object patch to predict the rest of the object (its context) as supervisory signal to help discover visually consistent object clusters. The proposed method outperforms previous unsupervised as well as weakly-supervised object discovery approaches, and is shown to provide correspondences detailed enough to transfer keypoint annotations, even for extremely difficult datasets intended for benchmarking fully supervised object detection algorithms (e.g. Pascal VOC).
Recently, many cloud based machine learning (ML) services have been launched, including Microsoft Azure Machine Learning, GraphLab, Google Prediction API and Ersatz Labs. Cloud ML makes machine learning very easy to use for common users. However, it invades the privacy and security of users' data. How to protect users' privacy in cloud ML is a big challenge. In this work, we focus on neural network which is a backbone model in machine learning, and investigated how to perform privacy-preserving neural network prediction on encrypted data. Users encrypt their data before uploading them to the cloud. Cloud performs neural network predictions over the encrypted data and obtains the results which are also in encrypted form that the cloud cannot decipher. The encrypted results are sent back to users and users do the decryption to get the plaintext results. In this process, cloud never knows users' input data and output results since they are both encrypted. This achieves a strong protection of users' privacy. Meanwhile, with the help of homomorphic encryption, predictions made on encrypted data are nearly the same as those on plaintext data. The predictive performance of neural network is guaranteed.
Inferring the "tree of life" from genetic data is an important problem in evolutionary biology. It has been customary to think about this problem as one of learning the evolutionary history of a single gene. However, individual genes might have evolutionary histories that are (topologically) distinct from each other and, of course, from the underlying tree of life. In this talk, I will discuss a probabilistic model of evolution that takes this into account. I will then outline our recent theoretical explorations into questions such as reliable algorithms for learning the tree of life from several genes and the amount of data required for such algorithms to succeed.
This is joint work with Rob Nowak and Sebastien Roch.
Vector space models (VSMs) represent word meanings as points in a high dimensional space. VSMs are typically created using a large text corpora, and so represent word semantics as observed in text. We present a new algorithm (JNNSE) that can incorporate a measure of semantics not previously used to create VSMs: brain activation data recorded while people read words. The resulting model takes advantage of the complementary strengths and weaknesses of corpus and brain activation data to give a more complete representation of semantics. Evaluations show that the model 1) matches a behavioral measure of semantics more closely, 2) can be used to predict corpus data for unseen words and 3) has predictive power that generalizes across brain imaging technologies and across subjects. We believe that the model is thus a more faithful representation of mental vocabularies.
Joint work with Partha Talukdar, Brian Murphy and Tom Mitchell
We consider the question of how unlabeled data can be used to estimate the true accuracy of learned classifiers. This is an important question for any autonomous learning system that must estimate its accuracy without supervision, and also when classifiers trained from one data distribution must be applied to a new distribution (e.g., document classifiers trained on one text corpus are to be applied to a second corpus). We show how to accurately estimate error rate from unlabeled data when given a collection of competing classifiers that make independent errors, based on the agreement rates between subsets of these classifiers. We further show that even when the classifiers do not make independent errors, both their accuracies and error dependencies can be estimated in a multitask learning setting under practical assumptions. Experiments on two data sets demonstrate accurate estimates of accuracy from unlabeled data. These results are of practical significance in situations where labeled data is scarce, and shed light on the more general question of how the consistency among multiple functions is related to their true accuracies.
Commanding robots through unconstrained natural language directions is intuitive, flexible, and does not require specialized interfaces or training. Providing this capability would enable effortless coordination in human robot teams that operate in non-specialized environments. However, natural language direction following through unknown environments requires understanding the meaning of language, using a partial semantic world model to generate actions in the world, and reasoning about the environment and landmarks that have not yet been detected.
We address the problem of robots following natural language directions through complex unknown environments. By exploiting the structure of spatial language, we can frame direction following as a problem of sequential decision making under uncertainty. We learn a policy using imitation learning from demonstrations of people following directions. The trained policy predicts a sequence of actions that follow the directions, explores the environment (discovering new landmarks), backtracks when necessary, and explicitly declares when it has reached its destination. By training explicitly in unknown environments we can generalize to situations that have not been encountered previously.
This is work with Anthony Stentz (CMU), Tom Kollar (CMU), Matt Walter (MIT), Tom Howard (MIT), and Sachi Hemachandra (MIT).
Ensemble methods are widely used in practice with the hope of obtaining better predictive performance than could be obtained from any of the constituent classifiers in the ensemble. Most of the existing literature is concerned with learning ensembles in a supervised setting. In this paper we propose an unsupervised iterative algorithm to combine the discriminant scores from different binary classifiers. We prove that (under certain assumptions) the Area Under the ROC Curve (AUC) of the resulting ensemble is greater than or equal to the AUC of the best classifier (with maximum AUC). We also experimentally validate this claim on a number of datasets and also show that the performance is better than the supervised ensembles.
In this talk, I will present some of my recent work with my collaborators on building models for inferring political ideologies from text. Given political candidate speeches from the 2008 and 2012 US elections, we seek to measure their ideological positioning. To accomplish this, we infer ideological cues from a corpus of political writings annotated with known ideologies. We then represent the speeches of U.S. presidential candidates as sequences of cues and lags (filler distinguished only by its length in words). We apply a domain-informed Bayesian HMM to infer the proportions of ideologies each candidate uses in each campaign. The results are validated against a set of preregistered, domain expert authored hypotheses. I will also present some preliminary results on our work with briefs and data from the US Supreme Court, studying the latent behaviors of amicus brief filers from a utility maximizing perspective.
Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of edges (|E|) and also in the number of distinct labels (m), and require O(m) space on each node. Unfortunately, there exist many applications of practical significance with very large m over large graphs, demanding better space and time complexity. In this talk, we propose MAD-Sketch, a novel graph-based SSL algorithm which compactly stores label distribution on each node using Count-min Sketch, a randomized data structure. We present theoretical analysis showing that under mild conditions, MAD-Sketch can reduce space complexity at each node from O(m) to O(log m), and achieve similar savings in time complexity as well. We support our analysis through experiments on multiple real world datasets. We observe that MAD-Sketch achieves similar performance as existing state-of-the-art graph-based SSL algorithms, while requiring smaller memory footprint and at the same time achieving up to 10x speedup. We find that MAD-Sketch is able to scale to datasets with one million labels, which is beyond the scope of existing graph-based SSL algorithms.
Joint work with William Cohen (CMU). Paper URL
In many modern applications built on massive data and using high-dimensional models, such as web-scale content extraction via topic models, genome-wide association mapping via sparse regression, and image understanding via deep neural networks, one needs to handle BIG machine learning problems that threaten to exceed the limit of current infrastructures and algorithms. While ML community continues to strive for new scalable algorithms, and several attempts on developing new system architectures for BIG ML have emerged to address the challenge on the backend, good dialogs between ML and system remain difficult --- most algorithmic research remain disconnected from the real system/data they are to face; and the generality, programmability, and theoretical guarantee of most systems on ML programs remain largely unclear. In this talk, I will present Petuum -- a general-purpose framework for distributed machine learning, and demonstrate how innovations in scalable algorithms and distributed systems design work in concert to achieve multiple orders of magnitude of scalability on a modest cluster for a wide range of large scale problems in social network (mixed-membership inference on 40M node), personalized genome medicine (sparse regression on 100M dimensions), and computer vision (classification over 20K labels), with provable guarantee on correctness of distributed inference.
In recent years, with the advancement of large-scale data acquisition technology in various engineering, scientiï¬c, and socio-economical domains, traditional machine learning and statistical methods have started to struggle with massive amounts of increasingly high-dimensional data. Luckily, in many problems there is a simple structure underlying high-dimensional data that can be exploited to make learning feasible.
In this talk, I will focus on the problem of detection and localization of a contiguous block of weak activation in a large matrix, from a small number of noisy, possibly adaptive, compressive measurements. This is closely related to the problem of compressed sensing, where the task is to estimate a sparse vector using a small number of linear measurements. Contrary to results in compressed sensing, where it has been shown that neither adaptivity nor contiguous structure help much, we show that for reliable localization the magnitude of the weakest signals is strongly inï¬uenced by both structure and the ability to choose measurements adaptively while for detection neither adaptivity nor structure reduce the requirement on the magnitude of the signal. We characterize the precise tradeoï¬s between the various problem parameters, the signal strength and the number of measurements required to reliably detect and localize the block of activation. The suï¬cient conditions are complemented with information theoretic lower bounds.
Joint work with Sivaraman Balakrishnan, Alessandro Rinaldo and Aarti Singh.
Human language is the result of cognitive processes whose contours are---at best---incompletely understood. Given the incomplete information we have about the processes involved, the frequently disappointing results obtained from attempts to use unsupervised learning to uncover latent linguistic structures (e.g., part-of-speech sequences, syntax trees, or word alignments in parallel data) can be attributed---in large part---to model misspecification.
This work introduces a novel framework for unsupervised learning of structured predictors with overlapping, global features. Each input's latent representation is predicted conditional on the observable data using a feature-rich conditional random field. Then a reconstruction of the input is generated, conditional on the latent structure, as drawn from cheaply-estimated multinomials. The autoencoder structure enables efficient inference without unrealistic independence assumptions, enabling us to incorporate the often conflicting, overlapping theories (in the form of hand-crafted features) about how latent structures relate to observed data in a coherent model. We contrast our approach with traditional joint unsupervised models that are learned to maximize the marginal likelihood of observed data. We show competitive results with instantiations of the model for two canonical NLP tasks: part-of-speech induction and bitext word alignment, and show that training our model is substantially more efficient than training feature-rich models.
This is joint work with Waleed Ammar and Noah Smith.
Modern applications awaiting next generation machine intelligence systems have posed unprecedented scalability challenges. These scalability needs arise from at least two aspects: 1) massive data volume, such as societal-scale social graphs with up to hundreds of millions of nodes; and 2) massive model size, such as the Google Brain deep neural network containing billions of parameters. Although there exist means and theories to support reductionist approaches like subsampling data or using small models, there is an imperative need for sound and effective distributed ML methodologies for users who cannot be well-served by such shortcuts. To this end, we propose a parameter server system for distributed ML, which follows a Stale Synchronous Parallel (SSP) model of computation that maximizes the time computational workers spend doing useful work on ML algorithms, while still providing correctness guarantees. The parameter server provides an easy-to-use shared interface for read/write access to an ML model's values (parameters and variables), and the SSP model allows distributed workers to read older, stale versions of these values from a local cache, instead of waiting to get them from a central storage. This significantly increases the proportion of time workers spend computing, as opposed to waiting. Furthermore, the SSP model ensures ML algorithm correctness by limiting the maximum age of the stale values. We provide a proof of correctness under SSP, as well as empirical results demonstrating that the SSP model achieves faster algorithm convergence on several different ML problems, compared to fully-synchronous and asynchronous schemes.
Being able to effectively model latent structure in data is a key challenge in modern AI research, particularly in Natural Language Processing (NLP) where it is crucial to discover and leverage syntactic and semantic relationships that may not be explicitly annotated in the training set. Unfortunately, while incorporating latent variables to represent hidden structure can substantially increase representation power, the key problems of model design and learning become significantly more complicated. For example, unlike fully observed models, latent variable models can suffer from non-identifiability, making it difficult to distinguish the desired latent structure from the others. Moreover, learning is usually formulated as a non-convex optimization problem, leading to the use of local search methods that may become trapped in local optima.
In this talk, we take a different perspective and approach two key problems in NLP, unsupervised parsing and language models, through the lens of linear algebra. By exploiting the connection between hidden variables and low rank factorization, we propose a method for unsupervised constituent parsing that has theoretical guarantees on latent structure recovery. Empirically our approach performs favorably to the Constituent Context Model of Klein and Manning (2002) without the need for careful initialization.
In our on-going work in language modeling, we leverage matrix factorization to generalize existing n-gram language models to non-integer n. Our method includes existing smoothing methods such as Absolute Discounting and Kneser Ney as special cases, and gives us noticeable improvements in perplexity over state-of-the-art Kneser Ney baselines.
This is joint work with Shay Cohen, Avneesh Saluja, Chris Dyer, and Eric Xing.
In this talk I will try to convince all the fervent believers in proximal gradient methods, ADMM, or coordinate descent that there is a better method for optimizing general (smooth) objectives with an L1 penalty: Newton coordinate descent. The method is the current state-of-the-art in tasks like sparse inverse covariance estimation, and I will highlight two examples from my group's work that use this approach to achieve substantial speedups over existing algorithms. In particular, I will discuss how we use this algorithm to learn sparse Gaussian conditional random field models (applied to energy forecasting), and to design sparse optimal control laws (applied to distributed control in a smart grid).
Being part of the Machine Learning team at Amazon is one of the most exciting engineering job opportunities in the world today. Our Machine Learning (ML) team is comprised of technical leaders with different backgrounds who create and develop novel and infinitely-scalable applications that optimize Amazonâs systems using cutting edge machine learning techniques. We develop innovative algorithms that model patterns within data to drive automated decisions at scale in all corners of the company, including our e-Commerce site and subsidiaries, Amazon Web Services, Seller and Buyer Services and Digital Media including Kindle. In this talk, I will give you an overview of some of the technical challenges we face and I will present some examples of ML-based solutions that had a significant impact inside the company.
Glenn Fung received a B.S. in pure mathematics from Universidad Lisandro Alvarado in Barquisimeto, Venezuela. He then earned an M.S. in applied mathematics from Universidad Simon Bolivar, Caracas, Venezuela, where later he worked as an assistant professor for two years. He also earned an M.S. degree and a Ph. D. degree in computer sciences from the University of Wisconsin-Madison. His main interests are optimization approaches to machine learning and data mining, with emphasis in kernel methods. In the summer of 2003 he joined the computer aided diagnosis group at Siemens Healthcare in Malvern, PA where he worked for 10 years developing and applying novel machine learning techniques to solve challenging problems that arise in the medical domain. In September of 2013 he joined the Amazon where he has been working on applying novel machine learning techniques to solve challenging problems that arise in e-commerce retail.
Determinantal Point Processes (DPPs) are random point processes well-suited for modelling repulsion. In machine learning and statistics, DPPs are a natural model for subset selection problems where diversity is desired. For example, they can be used to select diverse sets of sentences to form document summaries, or to return relevant but varied text and image search results, or to detect non-overlapping multiple object trajectories in video. Among many remarkable properties, they offer tractable algorithms for exact inference, including computing marginals, computing conditional probabilities, and sampling. In our recent work, we extended these algorithms to approximately infer non-linear DPPs defined over a large amount of data, as well as DPPs defined on continuous spaces using low-rank approximations. We demonstrated the advantages of our models on several machine learning and statistical tasks: motion capture video summarization, repulsive mixture modelling and synthesizing diverse human poses. Given time, I will also briefly touch on our other related works such as extending DPPs into a temporal process that sequentially select multiple diverse subsets across time and how we go about learning the parameters of a DPP kernel. These are joint works with Emily Fox, Ben Taskar and Alex Kulesza.
We study the problem of learning in the presence of a drifting target concept. Specifically, we provide bounds on the expected number of mistakes on a sequence of i.i.d. points, labeled according to a target concept that can change by a given amount on each round. Some of the results also describe an active learning variant of this setting, and provide bounds on the number of queries for the labels of points in the sequence sufficient to obtain the stated bounds on the number of mistakes.
This is joint work with Steve Hanneke and Varun Kanade.
We develop upper and lower bounds for the probability of Boolean functions by treating multiple occurrences of variables as independent and assigning them new individual probabilities. We call this approach dissociation and give an exact characterization of optimal oblivious bounds, i.e. when the new probabilities are chosen independent of the probabilities of all other variables. Our motivation comes from the weighted model counting problem (or, equivalently, the problem of computing the probability of a Boolean function), which is #P-hard in general. By performing several dissociations, one can transform a Boolean formula whose probability is difficult to compute, into one whose probability is easy to compute, and which is guaranteed to provide an upper or lower bound on the probability of the original formula by choosing appropriate probabilities for the dissociated variables. Our new bounds shed light on the connection between previous relaxation-based and model-based approximations and unify them as concrete choices in a larger design space. In the second part of the talk, we focus on the problem of query evaluation over probabilistic databases. We show how our theory allows a standard relational database management system (DBMS) to both upper and lower bound hard probabilistic queries in guaranteed polynomial time.
Equipping machines with knowledge, through the construction of machine-readable knowledge bases, presents a key asset for semantic search, question answering and other applications. In this talk, I will present methods for knowledge base construction. The first is a method for fact extraction. The second is a method for generating a large collection of binary relations organized by synonym sets and into a taxonomy based on relation subsumptions. The third is a method for discovering and semantically typing new entities as they emerge in dynamic Web sources such as news articles and social media.
Over the past few years, I have been developing and deploying interactive crowd-powered systems that solve characteristic âhardâ problems to help people get things done in their everyday lives. For instance, VizWiz answers visual questions for blind people in less than a minute, Legion drives robots in response to natural language commands, Chorus holds helpful conversations with human partners, and Scribe converts streaming speech to text in less than five seconds.
The future envisioned by my research is one in which the intelligent systems that we have dreamed about for decades, which have inspired generations of computer scientists from its beginning, are brought about for the benefit of people. My work illustrates a path for achieving this vision by leveraging the on-demand labor of people to fill in for components that we cannot currently automate, and by building frameworks that allow groups to do together what even expert individuals cannot do alone. A crowd-powered world may seem counter to the goals of computer science, but I believe that it is precisely by creating and deploying the systems of our dreams that will learn how to advance computer science to create the machines that will someday realize them.
A fundamental challenge in robotics is the so-called ``critter problem:ââ a robot, capable of performing actions and receiving observations, is placed in an unknown environment. The robot has no interpretation for its actions or observations and no knowledge of the structure of the environment. The problem is to program the robot to learn about its observations, actions, and environment well enough to make predictions of future observations given sequences of actions.
This modeling problem is especially challenging given the wide variety of sensors, actuators, and domains that are encountered in modern robotics. As a result, researchers have largely abandoned the critter problem and have instead focused on leveraging extensive domain knowledge to develop special-purpose tools for special cases of the problem: e.g., system identification to learn Kalman filters, body schema learning for manipulators, structure-from-motion in vision, or the many methods for simultaneous localization and mapping.
In this talk I will revisit the original critter problem from a modern machine learning perspective. I will discuss how spectral learning algorithms can unify disparate tools and special cases encountered in different sub-areas of robotics into a single general-purpose toolkit. Finally, I will show how spectral methods have achieved state-of-the-art performance on real-world robotics problems in several different domains.
Bayes nets are not only useful for developing AI systems, but can also help to explain how humans reason under uncertainty. I will present a Bayes net framework for reasoning about multiple causal systems and will apply it to two aspects of human reasoning. The first application considers how people make inferences that depend on both causal relationships between features (e.g. animals with wings often fly) and similarity relationships between objects (e.g. eagles and hawks are rather similar). The second application explores how people make counterfactual inferences, or inferences about scenarios that differ from the real world in some respect.
Charles Kemp is an associate professor in CMU's psychology department. His research focuses on probabilistic models of human learning and inference.
In this talk, I will be summarizing our work on enabling robots to produce motion that is suitable for human-robot collaboration and co-existence. Most motion in robotics is purely functional: industrial robots move to package parts, vacuuming robots move to suck dust, and personal robots move to clean up a dirty table. This type of motion is ideal when the robot is performing a task in isolation. Collaboration, however, does not happen in isolation. In collaboration, the robot's motion has an observer, watching and interpreting the motion. In this work, we move beyond functional motion, and introduce the notion of an observer into motion planning, so that robots can generate motion that is mindful of how it will be interpreted by a human collaborator. We formalize predictability and legibility as properties of motion that naturally arise from the inferences that the observer makes, drawing on action interpretation theory in psychology. We propose models for these inferences based on the principle of rational action, and use a combination of constrained trajectory optimization and machine learning techniques to enable robots to plan motion for collaboration.
Bio: Anca Dragan is a PhD candidate at Carnegie Mellon's Robotics Institute, and a member of the Personal Robotics Lab. She was born in Romania and received her B.Sc. in Computer Science from Jacobs University Bremen in 2009. Her research lies at the intersection of robotics, machine learning, and human-computer interaction: she is interested in enabling robots to seamlessly work with and around people. Anca is an Intel PhD Fellow for 2013, a Google Anita Borg Scholar for 2012, and serves as General Chair in the Quality of Life Technology Center's student council.
Many models in machine learning, computer vision or speech processing have the form of a sequence of nested, parameterized functions, such as a multilayer neural net, an object recognition pipeline, or a "wrapper" for feature selection. Joint estimation of the parameters of all the layers and selection of an optimal architecture is widely considered to be a difficult numerical nonconvex optimization problem, difficult to parallelize for execution in a distributed computation environment, and requiring significant human expert effort, which leads to suboptimal systems in practice. We describe a general mathematical strategy to learn the parameters and, to some extent, the architecture of nested systems, called the method of auxiliary coordinates (MAC). MAC has provable convergence, is easy to implement reusing existing algorithms for single layers, can be parallelized trivially and massively, applies even when parameter derivatives are not available or not desirable (so computing gradients with the chain rule does not apply), and is competitive with state-of-the-art nonlinear optimizers even in the serial computation setting, often providing reasonable models within a few iterations.
If time permits, I will illustrate how to use MAC to derive training algorithms for a range of problems, such as deep nets, best-subset feature selection, joint dictionary and classifier learning, supervised dimensionality reduction, and others.
This is joint work with Weiran Wang.
BIOGRAPHY: Miguel Ã. Carreira-Perpinan is an associate professor in Electrical Engineering and Computer Science at the University of California, Merced. He received the degree of "licenciado en informÃ¡tica" (MSc in computer science) from the Technical University of Madrid in 1995 and a PhD in computer science from the University of Sheffield in 2001. Prior to joining UC Merced, he did postdoctoral work at Georgetown University (in computational neuroscience) and the University of Toronto (in machine learning), and was an assistant professor at the Oregon Graduate Institute (Oregon Health and Science University). He is the recipient of an NSF CAREER award, a Google Faculty Research Award and a best student paper award at Interspeech. He is an associate editor for the IEEE Transactions on Pattern Analysis and Machine Intelligence and an area chair for NIPS. His research interests lie in machine learning, in particular unsupervised learning problems such as dimensionality reduction, clustering and denoising, with an emphasis on optimization aspects, and with applications to speech processing (e.g. articulatory inversion and model adaptation), computer vision, sensor networks and other areas.
While machine learning plays a significant role in the community of practice known as "computational social science", one area ripe for interdisciplinary work but fraught with its own challenges are the humanities, which encompass such domains as English, Literary Studies, History and Archaeology (among many others). These areas have a long history of engaging with quantitative and computational methods (pre-dating modern notions of the "digital humanities"), and offer a fascinating, complex proving ground for classic ML problems of learning and inference.
In this talk, I will discuss recent and ongoing work into two probabilistic latent variable models that fall in this domain: the first is a model for inferring character types (or "personas") in text, where a "persona" is defined as a set of mixtures over fine-grained latent lexical classes. These lexical classes capture the stereotypical actions of which a character is the agent and patient (villains "kill" and "are foiled"), as well as the attributes by which they are described (e.g., "evil"); I present results applying this model to a collection of movie plot summaries.
The second model addresses the problem of jointly inferring the identity and social rank of members of an Old Assyrian trade network from the 2nd millennium BCE, leveraging evidence in the form of local, partial ranks over observed name mentions in cuneiform tablets to learn a global rank over (latent) individuals.
Researchers from both CS and the humanities are welcome.
The rapid growth in the size and scope of datasets in science and technology has created a need for novel foundational perspectives on data analysis that blend the statistical and computational sciences. That classical perspectives from these fields are not adequate to address emerging problems in "Big Data" is apparent from their sharply divergent nature at an elementary level---in computer science, the growth of the number of data points is a source of "complexity" that must be tamed via algorithms or hardware, whereas in statistics, the growth of the number of data points is a source of "simplicity" in that inferences are generally stronger and asymptotic results can be invoked. Indeed, if data are a data analyst's principal resource, why should more data be burdensome in some sense? Shouldn't it be possible to exploit the increasing inferential strength of data at scale to keep computational complexity at bay? I present three research vignettes that pursue this theme, the first involving the deployment of resampling methods such as the bootstrap on parallel and distributed computing platforms, the second involving large-scale matrix completion, and the third introducing a methodology of "algorithmic weakening," whereby hierarchies of convex relaxations are used to control statistical risk as data accrue.
[Joint work with Venkat Chandrasekaran, Ariel Kleiner, Lester Mackey, Purna Sarkar, and Ameet Talwalkar].
The internet has revolutionized the way we communicate, leading to a constant flood of informal text available in electronic format, including: email, Twitter, SMS and the clinical text found in electronic medical records. This presents a big opportunity for Natural Language Processing (NLP) and Information Extraction (IE) technology to enable new large scale data-analysis applications by extracting machine-processable information from unstructured text at scale.
In this talk I will discuss several challenges and opportunities which arise when applying NLP and IE to informal text, focusing specifically on Twitter, which has recently rose to prominence, challenging the mainstream news media as the dominant source of realtime information on current events. I will describe several NLP tools we have adapted to handle Twitterâs noisy style, and present a system which leverages these to automatically extract a calendar of popular events occurring in the near future.
I will further discuss fundamental challenges which arise when extracting meaning from such massive open-domain text corpora. Several probabilistic latent variable models will be presented, which are applied to infer the semantics of large numbers of words and phrases and also enable a principled and modular approach to extracting knowledge from large open-domain text corpora.
Large amounts of data are routinely collected in the Electronic Medical Record yet their use in informing patient care is limited to manual assessment by the caregivers. In this talk, we discuss probabilistic approaches motivated by clinical practice for analyzing continuously measured physiologic data. We develop our models on data collected from instrumenting a neonatal intensive care unit. Based on insights derived from modeling this data, we tackle the application of risk stratification. As part of routine care, every infant at birth is risk stratified based on the Apgar score. We develop a cheap, non-invasive and simple risk stratification tool using markers from physiologic data for predicting infants at risk, dubbed by Science news as the modern electronic Apgar.
Bio: Suchi Saria is an Assistant Professor at Johns Hopkins University within the Schools of Engineering and Public Health. She received her PhD in machine learning from Stanford University. Her research interests span computational modeling of diverse, large temporal data, and in particular those from sensing devices and the electronic health record for improving patient management. Her work on predictive modeling for clinical data from infants has been covered by numerous national and international press sources. She is the recipient of multiple awards including a best student, a best paper finalist, Microsoft Full Scholarships, Rambus Corporation fellowship, an NSF Computing Innovation fellowship, and a Gordon and Betty Moore foundation award.
Regularized ERM is the engine driving numerous machine learning techniques such as SVMs and the LASSO, and is to thank for many modern results on fast optimization, model recovery, and noise tolerance. How then is it that the popular algorithm AdaBoost not only corresponds to an unregularized ERM problem, but moreover AdaBoost's optimization problem fails many of the basic sanity checks --- e.g., existence of minimizers --- provided by regularization?
This talk bridges this apparent rift by presenting the two theoretical guarantees substantiating the strong performance of AdaBoost --- convergence to the Bayes predictor in the general case, and fast convergence in the margin / weak-learnable case --- in a way that emphasizes the connection to regularization. In particular, in lieu of an algorithm-specified regularization parameter, the data itself constrains the behavior of the algorithm.
(Technical note: some results will focus on Lipschitz losses (e.g., the logistic loss), but the exponential loss will be discussed throughout.)
In the first part of this talk, we investigate information validation tasks that are initiated as queries from either automated agents or humans. We introduce OpenEval, a new online information validation technique, which uses information on the web to automatically evaluate the truth of queries that are stated as multi-argument predicate instances (e.g., DrugHasSideEffect(Aspirin, GI Bleeding))). OpenEval gets a small number of instances of a predicate as seed positive examples and automatically learns how to evaluate the truth of a new predicate instance by querying the web and processing the retrieved unstructured web pages. We show that OpenEval is able to respond to the queries within a limited amount of time while also achieving high F1 score. In addition, we show that the accuracy of responses provided by OpenEval is increased as more time is given for evaluation.
In the second part of the talk, we explain how OpenEval can be used to provide knowledge to anytime intelligent agents, in particular for a find-deliver task in a real mobile robot (CoBot), for a trip planner agent, and for the knowledge-on-demand system which is part of the NELL project.
These projects are joint work with Manuela Veloso, Manuel Blum, Thomas Kollar, and Tom Mitchell
In many real-world applications, data come as discrete metric spaces sampled around 1-dimensional linear structures that can be seen as metric trees or graphs. In this talk we will consider the reconstruction problem of such filamentary structures from data sampled around them. We will present two reconstruction methods. The first one comes with topological and metric guarantees but relies on parameter choices that might be tricky. The second one comes with metric guarantees but is much less sensitive to parameter choices and can be very efficiently implemented.
These are joint works with M. Aanjaneya, F. Chazal, D. Chen, M. Glisse, L. J. Guibas, D. Morozov (Stanford University) and with Jian Sun (Mathematical Sciences Center, Tsinghua University).
Models of prediction from text have seen an increase in interest with the rise of Social Media. Applications vary from predicting political sentiment, financial indicators or flu rates. Most of the approaches treat the problem as linear regression, learning to relate word frequencies to the response variable.
In this talk I will present a bilinear approach to text regression with a sparse regulariser. This model learns a sparse representation of two distinct sets of variables, in our case the words and the users. Our method is inspired by the fact that most Social Media users posts are not indicative to our prediction task. Further, we explore a multi-task learning approach in order to exploit the relationship between the output variables.
Uniformity is a parsimonious description of randomness. Unfortunately, a probability measure can come nowhere close to approximating uniformity on unbounded sets. This talk takes up that problem in the setting of the integers. By replacing the countable additivity axiom of probability with finite additivity, one can assign to each integer the same probability. Because assigning to each integer the same probability does not uniquely define a uniform distribution, various additional properties have been proposed. In this talk, three such properties will be reviewed, and a new one introduced. It turns out these notions can be ordered so that one implies the next. While stronger notions may be required to attain uniqueness for certain applications, the weaker notions have the virtue of being more parsimonious and interpretable. Moreover, stronger properties may be decomposed into weaker ones, and the weaker notions may have special number theoretic properties. For instance, the family of uniform distributions introduced in this talk may assign positive probability to the set of prime numbers.
In some reinforcement learning problems an agent may be provided with a set of input policies, perhaps learned from prior experience or provided by advisors. We present a reinforcement learning with policy advice (RLPA) algorithm which leverages this input set and learns to use the best policy in the set for the reinforcement learning task at hand. We prove that RLPA has a sublinear regret of $\widetilde O(\sqrt{T})$ relative to the best input policy, and that both this regret and its computational complexity are independent of the size of the state and action space. Our empirical simulations support our theoretical analysis. This suggests RLPA may offer significant advantages in large domains where some prior good policies are provided.
Nowadays, the scale of graph data that needs to be processed is massive. For example, in the context of online services, the Web graph amounts to at least one trillion of links, and Facebook recently reported more than 1 billion of users and 140 billion of friend connections.
In the first part of the talk, we will discuss the balanced graph partitioning problem in the context of big dynamic graph data, a key problem to enable efficient solving of a wide range of computational tasks. There exist two widely-used families of heuristics for graph partitioning in the streaming setting: place the newly arrived vertex in the cluster with the largest number of neighbors or in the cluster with the least number of non-neighbors. We will present a framework that unifies these two seemingly orthogonal approaches and allows us to interpolate between them, obtaining a superior performance. Surprisingly, this performance is even comparable to non-streaming algorithms. For instance, for the Twitter graph with more than 1.4 billion of edges, our method partitions the graph in about 40 minutes achieving a balanced partition that cuts as few as 6.8% of edges, whereas it took more than 8$\tfrac{1}{2}$ hours by METIS to produce a balanced partition that cuts 11.98\% of edges. Finally, we provide an O(logk/k)-approximation algorithm and we show experimentally using Apache Giraph that the partitions we obtain result in significant gains in terms of the communication cost and runtime.
In the second half of the talk we will discuss the problem of finding dense subgraphs in large-scale graphs. We introduce a general framework which subsumes popular density functions and provide theoretical insights into our framework. We introduce a special instance of our framework as a density function which favors small, dense, small-diameter subgraphs. We provide an additive approximation algorithm, and a multiplicative approximation algorithm with tight guarantees. We also develop applications of our method in data mining and bioinformatic tasks, such as forming a successful team of domain experts and finding highly correlated genes from a microarray dataset.
Joint work with Christos Gkantsidis (MSR Research), Bozidar Radunovic (MSR Research), Milan Vojnovic (MSR Research) and Francesco Bonchi (Yahoo! Research), Aris Gionis (Aalto University), Francesco Gullo (Yahoo! Research), Maria Tsiarli (University of Pittsburgh).
Much work in optimal control and inverse control has assumed that the controller has perfect knowledge of plant dynamics. However, if the controller is a human or animal subject, the subjectâs internal dynamics model may differ from the true plant dynamics. Here, we consider the problem of learning the subjectâs internal model from demonstrations of control and knowledge of task goals. Due to sensory feedback delay, the subject uses an internal model to generate an internal prediction of the current plant state, which may differ from the actual plant state. We develop a probabilistic framework and exact EM algorithm to jointly estimate the internal model, internal state trajectories, and feedback delay. We applied this framework to demonstrations by a nonhuman primate of brain-machine interface (BMI) control. We discovered that the subjectâs internal model deviated from the true BMI plant dynamics and provided significantly better explanation of the recorded neural control signals than did the true plant dynamics.
Story comprehension is a rich and rapid phenomenon, requiring multiple simultaneous processes (e.g. letter recognition, word understanding, sentence parsing...). Our goal is to study how the brain processes this complex information, by modeling the fMRI brain activity during story reading at a close to normal speed. This is a challenging goal, one reason being the coarse time-resolution of fMRI and the lack of a comprehensive model of word meaning composition. Classically, fMRI has been used to localize brain areas that process specific elements of text processing, e.g. which areas are involved in syntactic processing, but not model how the brain represents different instances of these elements, e.g. how do those areas represent different syntactic structures.
We created a generative model that predicts the fMRI activity created when subjects read a complex story where the words are presented in a serial manner, for 0.5 seconds each. We then performed an exploratory analysis in which we tested several types of story features (e.g. word length, syntax, semantics, story characters) to search for a good basis of features for story comprehension. We found different patterns of representation in the brain for different types of features. These patterns align with the predictions from the field. To test the validity of our model, we performed a classification task that decodes what passage of the story a time segment of brain activity corresponds to. The classification accuracy was significantly higher than chance (p < 10^-6). Our approach has the advantage of being flexible: any feature of language can be added to the model and tested, and features can range from simple perceptual features, to compositional semantics, to higher order reasoning about narrative structure and story comprehension.
Stochastic convex optimization (SCO) under the first order oracle model deals with approximately optimizing a convex function over a convex set, given access to noisy function and gradient values at any point, using as few queries as possible. Active learning of one-dimensional threshold (ALT) classifiers, is a classification problem that deals with approximately locating a "threshold" on a subinterval of the real line (which is a point to the left of which labels are more likely to be negative, and to the right of which labels are more likely to be positive), given access to an oracle that returns these noisy labels, using as few queries as possible.
Exploiting the sequential nature of both problems, we establish a concrete similarity between the "Tsybakov Noise Condition" from ALT theory and "Strong/Uniform Convexity Condition" from SCO theory, and show how information-theoretic lower-bound techniques from ALT can be used to get very similar lower bound rates in SCO (and show these are tight with matching upper bounds). Time permitting, I will also show a kind of algorithmic reduction from SCO to ALT for strongly/uniformly convex functions as well as a new adaptive ALT algorithm, that was inspired from a recent adaptive SCO algorithm.
Estimation methods in high-dimensional linear models, as studied in compressed sensing and sparse linear regression, basically rely on minimization of a squared error subject to a sparsity constraint. In first part of this talk we examine the problem of sparsity-constrained minimization of non-quadratic objective functions that can arise in models with non-linearities. We propose a greedy algorithm for these problems and prove its accuracy objectives with "stable restricted hessian" or "stable restricted linearization".
In the second part of the talk, we inspect the non-convex $\ell_p$-constrained least squares. In particular, we obtain accuracy guarantees for the projected gradient descent method under the restricted isometry property. We further discuss the implication of the result and derive necessary conditions for projection onto an $\ell_p$ ball.
A common classifier for unlabeled nodes on undirected graphs uses label propagation from the labeled nodes, a.k.a the harmonic predictor on Gaussian random fields (GRFs). For active learning on GRFs, the commonly used V-optimality criterion queries nodes that reduce the L2 (regression) loss. It has a submodularity property showing that greedy application of it produces a (1 - 1/e) globally optimal solution. However, L2 loss may not characterize the true nature of 0/1 loss in classification problems and thus may not be the best choice for active learning.
We propose a new criterion we call Sigma-optimality, which queries the node that minimizes the sum of the elements in the predictive covariance. Theoretically, we extend submodularity guarantees from V-optimality to Sigma-optimality using properties specific to GRFs. The proofs are interesting because we further show that GRFs have the suppressor-free condition in addition to the conditional independence inherited from Markov random fields.
Sigma-optimality directly optimizes the loss of the surveying problem, which is to determine the proportion of nodes belonging to one class. We test Sigma-optimality on several real-world graphs and show that in addition to the surveying problem, it also outperforms V-optimality and expected error reduction on the classification problem.
Convex optimization is a key tool in computer science, with applications ranging from machine learning to operational research. Due to the fast growth of data sizes, the development of faster algorithms is becoming a more pressing question. This talk aims to discuss several emerging approaches for faster and more accurate optimization algorithms using techniques from combinatorial algorithms, numerical analysis and spectral graph theory.
For quadratic, or L_2 minimization, we will present faster algorithms when the underlying matrix has highly uneven dimensions, or when the problem has graph-like structure. Key to these algorithms are methods for sampling the rows of a matrix that preserve the structure of its outer product. We will also discuss the close connections between L_1 regression and quadratic minimization, and describe some ongoing work in image processing using these ideas.
Nonparametric mixture models based on the Dirichlet process are an elegant alternative to finite models when the number of underlying components is unknown, but inference in such models can be slow. Existing attempts to parallelize inference in such models have relied on introducing approximations, which can lead to inaccuracies in the posterior estimate. In this talk, I will construct auxiliary variable representations for the Dirichlet process and the hierarchical Dirichlet process that facilitate the development of distributed Markov chain Monte Carlo schemes that use the correct equilibrium distribution. Experimental analyses show that this approach allows scalable inference without the deterioration in estimate quality that accompanies existing methods.
This is joint work with Avinava Dubey and Eric Xing
As large scale methods of measuring gene expression via images have been developed, the question arises: how can we predict gene interaction networks from images? The four major machine learning challenges are (a) how do you deal with multiple images per data source, and (b) multiple data sources, in (c) an unsupervised manner, while analyzing (d) global conditional independencies instead of pairwise interactions.
In this talk, I will describe a two-step solution to this problem. First, we present an algorithm, Gin-IM, for learning interaction networks from images in a single data source. Gin-IM combines multi-instance kernels with recent work in learning sparse undirected graphical models to predict interactions between genes.
Next, we propose NP-MuScL (nonparanormal multi source learning) to estimate a gene interaction network that is consistent with multiple sources of data, having the same underlying relationships between the nodes. NP-MuScL uses the semiparametric Gaussian copula to model the distribution of the different data sources, with the different copulas sharing the same covariance matrix.
We apply our algorithms on Drosophila embryonic ISH images from the Berkeley Drosophila Genome Project. Data from different time steps in Drosophila embryonic development are treated as separate data sources. With spatial gene interactions predicted via Gin-IM, and temporal predictions combined via NP-MuScL, we can finally predict spatiotemporal gene networks from these images.
Parts of this work were presented at ECCV 2012, and will be presented at RECOMB 2013. This is joint work with my advisor Eric P. Xing. No knowledge of biology is needed to follow this talk.
In this talk, I will discuss several interesting data mining / machine learning challenges for making personalized recommendations of TV shows and movies to each user at Netflix, and I will take a deeper dive into two of them. While the presented experiments are based on Netflix data, the challenges and approaches apply also to other domains besides movies.
The objective of our recommender system is to rank movies according to each user's preferences. Users' preferences can be estimated from their feedback data like plays or ratings of movies, among others. While the extreme data sparsity is a well-known problem (each user interacts with only a small fraction of all movies), I will in particular discuss the following two aspects: (1) the feedback data are missing not at random (MNAR), and (2) the distribution of the data is very skewed.
The MNAR nature of the feedback data originates from the fact that users can choose which movies to play or to rate. As a result, the fact which movies a user interacted with carries useful information. From a statistical perspective, however, MNAR data pose interesting challenges--not only for training a recommender system, but also for designing meaningful test/validation procedures on such data.
The second aspect addressed in detail in this talk is the skewed distribution of these data: while there are a few popular movies that receive most of the attention, the majority of movies is in the long tail of the popularity distribution. Recommendations from the long tail are generally considered to be particularly valuable for the user, as they are otherwise difficult to discover. On the other hand, recommendation accuracy tends to decrease towards the long tail due to lack of data.
I will discuss machine learning approaches to tackle both challenges, and present empirical evidence that large improvements can be achieved by tackling these key properties of the feedback data.
Harald Steck is a data scientist at Netflix, where he develops personalization algorithms and recommender systems. He has over ten years of experience in machine learning, in particular in graphical models and more recently in recommender systems. He has conducted research at various industrial and academic organizations, including Bell Labs, ETH Zurich in Switzerland, MIT AI Lab, as well as Technical University of Munich in Germany, where he obtained a PhD degree in Computer Science in 2001.
User feedback has become an invaluable source of training data for optimizing recommender systems in a rapidly expanding range of domains, most notably content recommendation (e.g., news, movies, ads). When designing recommender systems that adapt to user feedback, two important challenges arise. First, the system should recommend optimally diversiï¬ed content that maximizes coverage of the information the user ï¬nds interesting (to maximize positive feedback). Second, the system should make appropriate exploratory recommendations in order to learn a reliable model from feedback.
In this talk, I will describe the Linear Submodular Bandits Problem, which is a framework for jointly modeling the utility of a set of recommendations (so as to encourage diversity), as well as the exploration/exploitation trade-off that arises when learning from user feedback. In particular, the utility of a set of recommendations is modeled as a parameterized submodular function, which naturally encodes a notion of diminishing returns that encourages diversity. For this setting, I will also present an online learning algorithm that can efficiently converge to a near-optimal model.
As with any bandit learning problem, the inefficiency (or regret) of a recommendation algorithm is due primarily to the cost of exploration (i.e., making exploratory recommendations due to not knowing the user's preferences a priori). One way to reduce the cost of exploration is by leveraging prior knowledge. Intuitively, most users bear some similarity to "stereotypical users" that can be represented in a low-dimensional, or coarse, feature space. I will show how to construct a coarse-to-fine hierarchy of feature spaces from the preference profiles of existing users, and also how to conduct bandit learning using this feature hierarchy to drastically reduce the amount of exploration required.
I will present a live user study, where these approaches were applied to the setting of personalized news recommendation. Our results demonstrate improved performance against approaches that do not directly model diversification, do not employ exploration, or do not incorporate prior knowledge to reduce the amount of exploration required.
This is joint work with Carlos Guestrin and Sue Ann Hong.
We discuss a general notion of âsparsity structureâ and present a unified framework for the recovery of "sparse-structured" signals from their linear image of reduced dimension possibly corrupted with noise. This unified treatment covers usual sparse and block-sparse recoveries via commonly used $\ell_1$ regularization as well as low-rank matrix reconstruction via nuclear norm minimization. We present null-space type sufficient conditions for the recovery to be precise in the noiseless case, derive error bounds for imperfect recovery (nearly sparse signal, presence of observation noise) and relate to the other well-known conditions (Restricted Isometry Property, Mutual Incoherence) from the literature. Our emphasis is on efficiently verifiable sufficient conditions on the problem parameters (sensing matrix and sparsity structure) for the validity of the associated nullspace properties. While the efficient verifiability of a condition is by no means necessary for the condition to be meaningful and useful, we believe that verifiability has its value and is worthy of being investigated. In particular, verifiability allows us to design new recovery routines with explicit confidence bounds for the recovery error, which can then be optimized over the method parameters leading to recovery procedures with improved statistical properties.
This is joint work with Anatoli Juditsky, Arkadi Nemirovski and Boris Polyak.
Given a large repository of geotagged imagery, we seek to automatically find visual elements, e.g. windows, balconies, and street signs, that are most distinctive for a certain geo-spatial area, for example the city of Paris. This is a tremendously difficult task as the visual features distinguishing architectural elements of different places can be very subtle. In addition, we face a hard search problem: given all possible patches in all images, which of them are both frequently occurring and geographically informative? To address these issues, we propose to use a discriminative clustering approach able to take into account the weak geographic supervision. We show that geographically representative image elements can be discovered automatically from Google Street View imagery in a discriminative manner. We demonstrate that these elements are visually interpretable and perceptually geo-informative. The discovered visual elements can also support a variety of computational geography tasks, such as mapping architectural correspondences and influences within and across cities, finding representative elements at different geo-spatial scales, and geographically-informed image retrieval.
This work was presented at SIGGRAPH this past summer, and has been featured in the Wall Street Journal and other media outlets. http://graphics.cs.cmu.edu/projects/whatMakesParis/
We consider the high-dimensional heteroscedastic regression model, where the mean and the log variance are modeled as a linear combination of input variables. Existing literature on high-dimensional linear regression models has largely ignored non-constant error variances, even though they commonly occur in a variety of applications ranging from biostatistics to finance. In this paper we study a class of non-convex penalized pseudolikelihood estimators for both the mean and variance parameters. We show that the Heteroscedastic Iterative Penalized Pseudolikelihood Optimizer (HIPPO) achieves the oracle property, that is, we prove that the rates of convergence are the same as if the true model was known. We demonstrate numerical properties of the procedure on a simulation study and real world data.
This work was presented at ICML this past summer. http://icml.cc/2012/papers/722.pdf
Semantic parsing is the problem of automatically converting natural language into a computer-understandable formal representation (essentially, a statement in a programming language). The promise of semantic parsing is its generality: many language understanding tasks can be posed as semantic parsing problems, including question answering and information extraction. However, current approaches to semantic parsing suffer from two serious defects: (1) semantic parsers are trained on manually annotated sentences, which are impractical to obtain when working at web-scale and (2) semantic parsers rely on manually constructed knowledge bases, which are challenging to construct and typically incomplete or nonexistent. The second problem is especially problematic in grounded domains, such as robotics, where language may refer to objects in the real-world.
This talk presents two weakly supervised training algorithms for semantic parsers that overcome these two limitations. The first algorithm eliminates the need for annotated sentences; we demonstrate this algorithm by training an accurate semantic parser for Freebase that has the most expressive knowledge representation of any published semantic parser. The second algorithm eliminates the need for a prespecified knowledge representation; we demonstrate this algorithm on a natural language grounding task, identifying the objects in an image referred to by natural language expressions such as "the mug to the left of the monitor." In both cases, reducing the supervision requirements of semantic parsing allows us to tackle problems which are infeasible in the traditional, supervised paradigm.
This is joint work with Tom Mitchell and Thomas Kollar.
One way to build natural language processing tools is to ask human experts to annotate examples of the desired output for real-world text inputs, then apply supervised learning. Another is to start with text and a model family and apply unsupervised learning. In either case, how do we know whether the human or machine annotations are "correct"? In this talk, I'll give a little bit of background about the research area, and I'll discuss some of the weaknesses of our current evaluation methodology. I'll present a new abstract framework for evaluation. The central idea is to make explicit certain adversarial roles among researchers, so that the different roles in an evaluation are more clearly defined and participants in all roles are offered ways to make measurable contributions to the larger goal. This framework can be instantiated in many ways, simulating some familiar intrinsic and extrinsic evaluations as well as some new evaluations. This talk is entirely based on preliminary ideas (no theoretical or experimental results) and is intended to spark discussion.
Many important scientific and data-driven problems involve quantities which vary over both space and time. Examples include functional magnetic resonance imaging (fMRI), climate data, or experimental studies in physics and chemistry. Principal goals of many methods in statistics, machine learning, and signal processing are to use this data and i) extract informative structures and remove noisy, uninformative parts; ii) understand and reconstruct underlying spatio-temporal dynamics that govern these systems; and iii) forecast the data, i.e. describe the system in the future.
In this talk I present generally applicable, statistical methods that address all three problems in a unifying approach. I introduce two new techniques for optimal nonparametric forecasting of spatio-temporal data: hard and mixed LICORS (Light Cone Reconstruction of States). Hard LICORS is a consistent estimator of the predictive state space of continuous-valued data. Mixed LICORS builds on a new, fully probabilistic model of light cones and predictive states mappings, and is an EM-like version of hard LICORS. These estimators can then be used to estimate local statistical complexity (LSC), a fully automatic technique for pattern discovery in dynamical systems. Simulations and applications to fMRI data demonstrate that the proposed methods work well and give useful results in very general scientific settings.
This is joint work with Cosma Shalizi, Larry Wasserman, Christopher Genovese (CMU Stats), and Elisha Merriam (NYU, Center for Neural Science).
Timestamped data present a challenge to machine learning for predictions in the future: ignoring timestamps and assuming data are i.i.d. is scalable but risks distracting a model with irrelevant ``ancient history,'' while using only the most recent portion of the data risks overfitting to current trends and missing important time-insensitive effects. We seek a general approach to learning model parameters that consider the variation in how different effects change over time.
We construct two novel prior distributions that allows parameters of probabilistic models to vary over time. Our priors encourage correlation between parameters at successive timesteps. We show how to do learning and inference under these priors. We test the approaches on several real-world datasets, demonstrating significant improvements over time-series-ignorant priors. Moreover, inspecting feature coefï¬cients in the model allows us to identify trends and changes over time.
Current systems for graph computation require a distributed computing cluster to handle very large real-world problems, such as analysis on social networks or the web graph. While distributed computation resources have become more accessible, developing distributed graph algorithms still remains challenging, especially to non-experts.
In this work, we present GraphChi, a disk-based system for computing efficiently on graphs with billions of edges. By using a well-known method to break large graphs into small parts, and a novel parallel sliding windows method, GraphChi is able to execute several advanced data mining, graph mining, and machine learning algorithms on very large graphs, using just a single consumer-level computer. We further extend GraphChi to support graphs that evolve over time, and demonstrate that on a single computer, GraphChi can process over a hundred of thousands of graph updates per second, while simultaneously performing computation. We show by experiments and theoretical analysis, that GraphChi performs well on SSDs and srprisingly also on rotational hard drives.
By repeating experiments reported for existing distributed systems, we show that with only fraction of the resources, GraphChi can solve the same problems in very reasonable time. Our work brings large-scale graph computation available to anyone with a modern PC.
This work has been accepted to OSDI '12, and will be presented in Hollywood on October 8, 2012.
GraphChi: Big Data - small machine: http://graphchi.org
We analyze the problem of partitioning a 0-1 array or bipartite graph into subgroups (also known as co-clustering), under a relatively mild assumption that the data is generate by a general nonparametric process. We show that detection of co-clusters in the data implies with high probability the existence of co-clusters of similar proportion and connectivity in the generative process. Our main application is the analysis of a crude model for networks -- the stochastic co-blockmodel -- when the data is not assumed to be generated (even approximately) by a blockmodel, but rather by an unknown exchangeable process. Our result suggests that the stochastic co-blockmodel and other community detection algorithms may be robust to model misspecification.
We consider two nonparametric hypothesis testing problems: (1) Given samples from distributions p and q, a two-sample test determines whether to reject the null hypothesis p=q; and (2) Given a joint distribution p_xy over random variables x and y, an independence test determines whether to reject the null hypothesis of independence, p_xy = p_x p_y. In testing whether two distributions are identical, or whether two random variables are independent, we require a test statistic which is a measure of distance between probability distributions. One choice of test statistic is the maximum mean discrepancy (MMD), a distance between embeddings of the probability distributions in a reproducing kernel Hilbert space. The kernel used in obtaining these embeddings is critical in ensuring the test has high power, and correctly distinguishes unlike distributions with high probability.
In this talk, I will provide a tutorial overview of kernel distances on probabilities, and show how these may be used in two-sample and independence testing. I will then describe a strategy for optimal kernel choice, and compare it with earlier heuristics (including other multiple kernel learning approaches).
Joint work with: Bharath Sriperumbudur, Dino Sejdinovic, Heiko Strathmann, Sivaraman Balakrishnan, Massimiliano Pontil, Kenji Fukumizu
We study the challenging problem of learning decision lists attribute-efficiently, giving both positive and negative results.
Our main positive result is a new tradeoff between the running time and mistake bound for learning length-k decision lists over n Boolean variables. When the allowed running time is relatively high, our new mistake bound improves significantly on the mistake bound of the best previous algorithm of Klivans and Servedio.
Our main negative result is a new lower bound on the weight of any degree-d polynomial threshold function (PTF) that computes a particular decision list over k variables. This lower bound establishes strong limitations on the effectiveness of the Klivans and Servedio approach and suggests that it may be difficult to improve on our positive result. The main tool used in our lower bound is a new variant of Markov's classical inequality which may be of independent interest; it provides a bound on the derivative of a univariate polynomial in terms of both its degree and the size of its coefficients.
Deep learning and unsupervised feature learning offer the potential to transform many domains such as vision, speech, and NLP. However, these methods have been fundamentally limited by our computational abilities, and typically applied to small-sized problems. In this talk, I describe the key ideas that enabled scaling deep learning algorithms to train a very large model on a cluster of 16,000 CPU cores (2000 machines). This network has 1.15 billion parameters, which is more than 100x larger than the next largest network reported in the literature.
Such network, when applied at the huge scale, is able to learn abstract concepts in a much more general manner than previously demonstrated. Specifically, we find that by training on 10 million unlabeled images, the network produces features that are very selective for high-level concepts such as human faces and cats. Using these features, we also obtain significant leaps in recognition performance on several large-scale computer vision tasks.
This talk motivates the study of degrees of freedom in statistical machine learning problems. Essentially, the degrees of freedom of a fitting procedure is its effective number of parameters. Though this is a vague concept, it has a precise definition for a wide class of problems. I will discuss what is known for some key problems in this class, and how some other important problems are not particularly well-understood.
We study the problem of active learning in a stream-based setting, allowing the distribution of the examples to change over time. We prove upper bounds on the number of prediction mistakes and number of label requests for established disagreement-based active learning algorithms, both in the realizable case and under Tsybakov noise. We further prove minimax lower bounds for this problem.
It is often computationally hard to run these methods with the 0-1 loss. Passive learning often resolves this problem by replacing the 0-1 loss with a convex relaxation, called a surrogate loss. We examine the extent to which this trick can also be useful in active learning. We start with a negative result for active learning with convex losses, where we prove that even under bounded noise constraints, the minimax rates for optimizing a convex loss with proper active learning are often no better than for passive learning. Then we explore a strategy that makes use of a given surrogate loss function in a different way, so that although the algorithm does not necessarily optimize the surrogate loss, it does optimize the 0-1 loss under certain conditions. We further present label complexity results for this method, showing that it sometimes improves over the analogous passive learning method.
Finding the k nearest neighbors (k-NNs) of a given vertex in a graph has many applications such as link prediction, recommendation systems and keyword search. One robust measure of vertex-proximity in graphs is the Personalized Page Rank (PPR) score based on random walk with restarts. Since PPR scores have long-range correlations, computing them accurately and efficiently is challenging when the graph is too large to fit in memory, especially when it also changes over time. In this work, we propose ClusterRank, an efficient algorithm to answer PPR-based k-NN queries in large time-evolving graphs. ClusterRank represents a given graph as a collection of dense vertex-clusters with their inter connections. Each vertex-cluster maintains certain information related to internal random walks and updates this information as the graph changes. At query time, ClusterRank combines this information from a small set of relevant clusters and computes ppr scores efficiently. While ClusterRank can perform exact computations, we also propose several heuristics in order to reduce its query response time while only sacrifi cing a little on the accuracy. We validate the effectiveness of our method on several synthetic and real-world graphs from diverse domains.
Leman Akoglu is a Ph.D. candidate in the Computer Science Department at Carnegie Mellon University, advised by Prof. Christos Faloutsos. She received her B.S. at Bilkent University in 2007. She won 2 best paper awards and published 15 refereed articles in major data mining venues. She is one of the inventors of 3 U.S. patents (pending), filed by IBM T. J. Watson Research Labs. Her research focuses on large-scale data analytics, with an emphasis on anomaly and event detection in large, time-varying graphs using scalable algorithms and tools.
We present the first PAC bounds for learning parameters of Conditional Random Fields (Lafferty et al., 2001) with general structures over discrete and real-valued variables. Our bounds apply to composite likelihood (Lindsay, 1988), which generalizes maximum likelihood and pseudolikelihood (Besag, 1975). Moreover, we show that the only existing algorithm with a PAC bound for learning high-treewidth discrete models (Abbeel, 2006) can be viewed as a computationally inefficient method for computing pseudolikelihood. We present an extensive empirical study of the statistical efficiency of these estimators, as predicted by our bounds. Finally, we use our bounds to show how to construct computationally and statistically efficient composite likelihood estimators.
(Work appearing in AISTATS 2012)
How can machine learning help you write a song? In this talk, I will discuss two preliminary projects that have grown out of February Album Writing Month ( http://fawm.org/), an online community for thousands of international musicians that I organize. The goal of each participant is to write 14 new songs in the month of February. First, I will discuss simple generative language models which were used to create a suite of online "computational creativity tools" called The Muse ( http://muse.fawm.org) . Despite their simplicity, these tools were successful at helping hundreds of participants write new songs. Second, I will present work in modeling user behavior that begins to explain and characterize what social interactions are most associated with successful outcomes (number of songs written, attaining the 14-song goal, donating money, returning the following year, etc.). I will conclude by proposing a few open directions for how machine learning can support individual and group creativity.
Mitchell (2008) demonstrated that brain states can be understood in terms of a componential semantics - that is that certain conceptual dimensions or other semantic properties could be used to interpret, and to predict, the activity of the brain when it is processing the meaning of a word. Meaning can be characterized in many computational ways, including hand-crafted ontologies, electronic thesauri, and corpus-based distributional models like HAL or LSA. Broad-coverage "people-based" distributional models are also emerging through crowd-sourcing of word associations, properties and similarity judgements. Here I'll examine corpus-based models to see which kinds of co-occurrence data are most informative for understanding patterns of activity in the brain, using regularized regressions. I'll also compare how labour- and computationally-intensive they are relative to human benchmarks, and speculate on how a combination of automatically derived and hand-tailored resources might provide a more comprehensive description of the human lexicon.
This talk will focus on the contextual bandit problem, which essentially captures many situations from medical testing to computational advertising. In this setting, a learner only discovers about the actions it takes, creating a need for balancing exploiting good strategies and exploring new ones. I will discuss some recent results on the contextual bandit problem, including the first high-probability optimal algorithm, a reduction from bandits to cost sensitive learning, as well as an interesting generalization of the setting. These advances point to the possibility of launching better and more effective algorithms in practice.
Differential Privacy is a criteria used to judge whether a randomized algorithm operating over a database of individuals may be deemed to preserve privacy. In this work we apply the notion of Differential Privacy to reproducing kernel Hilbert spaces of functions. As in the finite dimensional Differential Privacy literature, we achieve privacy via noise addition where the variance is calibrated to the "sensitivity" of the output. In our setting the noise in question is the sample path of a Gaussian process, and the sensitivity is measured in the RKHS norm rather than the euclidean norm. We give examples of private versions of kernel density estimators and support vector machines.
This talk will be self contained in that it will not assume the prior knowledge about differential privacy, stochastic processes etc.
We derive generalization error bounds â bounds on the expected inaccuracy of the predictions â for traditional time series forecasting models. These bounds allow forecasters to select among competing models and to guarantee that with high probability, their chosen model will perform well without making strong assumptions about the data generating process or appealing to asymptotic theory. Extending results from statistical learning theory, we demonstrate how these techniques can benefit time-series forecasters interested in choosing models which behave well under uncertainty and misspecification.
Hierarchical Bayesian models have become a popular tool for analyzing large-scale real-world data, such as text and images. Through these models, people can build useful tools for latent structure discovery, browsing and recommendations. In this talk, I will present several of my work in the area of hierarchical Bayesian modeling, with an emphasis on topic modeling, Bayesian nonparametrics and approximate posterior inference. Specifically, I will first talk about an online variational inference framework that can scale up to millions of articles on a single machine and also does model selection using Bayesian nonparametric models. Then I will describe a novel recommendation model on scientific articles that provides an interpretable latent structure for both users and items. This is important for end-user interactions, however usually difficult to obtain and ignored in the literature. A demo is at http://www.cs.princeton.edu/~chongw/citeulike/ .
The problem of estimating high-dimensional network structures arises naturally in the analyses of many physical, biological and socio-economic systems. Examples include stock price fluctuations in financial markets and gene regulatory networks representing effects of regulators (transcription factors) on regulated genes in Genetics. In many of these applications, the variables have inherent grouping structures, that when incorporated, can result in improved estimation and prediction. We aim to learn the structure of the network over time employing the framework of Granger causal models, under the assumptions of sparsity of its edges and inherent grouping structure among its nodes. I introduce a truncated penalty variant of group lasso to discover the Granger causal interactions among the nodes of the network. Asymptotic results on the consistency of the new estimation procedure are developed. The performance of the proposed methodology is assessed through an extensive set of simulation studies and comparisons with existing techniques. Finally, various extensions of the framework to more complex Granger causal structures are discussed.
Joint work with Sumanta Basu and George Michailidis (University of Michigan)
Most machine learning algorithms, such as classification or regression, treat the individual data point as the object of interest. Here we consider extending machine learning algorithms to operate on groups of data points. We suggest treating a group of data points as a set of i.i.d. samples from an underlying feature distribution for the group. Our approach is to generalize kernel machines from vectorial inputs to i.i.d. sample sets of vectors. For this purpose, we use a nonparametric estimator that can consistently estimate the inner product and certain kernel functions of two distributions. The projection of the estimated Gram matrix to the cone of semi-definite matrices enables us to employ the kernel trick, and hence use kernel machines for classification, regression, anomaly detection, and low-dimensional embedding in the space of distributions. Our numerical results demonstrate that in many cases this approach can outperform state-of-the-art competitors on both simulated and challenging real-world datasets.
How do you learn a linear predictor on a dataset with 2 trillion nonzero features in a reasonable amount of time? Allsingle-machine algorithms fail, because the time to even stream the data through a network interface is too great. I will discuss an algorithm for doing this based on a combination of online learning, LBFGS, parallel learning, and a new communication infrastructure---Hadoop compatible Allreduce that we created. Deployed on 1000 nodes, we can learn faster than any single machine _ever_ will be able to learn a linear predictor on this hardware, the first such learning algorithm for which this claim can be made. The code behind this is open source in Vowpal Wabbit (http://hunch.net/~vw/).
This is joint work with Alekh Agarwal, Olivier Chapelle and Miroslav Dudik A full draft is here: http://arxiv.org/abs/1110.4198
I will discuss some work on sequential decision making under uncertainty in large, partially observable domains. During the talk I will demonstrate that by leveraging structural properties in the dynamics model we can scale to domains orders of magnitude larger than generic approaches. The talk will particularly focus on the strengths and technical challenges that arise in designing automated, adaptive instructional sequences for computerized tutoring systems.
Fielding ability remains a difficult quantity to estimate in baseball. I present a sophisticated hierarchical model that uses current ball-in-play data to evaluate individual fielders. I will discuss continuing efforts to extend these fielding models to examine the evolution of fielding ability over multiple seasons. Many challenges in this area remain: our modeling efforts are constrained by the aspects of fielding measured in the current data. These limitations will be discussed with a look towards the potential availability of much higher resolution data in the near future.
Modeling the purposeful behavior of imperfect agents from a small number of observations is a challenging task. When restricted to the single-agent decision-theoretic setting, inverse optimal control techniques assume that observed behavior is an approximately optimal solution to an unknown decision problem. These techniques learn a utility function that explains the example behavior and can then be used to accurately predict or imitate future behavior in similar observed or unobserved situations. In this work, we consider similar tasks in competitive and cooperative multi-agent domains. Here, unlike single-agent settings, a player cannot myopically maximize its reward --- it must speculate on how the other agents may act to influence the game's outcome. Employing the game-theoretic notion of regret and the principle of maximum entropy, we introduce a technique for predicting and generalizing behavior, as well as recovering a reward function in these domains.
Many interesting applications require solving the following problem: by watching an incoming stream of sensor data, hypothesize a dynamical system model which explains that data. The general problem of learning a dynamical system from a sensor data stream is difficult: to discover the right latent state representation and model parameters, one must solve difficult temporal and structural credit assignment problems, often leading to a search space with a host of (bad) local optima. In this talk, I will discuss how to overcome these problems by pairing an expressive class of models called Predictive State Representations with statistically consistent spectral learning algorithms. I will show that this framework is very general, easy to implement, computationally efficient, and can be used to unify and solve a number of different learning problems that are typically addressed in isolation.
We explore a transfer learning setting, in which a finite sequence of target concepts are sampled independently with an unknown distribution from a known family. We study the total number of labeled examples required to learn all targets to an arbitrary specified expected accuracy, focusing on the asymptotics in the number of tasks and the desired accuracy. Our primary interest is formally understanding the fundamental benefits of transfer learning, compared to learning each target independently from the others. Our approach to the transfer problem is general, in the sense that it can be used with a variety of learning protocols. The key insight driving our approach is that the distribution of the target concepts is identifiable from the joint distribution over a number of random labeled data points equal the Vapnik-Chervonenkis dimension of the concept space. This is not necessarily the case for the joint distribution over any smaller number of points.
This work has particularly interesting implications when applied to active learning methods. In particular, we study in detail the benefits of transfer for self-verifying active learning; in this setting, we find that the number of labeled examples required for learning with transfer is often significantly smaller than that required for learning each target independently.
Latent feature models are an appropriate choice for image modeling, since images generally contain multiple objects or features. However, many latent feature models either do not account for the fact that objects can appear at different locations in different images or they require pre-segmentation of images and cluster the resulting segments. While the recently-proposed transformed Indian buffet process (tIBP) provides a method for modeling transformation-invariant features in simple binary images without the need for pre-segmentation, it cannot be applied to real images because of both computational constraints and its modeling assumptions. In this talk, I will show how the tIBP can be combined with an appropriate likelihood to create a model applicable to real images, and describe a novel Metropolis-Hastings inference algorithm that significantly improves the scalability of the tIBP.
This is joint work with Ke Zhai, Yuening Hu and Jordan Boyd-Graber at the University of Maryland.
The Lemonade Stand Game Tournament is a multiagent competition focused on understanding the issues faced when playing in a multi-agent environment. In contrast to Robocup, the annual computer poker competitions, and the TAC Competitions, the computational aspects of the game are intentionally kept to a minimum, forcing players to focus on the far more enigmatic task of understanding their opponents. While Nash equilibria are easily computed, the game is unsolvable, meaning that if everyone plays a different equilibrium, there is no guarantee that there joint behavior is meaningful.
In this talk, I will discuss how, in the past three years of the competition, new and interesting strategies have arisen which have no single agent equivalent: strategies that attempt not only to understand and adapt to the world, but strategies that also force the world to adapt to them. Although the competition is about maximizing utility, teams have designed radically new decision methodologies to achieve this. It is a known failing of Nash equilibria that it is a static concept and does not incorporate learning. The ultimate goal of this competition is to create a "group thought experiment" where not just an understanding of the static states of multi-agent systems can be developed, but also new empirical laws governing their dynamics can be explored, to be then tested and analyzed in a wider setting.
I will also discuss the upcoming 2012 Lemonade Stand Game Tournament, with a deadline in early summer 2012. More details can be found at http://martin.zinkevich.org/lemonade.
Martin Zinkevich is a Senior Research Scientist at Yahoo! Research, where he has worked since 2007. His primary interests are anti-abuse, large scale machine learning, and multi-agent learning. He completed his Ph.D. with Avrim Blum at CMU in 2004 with a thesis entitled, "Theoretical Guarantees in Multiagent Settings". In between, he worked as a postdoc with Amy Greenwald at Brown University, looking at multi-agent reinforcement learning, and as a postdoc with Michael Bowling at the University of Alberta, building (among other things) poker programs that played at professional level, a project which culminated in a exhibition match in Vegas at the World Series of Poker and a Wired article. At his day job, he works on blocking spam e-mails and spam comments under news articles.
Facebook is the most popular social network in the world and millions check news from their friends on its home page every day. There is a machine learning system that creates personalized news stream for every user on every load of the page. Filtering and ranking stories in the news feed are unique Facebook problems but they are similar to many other machine learning challenges. The large scale of growing social network, changes in interface and user behavior do not give us an opportunity to find the best solution once and forever, this is an endless competition of ideas and algorithms. The presentation contains some details about the problem, architecture of the system, our ideas, and algorithms that we use.
Using Lipschitz extensions for classification in metric spaces was apparently first proposed by von Luxburg and Bousquet (2004), who also noted that algorithmically, the solution can be realized as a nearest-neighbor search. In a COLT 2010 paper, we showed how to exploit the intrinsic geometry of the metric space to construct highly efficient classifiers and to derive data-dependent generalization bounds. We employed the doubling dimension on two fronts: information-theoretically, to control the fat-shattering dimension of Lipschitz functions (which yields error estimates), and algorithmically, to perform approximate nearest-neighbor search exponentially faster than the exact one. Since then, we have extended this technique to regression and anomaly detection. The talk, intended for a broad audience, will present an overview of our recent results, obtained in collaboration with: Daniel Berend, Lee-Ad Gottlieb, Danny Hendler, Eitan Menahem, Robert Krauthgamer.
Aryeh (Leonid) Kontorovich received his undergraduate degree in mathematics with a certiï¬cate in applied mathematics from Princeton University in 2001. His M.Sc. and Ph.D. are from Carnegie Mellon University, where he graduated in 2007. After a postdoctoral fellowship at the Weizmann Institute of Science, he joined the Computer Science department at Ben-Gurion University of the Negev in 2009 as an assistant professor; this is his current position. His research interests are mainly in machine learning, with a focus on probability, statistics, and automata theory.
The variety and complexity of potentially-related data resources available for querying --- webpages, databases, data warehouses --- has been growing ever more rapidly. There is a growing need to pose integrative queries across multiple such sources, exploiting foreign keys and other means of interlinking data to merge information from diverse sources. This has traditionally been the focus of research within Information Extraction (IE) and Information Integration (II) communities, with IE focusing on converting unstructured sources into structured sources, and II focusing on providing a unified view of diverse structured data sources. However, most of the current IE and II methods, which can potentially be applied to the problem of integration across sources, require large amounts of human supervision, often in the form of annotated data. This need for extensive supervision makes existing methods expensive to deploy and difficult to maintain. Faced with this challenge, in this talk, I shall present an overview of my research into graph-based weakly-supervised methods for IE and II.
Joint work with Koby Crammer, Sudipto Guha, Zack Ives, Marie Jacob, Marius Pasca, Fernando Pereira, Joseph Reisinger
Partha Pratim Talukdar is a Postdoctoral Fellow in the Machine Learning Department at CMU, working with Tom Mitchell. He received his PhD (2010) in CIS from the University of Pennsylvania, working under the supervisions of Fernando Pereira, Mark Liberman, and Zack Ives. He is broadly interested in Machine Learning, Natural Language Processing, Data Integration, and Cognitive Science, with particular interest in large-scale learning and inference over graphs. Partha has worked at a variety of industrial research labs, including HP Labs, Google Research, and most recently Microsoft Research where he spent a year before coming to Carnegie Mellon.
The saliency of regions or objects in an image can be signiï¬cantly boosted if they recur in multiple images. Leveraging this idea, cosegmentation jointly segments common regions from multiple images. In this paper, we propose CoSand, a distributed cosegmentation approach for a highly variable large-scale image collection. The segmentation task is modeled by temperature maximization on anisotropic heat diffusion, of which the temperature maximization with ï¬nite K heat sources corresponds to a Kway segmentation that maximizes the segmentation conï¬dence of every pixel in an image. We show that our method takes advantage of a strong theoretic property in that the temperature under linear anisotropic diffusion is a submodular function; therefore, a greedy algorithm guarantees at least a constant factor approximation to the optimal solution for temperature maximization. Our theoretic result is successfully applied to scalable cosegmentation as well as diversity ranking and single-image segmentation. We evaluate CoSand on MSRC and ImageNet datasets, and show its competence both in competitive performance over previous work, and in much superior scalability.
"Presented in partial fulfillment of the CSD speaking skill requirement".
Although spectral clustering has enjoyed considerable empirical success in machine learning, its theoretical properties are not yet fully developed. We analyze the performance of a spectral algorithm for hierarchical clustering and show that on a class of hierarchically structured similarity matrices, this algorithm can tolerate noise that grows with the number of data points while still perfectly recovering the hierarchical clusters with high probability. We additionally improve upon previous results for k-way spectral clustering to derive conditions under which spectral clustering makes no mistakes. Further, using minimax analysis, we derive tight upper and lower bounds for the clustering problem and compare the performance of spectral clustering to these information theoretic limits. We also present experiments on simulated and real world data illustrating the strength of our results.
Joint work with Sivaraman Balakrishnan, Akshay Krishnamurthy, and Aarti Singh
Min is the opposite of max.
When dealing with time series with complex and uncertain non-stationarities, low retrospective regret on individual realizations is in general a more appropriate goal than low prospective risk in expectation. Online learning algorithms provide powerful guarantees of this form and have often been proposed for use with non-stationary processes because of their ability to switch between different forecasters or "experts." However, existing methods assume that this set of experts whose forecasts are to be combined is given at the start and fixed over time, and such assumptions are not generally plausible when dealing with genuinely historical or evolutionary systems. I show how to modify the "fixed shares" algorithm for tracking the best expert to handle a steadily growing set of experts, in which new experts are fitted to new data as they become available, and obtain regret bounds for the growing ensemble. Joint work with Kristina Klinkner, Abigail Jacobs, and Aaron Clauset.
Extracting useful knowledge from large network datasets has become a fundamental challenge in many domains, from scientific literature to social networks and the web. We introduce Apolo, a system that uses a mixed-initiative approach --- combining visualization, rich user interaction and machine learning --- to guide the user to incrementally and interactively explore large network data and make sense of it. Apolo engages the user in bottom-up sensemaking to gradually build up an understanding over time by starting small, rather than starting big and drilling down. Apolo also helps users find relevant information by specifying exemplars, and then using a machine learning method called Belief Propagation to infer which other nodes may be of interest. We evaluated Apolo with twelve participants in a between-subjects study, with the task being to find relevant new papers to update an existing survey paper. Using expert judges, participants using Apolo found significantly more relevant papers. Subjective feedback of Apolo was also very positive.
The rise of social media has yielded tremendous amounts of behavioral and
textual data recording people's attitudes and interests in various social
contexts. I will present several projects, employing techniques from simple
statistics to LDA-style graphical models, that infer cultural phenomena from
these data:
(1) Relating opinion polls to Twitter sentiment analysis
(2) Geographic linguistic communities on Twitter
(3) Cultural interest groupings in the Facebook "Like" graph
Many activities traditionally regarded as based on individual choice are in fact strongly influenced by the social connections of each customer. In my talk I will present a novel method for analyzing social connections called the Group-First approach. This method exploits the structure of customer interactions to identify social leaders, and through it predict their behavior, as well as that of their peers. I will demonstrate the applicability of this approach to prediction of customer churn in cellular networks. I will present our results from several carriers, which confirm the unique advantages of our approach. Finally, I will discuss the parallel architecture which allows us to process the large volumes of data required for this analysis.
Elad Yom-Tov is a Senior Research Scientist at Yahoo Research. Before joining Yahoo in 2010, he was with the Machine Learning group at IBM Research Haifa Lab and Rafael. Dr. Yom-Tov received his B.Sc from Tel-Aviv University and his M.Sc and Ph.D. from the Technion Ã¢ÂÂ Israel Institute of Technology. Dr. Yom-Tov has co-authored two books and over 40 publications in top international conferences and journals, and filed over 30 patents (6 of which have been granted so far). His primary research interests are in large-scale Machine Learning, Information Retrieval, and in the past few years, social analysis.
Every month, users spend 700 billion minutes on Facebook. Surfacing personalized relevant content to each user at this scale, and for each of the billions of page views, presents interesting machine learning challenges. As a sample application for matching content to user interests, I will describe a method for improving the relevance of online advertising. The method is based on a user interest model that is trained on past clickthrough data. I will also talk briefly about our work on a recommendation system to suggest new friends. This system is responsible for 40% of all new friend connections made on Facebook.
This talk describes recent progress on generally characterizing the number of label requests sufficient for active learning to achieve a given accuracy (i.e., the label complexity), in both noisy and noise-free settings. Specifically, we begin by discussing a disagreement-based approach to the design and analysis of active learning algorithms, which has recently gained popularity in the literature. We find that the label complexities achieved by algorithms of this type are typically well-characterized by a simple quantity known as the disagreement coefficient. We then proceed to discuss a newer approach to the design of active learning algorithms, based on shatterable sets, and characterize the label complexities achievable by these methods in terms of a quantity analogous to the disagreement coefficient. In particular, these label complexities are often significantly better than those achievable by disagreement-based methods.
Models of interacting Self-Propelled Particles (SPPs) have proven adept at reproducing realistic-looking simulations of animal swarms and form the basis for our understanding of how collective behaviour emerges from individual interactions. With the continued progression of animal tracking technology it is now possible to consider inferring the optimal interaction model based on empirical data of individual movements within co-moving groups.
Group simulations often result in similar patterns of collective behaviour despite including different fine-scale interactions rules. The emergence of similar collective patterns in simulations of animal groups points to a restricted set of "universal" classes for these patterns. Universality presents a challenge to inferring such interactions from macroscopic group dynamics since these can be consistent with many underlying interaction models. As such, fine-scale movements of the animals must be considered if one is to understand exactly how individuals interact.
I will present a study of using Bayesian model selection on simulated data of animal swarms to distinguish between competing interaction models and demonstrate the limitations posed by accepted standards of experimental design.
Richard Mann is a postdoc in the Centre for Interdisciplinary Mathematics at Uppsala University, Sweden. He works on applying methods of inference and machine learning to the analysis of data in animal behaviour experiments. He completed his PhD at the University of Oxford, UK in 2010.
I will discuss two projects, both of which are concerned with finding structure in data. The first is "Graph-Valued Regression" where we estimate an undirected graph for a random vector Y, as a function of a second random vectors X. (Joint work with Han Liu, Xi Chen, and John Lafferty).
The second is the problem of "Minimax Estimation of Manifolds From Noisy Data, in the Hausdorff Distance". (Joint work with Chris Genovese, Marco Perone-Pacifico and Isabela Verdinelli).
For naturally occurring data, the dimension of the given input space is often very large while the data themselves have a low intrinsic dimensionality. Spectral kernel methods are non-linear techniques for transforming data into a coordinate system that efficiently reveals the underlying structure -- in particular, the "connectivity" -- of the data. In this talk, I will focus on one particular technique -- diffusion maps -- but the analysis can be used for other spectral methods as well. I will give examples of various applications of the method in high-dimensional inference. I will also present a new extension of the diffusion framework to comparing distributions in high-dimensional spaces with an application to content-based image retrieval. (Part of this work is joint with R.R. Coifman, S. Lafon, C. Schafer and L. Wasserman)
As computing hardware becomes more portable and more instrumented with sensors, algorithms to parse this data become ever more important. We introduce a framework for building context awareness into standard smart-phones, which allows us to personalize the behavior of the phone--or other devices communicating remotely with the phone--using observations made about the owner's usage patterns.
In this test-bed application, we utilize data from a GPS, microphone, proximity sensor, accelerometer as well as from user interactions to predict whether the phone should have it's ringer set to silent or loud in the current context. The system must face challenges originating from noise in the sensor data, inconsistent user feedback, and it must face these challenges using only a minimal amount of battery power.
Initial results from the In-Context system will be presented, and a variety of other applications for this technology will be introduced.
Graphical models or Markov random fields provide a graph-based framework for capturing dependencies between random variables of a large-scale multivariate distribution. This interdisciplinary topic has found widespread application in a variety of areas including image processing, bioinformatics, combinatorial optimization and machine learning. Estimating the graph structure of the model using samples drawn from it forms an important task, since the structure reveals important relationships between the variables. However, structure estimation has several challenges: in general graphical models, it is NP-hard, the models are typically in the high-dimensional regime where the number of variables is much larger than the number of samples obtained, and there could be many latent variables which are unobserved. I will address these challenges in the talk and provide solutions for certain classes of models.
I will focus on latent tree models in the first part of the talk. These are tree graphical models where there are latent variables, but there is no knowledge of the number or the location of the latent variables. We have developed novel algorithms which are consistent, computationally efficient and have low sample complexity. These algorithms are based on the presence of an additive metric on the tree, due to the properties of correlations on a tree model. The first algorithm uses these properties to check for sibling relationships between node pairs and builds the tree in a recursive fashion. The second algorithm initially builds a tree over the observed variables, and then adds hidden nodes in a step-by-step fashion by only operating on small subsets of variables. This leads to considerable computational savings compared to the first algorithm. We modify the second algorithm for experiments on real data by trading off number of added latent variables with the accuracy of resulting model fitting via the Bayesian Information Criterion (BIC). Experiments on the S&P 100 monthly returns data and on the occurrence of words in newsgroups reveal interesting relationships.
In the second part, I will talk about recent results on learning graphical models on sparse Erdos-Renyi random graphs. These random graphs are relevant in social networks. Since these graphs are locally tree-like, it is a natural question if structure learning is feasible in these models, given that learning tree models is tractable. We provide a positive answer when the model is in the so-called uniqueness regime, where there is a decay of long-range correlations. The algorithm is based on a set of conditional mutual information tests and is shown to be consistent for structure estimation with almost order-optimal sample complexity. A simpler algorithm based on correlation thresholding is also consistent, but under more stringent conditions. Finally, depending on the time availability, I will briefly mention related works on consistent estimation of high-dimensional forest distributions and the characterization of extremal tree structures with respect to error rates for structure learning.
Anima Anandkumar received her B.Tech in Electrical Engineering from the Indian Institute of Technology (IIT) Madras in 2004 and her MS and PhD degrees in Electrical Engineering from Cornell University, Ithaca, NY in 2009. She was at the Stochastic Systems Group at MIT, Cambridge, MA as a post-doctoral researcher. She has been an assistant professor at EECS Dept. at U.C.Irvine since July 2010. She is the recipient of the 2009 Best Thesis Award by the ACM Sigmetrics Society, 2008 IEEE Signal Processing Society Young Author Best Paper Award, 2008 IBM Fran Allen PhD fellowship, and student paper award at 2006 IEEE ICASSP. Her research interests are in the area of statistical-signal processing, network theory and information theory.
We sharply characterize the performance of different penalization schemes for the problem of selecting the relevant variables in the multi-task setting. Previous work focuses on the regression problem where conditions on the design matrix complicate the analysis. A clearer and simpler picture emerges by studying the Normal means model. This model, often used in the field of statistics, is a simplified model that provides a laboratory for studying complex procedures. These theoretical results will be presented together with implications for practitioners.
Markov switching processes, such as hidden Markov models (HMMs) and switching linear dynamical systems (SLDSs), are often used to describe rich classes of dynamical phenomena. They describe complex temporal behavior via repeated returns to a set of simpler models: imagine, for example, a person alternating between walking, running and jumping behaviors, or a stock index switching between regimes of high and low volatility.
Traditional modeling approaches for Markov switching processes typically assume a fixed, pre-specified number of dynamical models. Here, in constrast, we develop Bayesian nonparametric approaches that define priors on an unbounded number of potential Markov models. Using stochastic processes including the beta and Dirichlet process, we develop methods that allow the data to define the complexity of inferred classes of models, while permitting efficient computational algorithms for inference. The new methodology also has generalizations for modeling and discovery of dynamic structure shared by multiple related time series.
Interleaved througout the talk are results from studies of the NIST speaker diarization database, stochastic volatility of a stock index, the dances of honeybees, and human motion capture videos.
An embedding of probability distributions into a reproducing kernel Hilbert space (RKHS) has been introduced: like the characteristic function, this provides a unique representation of a probability distribution in a high dimensional feature space. This representation forms the basis of an inference procedure on graphical models, where the likelihoods are represented as RKHS functions. The resulting algorithm is completely nonparametric: all aspects of the model are represented implicitly, and learned from a training sample. Both exact inference on trees and loopy belief propagation on pairwise Markov random fields are demonstrated.
Kernel message passing can be applied to general domains where kernels are defined, handling challenging cases such as discrete variables with huge domains, or very complex, non-Gaussian continuous distributions. We apply kernel message passing and competing approaches to cross-language document retrieval, depth prediction from still images, protein configuration prediction, and paper topic inference from citation networks: these are all large-scale problems, with continuous-valued or structured random variables having complex underlying probability distributions. In all cases, kernel message passing performs outstandingly, being orders of magnitude faster than state-of-the-art nonparametric alternatives, and returning more accurate results.
Work with Danny Bickson, Kenji Fukumizu, Carlos Guestrin, Yucheng Low, Le Song
The number of triangles is a computationally expensive graph statistic which is frequently used in complex network analysis (e.g., transitivity ratio), in various random graph models (e.g., exponential random graph model) and in important real world applications such as spam detection, uncovering of the hidden thematic structure of the Web and link recommendation. Counting triangles in graphs with millions and billions of edges requires algorithms which run fast, use small amount of space, provide accurate estimates of the number of triangles and preferably are parallelizable.
In this paper we present an efficient triangle counting algorithm which can be adapted to the semistreaming model. The key idea of our algorithm is to combine the sampling algorithm of Tsourakakis et al. and the partitioning of the set of vertices into a high degree and a low degree subset respectively as in Alon, Yuster and Zwick treating each set appropriately. We obtain a running time $O \left( m + \frac{m^{3/2} \Delta \log{n} }{t \epsilon^2} \right)$ and an $\epsilon$ approximation (multiplicative error), where $n$ is the number of vertices, $m$ the number of edges and $\Delta$ the maximum number of triangles an edge is contained. Furthermore, we show how this algorithm can be adapted to the semistreaming model with space usage $O\left(m^{1/2}\log{n} + \frac{m^{3/2} \Delta \log{n}}{t \epsilon^2} \right)$ and a constant number of passes (three) over the graph stream. We apply our methods in various networks with several millions of edges and we obtain excellent results (e.g., for the Orkut graph with ~120M edges, and ~286M triangles our method runs in ~5sec with 98% accuracy). Finally, we propose a random projection based method for triangle counting and provide a sufficient condition to obtain an estimate with low variance.
Joint work with Mihail Kolountzakis, Gary Miller and Richard Peng
Dynamic Programming, since its introduction by Richard Bellman in the 1940s, is one of the most important problem solving techniques, with numerous applications in operations research, databases (histogram construction) times series analysis, speech recognition, robotics, biology and in many other fields. In this talk we will present two new techniques for performing dynamic programming approximately, for a recurrence not treated efficiently by existing methods. The basis of our first algorithm is the definition of a constant-shifted variant of the objective function that can be efficiently approximated using state of the art methods for range searching. Our technique approximates the optimal value of our objective function within additive $\epsilon$ error and runs in $\tilde{O}(n^{4/3+\delta} \log{ (\frac{U}{\epsilon}) )}$ time, where $\delta$ is an arbitrarily small positive constant and $U = \max \{ \sqrt{C},(P_i)_{i=1,\ldots,n} \}$. The second algorithm we provide solves a similar recurrence that's within a multiplicative factor of (1+$\epsilon$) and runs in $O(n \log{n} / \epsilon )$. The new technique introduced by our algorithm is the decomposition of the initial problem into a small (logarithmic) number of Monge optimization subproblems which we can speed up using existing techniques. Finally, we demonstrate a biological application of our recurrence where we obtain results superior to leading competitors both on benchmarks and real data.
Joint work with Richard Peng, David Tolliver, Maria Tsiarli, Stanley Shackney, Gary Miller and Russell Schwartz
In this work, we describe PeGaSus, an open source Peta Graph Mining library which performs typical graph mining tasks such as computing the diameter of the graph, computing the radius of each node, finding the connected components, and computing the importance score of nodes. As the size of graphs reaches several Giga-, Tera- or Peta-bytes, the necessity for such a library grows too. To the best of our knowledge, PeGaSus is the first such library, implemented on the top of the Hadoop platform, the open source version of MapReduce.
Many graph mining operations (PageRank, spectral clustering, diameter estimation, connected components etc.) are essentially a repeated matrix-vector multiplication. In this paper we describe a very important primitive for PeGaSus, called GIM-V (Generalized Iterated Matrix-Vector multiplication). GIM-V is highly optimized, achieving (a) good scale-up on the number of available machines, (b) linear running time on the number of edges, and (c) more than 5 times faster performance over the non-optimized version of GIM-V.
Our experiments ran on M45, one of the top 50 supercomputers in the world. We report our findings on several real graphs, including one of the largest publicly available Web graphs, thanks to Yahoo!, with ~ 6,7 billion edges.
We consider the problem of identifying an activation pattern in a complex, large-scale network that is embedded in very noisy measurements. This problem is relevant to several applications, such as identifying traces of a biochemical spread by a sensor network, expression levels of genes, and anomalous activity or congestion in the Internet. Extracting such patterns is a challenging task specially if the network is large (pattern is very high-dimensional) and the noise is so excessive that it masks the activity at any single node. However, typically there are statistical dependencies in the network activation process that can be leveraged to fuse the measurements of multiple nodes and enable reliable extraction of high dimensional noisy patterns. In this paper, we analyze an estimator based on the graph Laplacian eigenbasis, and establish the limits of mean square error recovery of noisy patterns arising from a probabilistic (Gaussian or Ising) model based on an arbitrary graph structure. We consider both deterministic and probabilistic network evolution models, and our results indicate that by leveraging the network interaction structure, it is possible to consistently recover high-dimensional patterns even when the noise variance increases with network size.
I am currently a Ph.D student in Machine Learning and Statistics at CMU. I have a masters in Statistics and a bachelors in Math and Physics from the Ohio State University. I have worked on exploiting structure for statistical estimation and pattern localization. Currently, I am working on optimal design and active learning for Ising and infection models, structured sparsity, and density estimation over large graphs.
Latent variable models are an essential tool for the analysis of data in numerous application areas, whether it be source separation, image analysis, matrix completion or preference elicitation; since they provide a means of understanding the inherent structure in data. This talk will, in two parts, examine the natural evolution of latent variable models and the statistical tools - Bayesian tools in particular - used for learning with these models.
In the first part, the motivation for generalising such models to the exponential family will be given, which will allow for the modelling of data that may be binary, categorical, counts or non-negative. The focus will be on Bayesian analysis of this class of models, showing how inference is performed using Hybrid Monte Carlo sampling, the relationship between the generalised models and various other existing models, and examining aspects of model identifiability.
In the second part, the discussion will switch to the priors that are used to learn latent representations of data, and in the context of the generalised latent variable models just discussed. The focus will be on sparse Bayesian learning: weak sparsity achieved through priors based on the Gaussian scale mixture construction; and strong sparsity using discrete mixture priors. A comparison of these two methods will be given and an efficient sampler for learning with discrete mixture priors will be described, which has many advantages over a corresponding optimisation approach to learning.
Shakir Mohamed is a PhD Candidate in the Machine Learning Group at the University of Cambridge working under the supervision of Prof. Zoubin Ghahramani. At Cambridge, he is a Commonwealth scholar and a member of St John's College. His research interests lie in Bayesian statistics and latent variable models, sparse Bayesian learning and its connections to compressed sensing and learning in infinite dimensional settings, and the probabilistic modelling of tensor data, amongst others.
Permutations are ubiquitous in many real world problems, such as voting, rankings and data association. Representing uncertainty over permutations is challenging, however, since there are $n!$ possibilities. A pervasive technique in machine learning for making large problems tractable is to exploit probabilistic independence structures for decomposing large problems into much smaller ones. However, it is not obvious how one might exploit independence for permutation data due to the mutual exclusivity constraints which disallow, for example, two items to map to the same rank.
I will talk about recent progress on tracking and ranking problems. First, I will discuss probabilistically tracking multiple moving targets by decomposing large sets of targets into smaller independent subsets. Via independence decompositions, we are able to track much larger collections of targets Unlike multiobject tracking, distributions over rankings are typically not amenable to independence assumptions. Instead, I will present a novel generalization of independence, called \emph{riffled independence}, encompassing a more expressive family of distributions while retaining many of the properties necessary for performing efficient inference and reducing sample complexity. In riffled independence, one draws two permutations independently, then performs the \emph{riffle shuffle}, common in card games, to combine the two permutations to form a single permutation. In ranking, riffled independence corresponds to ranking disjoint sets of objects independently, then interleaving those rankings.
This is joint work with Carlos Guestrin, Leo Guibas, Xiaoye Jiang and Ashish Kapoor
Given a contact network that changes over time (say, day vs night connectivity), and the SIS (susceptible/infected/susceptible, flu like) virus propagation model, what can we say about its epidemic threshold? That is, can we determine when a small infection will "take-off" and create an epidemic? Consequently then, which nodes should we immunize to prevent an epidemic? This is a very real problem, since, e.g. people have different connections during the day at work, and during the night at home. Static graphs have been studied for a long time, with numerous analytical results. Time-evolving networks are so hard to analyze, that most existing works are simulation studies. Specifically, our contributions in this paper are: (a) we formulate the problem by approximating it by a Non-linear Dynamical system (NLDS), (b) we derive the first closed formula for the epidemic threshold of timevarying graphs under the SIS model, and finally (c) we show the usefulness of our threshold by presenting efficient heuristics and evaluate the effectiveness of our methods on synthetic and real data like the MIT reality mining graphs.
Joint work with Hanghang Tong, Nicholas Valler, Michalis Faloutsos and Christos Faloutsos
Topic modeling has been popularly used for data analysis in various domains including text documents. Previous topic models, such as probabilistic Latent Semantic Analysis (pLSA) and Latent Dirichlet Allocation (LDA), have shown impressive success in discovering low-rank hidden structures for modeling text documents. These models, however, do not take into account the manifold structure of data, which is generally informative for the non-linear dimensionality reduction mapping. More recent models, namely Laplacian PLSI (LapPLSI) and Locally-consistent Topic Model (LTM), have incorporated the local manifold structure into topic models and have shown the resulting benefits. But these approaches fall short of the full discriminating power of manifold learning as they only enhance the proximity between the low-rank representations of neighboring pairs without any consideration for non-neighboring pairs. In this paper, we propose Discriminative Topic Model (DTM) that separates non-neighboring pairs from each other in addition to bringing neighboring pairs closer together, thereby preserving the global manifold structure as well as improving the local consistency. We also present a novel model fitting algorithm based on the generalized EM and the concept of Pareto improvement. As a result, DTM achieves higher classification performance in a semi-supervised setting by effectively exposing the manifold structure of data. We provide empirical evidence on text corpora to demonstrate the success of DTM in terms of classification accuracy and robustness to parameters compared to state-of-the-art techniques.
High-level parallel frameworks like MapReduce (Hadoop) have begun to receive considerable attention within the machine learning (ML) community. However, while MapReduce is ideal for many large-scale data processing tasks, it does not naturally or efficiently express asynchronous iterative computation with sparse dependencies. Unfortunately, many popular machine learning algorithms like belief propagation, Gibbs sampling, CoEM, and the lasso (shooting algorithm) require asynchronous iterative computation and impose sparse parameter dependencies. To fill this critical void, we developed GraphLab which naturally expresses asynchronous iterative algorithms with sparse computational dependencies while ensuring data consistency and achieving a high degree of parallel performance. As part of our recent UAI'10 publication, we released a shared-memory (multicore) implementation of the GraphLab API and are in the process of developing cluster and GPU versions.
In this talk I will briefly review the MapReduce abstraction and demonstrate how coercing popular iterative machine learning algorithm into MapReduce can lead to a highly inefficient parallel algorithms. I will then introduce the GraphLab framework and explain how it addresses the critical limitations of the MapReduce framework while retaining the advantages of a high-level abstraction. I will show how the GraphLab abstraction can be used to represent efficient provably correct versions of several popular sequential machine learning algorithms. Finally, I will present scaling results from our UAI'10 paper which describes our initial Shared-Memory API (currently available for download). I will conclude by briefly discussing some of our ongoing work on cluster (distributed-memory) API.
This is Joint work with: Yucheng Low, Aapo Kyrola, Kannat Tangwonsan, Danny Bickson, Carlos Guestrin, Guy Blelloch, David O'Hallaron, Joseph M. Hellerstein
In this talk, I will describe an interactive approach for learning ranking functions. In particular, this approach leverages clickthrough data collected from online interleaving experiments.
Interleaving experiments are an effective methodology for eliciting reliable implicit feedback. For any query, the rankings computed by two retrieval functions are interleaved together and then presented to the user. Afterward, clicks can be interpreted as a relative preference of one retrieval function over the other.
I will present a novel online learning framework, called the Dueling Bandits Problem, that characterizes the exploration/exploitation trade-off of running online interleaving experiments (since they might lower current retrieval quality). For example, intuitively, we should quickly stop running experiments using low quality retrieval functions once we have discovered that they are of low quality. The Dueling Bandits Problem differs from existing online learning frameworks in that only relative information (e.g., is A better than B?) is assumed to be available to the learning algorithm.
I will also describe a learning approach to improve the statistical power of interleaving experiments. This approach is motivated by the intuition that not all clicks are equally informative. By learning a weighting function on clicks collected from interleaving experiments, we can arrive at an improved statistical test that will more efficiently tease apart the relative quality between pairs of retrieval functions.
This is joint work with Thorsten Joachims, Bobby Kleinberg, Josef Broder, Yue Gao, Olivier Chapelle and Ya Zhang.
We develop a penalized kernel smoothing method for the problem of selecting non-zero elements of the conditional precision matrix, known as conditional covariance selection. This problem has a key role in many modern applications such as finance and computational biology. However, it has not been properly addressed. Our estimator is derived under minimal assumptions on the underlying probability distribution and works well in the high-dimensional setting. The efficiency of the algorithm is demonstrated on both simulation studies and the analysis of the stock market.
With the increasing popularity of large- scale probabilistic graphical models, even "lightweight" approximate inference methods are becoming infeasible. Fortunately, often large parts of the model are of no immediate interest to the end user. Given the variable that the user actually cares about, we show how to quantify edge importance in graphical models and to signi?cantly speed up infer- ence by focusing computation on important parts of the model. Our algorithm empiri- cally demonstrates convergence speedup by multiple times over state of the art.
Hidden Markov Models (HMMs) are important tools for modeling sequence data. However, they are restricted to discrete latent states, and are largely restricted to Gaussian and discrete observations. And, learning algorithms for HMMs have predominantly relied on local search heuristics, with the exception of spectral methods such as those described below. We propose a nonparametric HMM that extends traditional HMMs to structured and non-Gaussian continuous distributions. Furthermore, we derive a local-minimum-free kernel spectral algorithm for learning these HMMs. We apply our method to robot vision data, slot car inertial sensor data and audio event classi?cation data, and show that in these applications, embedded HMMs exceed the previous state-of-the-art performance.
One potentially effective way to deal with a large collection of data samples is to discover a subspace (usually much lower dimensional and latent) representation, which can be used for further knowledge discovery or data management tasks. To automatically discover a low dimensional representation, many methods have been developed, such as the classic PCA, CCA, and the probabilistic topic models (e.g., latent Dirichlet allocation or LDA). One of the advocated advantages of such models is that they do not require supervision during training, which is arguably preferred over supervised learning that would necessitate extra cost. But with the increasing availability of free on-line information such as image tags,user ratings, etc., various forms of side-information that can potentially offer “free” supervision have led to a need for new models and training schemes that can make effective use of such information to achieve better results, such as more discriminative topic representations of image contents, and more accurate image classifiers. The standard LDA and many other models are unsupervised and ignore the commonly available supervision information,and thus can discover a sub-optimal representation for prediction tasks. Extensions to supervised models which can explore side information for discovering predictive subspace representations have been proposed and their training are typically performed with maximum likelihood estimation, which may not yield conclusive results or results in an unbalanced prediction rule. Our goal is to investigate how the arguably more discriminative maximum margin principle can be effectively applied to discover predictive subspace representations from a large collection of data, which can be bag-of-word text documents or images with multiple types of features. We aim to develop a generic learning framework for discovering predictive latent subspace representations and provide several useful tools for helping users in managing the huge amount of online content. In this talk, I will present one recent work on multi-view data analysis.
Recently, we developed a new ML framework that allows us to systematically avoid density estimation. The key idea is to directly estimate the ratio of density functions, not densities themselves. Our framework includes various ML tasks such as importance sampling (e.g., covariate shift adaptation, transfer learning, multitask learning), divergence estimation (e.g., two-sample test, outlier detection, change detection in time-series), mutual information estimation (e.g., independence test, independent component analysis, feature selection, sufficient dimension reduction, causal inference), and conditional probability estimation (e.g., probabilistic classification, conditional density estimation).
In this talk, I introduce the density ratio framework, review methods of density ratio estimation, and show various real-world applications including brain-computer interface, speech recognition, image recognition, and robot control.
We consider the problem of re-ranking search results by incorporating user feedback. We present a graph theoretic measure for discriminating irrelevant results from relevant results using a few labeled examples provided by the user. The key intuition is that nodes relatively closer (in graph topology) to the relevant nodes than the irrelevant nodes are more likely to be relevant. We present a simple sampling algorithm to evaluate this measure at specific nodes of interest, and an efficient branch and bound algorithm to compute the top k nodes from the entire graph under this measure. On quantifiable prediction tasks the introduced measure outperforms other diffusion-based proximity measures which take only the positive relevance feedback into account. On the Entity-Relation graph built from the authors and papers of the entire DBLP citation corpus (1.4 million nodes and 2.2 million edges) our branch and bound algorithm takes about 1.5 seconds to retrieve the top 10 nodes w.r.t. this measure with 10 labeled nodes.
Considerable speed-ups to machine learning problems have been achieved by two developments: distributed computing (either on multi-core or "cloud" architectures) and rapidly converging online learning algorithms. In this talk, we combine these two. Distributed computing has largely been paired with "batch" algorithms like EM and L-BFGS, in which the entire training dataset is processed once per iterative update; our approach makes more frequent online updates asynchronously, either in a pure online or mini-batch setting. Asynchronous updates can introduce error, but the approach has similar convergence guarantees to other online learning algorithms in certain settings, such as the case of online gradient-based optimization for convex objectives. We first consider this setting, and present a series of experiments exploring practical issues for a structured prediction task in natural language processing, named-entity rec
ognition. We also consider settings that are not yet supported by theoretical results. We apply an online version of EM (Cappe and Moulines, 2009) to two unsupervised structured learning tasks: (1) word alignment for machine translation, and (2) unsupervised part-of-speech tagging. For the former we use a model that actually has a concave log-likelihood function, while the latter fits the more common unsupervised learning scenario with a non-concave objective. In both cases we find significant speed-ups over batch algorithms with no observable problems arising from the use of asynchronous updates. In addition, we present experimental results when running asynchronous mini-batch algorithms on M45, a large cluster running the Hadoop MapReduce framework. We find that, while MapReduce is not an ideal fit for these algorithms, they do converge faster than batch algorithms on the same hardware and we expect that the MapReduce framework may become more appropriate for asynchronous learning as problem sizes continue to grow.
This is joint work with Dipanjan Das and Noah Smith.
Genome rearrangement refers to large structural changes on genomes which effect their chromosomal organizations. Over the past 15 years, significant progress has been achieved for problems with two genomes. However problems with three or more genomes, e.g., to infer phylogeny and ancestral genome organizations simultaneously, had not been satisfactorily solved, due to the underlying computational difficulties.
In this talk I present my solutions to these challenging problems. I start with the problem with three genomes; then introduce a new mathematical theory, which captures its essential combinatorial structures and whose iterative application quickly finds optimal solutions to most instances. For the general problem with any number of genomes, I develop a method which extends this theory and systematically exploits information from known genomes. The new method gives extremely accurate results within a few minutes on large datasets, which are far beyond the reach of previous methods. Finally this talk presents results of the new method applied on UCSC and ENSEMBL high-resolution datasets.
Andrew Wei Xu received a Bachelor degree in Biophysics (2002, Nanjing University, China) and a Master degree in Physics (2005, Mcmaster University). Then he studied after David Sankoff and received his Ph.D. in mathematics (2008, University of Ottawa) with specification on probability theory and discrete math. Now he works with Bernard Moret as a postdoctoral fellow in the School of Computer and Communication Sciences, at Swiss Federal Institute of Technology, Lausanne. His research interests center on computational biology, with emphasis on solving genetic and genomic problems using various quantitative analysis methods, including probability analysis, statistical inference, algorithm design and combinatorial analysis. Now he is looking forward to developing and applying statistical machine learning methods to solve computational biology problems.
In dual-income families, attending to the detail required to make and monitor transportation plans for kids’ activities requires much parental attention. Families rely on routines as a mechanism to support this transportation process. Families face their largest challenges, however, on non-routine days, where the effectiveness of routine resources and behavioral patterns are significantly diminished. When Dad takes over a chore Mom usually manages, for example, the task is significantly more at risk of coordination breakdowns. Especially when making or changing plans, information routine to Mom could easily be unknown to Dad.
In this talk, I will discuss work towards the creation of technologies that can support busy families during such moments. In particular, I will describe the collection of a massive dataset on family coordination, and demonstrate how only simple GPS is needed to create intelligent applications that can support family coordination. I will demonstrate how learned models of location can be used to detect when a parent has forgotten to pick up a child, and propose a variety of applications that can capture and reveal important but invisible (to people) aspects of family movement patterns. These applications open a space where Machine Learning and HCI research can collaborate to create novel and valuable services for families
A line of recent work has demonstrated that sparsity is a powerful technique in signal reconstruction and in statistical estimation.
Given n noisy samples with p dimensions, where n << p, we show that the multi-step thresholding procedure based on the Lasso -- we call it the Thresholded Lasso, can accurately estimate a sparse vector beta in p dimensional space in a linear model. We show that under the Restricted Eigenvalue (RE) condition (Bickel-Ritov-Tsybakov 09), it is possible to achieve the L2 loss within a logarithmic factor of the ideal mean square error one would achieve with an oracle while selecting a sufficiently sparse model -- hence achieving "sparse oracle inequalities"; the oracle would supply perfect information about which coordinates are non-zero and which are above the noise level.
Shuheng Zhou received her Ph.D. from Carnegie Mellon University in August 2006, co-advised by Professors Greg Ganger and Bruce Maggs; Her dissertation work focused on combinatorial optimization problems in network routing. She then continued as a postdoc fellow at CMU, working with Professors John Lafferty and Larry Wasserman on statistical and machine learning algorithms and theory. She has been a postdoc fellow with Seminar for statistics in Department of Mathematics at ETH Zurich, since August 2008. She is currently visiting Department of Statistics at UC Berkeley, hosted by Professor Bin Yu.
A lot of research in machine learning and data mining is concentrated on building predictive models with the best possible performance. In most cases such models act as black boxes: they make good predictions, but do not provide much insight into the decision making process. This is unsatisfactory for domain scientists who also want to answer questions like: What effects do important features have on the response variable? Which features are involved in complex effects and should be studied only together with some other features? How can we visualize and interpret such complex effects? Separate post-processing techniques are needed to answer these questions.
The term statistical interaction is used to describe the presence of non-additive effects among two or more variables in a function. When variables interact, their effects must be modeled and interpreted simultaneously. Thus, detecting statistical interactions can be critical for an understanding of processes by domain researchers. In this talk I will describe an approach to interaction detection based on comparing the performance of unrestricted and restricted prediction models, where restricted models are prevented from modeling an interaction in question. I will show that an additive model-based regression ensemble, Additive Groves, can be restricted appropriately for use with this framework, and thus has the right properties for accurately detecting variable interactions. Apart from being useful for interaction detection, Additive Groves is a strong predictive model by itself. I will show that it consistently outperforms other state-of-the-art ensembles of trees for regression and yields high performance across a variety of classification problems. In the last part of the talk I will describe several applications of Additive Groves and interaction detection techniques to real data, such as tracking environment change based on birds abundance, Salmonella risk prediction on meat-processing establishments and successful participation in recent KDD and ICDM data mining competitions. All algorithms presented in the talk are implemented as a part of an open source C++ package available at www.cs.cmu.edu/~daria/TreeExtra.htmProgramming robots is hard. While demonstrating a desired behavior may be easy, designing a system that behaves this way is often difficult, time consuming, and ultimately expensive. Machine learning promises to enable "programming by demonstration" for developing high-performance robotic systems. Unfortunately, many approaches that utilize the classical tools of supervised learning fail to meet the needs of imitation learning.
Perhaps foremost, classical statistics and supervised machine learning exist in a vacuum: predictions made by these algorithms are explicitly assumed to not affect the world in which they operate. I'll discuss the problems that result from ignoring the effect of actions influencing the world, and I'll highlight simple "reduction- based" approaches that, both in theory and in practice, mitigate these problems.
Additionally, robotic systems are often built atop sophisticated planning algorithms that efficiently reason far into the future; consequently, ignoring these planning algorithms in lieu of a supervised learning approach often leads to poor and myopic performance. While planners have demonstrated dramatic success in applications ranging from legged locomotion to outdoor unstructured navigation, such algorithms rely on fully specified cost functions that map sensor readings and environment models to a scalar cost. Such cost functions are usually manually designed and programmed. Recently, our group has developed a set of techniques that learn these functions from human demonstration by applying an Inverse Optimal Control (IOC) approach to find a cost function for which planned behavior mimics an expert's demonstration. These approaches shed new light on the intimate connections between probabilistic inference and optimal control.
I'll consider case studies in activity forecasting of drivers and pedestrians as well as the imitation learning of robotic locomotion and rough-terrain navigation. These case-studies highlight key challenges in applying the algorithms in practical settings.
The exploration-exploitation tradeoff is crucial to reinforcement-learning (RL) agents, and a number of sample-efficiency results guaranteeing near-optimal behavior with a relatively small amount of exploration have been derived for agents in propositional domains. But these algorithms and results do not cover many traditional AI representations that rely on relational state descriptions (like STRIPS rules). In this talk, we consider sample-efficient RL algorithms for a rich class of representations: relational action schemas. These high-level representations, combined with our sample-efficient algorithms, allow us to specify transitions in a general and compact form that is still efficiently learnable under some conditions, and have important implications in a number of real-world domains described using relational models.
Tom Walsh is a PhD candidate at Rutgers University being advised by Dr. Michael Littman. His research interests include reinforcement learning, relational representations, machine learning and data mining. His undergraduate degree and research were done at UMBC and he has also had stints at the NIST and Siemens Corporate Research.
Dependence measures between random variables play a fundamental role in many subtasks of machine learning, such as clustering, feature selection, active learning, image registration and more. One of the best-founded measures of dependence is Renyi's mutual information, a generalization of Shannon's mutual information. The question studied here is how to estimate Renyi's mutual information given a finite sample. Here we give a new estimator, show its consistency (under some regularity assumptions) and demonstrate that it is both more efficient and robust than its competitors. For illustration we use examples from image registration and independent subspace analysis. The beauty of the new estimator is that it uses the rank-statistics of the sample only, and connects copulae, information theory and Euclidean graph optimization techniques.
Barnabas Poczos received his M.Sc. degree (summa cum laude) in applied mathematics in 2001 at the Eotvos Lorand University, Budapest, Hungary. He won the first prize in the Hungarian national scientific student competition in Informatics in 2001. From 2005-2007 he was an assistant lecturer at the Eotvos Lorand University. He received there his Ph.D. (summa cum laude) in computing science in 2007. His thesis was about independent subspace analysis, a multidimensional generalization of the independent component analysis problem. He is currently a Postdoctoral Fellow at the Department of Computing Science of the University of Alberta under the supervision of Dr. Csaba Szepesvari. His areas of expertise include machine learning, unsupervised learning, Bayesian methods, active learning, bioinformatics, independent component analysis, information and entropy estimation.
In recent years, the blogosphere has experienced a substantial increase in the number of posts published daily, forcing users to cope with information overload. The task of guiding users through this flood of information has thus become critical. To address this issue, we present a principled approach for picking a set of posts that best covers the important stories in the blogosphere.
We define a simple and elegant notion of coverage and formalize it as a submodular optimization problem, for which we can efficiently compute a near-optimal solution. In addition, since people have varied interests, the ideal coverage algorithm should incorporate user preferences in order to tailor the selected posts to individual tastes. We define the problem of learning a personalized coverage function by providing an appropriate user-interaction model and formalizing an online learning framework for this task. We then provide a no-regret algorithm which can quickly learn a user's preferences from limited feedback.
Mixed integer linear programming (MILP) is a powerful representation often used to formulate decision-making problems under uncertainty. However, it lacks a natural mechanism to reason about objects, groups of objects, and relations. First-order logic (FOL), on the other hand, excels at reasoning about groups of objects, but lacks a natural representation of uncertainty. While representing propositional logic in MILP has been extensively explored, no theory exists yet for fully combining FOL with MILP. We propose a new representation, called first-order programming or FOP, which fully subsumes both FOL and MILP. We establish formal methods for reasoning about first order programs, including a sound and complete lifted inference procedure based on Gomory cuts. Since FOP can offer exponential savings in representation size compared to FOL, and since our inference procedure can directly duplicate any FOL resolution proof, we anticipate that inference in FOP will be more tractable than inference in FOL for corresponding problems.
There has been an explosion of interest in statistical models for analyzing network data, and considerable interest in the class of exponential random graph (ERG) models.
In this talk I will relate the properties of ERG models to the properties of the broader class of discrete exponential families. I will describe a general geometric result about discrete exponential families with polyhedral support. I will show how the properties of these families can be well captured by some fundamental geometric objects of polyhedral form.
I will discuss the relevance of such results to maximum likelihood estimation, both from a theoretical and computational standpoint. I will then apply these results to the analysis of ERG models. By means of a detailed example, I will provide some characterization and a partial explanation of certain pathological features of ERG models known as degeneracy.
Joint work with S.E. Fienberg and Y. Zhou
Different representation languages are useful for capturing different inductive biases. What is the language that best captures human inductive biases? I will present several models and studies which suggest that human knowledge is mentally represented in a logical language, and that human learning can be characterized in terms of probabilistic inference over these logical representations. The applications presented will include category learning, relational learning, and one-shot learning.
Inferring spatially co-located regions of interest is an important problem in several applications, such as identifying activation regions in the brain or contamination regions in environmental monitoring. In this talk, I will present multi-resolution methods for passive and active learning of sets that aggregate data at appropriate resolutions, to achieve optimal bias and variance tradeoffs for set estimation. In the passive setting, we observe some data such as a noisy fMRI image of the brain and then extract the regions with statistically significant brain activity. Active setting, on the other hand, involves feedback where the location of an observation is decided based on the data observed in the past. This can be used for rapid extraction of set estimates, such as a contamination region in environmental monitoring, by designing data-adaptive spatial survey paths for a mobile sensor. I will describe a method that uses information gleaned from coarse surveys to focus sampling around informative regions (boundaries), thus generating successively refined multi-resolution set estimates. I will also discuss some current research directions which aim at efficient extraction of spatially distributed sets of interest by exploiting non-local dependencies in the data.
Human-computer interaction researchers use diverse methods—from eye-tracking to ethnography—to understand human activity, and machine learning is growing in popularity as a method within the community. This talk will survey projects from HCI researchers at CMU combining machine learning with other techniques to problems such as adapting interfaces to individuals with motor impairments, predicting routines in dual-income families, classifying controversial Wikipedia articles, and identifying rhetorical strategies newcomers use in online support groups that elicit responses. Researchers without strong computational backgrounds can face practical challenges as consumers of machine learning tools; this talk will highlight opportunities for tool design and collaboration across communities.
Different representation languages are useful for capturing different inductive biases. What is the language that best captures human inductive biases? I will present several models and studies which suggest that human knowledge is mentally represented in a logical language, and that human learning can be characterized in terms of probabilistic inference over these logical representations. The applications presented will include category learning, relational learning, and one-shot learning.
A machine learning approach to learning to rank trains a model to optimize a target evaluation measure with respect to training data. Currently, existing information retrieval measures are impossible to optimize directly except for mod- els with a very small number of parameters. This poses a major challenge: how to optimize IR measures of interest directly? We have shown that LambdaRank, which smoothly approximates the gradient of the target measure, can be adapted to work with four popular IR target eval- uation measures using the same underlying gradient con- struction. It is likely, therefore, that this construction is extendable to other evaluation measures. We empirically show that LambdaRank finds a locally optimal solution for mean NDCG@10, mean NDCG, MAP and MRR with a 99% confidence rate. We also show that the amount of effective training data varies with IR measure and that with a suf- ficiently large training set size, matching the training op- timization measure to the target evaluation measure yields the best accuracy. In this talk, I will first review LambdaRank and then present the local optimality testing and results.
We consider the problem of learning structured-output regression, where the output consists of multiple response variables that are related by a graph and correlated response variables are dependent on the input variables in a sparse but synergistic fashion, rather than independently. Such problems can be found in genetic investigation of causal DNA variations that lead to perturbations of multiple related genes in a module, or in engineering problems, where complex responses must be jointly mapped to causal factors to overcome weak statistical power in independent analysis of individual responses. We propose the graph-guided fused lasso methods, which incorporate couplings among response variables on top of the standard lasso penalty for overall sparsity. Our approach represents the dependency structure among the output variables explicitly as a network, and leverages this network to encode structured regularizations in a multivariate regression model, so that the inputs that jointly influence subgroups of highly correlated outputs can be detected with high sensitivity and specificity. Experimental comparisons with standard lasso are performed on both simulated and real data, and our results show that there is a significant advantage in detecting the true relevant inputs when we incorporate the correlation pattern in outputs using our proposed methods.
We propose a new, recursive model to generate realistic graphs, evolving over time. Our model has the following properties: it is (a) flexible, capable of generating the cross product of weighted/unweighted, directed/ undirected, uni/bipartite graphs; (b) realistic, giving graphs that obey eleven static and dynamic laws that real graphs follow (we formally prove that for several of the (power) laws and we estimate their exponents as a function of the model parameters); (c) parsimonious, requiring only four parameters. (d) fast, being linear on the number of edges; (e) simple, intuitively leading to the generation of macroscopic patterns. We empirically show that our model mimics two real-world graphs very well: Blognet (unipartite, undirected, unweighted) with 27K nodes and 125K edges; and Committee-to-Candidate campaign donations (bipartite, directed, weighted) with 23K nodes and 880K edges. We also show how to handle time so that edge/weight additions are bursty and self-similar.
Despite the widespread use of Clustering, we are only beginning to understand the general unifying principles behind clustering functions. Questions like "Are there any principles governing all clustering paradigms?" and "How should a user choose an appropriate clustering algorithm for a particular task?", etc. are almost completely unanswered by the existing body of clustering literature. We consider an axiomatic approach to the theory of Clustering. We adopt the framework of Kleinberg. By relaxing one of Kleinberg's clustering axioms, we sidestep his impossibility result and arrive at a consistent set of axioms. We use this set of axioms to put forward a theory of clustering to help answer the mentioned questions.
We present a cyclical blockwise coordinate descent algorithm for the multi-task Lasso that efficiently solves problems with thousands of features and tasks. The main result shows that a closed-form Winsorization operator can be obtained for the sup-norm penalized least squares regression. This allows the algorithm to find solutions to very large-scale problems far more efficiently than existing methods. This result complements the pioneering work of Friedman, et al. (2007) for the single-task Lasso. As a case study, we use the multi-task Lasso as a variable selector to discover a semantic basis for predicting human neural activation. The learned solution outperforms the standard basis for this task on the majority of test participants, while requiring far fewer assumptions about cognitive neuroscience. We demonstrate how this learned basis can yield insights into how the brain represents the meanings of words.
We present a cyclical blockwise coordinate descent algorithm for the multi-task Lasso that efficiently solves problems with thousands of features and tasks. The main result shows that a closed-form Winsorization operator can be obtained for the sup-norm penalized least squares regression. This allows the algorithm to find solutions to very large-scale problems far more efficiently than existing methods. This result complements the pioneering work of Friedman, et al. (2007) for the single-task Lasso. As a case study, we use the multi-task Lasso as a variable selector to discover a semantic basis for predicting human neural activation. The learned solution outperforms the standard basis for this task on the majority of test participants, while requiring far fewer assumptions about cognitive neuroscience. We demonstrate how this learned basis can yield insights into how the brain represents the meanings of words.