# ML

Hamiltonian Monte Carlo (HMC) is a powerful approach to Markov Chain Monte Carlo, since it generates successive samples with low correlation even for distributions in high dimensions and with complex nonlinear support. For some target densities, however, HMC cannot be applied, since the likelihood and its gradients are not computable in closed form (for instance, in Gaussian process classification). On target densities where classical HMC is not an option, KMC adaptively learns the target's gradient structure by fitting an infinite dimensional exponential family model in a Reproducing Kernel Hilbert Space. Our talk addresses two topics: first, we describe the properties of the exponential family model, show consistency for a wide class of target densities, and provide convergence rates under smoothness assumptions. Second, we demonstrate two strategies to approximate the gradient of this model efficiently, and show how these approximate gradients may be used in constructing an adaptive Hamiltonian Monte Carlo sampler. We establish empirical performance of the kernel adaptive HMC sampler with experimental studies on both toy and real-world applications, including exact-approximate MCMC for Gaussian process classification.

—

Arthur Gretton is a Reader (Associate Professor) with the Gatsby Computational Neuroscience Unit, University College London. Arthur's research interests include machine learning, kernel methods, statistical learning theory, nonparametric hypothesis testing, nonparametric density estimation, and techniques for neural data analysis. He has been an Action Editor for JMLR since April 2013, an associate editor at IEEE Transactions on Pattern Analysis and Machine Intelligence from 2009 to 2013, a member of the NIPS Program Committee in 2008 and 2009, an Area Chair for ICML in 2011 and 2012, and a member of the COLT Program Committee in 2013. Arthur was co-chair of AISTATS in 2016. He received degrees in physics and systems engineering from the Australian National University, and a PhD with Microsoft Research and the Signal Processing and Communications Laboratory at the University of Cambridge.

Statistical Relational Learning (SRL) Models combine the powerful formalisms of probability theory and first-order logic to handle uncertainty in large, complex problems. While they provide a very effective representation paradigm due to their succinctness and parameter sharing, efficient learning is a significant problem in these models. First, I will discuss state-of-the-art learning method based on boosting that is representation independent. Our results demonstrate that learning multiple weak models can lead to a dramatic improvement in accuracy and efficiency.

One of the key attractive properties of SRL models is that they use a rich representation for modeling the domain that potentially allows for seam-less human interaction. However, in current SRL research, the human is restricted to either being a mere labeler or being an oracle who provides the entire model. I will present our recent work that allows for more reasonable human interaction where the human input is taken as “advice” and the learning algorithm combines this advice with data. Finally, I will discuss our work on employing SRL models for achieving transfer across seemingly unrelated domains allowing for more efficient learning in data-scarce domains.

—

Sriraam Natarajan is an Assistant professor at Indiana University. He was previously an Assistant Professor at Wake Forest School of Medicine, a post-doctoral research associate at University of Wisconsin-Madison and had graduated with his PhD from Oregon State University. His research interests lie in the field of Artificial Intelligence, with emphasis on Machine Learning, Statistical Relational Learning and AI, Reinforcement Learning, Graphical Models and Biomedical Applications. He has received the Young Investigator award from US Army Research Office. He is the organizer of the key workshops in the field of Statistical Relational Learning and has co-organized the AAAI 2010, the UAI 2012, AAAI 2013 and AAAI 2014 workshops on Statistical Relational AI (StarAI), ICML 2012 Workshop on Statistical Relational Learning, and the ECML PKDD 2011 and 2012 workshops on Collective Learning and Inference on Structured Data (Co-LISD). He was also the co-chair of the AAAI student abstract and posters at AAAI 2014 and AAAI 2015.

**Faculty Host:** Tom Mitchell

The desired output in many machine learning tasks is a structured object such as a tree, a clustering of nodes, or a sequence. Learning accurate prediction models for such problems requires training on large amounts of data, making use of expressive features and performing global inference that simultaneously assigns values to all interrelated nodes in the structure. All these contribute to significant scalability problems. We describe a collection of results that address several aspects of these problems – by carefully selecting and caching samples, structures, or latent items. Our results lead to efficient learning algorithms for structured prediction models and for online clustering models which, in turn, support reduction in problem size, improvements in training and evaluation speed and improved performance. We have used our algorithms to learn expressive models from large amounts of annotated data and achieve state-of-the art performance on several natural language processing tasks.

—

Kai-Wei Chang is a doctoral candidate advised by Prof. Dan Roth in the Department of Computer Science, University of Illinois at Urbana-Champaign. His research interests lie in designing practical machine learning techniques for large and complex data and applying them to real world applications. He has been working on various topics in Machine learning and Natural Language Processing, including large-scale learning, structured learning, coreference resolution, and relation extraction. Kai-Wei was awarded the KDD Best Paper Award (2010), the Yahoo! Key Scientific Challenges Award (2011), and the C.L. and Jane W-S. Liu Award (2013) from the Department of Computer Science at UIUC. He was one of the main contributors of a popular machine learning library, LIBLINEAR.

**Faculty Host:** Tom Mitchell

Topic Modeling is widely used. The general model inference problem is hard. Known provable algorithms need some strong assumptions on the model.

After a from-first-principles introduction, the talk will formulate a model with two natural assumptions: (i) each document has a dominant topic and (ii) each topic has dominant words. We provably solve the model fitting problem. The algorithm is based on three natural components: a thresholding step, Singular Value Decomposition and clustering. The proof crucially draws upon recent results on the convergence of another widely used tool: Lloyd's k-means algorithm. We test both the performance of the algorithm and the validity of the assumptions on real document corpora with success. The simple device of thresholding has other uses - we will see an exponential advantage for certain \planted" problems in terms of the signal-to-noise ratio.

*Joint with T. Bansal and C. Bhattacharyya, Indian Institute of Science.*

—

Ravindran (Ravi) Kannan is Principal Researcher in the Algorithms Research Group at Microsoft Research Bangalore. Previously he was a professor at CMU, MIT, and Yale, where he was the William Lanman Professor of Computer Science. His research areas span Algorithms, Optimization and Probability.

He is widely known for introducing several groundbreaking techniques in theoretical computer science, notably in the algorithmic geometry of numbers, sampling and volume computation in high dimension, and algorithmic linear algebra. He received the Knuth Prize in 2011, and the Fulkerson Prize in 1992. He is a distinguished alumnus of IIT Bombay.

**Faculty Host:** Nina Balcan

We consider parallel derivative-free global optimization of expensive-to-evaluate functions. We present a new decision-theoretic algorithm for this problem, which places a Bayesian prior distribution on the objective function, and chooses the set of points to evaluate next that provide the largest value of the information. This decision-theoretic approach was previously proposed by Ginsbourger and co-authors in 2008, but was deemed too difficult to actually implement in practice. Using stochastic approximation, we provide a practical algorithm implementing this approach, and demonstrate that it provides a significant speedup over the single-threaded expected improvement algorithm. We then describe how Yelp, the online business review company, uses this algorithm to optimize the content that their users see. An open source implementation, called the Metrics Optimization Engine (MOE), was co-developed with engineers at Yelp and is available at github.com/yelp/MOE.

—

Peter I. Frazier is an assistant professor in the School of Operations Research and Information Engineering at Cornell University, and received a Ph.D. in Operations Research and Financial Engineering from Princeton University in 2009. He is the recipient of an AFOSR Young Investigator Award, and an NSF CAREER Award. He is an associate editor for Operations Research, ACM Transactions on Modeling and Computer Simulation and IIE Transactions. His research interest is in dynamic programming and Bayesian statistics, focusing on the optimal acquisition of information and sequential design of experiments. He works on applications in simulation, optimization, operations management, medicine, and materials science.

**Faculty Host: ** Jeff Schneider

**Presented by the BRAIN Lab at Bloomberg (AI, NLP, and Knowledge)**

The problem of sharing reproducing extending , and using research plagues almost anyone who attempts to contribute to or leverage the state-of-the-art in AI. We waste hours, days, weeks, and months attempting to recreate results which should be available to us in minutes, let alone extending them and running our own experiments. In collaboration with leading minds in the field, we have developed and actively use NLP<Go> to alleviate these problems, bring order to our ideas, and discipline to our tasks. NLP<Go> is a highly scalable and flexible open framework that allows the modular development of complex pipelines in AI and NLP, and already implements many core algorithms. This talk reveals the why and the how. We want to invite and encourage your participation in this community, to make everyone's work more transparent, accessible, and valuable.

This is the first announcement of the NLP<Go> beta release.

*—*

This event will be followed by an optional hands-on workshop on Thursday, Feb. 12 from 10:00 am to 1:00 pm in Gates Hillman 8102, aimed at increasing the number of research contributions compatible with the project.If you are interested in the workshop (highly recommended) please provide your github username when confirming your attendance.

*With:* *James Hodson, Prabhanjan Kambadur, Shefaet Rahman*

—

Please register for the Workshop by contacting Estevam Hrushka and Sharon Cavlovich with your name and github username. If you need to register for github, please visit their site.

We motivate and develop an additive autoregressive hidden Markov model specifically designed to work on the task of energy disaggregation; that is, separating a whole-building electricity signal into its component device signals whose sum is the aggregate signal observed by a smart meter. This model assumes each device in the building operates as an individual autoregressive HMM, where hidden states represent the underlying power mode of the device and Gaussian emissions correspond to that device's power consumption. The additive property models the observed output (whole-building power signal) as the sum of the emissions of multiple hidden states (i.e., as the sum of individual consumptions of multiple devices in the building). The autoregressive property realistically models how many appliances consume energy. Finally, our model also includes a robust mixture component, via an L1-regularized noise term, that can absorb outliers arising in this setting from unknown or rarely-used devices. We extract the power signals and underlying state sequences of single devices in a stagewise fashion and illustrate the results of this process on the Reference Energy Disaggregation Dataset (REDD).

**Faculty Advisor:** Zico Kolter

The high-dimensional data setting is a challenging paradigm that appears in many real-world problems. In this scenario, leveraging multiple related sources or "views" of data in a principled manner can substantially help distinguish signal from noise. One approach to handling multi-view data is to perform subspace learning to estimate a set of latent features that capture most of the variance across all views. Existing subspace learning models typically assume that the data can be fully represented by its embedding in one or more latent subspaces. In this work, we argue that this assumption is not suitable for many high-dimensional datasets in which only some variables can easily be projected to a low-dimensional space while others cannot. Instead, we propose a hybrid dimensionality reduction technique in which some features are mapped to a low-dimensional subspace common to all views, while others remain in their original space. Our model leads to more accurate estimation of the latent space and lower reconstruction error. We present a simple optimization procedure for the resulting biconvex problem and show synthetic and real data results that demonstrate the advantages of our approach over existing methods.

**Advisor:** Erix P. Xing

Transfer of learning, which is the extent that one is able to apply prior experience and knowledge in a new but similar situation, has been studied extensively since the beginning of the last century. Recent brain imaging studies have provided new insight into how students are able to extend their previous problem solving skills to new but similar problems. It was still unclear, however, what the basis is of individual differences in their success at transfer. In this study, 75 participants had been trained to solve a set of mathematical problems before they were put into the fMRI scanner, where they were challenged to solve modified version of familiar problems. A hidden semi-Markov Model identified the sequential structure of thought when solving the problems. Analyzing the patterns of brain activity over the sequence of states identified by the model, we observed that participants who showed consistent brain patterns in solving different problems performed better. In particular, early consistency was predictive of later performance in the experiment. In addition to looking at activity across all brain voxels and in pre-defined brain regions, a data-driven approach over the whole brain was carried out as an alternative analysis that verified the same observation.

**Advisors:**

John Anderson

Rob Kass

When scaling up Reinforcement Learning (RL) to large continuous domains with imperfect representations and hierarchical structure, we often try applying algorithm that are proven to converge in small finite domains, and then just hope for the best. This talk will advocate instead designing algorithms that adhere to the constraints, and indeed take advantage of the opportunities, that might come with the problem at hand.

Drawing on several different research threads within the Learning Agents Research Group at UT Austin, I will touch on four types of issues that arise from these constraints and opportunities: 1) Representation - choosing the algorithm for the problem's representation and adapting the representation to fit the algorithm; 2) Interaction - with other agents and with human trainers; 3) Synthesis - of different algorithms for the same problem and of different concepts in the same algorithm; and 4) Mortality - dealing with the constraint that when the environment is large relative to the number of action opportunities available, one cannot explore exhaustively.

Within this context, I will focus on two specific RL approaches, namely the TEXPLORE algorithm for real-time sample-efficient reinforcement learning for robots; and layered learning, a hierarchical machine learning paradigm that enables learning of complex behaviors by incrementally learning a series of sub-behaviors. TEXPLORE has been implemented and tested on a full-size fully autonomous robot car, and layered learning was the key deciding factor in our RoboCup 2014 3D simulation league championship.

—

Dr. Peter Stone is an Alfred P. Sloan Research Fellow, Guggenheim Fellow, AAAI Fellow, Fulbright Scholar, and University Distinguished Teaching Professor in the Department of Computer Science at the University of Texas at Austin. He received his Ph.D in Computer Science in 1998 from Carnegie Mellon University. From 1999 to 2002 he was a Senior Technical Staff Member in the Artificial Intelligence Principles Research Department at AT&T Labs - Research. Peter's research interests include machine learning, multiagent systems, robotics, and e-commerce. In 2003, he won a CAREER award from the National Science Foundation for his research on learning agents in dynamic, collaborative, and adversarial multiagent environments. In 2004, he was named an ONR Young Investigator for his research on machine learning on physical robots. In 2007, he was awarded the prestigious IJCAI 2007 Computers and Thought award, given once every two years to the top AI researcher under the age of 35. In 2013 he was awarded the University of Texas System Regents' Outstanding Teaching Award and in 2014 he was inducted into the UT Austin Academy of Distinguished Teachers.

**Faculty Host: **Emma Brunskill