// about

Project Scientist in the SAILING Lab at CMU working with Eric Xing, Machine Learning Department.

I am interested in problems at the intersection of high-dimensional statistics and machine learning, with a focus on developing algorithms, theory, and software for applications in computational biology and precision medicine. I am particularly interested in problems with non-iid, nonconvex, heterogeneous, and hierarchical structures.

For more details on my projects, please see my research page.

// research interests

  • Nonparametric unsupervised learning
  • Network inference and graphical models
  • Statistical modeling for precision medicine
  • Confounding and heterogeneity in GWAS

// recent work

preprint

Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable by introducing a novel framework for clustering overfitted parametric (i.e. misspecified) mixture models. These conditions generalize existing conditions in the literature, and are flexible enough to include for example mixtures of Gaussian mixtures. In contrast to the recent literature on estimating nonparametric mixtures, we allow for general nonparametric mixture components, and instead impose regularity assumptions on the underlying mixing measure. As our primary application, we apply these results to partition-based clustering, generalizing the well-known notion of a Bayes optimal partition from classical model-based clustering to nonparametric settings. Furthermore, this framework is constructive in that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples. The key conceptual device in the analysis is the convex, metric geometry of probability distributions on metric spaces and its connection to optimal transport and the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.

Keywords: Nonparametric statistics, mixture models, clustering, identifiability, optimal transport

journal / preprint / code

In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models. To uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning latent distance metrics between personalized parameters and clinical covariates, and attempting to match these distances as closely as possible. Crucially, we do not assume these distances are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer.

Keywords: Precision medicine, personalized regression, patient-specific modeling, distance-matching, TCGA

preprint / code

Learning graphical models from data is an important problem with wide applications, ranging from genomics to the social sciences. Nowadays datasets often have upwards of thousands---sometimes tens or hundreds of thousands---of variables and far fewer samples. To meet this challenge, we have developed a new R package called sparsebn for learning the structure of large, sparse graphical models with a focus on Bayesian networks. While there are many existing software packages for this task, this package focuses on the unique setting of learning large networks from high-dimensional data, possibly with interventions. As such, the methods provided place a premium on scalability and consistency in a high-dimensional setting. Furthermore, in the presence of interventions, the methods implemented here achieve the goal of learning a causal network from data. Additionally, the sparsebn package is fully compatible with existing software packages for network analysis.

Keywords: R, software, graphical modeling, directed acyclic graphs, structural equations

preprint

We study a family of regularized score-based estimators for learning the structure of a directed acyclic graph (DAG) for a multivariate normal distribution from high-dimensional data with p >> n. Our main results establish support recovery guarantees and deviation bounds for a family of penalized least-squares estimators under concave regularization without assuming prior knowledge of a variable ordering. These results apply to a variety of practical situations that allow for arbitrary nondegenerate covariance structures as well as many popular regularizers including the MCP, SCAD, L0 and L1. The proof relies on interpreting a DAG as a recursive linear structural equation model, which reduces the estimation problem to a series of neighbourhood regressions. We provide a novel statistical analysis of these neighbourhood problems, establishing uniform control over the superexponential family of neighbourhoods associated with a Gaussian distribution. We then apply these results to study the statistical properties of score-based DAG estimators, learning causal DAGs, and inferring conditional independence relations via graphical models. Our results yield---for the first time---finite-sample guarantees for structure learning of Gaussian DAGs in high-dimensions via score-based estimation.

Keywords: Graphical modeling, high-dimensional statistics, concave regularization, directed acyclic graphs, structural equations, sparse regression

// research

// general interests

Broadly speaking, I work on statistical theory and algorithms for the following topics and problems:

  • Nonparametric unsupervised learning
  • Nonparametric mixture models / identifiability / data clustering / unsupervised classification

  • Network inference and graphical models
  • DAG learning / structural equation models / causal inference / gene networks / nonconvex optimization

  • Statistical modeling for precision medicine
  • Sample-specific models / hierarchical inference / statistical theory for precision medicine

  • Confounding and heterogeneity in GWAS
  • Mixed models / population structure / high-dimensional regression / model selection

More generally, I am also interested in open-source software development, reproducible research, and ethics and accountability in ML and AI. My work has been applied to both academic and industry problems.

// projects

  • Personalized and pan-cancer modeling.
  • What makes personalized statistical models different from traditional statistical models? How do we overcome the loss of power when learning models that are specific to a single sample, patient, or customer? Our recent work on pan-cancer modeling provides a general framework for addressing some of these challenges.

  • Learning DAGs from data.
  • Learning directed acyclic graphs (DAGs) is a notoriously difficult problem. We have developed several fast algorithms for learning DAGs from data based on block coordinate descent, black-box optimization, and neighbourhood regression. We have also released the open-source sparsebn project to make these methods available to the public.

  • Nonparametric unsupervised learning.
  • Our work in unsupervised learning has developed new theory and tools for uncovering hidden structure in unlabeled data with limited samples. We have established new identifiability and estimability results for nonparametric mixture models and applied these to problems in data clustering.

  • Population stratification in GWAS.
  • Contemporary genome-wide association studies are plagued by the non-iid and confounded nature of modern genetic data. To overcome these issues, mixed models have emerged as a promising tool. Nonetheless, there remain many practical challenges in their implementation and interpretation. Our work in this area seeks to provide statistical insight and practical algorithms for modern GWAS.

// publications

// preprints

Estimating the structure of directed acyclic graphs (DAGs, also known as Bayesian networks) is a challenging problem since the search space of DAGs is combinatorial and scales superexponentially with the number of nodes. Existing approaches rely on various local heuristics for enforcing this difficult constraint---greedy search, branch-and-bound, etc.---and are not well-suited to general purpose optimization packages for their solution. In this paper, we introduce a fundamentally different strategy: We formulate the structure learning problem as a smooth, constrained optimization problem over real matrices that avoids this combinatorial constraint entirely. This is achieved by a novel characterization of acyclicity that is not only smooth but also exact. The resulting nonconvex, constrained program involves smooth functions whose gradients are easy to compute and only involve elementary matrix operations. By using existing black-box optimization routines, our method uses global search to find an optimal DAG and can be implemented in about 50 lines of Python and outperforms existing methods without imposing any structural constraints.

Keywords: Directed acyclic graphs, Bayesian networks, constrained optimization, nonconvex optimization, augmented Lagrangian, black-box

Link to arXiv

Motivated by problems in data clustering, we establish general conditions under which families of nonparametric mixture models are identifiable by introducing a novel framework for clustering overfitted parametric (i.e. misspecified) mixture models. These conditions generalize existing conditions in the literature, and are flexible enough to include for example mixtures of Gaussian mixtures. In contrast to the recent literature on estimating nonparametric mixtures, we allow for general nonparametric mixture components, and instead impose regularity assumptions on the underlying mixing measure. As our primary application, we apply these results to partition-based clustering, generalizing the well-known notion of a Bayes optimal partition from classical model-based clustering to nonparametric settings. Furthermore, this framework is constructive in that it yields a practical algorithm for learning identified mixtures, which is illustrated through several examples. The key conceptual device in the analysis is the convex, metric geometry of probability distributions on metric spaces and its connection to optimal transport and the Wasserstein convergence of mixing measures. The result is a flexible framework for nonparametric clustering with formal consistency guarantees.

Keywords: Nonparametric statistics, mixture models, clustering, identifiability, optimal transport

Link to arXiv

We define and study partial correlation graphs (PCGs) with variables in a general Hilbert space and their connections to generalized neighborhood regression, without making any distributional assumptions. Using operator-theoretic arguments, and especially the properties of projection operators on Hilbert spaces, we show that these neighborhood regressions have the algebraic structure of a lattice, which we call a neighborhood lattice. This lattice property significantly reduces the number of conditions one has to check when testing all partial correlation relations among a collection of variables. In addition, we generalize the notion of perfectness in graphical models for a general PCG to this Hilbert space setting, and establish that almost all Gram matrices are perfect. Under this perfectness assumption, we show how these neighborhood lattices may be "graphically" computed using separation properties of PCGs. We also discuss extensions of these ideas to directed models, which present unique challenges compared to their undirected counterparts. Our results have implications for multivariate statistical learning in general, including structural equation models, subspace clustering, and dimension reduction. For example, we discuss how to compute neighborhood lattices efficiently and furthermore how they can be used to reduce the sample complexity of learning directed acyclic graphs. Our work demonstrates that this abstract viewpoint via projection operators significantly simplifies existing ideas and arguments from the graphical modeling literature, and furthermore can be used to extend these ideas to more general nonparametric settings.

Keywords: Partial correlation graphs, graphical modeling, neighbourhood regression, partial orthogonality, projection operators, Hilbert spaces

Link to arXiv

We study a family of regularized score-based estimators for learning the structure of a directed acyclic graph (DAG) for a multivariate normal distribution from high-dimensional data with p >> n. Our main results establish support recovery guarantees and deviation bounds for a family of penalized least-squares estimators under concave regularization without assuming prior knowledge of a variable ordering. These results apply to a variety of practical situations that allow for arbitrary nondegenerate covariance structures as well as many popular regularizers including the MCP, SCAD, L0 and L1. The proof relies on interpreting a DAG as a recursive linear structural equation model, which reduces the estimation problem to a series of neighbourhood regressions. We provide a novel statistical analysis of these neighbourhood problems, establishing uniform control over the superexponential family of neighbourhoods associated with a Gaussian distribution. We then apply these results to study the statistical properties of score-based DAG estimators, learning causal DAGs, and inferring conditional independence relations via graphical models. Our results yield---for the first time---finite-sample guarantees for structure learning of Gaussian DAGs in high-dimensions via score-based estimation.

Keywords: Graphical modeling, high-dimensional statistics, concave regularization, directed acyclic graphs, structural equations, sparse regression

Link to arXiv

// publications

In many applications, inter-sample heterogeneity is crucial to understanding the complex biological processes under study. For example, in genomic analysis of cancers, each patient in a cohort may have a different driver mutation, making it difficult or impossible to identify causal mutations from an averaged view of the entire cohort. Unfortunately, many traditional methods for genomic analysis seek to estimate a single model which is shared by all samples in a population, ignoring this inter-sample heterogeneity entirely. In order to better understand patient heterogeneity, it is necessary to develop practical, personalized statistical models. To uncover this inter-sample heterogeneity, we propose a novel regularizer for achieving patient-specific personalized estimation. This regularizer operates by learning latent distance metrics between personalized parameters and clinical covariates, and attempting to match these distances as closely as possible. Crucially, we do not assume these distances are already known. Instead, we allow the data to dictate the structure of these latent distance metrics. Finally, we apply our method to learn patient-specific, interpretable models for a pan-cancer gene expression dataset containing samples from more than 30 distinct cancer types and find strong evidence of personalization effects between cancer types as well as between individuals. Our analysis uncovers sample-specific aberrations that are overlooked by population-level methods, suggesting a promising new path for precision analysis of complex diseases such as cancer.

Keywords: Precision medicine, personalized regression, patient-specific modeling, distance-matching, TCGA

Link to biorXiv

A fundamental and important challenge in modern datasets of ever increasing dimensionality is variable selection, which has taken on renewed interest recently due to the growth of biological and medical datasets with complex, non-i.i.d. structures. Naively applying classical variable selection methods such as the Lasso to such datasets may lead to a large number of false discoveries. Motivated by genome-wide association studies in genetics, we study the problem of variable selection for datasets arising from multiple subpopulations, when this underlying population structure is unknown to the researcher. We propose a unified framework for sparse variable selection that adaptively corrects for population structure via a low-rank linear mixed model. Most importantly, the proposed method does not require prior knowledge of sample structure in the data and adaptively selects a covariance structure of the correct complexity. Through extensive experiments, we illustrate the effectiveness of this framework over existing methods. Further, we test our method on three different genomic datasets from plants, mice, and human, and discuss the knowledge we discover with our method.

Keywords: GWAS, linear mixed models, heterogeneous data, confounding, population structure

Link to biorXiv

Learning graphical models from data is an important problem with wide applications, ranging from genomics to the social sciences. Nowadays datasets often have upwards of thousands---sometimes tens or hundreds of thousands---of variables and far fewer samples. To meet this challenge, we have developed a new R package called sparsebn for learning the structure of large, sparse graphical models with a focus on Bayesian networks. While there are many existing software packages for this task, this package focuses on the unique setting of learning large networks from high-dimensional data, possibly with interventions. As such, the methods provided place a premium on scalability and consistency in a high-dimensional setting. Furthermore, in the presence of interventions, the methods implemented here achieve the goal of learning a causal network from data. Additionally, the sparsebn package is fully compatible with existing software packages for network analysis.

Keywords: R, software, graphical modeling, directed acyclic graphs, structural equations

Link to arXiv

We develop a penalized likelihood estimation framework to learn the structure of Gaussian Bayesian networks from observational data. In contrast to recent methods which accelerate the learning problem by restricting the search space, our main contribution is a fast algorithm for score-based structure learning which does not restrict the search space in any way and works on high-dimensional data sets with thousands of variables. Our use of concave regularization, as opposed to the more popular L0 (e.g. BIC) penalty, is new. Moreover, we provide theoretical guarantees which generalize existing asymptotic results when the underlying distribution is Gaussian. Most notably, our framework does not require the existence of a so-called faithful DAG representation, and as a result, the theory must handle the inherent nonidentifiability of the estimation problem in a novel way. Finally, as a matter of independent interest, we provide a comprehensive comparison of our approach to several standard structure learning methods using open-source packages developed for the R language. Based on these experiments, we show that our algorithm obtains higher sensitivity with comparable false discovery rates for high-dimensional data and scales efficiently as the number of nodes increases. In particular, the total runtime for our method to generate a solution path of 20 estimates for DAGs with 8000 nodes is around one hour.

Keywords: Bayesian networks, concave penalization, directed acyclic graphs, coordinate descent, nonconvex optimization

Link to arXiv

// theses

Research into graphical models is a rapidly developing enterprise, garnering significant interest from both the statistics and machine learning communities. A parallel thread in both communities has been the study of low-dimensional structures in high-dimensional models where $p\gg n$. Recently, there has been a surge of interest in connecting these threads in order to understand the behaviour of graphical models in high-dimensions. Due to their relative simplicity, undirected models such as the Gaussian graphical model and Ising models have received most of the attention, whereas directed graphical models have received comparatively little attention. An important yet largely unresolved class of directed graphical models are Bayesian networks, or directed acyclic graphs (DAGs). These models have a wide variety of applications in aritificial intelligence, machine learning, genetics, and computer vision, but estimation of Bayesian networks in high-dimensions is not well-understood. The main focus of this dissertation is to address some fundamental questions about these models in high-dimensions.

The primary goal is to develop both algorithms and theory for estimating continuous, linear Bayesian networks, capable of handling modern high-dimensional problems. Motivated by problems from the regression literature, we show how to adapt recent work in sparse learning and nonconvex optimization to the structure learning problem for Bayesian networks in order to estimate DAGs with several thousand nodes. We draw an explicit connection between linear Bayesian networks and so-called neighbourhood regression problems and show how this can be exploited in order to derive nonasymptotic performance bounds for penalized least squares estimators of directed graphical models.

On the algorithmic side, we develop a method for estimating Gaussian Bayesian networks based on convex reparametrization and cyclic coordinate descent. In contrast to recent methods which accelerate the learning problem by restricting the search space, we propose a method for score-based structure learning which does not restrict the search space. We do not require the existence of a so-called faithful DAG representation, and as a result, our methodology must handle the inherent nonidentifiability of the estimation problem in a novel way. On the theoretical side, we provide (a) Finite-dimensional performance guarantees for local minima of the resulting nonconvex program, and (b) A general high-dimensional framework for global minima of the nonconvex program. Both the algorithms and theory apply to a general class of regularizers, including the MCP, SCAD, $\ell_1$ and $\ell_0$ penalties. Finally, as a matter of independent interest, we provide a comprehensive comparison of our approach to several standard structure learning methods using open-source packages developed for the \texttt{R} language.

Keywords: Bayesian networks, high-dimensional statistics, graphical models, sparse regression, concave regularization, nonconvex optimization

Ten years ago, Ehrlich and Sanchez produced a pointwise statement of the classical Bishop volume comparison theorem for so-called SCLV subsets of the causal future in a Lorentz manifold, while Petersen and Wei developed and proved an integral version for Riemannian manifolds. We apply Peterson and Wei's method to the SCLV sets, and verify that two essential differential equations from the Riemannian proof extend to the Lorentz setting. As a result, we obtain a volume comparison theorem for Lorentz manifolds with integral, rather than pointwise, bounds. We also brie􏱭y discuss the history of the problem, starting with Bishop's original theorem from 1963.

Keywords: Differential geometry, volume comparison, Lorentz manifolds

Link

I will not be teaching in Spring 2018.

Past teaching assignments:

  • Machine Learning 10-821: Data Analysis Project Preparation (Fall 2017)
  • Statistics 10: Introduction to Statistical Reasoning (2015-2016)
  • Statistics 495A: Teaching College Statistics (Winter 2015)
  • Statistics 100A: Introduction to Probability (Spring 2014)
  • Statistics 101B: Introduction to Design and Analysis of Experiments (Winter 2014)
  • Statistics 102A: Introduction to Computational Statistics with R (Fall 2013)
  • PIC 20A: Principles of Java (Spring 2010)
  • PIC 10A: Introduction to C++ Programming (Winter 2010)
  • PIC 10A: Introduction to C++ Programming (Fall 2009)

// software

You can find more up-to-date information on my software projects by visiting my Github page.

// Personalized regression

This repository contains Python code for learning sample-specific, personalized regression models. The goal of personalized regression is to perform retrospective analysis by estimating simple models that each apply to a single sample. After estimating these sample-specific models, we have a matrix of model parameters which we may analyze as we wish.

source / paper

// sparsebn package for R

sparsebn is an R package for learning large-scale Bayesian networks from high-dimensional data. It allows users to incorporate mixed experimental and observational data with either continuous or discrete observations, and scales to datasets with many thousands of variables. The underlying framework is based on recent developments in sparse (e.g. L1) regularization, coordinate descent, and nonconvex optimization.

cran / source / paper

// ccdr package for R

The source code for the CCDr algorithm described in Aragam and Zhou (2015) is freely available online through GitHub.

ccdr is an R package for structure learning of linear Bayesian networks from high-dimensional, Gaussian data. The underlying algorithm estimates a Bayesian network (aka DAG or belief net) using penalized maximum likelihood based on L1 or concave (MCP) regularization and observational data.

source / paper

// consulting

I have some limited availability for part-time and short-term consulting projects in analytics, data science, modeling, and/or simulation. I am experienced in many core languages as well as web-oriented and design-oriented software.

  • Data: R + Rcpp / SQL / PL/pgSQL / MapReduce
  • Numerics: Mathematica / MATLAB
  • Core: C++ / C / Python / Java / Visual Basic
  • Web: HTML(5) / XML / CSS / Javascript / jQuery
  • Design: Photoshop / Digital photography / Video editing

I am particularly interested in the analysis of social data, including social networks, advertising, and marketing.

// availability

Please e-mail me for more details.

// contact

If you want to...

...e-mail me: naragam at cs dot cmu dot edu.

...find me on LinkedIn, click here.

...see my CV, click here.

I also got a little bored while designing this site so I hid some easter eggs here and there.