Research

I work on theoretical statistics with a focus on clustering, bootstrap, random matrices and networks. Here are some of my publications. For all publications look at the arxiv link.

PREPRINTS: Find me on Arxiv.

On clustering network-valued data. [Supplementary]
(Soumendu Mukherjee, Purnamrita Sarkar, Lizhen Lin)

   - Two clustering algorithms for clustering where each datapoint corresponds to a network.

   - NIPS 2017.

Statistical Convergence Analysis of Gradient EM on multi-component Gaussian Mixture Models. [Supplementary]
(Bowei Yan, Mingzhang Yin, Purnamrita Sarkar)

   - Local convergence guarantees for EM with general Gaussian Mixture Models

   - NIPS 2017.

On Mixed Memberships and Symmetric Nonnegative Matrix Factorizations. [Supplementary] [code]
(Xueyu Mao, Purnamrita Sarkar, Deepayan Chakrabarti)

   - Asymptotic guarantees for Symmetric Non-negative matrix factorization for the Mixed Membership Blockmodel.

   - ICML 2017.

On Robustness of Kernel Clustering. [arxiv] [code]
(Bowei Yan, Purnamrita Sarkar)

   - Asymptotic guarantees for a semidefinite relaxation of Kernel Clustering under outlier nodes.

   - NIPS 2016.

The Consistency of Common Neighbors for Link Prediction in Stochastic Blockmodels. [Supplementary]
(Purnamrita Sarkar, Deepayan Chakrabarti, Peter Bickel)

   - Asymptotic guarantees for link prediction with common neighbors .

   - NIPS 2015.

Scaling Up Crowd-Sourcing to Very Large Datasets: A Case for Active Learning.
(Barzan Mozafari, Purnamrita Sarkar, Michael Franklin, Michael Jordan, Samuel Madden)

   - Using the theory of nonparametric bootstrap, we propose an active learning scheme for a crowdsourced database.

   - To appear in VLDB 2015.

Role of normalization in spectral clustering.
(P. Sarkar and P. J. Bickel)

   - We theoretically investigate the effect of normalization for Spectral Clustering in Stochastic Blockmodels. A common practice in Spectral Clustering is to use the normalized adjacency matrix. Our results suggest that normalization improves the second order consistency properties of the principal eigenvectors of the adjacency matrix. Under the spectral embedding, normalization reduces the variance of points in a class while keeping the distance between cluster centers intact.

   - To appear in the Annals of Statistics.

Hypothesis Testing for Automated Community Detection in Networks.
(P. J. Bickel and P. Sarkar)

   - Most network clustering algorithms assume that the number of clusters k is known. We propose a hypothesis testing framework for automatically determining the number of clusters in networks generated from Stochastic Blockmodels. We theoretically derive the asymptotic distribution of the largest eigenvalue of a shifted and scaled adjacency matrix and use this for the test. Further we develop a recursive bipartitioning algorithm for generating a hierarchical clustering of the graph.

      - To appear in the Journal of Royal Statistical Society, Series B.

Crowdsourced Enumeration Queries. (International Conference on Data Engineering, ICDE 2013)
(B. Trushkowsky, T. Kraska, M. J. Franklin, and P. Sarkar)

   -We propose to use statistical species estimation tools for improving the quality of enumeration type queries in crowdsourced databases.

      - Received the Best Paper Award

Nonparametric Link Prediction in Dynamic Networks. (International Conference on Machine Learning, ICML 2012). [Appendix].
(P. Sarkar, D. Chakrabarti, and M. I. Jordan)

   -We propose a nonparametric time-series model for dynamic networks with non-linear linkage patterns. We use simple link prediction heuristics from static networks as features, and propose a scalable estimation procedure that uses locality sensitive hashing. We also prove asymptotic consistency of the estimator under the specified model.

         -Extended version: Nonparametric Link Prediction in Large Scale Dynamic Networks.
         -To appear in the Electronic Journal of Statistics.

Big Data Bootstrap. (International Conference on Machine Learning, ICML 2012)
(A. Kleiner, A. Talwalkar, P. Sarkar, and M. I. Jordan)

   -With the huge amounts of noisy data generated from public and corporate sources everyday, estimation of confidence on a quantity of inferential interest is becoming more and more important. In this work we present "Bag of Little Bootstraps", a sampling scheme which obtains compact representations of bootstrap samples. This algorithm significantly reduces the computational burden of bootstrap without losing out on its asymptotic consistency properties.

         -Extended version: A Scalable Bootstrap for Massive Data.
         -Journal of Royal Statistical Society, Series B. In press.

Theoretical Justification of Popular Link Prediction Heuristics. (International Conference on Learning Theory, COLT 2010)
(P. Sarkar, D. Chakrabarti, and A. W. Moore)

- Link prediction is an interesting and important practical problem in social networks. There are many different graph-based heuristics for doing link prediction, namely number of common neighbors between two nodes, shortest path etc. By using a generative model for link formation, and geometric intuitions, we justify some popular link prediction heuristics. Here is the presentation.

      - Received the Mark Fulk Best Student Paper Award
      - Invited to the Best Paper Track, International Joint Conference on Artificial Intelligence (IJCAI), 2011.

Fast Nearest-neighbor Search in Disk-resident Graphs. (ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2010)
(P. Sarkar and A. W. Moore)

- We examine the problem of computing random walk based measures on large disk-resident graphs. We also examine the relationship between different proximity measures in order to analyze the effect of high degree nodes on random walks. The technical report can be found here.

Fast Dynamic Reranking in Large Graphs. (International Conference on World Wide Web, WWW 2009)
(P. Sarkar and A. W. Moore)

- In this paper we consider the problem of re-ranking search results by incorporating user feedback. We propose to use a graph theoretic measure for discriminating irrelevant results from relevant results using a few labeled examples provided by the user. We present efficient sampling and neighborhood expansion algorithms to compute the top k nodes under this measure. Here is the presentation.

Fast Incremental Proximity Search in Large Graphs. (International Conference on Machine Learning, ICML 2008)
(P. Sarkar, A. W. Moore, and A. Prakash)

- In this paper we combine random sampling with deterministic neighborhood expansion schemes to compute approximate nearest neighbors of a query under hitting and commute times. Resulting algorithms can process queries on the fly without any preprocessing. This is of crucial importance for graphs which change quickly over time. Here is the presentation.

A Tractable Approach to Finding Closest Truncated-commute-time Neighbors in Large Graphs. (Uncertainty in Artificial Intelligence, UAI 2007)
(P. Sarkar and A. W. Moore)

    - We present GRANCH, an algorithm to compute all interesting pairs of approximate nearest neighbors in truncated commute times in a graph, without computing it between all pairs. Our algorithm provably prunes away uninteresting pairs of nodes in the graph, and as a result quickly finds the "potential nearest neighbors".

A Latent Space Approach to Dynamic Embedding of Co-occurrence Data. (International Conference on Artificial Intelligence and Statistics, AI-STATS 2007)
(P. Sarkar, S. Siddiqi, and G. Gordon)

- We propose a graphical model to compute dynamic embeddings of co-occurrence data. We show how to make inference tractable by using Kalman Filters, which also provides distributional information of the embedding inferred.

Dynamic Social Network Analysis using Latent Space Models. (Advances in Neural Information Processing Systems, NIPS 2005)
(P. Sarkar and A. W. Moore)

- We present a dynamic model for social networks changing over time. We combine a new dynamic multidimentional scaling algorithm with a tractable local optimization technique for inference. Both of these algorithms are highly efficient (sub-quadratic in size of the social network).

     - [Extended version:] Appeared in SIGKDD Explorations: Special Edition on Link Mining 2005