On the statistical cost of foregoing maximum likelihood
February 15, 2023 (GHC 8102)

Abstract: Energy-based models are a recent class of probabilistic generative models wherein the distribution being learned is parametrized up to a constant of proportionality (i.e. a partition function). Fitting such models using maximum likelihood (i.e. finding the parameters which maximize the probability of the observed data) is computationally challenging, as evaluating the partition function involves a high dimensional integral. Thus, newer incarnations of this paradigm instead train other losses which obviate the need to evaluate partition functions. Some examples of this are score matching (in which we fit the score of the data distribution) and noise contrastive estimation (in which we set up a classification problem to distinguish data from noise).

What's gained with these approaches is tractable gradient based algorithms. What's lost is less clear: for example, since maximum likelihood is asymptotically optimal in terms of statistical efficiency, how suboptimal are losses like score matching and noise-contrastive estimation? We will provide partial answers to this question in the asymptotic limit --- and in the process uncover connections between geometric properties of the distribution (Poincaré and isoperimetric constants) and the statistical efficiency of score matching.

Based primarily on https://arxiv.org/abs/2210.00726 and https://arxiv.org/abs/2210.00189.