Projection Pursuit, Gaussian Scale Mixtures, and the EM Algorithm

WThe EM algorithm for fitting Gaussian mixture models is one of the most widely-used clustering methods. Yet, surprisingly little is known about its behavior. Under what conditions does it converge to a good local optimum? What are good ways to initialize it, and to merge/remove intermediate clusters? Such questions are difficult to answer with the mathematical tools that have traditionally been applied to EM.

I will describe an alternative way of analyzing EM: by a probabilistic analysis. This reveals, first of all, that many common methods of initializing EM produce highly suboptimal results even in ideal clustering scenarios. On the other hand, I'll show that a particular variant of EM will provably recover a near-optimal clustering, provided that the clusters are adequately separated and that their distributions are of a certain fairly general form.

The type of cluster distributions allowed is motivated by a new result in projection pursuit, along the lines of the (somewhat mistaken) folklore that "almost all projections of high-dimensional data look Gaussian".

Specifically, I'll show that for any D-dimensional distribution with finite second moments, there is a precise sense in which almost all of its linear projections into d < D dimensions look like a scale-mixture of spherical Gaussians (concentric spherical Gaussians with the same center).

The extent of this effect depends upon the ratio of d to D, and upon the "eccentricity" of the high-dimensional distribution.

Speaker Bio

Sanjoy Dasgupta

Sanjoy Dasgupta is an Assistant Professor in the Department of Computer Science and Engineering at UC San Diego. He received his PhD from Berkeley, and spent two years at AT&T Research Labs before joining UCSD.

His area of research is algorithmic statistics, with a focus on unsupervised and minimally supervised learning. He has recently completed a textbook, "Algorithms" (with Christos Papadimitriou and Umesh Vazirani).