CMU Theory Lunch

Small Coresets and a Dimensionality Reduction for the k-means Problem

Feb 18, 2015

The k-means problem seeks a clustering that minimizes the sum of squared errors cost function: For input points P from the Euclidean space R^d and any solution consisting of k centers from R^d, the cost is the sum of the squared distances of any point to its closest center.

This talk is about reducing the size of an input set P while maintaining the k-means cost function up to an epsilon-fraction *for all choices of k centers*. Thus, the reduced version of P behaves approximately the same for any solution.

Dimensionality reduction refers to reducing the dimension of the input points, while the term coreset refers to computing a reduced version with less points.

We will see how to reduce the dimensionality to O(k/eps^2) and how compute a coreset with poly(k/eps) points. This means that inputs of arbitrary size can be replaced by a (weighted) point set whose size is independent of the input size.