Clustering High Dimensional Dynamic Data Streams
February 21, 2018 (GHC 6115 )

We first present a data streaming algorithms for the k-median problem in high-dimensional dynamic geometric data streams, i.e. streams allowing both insertions and deletions of points from a discrete Euclidean space {1,2,…Δ}^d. Our algorithms use k/eps^2 * poly(d log Δ) space/time and maintain with high probability a small weighted set of points (a coreset) such that for every set of k centers the k-median cost of the coreset (1+eps)-approximates the cost of the streamed point set. We can use this coreset to compute a (1+eps)-approximation for the k-median problem by any efficient offline k-median algorithm.

We then show a recent development of data streaming algorithms for the k-means in the same setting. Our new algorithm simulates the sensitivity-based offline sampling algorithm (developed by Feldman & Langberg (2011)) in a dynamic stream setting and provides coresets for k-means with k / poly(eps) * poly(d log Δ) space. This new framework can be generalized to various different clustering objectives.

All previous algorithms for computing a (1+eps)-approximation for the k-median/means problem over dynamic data streams required space exponential in d.

Based on paper:

https://arxiv.org/abs/1706.03887 and https://arxiv.org/abs/1802.00459

Joint work with:

Vladimir Braverman, Gereon Frahling, Harry Lang, Christian Sohler

Zhao Song and Peilin Zhong