What: Theory Seminar Date: Friday, Oct. 15, 1999 Place: Wean Hall 7220 Time: 3:30-5:00 ============================= Clustering in Large Data Sets ============================= Leonard J. Schulman, Goergia Tech In recent years our ability to access and store large quantities of data has greatly surpassed our ability to meaningfully process the information in the data. This has led to concerted efforts to develop new algorithms to deal with large data sets. A fundamental primitive in organizing large data sets is clustering: classify the elements of the data set into clusters of similar elements (under some given notion of similarity). Clustering arises in a wide variety of areas including computer vision (image segmentation), web document classification, time series analysis, and genomics. Until recently, clustering has been regarded as an ill-defined and difficult problem. Most research has focussed on heuristics which typically converge to local optima, and are useful primarily for very low-dimensional problems. In this talk we will discuss a mathematical formulation of the clustering problem: minimization of the sum of squared Euclidean distances among points within each cluster. On the one hand, this formulation is quite rich: for instance it can "host" other formulations such as the sums of L_1 or L_2 distances (and the embeddings are efficiently computable). On the other hand, this formulation is computationally tractable. We will outline an algorithm which computes a near-optimal clustering, in linear time for dimension below (log n)/(log log n) (given n input points), and in time n^{log log n} for unrestricted input dimension. The algorithm itself is quite simple to describe and implement, but its design and analysis require a number of algorithmic techniques which have recently been distilled from disparate research fields (analysis, statistics, combinatorial geometry and algorithmic algebra). Host: Avrim Blum Appts: daz@cs, 83779