What:  Theory Seminar
Date:  Friday, Oct. 15, 1999
Place: Wean Hall 7220
Time:  3:30-5:00

=============================
Clustering in Large Data Sets
=============================

Leonard J. Schulman, Goergia Tech

In recent years our ability to access and store large quantities of
data has greatly surpassed our ability to meaningfully process the
information in the data. This has led to concerted efforts to develop
new algorithms to deal with large data sets. A fundamental primitive
in organizing large data sets is clustering: classify the elements of
the data set into clusters of similar elements (under some given
notion of similarity). Clustering arises in a wide variety of areas
including computer vision (image segmentation), web document
classification, time series analysis, and genomics.

Until recently, clustering has been regarded as an ill-defined and
difficult problem. Most research has focussed on heuristics which
typically converge to local optima, and are useful primarily for very
low-dimensional problems.

In this talk we will discuss a mathematical formulation of the
clustering problem: minimization of the sum of squared Euclidean
distances among points within each cluster. On the one hand, this
formulation is quite rich: for instance it can "host" other
formulations such as the sums of L_1 or L_2 distances (and the
embeddings are efficiently computable). On the other hand, this
formulation is computationally tractable. We will outline an
algorithm which computes a near-optimal clustering, in linear time
for dimension below (log n)/(log log n) (given n input points), and
in time n^{log log n} for unrestricted input dimension. The algorithm
itself is quite simple to describe and implement, but its design and
analysis require a number of algorithmic techniques which have
recently been distilled from disparate research fields (analysis,
statistics, combinatorial geometry and algorithmic algebra).

Host: Avrim Blum
Appts: daz@cs, 83779