Date: Wed, 12 Apr 2000 09:05:15 -0400
From: Avrim Blum <avrim@archimedes.theory.cs.cmu.edu>
To: reinforcement@cs.cmu.edu
Subject: Theory Seminar this Friday on clustering and fast approx SVD

Santosh Vempala is giving this Friday's theory seminar, on clustering
via fast approximate SVD-type methods.  If all goes well, he'll give a
demo on clustering web searches.  Santosh is one of our former grad
students, now on the faculty at MIT.  (He's in town for Adam Kalai's
thesis proposal.)  -Avrim

=======================================================================

 Friday, April 14, 2000
 3:30, Wean Hall 5409
 
 Santosh Vempala, MIT
 Title: On Clusterings --- Good, Bad and Spectral.
 Abstract:
 Clustering, or partitioning into similar groups, is a classical problem
 in mathematics as well as in the applied sciences. The availability of vast
 amounts of data has revitalized research on the problem. There are two
 aspects to analyzing a clustering algorithm: (1) How good is the clustering
 produced? (quality)and (2) How fast can it be found? (speed).
 On the practical side, several clever heuristics have been invented.
 Theoreticians, meanwhile, have been busy analyzing quality measures that
 are seductively simple to define (e.g. k-median, min sum, min diameter,
 etc.). Both approaches have obvious shortcomings: the justification
 provided by practitioners is typically case-by-case and experimental (it
 works well on my data); the measures thus far analyzed by theoreticians
 are easy to fool, i.e. there are simple examples where it is clear what the
 right clustering is, but optimizing the measures defined produces
 undesirable solutions. 
 We propose a new measure of the quality of a clustering. Then we analyze
 the quality of a popular clustering heuristic, spectral clustering, under
 this measure. Spectral clustering is the general technique of partitioning
 the rows of a matrix (say) according to their components in the top few
 singular vectors of the matrix. Our first result is that under some
 reasonable asumptions, the spectral algorithm finds nearly optimal solutions. 
 Our second result, that spectral clustering can be made very fast, is based
 on randomized algorithms for approximating a given m x n matrix by a
 low-rank matrix.  Standard methods to compute low-rank approximations via
 the singular value decomposition take time that is polynomial in m and n.
 This can be prohibitively expensive for applications such as information
 retrieval. In contrast, the running time of our first algorithm depends
 only on the quality of the desired approximation and not on the size of the
 matrix, but it assumes that the entries of the matrix can be sampled in a
 specific manner. Our second algorithm needs no assumptions and has a
 running time that is linear in the number of non-zero entries. 
 The talk will be self-contained and will include a demo of clustering web
 searches in real-time. 
 This is joint work with Alan Frieze (CMU), Ravi Kannan, Petros Drineas
 (Yale), and V.Vinay (IISc).