Mixtures of Rectangles: Interpretable Soft Clustering

Dan Pelleg

Abstract

To be effective, data-mining has to conclude with a succinct description of the data. To this end, we explore a clustering technique that finds dense regions in data. By constraining our model in a specific way, we are able to represent the interesting regions as an intersection of intervals. This has the advantage of being easily read and understood by humans.

Specifically, we fit the data to a mixture model in which each component is a hyper-rectangle in M-dimensional space. Hyper-rectangles may overlap, meaning some points can have soft membership of several components. Each component is simply described by, for each attribute, lower and upper bounds of points in the cluster. The computational problem of finding a locally maximum-likelihood collection of k rectangles is made practical by allowing the rectangles to have soft ``tails'' in the early stages of an EM-like optimization scheme. Our method requires no user-supplied parameters except for the desired number of clusters. These advantages make it highly attractive for ``turn-key'' data-mining application. We demonstrate the usefulness of the method in subspace clustering for synthetic data, and in real-life datasets. We also show its effectiveness in a classification setting.


Charles Rosenberg
Last modified: Thu Jun 21 14:41:31 EDT 2001