Date: Tue, 05 Nov 1996 20:44:23 GMT Server: NCSA/1.5 Content-type: text/html Last-modified: Wed, 16 Oct 1996 19:07:56 GMT Content-length: 5371 BIRCH

BIRCH


Document Outline:


What is BIRCH

BIRCH means "Balanced Iterative Reducing and Clustering using Hierarchies". BIRCH is a system designed for doing clustering and density analysis for any large dataset with a limited resources (memory or time) while minimizing I/O cost.

What is new in BIRCH

All previous data clustering or density analysis methods treat each individual data point equally important. So when they are faced with very large datasets, they are all too expensive to scale up. Their running time, clustering quality and (in)sensitivity to data input order hence become very unstable.

Examples of what BIRCH can do

(1) Made-up example


The above picture represents an artificial dataset of 100,000 tuples. Within it there are 100 clusters, each of which is a normal distribution of 900 tuples, distributed along a sine curve. The rest 10,000 (i.e., 10% of total) tuples are uniformly distributed noises.

Now we apply BIRCH to the above dataset to search for the 100 clusters. With 80 kbytes of memory and less than 1 minute on a HP9000 unix workstation, we find the 100 clusters, each of which is plotted as a circle with the number of tuples in it labeled, its centroid as center, and its standard deviation as radius.

(2) Real-world application: pixel classification


The above are two similar images of trees with a partly cloudy sky as the background, taken in two different wavelengths. One is in the visible wavelength band (VIS), and another is in the near-infrared band (NIR). Each image contains 512x1024 pixels, and each pixel actually has a pair of brightness values corresponding to VIS and NIR. Soil scientists receive hundreds of such image pairs and try to first classify pixels into different categories such as background, sunlit leaves, shadows and branches, and then apply further statistical analysis.

BIRCH has been used to help soil scientists classify pixels efficiently. The scatter plot of (VIR,NIR) for all pairs of pixels in the image (512x1024 2-d tuples). With 400 kbytes of memory and less than 6 minutes on a HP9000 unix workstation, we can find 6 clusters corresponding to clouds, ordinary part of sky, very bright part of sky, tree branches, shadows on the tree, and sublit leaves. The following shows the pixels corresponding to sunlit leaves, tree branches and shadows on the trees classified with BIRCH.

Relevant Publications


Code

Last Modified: October 30, 1995 by Tian Zhang(zhang@cs.wisc.edu)