FastMix

Introduction
Downloads
Options
File Formats
Auxiliary Utilities
References

FastMix generates Gaussian mixture models for large datasets using efficient EM clustering algorithms developed by the Auton Project. The program automatically selects the number of Gaussians and the locations of the Gaussians using the KD-Clust algorithm. The algorithm uses kd-trees for two purposes: to accerate EM by caching sufficient statistics and to find regions where the model underpredicts the data. The regions of underprediction are used as candidates for new cluster locations.

FastMix takes as input a file of data points and outputs a set of Gaussian centers and covariances. The program is currently supported by Peter Sand (psand@cs.cmu.edu).

Downloads

fastmix-linux.tar.gz (Linux Version)
fastmix-sun4.tar.gz (Sun4 Version)
sample1.ds (sample data file)
sample2.ds (sample data file)
sample.mix (sample Gaussian mixture file)

Options

The standard command line syntax for FastMix includes an input file and an optional output file:

fastmix input-file output-file

If the output file is omitted, the standard output is used. Additional options may be specified on the command line, or in the fastmix.conf file. (Options specified on the command line override values specified in the configuration file.)

Option	Values	Default	Description
-t	1 to 100000	1000	running time in seconds
-s	aic, bic, testset	bic	scoring type; testset scoring uses the loglikelihood of a holdout set
-d	fast, slow	fast	scoring speed: determines whether the algorithm compares different models using a fast (approximate) or slow (precise) scoring method
-v	0.0 to 20.0	1.0	verbosity: a value of 0.0 will result in no on-screen output; 1.0 displays status each time the mixture score changes; 5.0 provides additional information; values over 10.0 may result in excessive output
-r	(integer)	137	seed for random number generator
-b	10 to 10000	50	max steps since best: if no improvement is made after modifying the model this number of times, the algorithm terminates; if running-time is not an issue, set this to be higher (e.g. 500)
-l	(filename)	(none)	the specified log file is updated each time the mixture score changes; if the log file is unspecified, no logging occurs
-T	0.0 to 10.0	0.03	tau: a parameter of the fast EM algorithm that determines how closely it should approximate standard (slow) EM; larger values increase the speed; smaller values increase the accuracy
-M	1e-9 to 1.0	0.001	min-box-width: the minimum size of the kd-tree's hyper-rectangles; larger values increase the speed; smaller values increase the accuracy
-f	1 to 10000	(none)	funnel mode: the search algorithm is funneled toward the specified number of gaussians; when the model has more than the given number of gaussians, FastMix tends to remove gaussians; when the model has less than the given number of gaussians, it tends to add gaussians; the final mixture model is the best model found with exactly the specified number of gaussians; if the funnel parameter is not specified, funnel mode is disabled

The following is a sample FastMix command line:

fastmix data.ds mix.cen -t 500 -s aic -d 300

File Formats

FastMix can load standard text-based data files with fields seperated by spaces or commas, and records seperated by line breaks. Any lines that are blank or start with the # character are ignored. The program also supports fast-loading data files. These fast-loading files (which must end with a .fds extension) can be generated using the fastmix-fds utility (see the utilities section).

FastMix outputs files containing a mixture of gaussians. The file starts with two header lines:

gaussians: num_gaussians
dimensions: num_dimensions

Each line after the header specifies a single gaussian. The values in each line are seperated by spaces and include the gaussian's mean, covariance, and probability (which sums to 1 over the entire mixture of gaussians). Each line contains fields in the following order (d denotes the number of dimensions):

gaussian-id
probability
mean₀ ... mean_d-1
cov_{0, 0} ... cov_{d-1, 0} ... cov_{d-1, d-1}

A FastMix log file contains lines of the following format:

secs-since-start score num-gaussians

Auxiliary Utilities

FastMix provides a utility that scores a mixture file using the specified dataset. The utility has the following syntax (valid values for score-type are aic and bic):

fastmix-score score-type dataset-file mixture-file

FastMix also includes a program that generates a graphical display of a dataset and a mixture file:

fastmix-show dataset-file mixture-file

Another utility generates a postscript rendering of a dataset and a mixture file:

fastmix-ps dataset-file mixture-file ps-file

The following program will convert a standard text data file into a fast-loading .fds file (with a .fds extension):

fastmix-fds input-file

Another utility updates a mixture model by running a sequence of EM steps:

fastmix-em dataset-file input-mixture-file output-mixture-file options

The following options may be used:

Option	Values	Default	Description
-c	(positive real)	1.0	convergence bound: the algorithm halts after the difference between subsequent scores falls below this value and does not improve after a specified number of steps (see below)
-b	0 to 100	5	steps since best: the algorithm halts after the score does not improve for this number of steps; this mechanism is necessary because of fluctuations that occur when fast scoring is enabled
-s	aic, bic	bic	scoring type
-d	fast, slow	fast	scoring speed: determines whether the algorithm compares different models using a fast (approximate) or slow (precise) scoring method
-v	0.0 to 20.0	1.0	verbosity: a value of 0.0 will result in no on-screen output; 1.0 displays status each time the mixture score changes; 5.0 provides additional information; values over 10.0 may result in excessive output
-l	(filename)	(none)	the specified log file is updated each time the mixture score changes; if the log file is unspecified, no logging occurs
-T	0.0 to 10.0	0.03	tau: a parameter of the fast EM algorithm that deterimes how closely it should approximate standard (slow) EM; larger values increase the speed; smaller values increase the accuracy
-M	1e-9 to 1.0	0.001	min-box-width: the minimum size of the kd-tree's hyper-rectangles; larger values increase the speed; smaller values increase the accuracy

The output mixure model is the highest scoring, not necessarily the last.

References

Andrew W. Moore, Very Fast EM-based Mixture Model Cluster using Multiresolution kd-trees, Advances in Neural Information Processing Systems 11, (Submitted May 1998, Proceedings published May 1999).

Peter Sand and Andrew W. Moore, Repairing Faulty Mixture Models using Density Estimation, International Conference on Machine Learning, 2001 (ICML2001)