FastMix
Introduction
Downloads
Options
File Formats
Auxiliary Utilities
References
Introduction
FastMix generates Gaussian mixture models for large datasets using efficient EM clustering algorithms developed by the Auton Project. The program automatically selects the number of Gaussians and the locations of the Gaussians using the KD-Clust algorithm. The algorithm uses kd-trees for two purposes: to accerate EM by caching sufficient statistics and to find regions where the model underpredicts the data. The regions of underprediction are used as candidates for new cluster locations.

FastMix takes as input a file of data points and outputs a set of Gaussian centers and covariances. The program is currently supported by Peter Sand (psand@cs.cmu.edu).
Downloads
fastmix-linux.tar.gz (Linux Version)
fastmix-sun4.tar.gz (Sun4 Version)
sample1.ds (sample data file)
sample2.ds (sample data file)
sample.mix (sample Gaussian mixture file)
Options
The standard command line syntax for FastMix includes an input file and an optional output file:

        fastmix   input-file   output-file

If the output file is omitted, the standard output is used. Additional options may be specified on the command line, or in the fastmix.conf file. (Options specified on the command line override values specified in the configuration file.)

Option Values Default Description
-t 1 to 100000 1000 running time in seconds
-s aic, bic, testset bic scoring type; testset scoring uses the loglikelihood of a holdout set
-d fast, slow fast scoring speed: determines whether the algorithm compares different models using a fast (approximate) or slow (precise) scoring method
-v 0.0 to 20.0 1.0 verbosity: a value of 0.0 will result in no on-screen output; 1.0 displays status each time the mixture score changes; 5.0 provides additional information; values over 10.0 may result in excessive output
-r (integer) 137 seed for random number generator
-b 10 to 10000 50 max steps since best: if no improvement is made after modifying the model this number of times, the algorithm terminates; if running-time is not an issue, set this to be higher (e.g. 500)
-l (filename) (none) the specified log file is updated each time the mixture score changes; if the log file is unspecified, no logging occurs
-T 0.0 to 10.0 0.03 tau: a parameter of the fast EM algorithm that determines how closely it should approximate standard (slow) EM; larger values increase the speed; smaller values increase the accuracy
-M 1e-9 to 1.0 0.001 min-box-width: the minimum size of the kd-tree's hyper-rectangles; larger values increase the speed; smaller values increase the accuracy
-f 1 to 10000 (none) funnel mode: the search algorithm is funneled toward the specified number of gaussians; when the model has more than the given number of gaussians, FastMix tends to remove gaussians; when the model has less than the given number of gaussians, it tends to add gaussians; the final mixture model is the best model found with exactly the specified number of gaussians; if the funnel parameter is not specified, funnel mode is disabled

The following is a sample FastMix command line:

        fastmix   data.ds   mix.cen   -t 500   -s aic   -d 300
File Formats
FastMix can load standard text-based data files with fields seperated by spaces or commas, and records seperated by line breaks. Any lines that are blank or start with the # character are ignored. The program also supports fast-loading data files. These fast-loading files (which must end with a .fds extension) can be generated using the fastmix-fds utility (see the utilities section).

FastMix outputs files containing a mixture of gaussians. The file starts with two header lines:

        gaussians: num_gaussians
        dimensions: num_dimensions

Each line after the header specifies a single gaussian. The values in each line are seperated by spaces and include the gaussian's mean, covariance, and probability (which sums to 1 over the entire mixture of gaussians). Each line contains fields in the following order (d denotes the number of dimensions):

        gaussian-id
        probability
        mean0 ... meand-1
        cov0, 0 ... covd-1, 0 ... covd-1, d-1


A FastMix log file contains lines of the following format:

        secs-since-start   score   num-gaussians
Auxiliary Utilities

FastMix provides a utility that scores a mixture file using the specified dataset. The utility has the following syntax (valid values for score-type are aic and bic):

        fastmix-score  score-type  dataset-file  mixture-file

FastMix also includes a program that generates a graphical display of a dataset and a mixture file:

        fastmix-show   dataset-file   mixture-file

Another utility generates a postscript rendering of a dataset and a mixture file:

        fastmix-ps   dataset-file   mixture-file   ps-file

The following program will convert a standard text data file into a fast-loading .fds file (with a .fds extension):

        fastmix-fds   input-file

Another utility updates a mixture model by running a sequence of EM steps:

        fastmix-em   dataset-file   input-mixture-file   output-mixture-file   options

The following options may be used:

Option Values Default Description
-c (positive real) 1.0 convergence bound: the algorithm halts after the difference between subsequent scores falls below this value and does not improve after a specified number of steps (see below)
-b 0 to 100 5 steps since best: the algorithm halts after the score does not improve for this number of steps; this mechanism is necessary because of fluctuations that occur when fast scoring is enabled
-s aic, bic bic scoring type
-d fast, slow fast scoring speed: determines whether the algorithm compares different models using a fast (approximate) or slow (precise) scoring method
-v 0.0 to 20.0 1.0 verbosity: a value of 0.0 will result in no on-screen output; 1.0 displays status each time the mixture score changes; 5.0 provides additional information; values over 10.0 may result in excessive output
-l (filename) (none) the specified log file is updated each time the mixture score changes; if the log file is unspecified, no logging occurs
-T 0.0 to 10.0 0.03 tau: a parameter of the fast EM algorithm that deterimes how closely it should approximate standard (slow) EM; larger values increase the speed; smaller values increase the accuracy
-M 1e-9 to 1.0 0.001 min-box-width: the minimum size of the kd-tree's hyper-rectangles; larger values increase the speed; smaller values increase the accuracy

The output mixure model is the highest scoring, not necessarily the last.
References
Andrew W. Moore, Very Fast EM-based Mixture Model Cluster using Multiresolution kd-trees, Advances in Neural Information Processing Systems 11, (Submitted May 1998, Proceedings published May 1999).

Peter Sand and Andrew W. Moore, Repairing Faulty Mixture Models using Density Estimation, International Conference on Machine Learning, 2001 (ICML2001)