Note that the directory you got this file from may be restricted by requesting domain and/or password. If you need access, it is fairly easy to get it. Just follow the instructions in: http://www.cs.cmu.edu/~dpelleg/kmeans.html The files under the "precompiled" directory are, well, precompiled binaries. The "precompiled" subdirectories are for different OS/architectures. Note that these may contain outdated versions. So compiling yourself is the best way to get the latest and greatest version. The file "demo-instructions" has instructions on how to run a demo of the K-means code and our accelerations to it. The file "11class.unit.ds" is a sample data file. The data format is pretty straightforward: #'s start comments there is a line with attribute names (in the example, x0 and x1 for two dimensions) next, there are N lines for N datapoints, in each there is the two coordinates of the point. If you want to build kmeans from source, fetch the file "kmeans.tar.gz". Untar it using the following: gunzip kmeans.tar.gz tar -xvf kmeans.tar Now, all you need to do is to cd to the auton/kmeans directory, and type "gmake". GNU make is required (it may be called "make" on your system: do a "make -v" to test). You can also type "gmake t:=fast" to compile a speed-optimized version (no debugging info and no runtime self-checks). Another thing you can do is compile "gmake user_flags=-DUNIX_TTY_PLATFORM". This will avoid compiling the graphic calls into the code. It may help if our build tools are not able to find the X11 libraries on your platform and you get compile errors. Of course, this will remove all the fancy graphics from the program. For more options, have a look at the technical documentation at our website, http://www.autonlab.org/ If you compiled the sources yourself, you have an executable in auton/kmeans/kmeans. If you downloaded the the binary file, you have that. From now on we'll assume you have a file "kmeans" which you can run. To work, this kmeans program needs a "universe" file that has the largest and smallest value any coordinate can have in each of the dimensions. To create such a file, just run kmeans like this: ./kmeans makeuni in 11class.unit.ds This will create 11class.unit.ds.universe. Of course, you can reuse the file for other input files by renaming it to .universe, but make sure there aren't any points outside of the box it describes. If you have some known limits on your data, just write them yourself and re-use that file. Otherwise, either pre-run "kmeans makeuni" on every file before you do clustering on it, or specify "-create_universe true" on the command line. Now, you can see how xmeans works, step-by-step, by running the following: ./kmeans kmeans -k 1 -D_SHOW_END_CENTERS -D_DRAWPOINTS -D_INTERACTIVE -D_SHOW_BVALUE -method blacklist -max_leaf_size 40 -min_box_width 0.03 -cutoff_factor 0.5 -max_iter 200 -num_splits 6 -max_ctrs 15 -in 11class.unit.ds It will show you, graphically, how the centers split and so on. For your convenience, this line is included in the file RUNME.BAT. Options explained: As you can guess, the second "kmeans" tells the program to cluster stuff, as opposed to make a universe file (what it did with "makeuni" above). The option "-in 11class.unit.ds" tells it that the input file is 11class.unit.ds. -method blacklist This tells it to do x-means. -k 1 This tells it to start with just one center. Try with -k 3 and see the difference. If you have a known lower-bound on the number of clusters in your data, use it here. Otherwise, using -k 1 wouldn't hurt, and shouldn't affect runtime too much. -num_splits 6 Try to split the centers 6 times. [ Explanation on the connection between K, num_splits, and the optional parameter max_ctrs: num_splits is the exact number of splits. k is the initial number of centers. Each split can potentially double the number of centers, so the theoretical maximum is k*2^num_splits. In this example, this means 6 different configurations of centers will be tested. They will each have between 1 and 64 centers. The best one in terms of penalized score will be output. Alternatively, max_ctrs can be used instead of num_splits. The code then chooses the value for num_splits from max_ctrs and k. If you specify k and max_ctrs, you're guaranteed the number of output centers will be between these two values. It's probably biased to not give the very high end of this range. Note that in both of these cases, not all values between k and max_ctrs (or k*2^num_splits) are tested. If this is what you want, use the (much slower) "BL-search" method. End of explanation] -max_leaf_size 40 -min_box_width 0.03 -cutoff_factor 0.5 -max_iter 200 Various parameters. You can leave this as-is and not worry about it. If you're interested, here is a more detailed explanation: cutoff_factor is ignored, you can probably not even have it. You lower max_iter if you want it to run faster. Values down to 50 won't change the accuracy that much, and 20 will also work pretty well. This is the number of local iterations it does before making the decision whether to split a center or not. For very big data files (over 100K points), try and increase max_leaf_size (say, 80 or 100) and min_box_width (say, 0.1). That shouldn't affect the results, only speed, and the way it does will probably depends on your data. max_leaf_size and min_box_width are parameters for the construction of the kd-tree, BTW. -D_SHOW_END_CENTERS -D_DRAWPOINTS -D_INTERACTIVE -D_SHOW_BVALUE These tells it to graphically show what it's doing, and to wait for you to hit enter after each step. So, to run it all at once, without hitting enter, and no graphics, type: ./kmeans kmeans -k 1 -method blacklist -max_leaf_size 40 -min_box_width 0.03 -cutoff_factor 0.5 -max_iter 200 -num_splits 6 -max_ctrs 15 -in 11class.unit.ds -save_ctrs ctrs.out Note the "-save_ctrs" option. With it, it will save the resulting centers in the named file ("ctrs.out"). If you add "-printclusters clust" to the command-line, the points will be output to the file name "clust" together with cluster IDs. Each point is on a line of its own, and is represented by a single integer. This number is the "serial number" of the point in the input file, starting from zero (comment lines, empty lines, and the line with column names do not add to the count). Such output should be post-processed by: "./membership clust". This will output a file that has as many lines as the input file. In each line, there is an index of the cluster to which that point was assigned. You can redirect the output to a file, read it in MATLAB, etc. The option -D_SHOW_END_CENTERS makes the program output BIC and distortion, regardless of graphics options, like this: ##DISTORTION 21.249986 ##BICSCORE -55.046467 If this runs too slowly for you, consider compiling a version for speed. This default version has all the graphics stuff in it, so I preferred to give it to you first. It's pretty fast as it is, but if you have huge data files, or very large numbers of clusters, you'll want the optimized version. See "gmake t:=fast" above for details. If you start kmeans with "-init_ctrs file", it will read the initial centroids from that file instead of determining the initial placement on its own. You can create this file yourself, or use the output from a previous run that had a "-save_ctrs file" in the command line. If you give it the option "-method BL-search" instead of "-method blacklist". It will try some k values in the range (see below), giving you the equivalent of running it with an external loop that ranges over k. In this mode it will not do any center-splitting (ie, a straight-up K-means, optimized by a kd-tree). The values it will try is determined by -num_splits (#samples), -k (lower bound) and -max_ctrs (upper bound). You want num_splits to be the difference between the other two values if you want it to try all possible values of k in the range. The ICML00 paper describing this work is: Dan Pelleg and Andrew Moore: X-means: Extending K-means with Efficient Estimation of the Number of Clusters. http://www.cs.cmu.edu/~dpelleg/download/xmeans.ps.gz If you want to know more about BIC, read the following TR: Larry Wasserman: Bayesian Model Selection and Model Averaging http://www.stat.cmu.edu/www/cmu-stats/tr/tr666/tr666.html Let me know if you need anything else! Dan Pelleg =========================================================== 2. Using the MATLAB wrapper under UNIX ====================================== Change directory to auton/kmeans/mex. Type "make". You might need to set the MATLAB variable to point to the location of MATLAB on your computer, or edit the makefile appropriately. If you need to do this, and are using bash or a similar shell, the command will look like: MATLAB=/usr/local/lib/matlab6/ gmake If you are a csh or tcsh user, do: setenv MATLAB /usr/local/lib/matlab6/ ; gmake Once the build is done, start up matlab and run "test". It will offer you a selection of data files to run on, and will also display the resulting clustering. 3. Using the MATLAB wrapper under Windows ========================================= The distribution also contains a makefile you can use under Windows to create a MATLAB-callable function. It is called Makefile.windows in the auton/kmeans/mex directory. This makefile is courtesy of Rudolph van der Merwe. Before you can use the Makefile, you'll need to install Cygwin and Gnumex. See http://www.mrc-cbu.cam.ac.uk/Imaging/gnumex20.html for details. 4. Using an alternative split statistic ======================================= x-means works by starting from a small number of clusters. Then it decides, locally for each center, whether to split it into two new centers. By default, the criterion used for the decision is BIC. However, if you prefer, you can use the Anderson-Darling statistic instead. This idea has been suggested by Gregory Hamerly and Charles Elkan: Learning the k in k-means (in NIPS 2003). The claim is that it works better for clusters which have non-spherical shape. If this is what you want to do, then add the "-S AD" switch to your kmeans command-line. NOTES: 1. Using the AD statistic implies the following change: After each global k-means step, local splitting decisions are made, and some centers may decide to split, while others won't. The number of resulting clusters can be anywhere from the original number (if none split), to twice that (if everybody splits). If the case is that all centers decide NOT to split, then the behavior was that some are split anyway, just to try and get out of a possible local minimum. The way this was done was to first rank the centers by how much a split would improve them (this value is just the difference in BIC scores between the tentative centers, and the BIC score of the existing center). Then the top half of the centers are split. This means that the x-means code will continue to search for better configurations even if it runs into a local minimum of the BIC score. The best performing configuration (in terms of total BIC) is saved, and this is the one finally output. Now, with the AD statistic, the behavior is different: again the decisions are local, and this time the value used is obviously the AD value. But, if no centers want to split, no splits are forced. That is, with the AD statistic, a local minimum will terminate the search. This is the behavior described in the Hamerly & Elkan paper, and their results with it are very good. I suspect that the reason why their version is so fast is partly because they do not check all of those extra configurations that the BIC-based version does (and apparently, this doesn't hurt the output model's quality). If you want, you can still choose to force-split centers even with the AD statistic. For example, you can specify that you want half the centers to split. Again they will be ranked by how well that split is expected to behave (this time it will be sorted on the AD number). For this specify the -forced_split_fraction command-line switch with a value of 0.5. In general, -forced_split_fraction can be set to anything between 0 and 1. If it's 0 it means no forced-splits are done on a local minimum. If it's 0.5 then it means that half the centers are forced-split on a local minimum (the top half). In these terms, it defaults to 0.5 if the split statistic is BIC and to 0 if it's AD. 2. As said above, the configuration that performed best so far is saved and this is the one output at the end. The way to judge "best so far" is by taking the BIC value of the whole model. This is still true even if you choose the AD split statistic (the reason is that I don't know how to compute AD for an entire model). 3. To change the confidence value for the AD split statistic, use the -split_conf_level command-line switch. The default is 0.9999. The corresponding critical value is computed by a table lookup, so it's possible for the value used to match a value which is close, but not identical, to the one you specified. 4. BEWARE, if you use the Anderson-Darling formula from the Hamerly and Elkan paper, note that it has an error. It says: (formula 1) ...(2i - 1)[log(zi) - ( 1 - log(z(n+1-i)))]... where the correct term is: ...(2i - 1)[log(zi) + ( 1 - log(z(n+1-i)))]... The correct form appears in Hamerly's PHD dissertation, as well as in the original paper by Stephens mentioned there and in the NIPS paper. The error above was spotted in a draft version that NIPS has made available on the web prior to publication of the proceedings. Hopefully it will be corrected in the final proceedings (due out "in the spring of 2004"). 5. Adding constraints ======================================= It is also possible to incorporate constraints into the k-means clustering. They can come in two forms: "must-link" and "cannot-link". Each respective constraint is a directive to put two specific input points in either the same cluster, or in different clusters. Despite the names, these are soft constraints. In particular, they may be violated if that will (locally) maximize the objective function. To use constraints, generate an additional file. Each line in this file contains one constraint. The syntax is either: ML x y or: CL x y where ML means "must-link" and CL means "cannot-link". The indices x and y are the numbers of the points in the main input file. IMPORTANT: point indices start from zero. The k-means distribution contains a constraint sample file names "cons.txt". The "-method" argument must also be changed (from "blacklist" or "BL-search") to one of the following values: -method slowcons-cvqe -method slowcons-lcvqe ....which will choose either the CVQE[1] algorithm or the LCVQE[2] algorithm, respectively. The constraint file is specified with the "-cons" argument. For example, to include constraints in the the example above, change it as follows: ./kmeans kmeans -k 1 -D_SHOW_END_CENTERS -D_DRAWPOINTS -D_INTERACTIVE -D_SHOW_BVALUE -method slowcons-lcvqe -max_leaf_size 40 -min_box_width 0.03 -cutoff_factor 0.5 -max_iter 200 -num_splits 6 -max_ctrs 15 -in 11class.unit.ds -cons cons.txt The code for CVQE and LCVQE constrained clustering has been generously contributed by IBM. References: 1. Dan Pelleg and Dorit Baras: K-means with Large and Noisy Constraint Sets, ECML 2007. Full tech report: http://domino.research.ibm.com/library/cyberdig.nsf/1e4115aea78b6e7c85256b360066f0d4/08c39e5a60da9d278525731b00577324?OpenDocument 2. Davidson I. and Ravi, S.S.: Clustering under Constraints: Feasibility Issues and the K-Means Algorithm, SDM 2005.