Partition.h and affiliated code Generalized Swendsen-Wang MCMC Sampler "Usage Manual" (scare quotes intentional) Tom Stepleton 9 February 2008 INTRODUCTION 1. This code implements the stochastic graph partition technique described in http://repositories.cdlib.org/cgi/viewcontent.cgi?article=1113&context=uclastat, among other places, in templated C++. 2. This code is not distributed as a complete product like, say, the Boost C++ libraries. You may find that the code does what you need it to do without modifications, but then again you might not. If not, you should dive into the code and make it work for you. This is normal. 3. Not only is this code is provided with no guarantees, it is furnished to you with malice aforethought! Beware! REQUIREMENTS 0. Requirements of this code include, but are not limited to: 1. The Boost C++ library (random number generation), 2. Maps.h (included), my accretion of convenience code around C++ Standard Template Library map containers, which will have to be modified if you're not using the Gnu C++ libraries (replace instances of __gnu_cxx as needed), 3. C++ compiler, 4. Computer. 5. Additionally, to compile and run the demo program, you will either need to have a UNIX machine with GNU Guile installed on your computer in /Users/tom/usr/bin/, or you will need to make modifications. Guile is used because the sampler can dump its own state out in a Lisp-like nested parentheses format, and Guile has a nice prettyprinter for such things. 6. You will also need OpenCV for the demo program. 7. And it's highly likely you'll have to hack the Makefile too. This code was developed and tested on a computer running MacOS 10.5 using GCC version 4.0.1. FILES 1. Files included in this source distribution are, by purpose: DOCUMENTATION README_Partition.txt CHANGELOG_Partition.txt PARTITIONING CODE Maps.h Partition.h SerialSource.h z_Samplers.h DEMONSTRATION PROGRAM CvMatHolder.h Makefile OCVPlot.h OCVPlot_more.h prettyprinter.scm test_Partition.cc 2. When using the code in your own project, it is only necessary to include Partition.h. THEORY OF OPERATION 1. Beyond the problem setting described in the tech report cited at top, know ye also that: 2. We assume a graphical relationship among entities called "objects" for reasons relating to the code's original use. Objects contain collections of attributes that are arbitrarily called "views". 3. The generalized Ising model used by this technique requires you to compute the probability of two objects belonging to the same subgraph in a partition. This code assumes that for two given objects, this probability is a function of only two views: one per object. The two views used to determine this probability are the view in object A and the view in object B that have the smallest "distance" between one another. The precise definition of the scalar, non-negative distance value is application-specific. 4. If this multi-view generalization of the technique is unsuitable or unnecessary for your application, note that the original technique as formulated in the tech report may be achieved through objects with only one view. 5. The code is not intended to store objects or views on its own. Rather, it is designed to maintain indices or references into data structures that you maintain. The types of these indices are template parameters, and there are requirements on these types: they must be equality comparable and hashable, among other things. 6. For certain applications (specifically those where two objects' views may be merged together to form a new object), your view indices must be unique across all objects. 7. This code was designed for a setting where views change over time. In order for the code to tell which views in different objects are the "closest", each modification to a view would require O(N) comparisons (N == number of views across all objects) to determine whether the the modified view remains or has become a view that determines object compatibility probability. Since the code's setting is one where view comparisons are computationally expensive, the code is designed to facilitate a straightforward O(M) approximation (M == number of objects; usually M << N). 8. Viz., each edge between objects is annotated with the views in both objects that are closest together, as well as the distance between those two views. When a view in an object is created or changed, your program can retrieve this edge information and compute the distances between the new or modified view and the views at the other end of each of the edges. If it is closer to any of them than the view designated by the existing edge, the edge will be replaced. 9. Since this approach does not guarantee that the two nearest views in two objects will always be selected as the edge, the system associates with the edge the number of times another view has "challenged" it as the best edge. Once the challenges surpass some number specified in the configuration, the edge becomes "stale", and the system will have to be explicitly told which two views should define the edge. 10. Lastly, the system also allows you to split an object's set of views into two separate objects, and also to merge two objects together into a single object. Splitting an object causes both new objects to have the same connectivity to the other objects as the original objects, along with an edge between themselves. Many of these new edges will be stale. 11. During all of these operations, the system will typically maintain a set of sampled graph partitions. When the graph is modified, the system computes a new set of samples if it can---if edges are stale, however, this is not possible. In any case, it is possible to suppress these resampling updates and then call for a resampling when desired. 12. Also during these operations, you will be required at times to pass in an argument function object with the following methods, where obj_index_T is the index type you use for objects and view_index_T is the index type you use for views: class MyClass { ... public: ... // Given an edge (NearNeighbor) between two objects (first and second // args), returns the probability of both belonging to the same subgraph. double operator()(const obj_index_T&, const obj_index_T&, const Partition::NearNeighbor&); // Given a partition of the graph, returns a value proportional to the // probability of the partition. double operator()(const Partition::Partition&); // Returns true iff the values returned by the methods above are // the natural logarithms of the probabilities. bool isLogarithmic(); ... }; Note that the first method requires the probability, while the second only requires a value proportional to the probability. This function object is typed in the code with the template argument SwendsenWangWeighter. 13. Here is pseudocode for a simple application of the system. Parenthesized items refer to actual class and method names in the code. This pseudocode is largely the structure of the example code in test_Partition.cc. BEGIN ;; ;; SYSTEM INITIALIZATION ;; Set configuration options (Partitions::Config) Initialize system (construct Partitions::Partition) Set up the SwendsenWangWeighter function object ;; ;; INITIAL DATA INSERTION ;; FOR o IN indices of all objects Insert object o, suppressing resampling (addObject) END FOR FOR v1 IN indices of all views in all objects FOR v2 IN indices of all views in all objects o1 := index of the object containing v1 o2 := index of the object containing v2 IF o1 != o2 THEN d := distance between v1 and v2 Suggest a potential edge v1<->v2 between objects o1 and o2, with distance d between them, suppressing resampling (suggestNeighbors) END IF END FOR END FOR ;; ;; INITIAL SAMPLING ;; Create an initial sampling of partitions by marking all links as not stale. (forceReady) ;; ;; UPDATES/MODIFYING VIEWS ;; FOR EVER v1 := index of a view that has changed o1 := index of the object containing v1 FOR e IN all edges incident on object o1 (getLinkPtr) o2 := index of the other object in e v2 := index of the view in other object in e d := distance between v1 and v2 Suggest a potential replacement edge v1<->v2 between objects o1 and o2, with distance d between them (suggestNeighbors) END FOR FOR e IN all stale edges (getStaleLinks) o1 := index of the first object in e o2 := index of the second object in e v1,v2 := index of the nearest views in o1,o2 respectively d := distance between v1 and v2 Declare the shortest edge v1<->v2 between objects o1 and o2, with distance d between them (declareNeighbors) END ;; ;; Your code for using the graph samples goes here ;; END FOR END. CONCLUSION 1. Whereof we cannot sample, thereof we must forever remain unrepresentative.