Probabilistic Modeling for Combinatorial Optimization

Shumeet Baluja & Scott Davies

This work originated in attempt to create an explicit probabilistic model of the behavior of genetic algorithms [Goldberg, 1989][Holland, 1975][De Jong, 1975]. By maintaining a population of points, genetic algorithms (GAs) can be viewed as creating implicit probabilistic models of the solutions discovered in the search's progression. Rather than modeling the dependencies in the search solutions, GAs attempt to preserve the dependencies between parameters by using a crossover-based recombination operator to generate new sampling points.

One attempt towards making the GA's probabilistic model more explicit was the Population-based incremental learning algorithm (PBIL) [Baluja, 1994][Baluja & Caruana, 1995]; this was further explored in [Juels, 1997][Kvasnicka et al., 1995][Hohfeld & Rudolph, 1997]. PBIL used a very simple probabilistic model that did not model inter-parameter dependencies; each parameter was handled independently. The PBIL algorithm worked as follows: instead of using recombination/crossover to create a new population, a real-valued vector, P, is sampled. Assuming the solutions are represented as bit strings, P specifies the probability of generating a 1 in each bit position. Initially, all values in P are set to 0.5, so that random, uniformly distributed, solutions are generated. A number of solution vectors are generated by stochastically sampling P; each bit is sampled independently. The probability vector is then moved towards the solution vectors for which the evaluation function returns the best values, in a manner similar to the updates used in unsupervised competitive learning [Hertz, et al. 1991]. This cycle is then repeated. The final result of the PBIL algorithm is the best solution generated throughout the search. This basic algorithms is similar to early work in cooperative systems of discrete learning automata [Thathachar & Sastry, 1987] and the Bit-Based Simulated Crossover algorithm [Syswerda, 1993]. Numerous extensions to this basic algorithm are possible - many are similar to those commonly used with genetic algorithms, such as variable or constant mutation rates, maintenance of the best solution found, parallel searches, or local optimization heuristics.

One of the goals of crossover in GAs is to combine "building blocks" from two different solutions to create new sampling points. Because all parameters are modeled independently, PBIL cannot propagate building blocks in a manner similar to standard GAs. Nonetheless, in a variety of standard benchmark problems used to test GAs and Simulated Annealing approaches, PBIL performed extremely well [Baluja, 1997]. However, if we knew which parameters need to be modeled together, we suspected that even better solutions could be found. In GAs, knowledge of correlated parameters could be used to ensure that building blocks are propagated intact. It is because the dependencies are not explicitly represented in the GA framework that the recombination operator combines random subsets of two "parent" solutions in the hopes of not separating parameters that should be maintained together.

An extension to the PBIL algorithm is to replace the naive probabilistic model in PBIL with a more expressive model. In theory, with enough samples, a fully connected Bayesian networks can be employed. However, with only a limited number of samples, more restricted networks are required. Much of our experimentation has used Optimal-Dependency Trees to model pair-wise interactions between parameters. As shown in [Chow & Liu, 1968], this produces the tree-shaped network that maximizes the likelihood of the data. This tree generation algorithm runs in time O(n2), where n is the number of parameters in the solution encoding.

The resulting optimization algorithm would work as follows. After evaluating a set of sample solutions, the best members are used to update the probabilistic model from which the next generation's population will be generated. The best few samples from each generation are added into a dataset, termed S. Rather than recording the individual members of S, our algorithm maintains a sufficient set of statistics in an array A that contains a number A[Xi=a, Xj=b] for every pair of variables Xi and Xj and every combination of binary assignments to a and b. A[Xi=a, Xj=b] is as an estimate of how many recently generated "good" bit strings (from S) have bit Xi=a and bit Xj=b. Since the probability distribution will gradually shift towards better bit strings, we put more weight on more recently generated bit-strings by decaying the contributions of bitstrings that were previously added to the dataset. All A[Xi=a, Xj=b] are initialized to some constant Cinit before the first iteration of the algorithm; this causes the algorithm's first set of bit-strings to be generated from the uniform distribution. The basic framework for these algorithms is shown in See Framework for using a probabilistic model to maintain statistics about the search space. .

The values of A[Xi=a, Xj=b] at the beginning of an iteration may be thought of as specifying a prior probability distribution over "good" bit-strings: the ratios of the values within A[Xi=a, Xj=b] specify the distribution, while the magnitudes of these values, multiplied by a, specify an "equivalent sample size" reflecting how confident we are that this prior probability distribution is accurate.

The first extension to PBIL that captured pair-wise dependencies was termed Mutual Information Maximization for Input Clustering (MIMIC) [De Bonet et al., 1997]. MIMIC used a greedy search to generate a chain in which each variable is conditioned on the previous variable. While MIMIC was restricted to a greedy heuristic for finding chain-based models, the optimal dependency tree based algorithm used in [Baluja & Davies, 1997] employed a broader class of models, trees, and found the optimal model in the class. Example dependency graphs shown in See A: A noisy two-color graph coloring problem. B: the empty dependency graph used by PBIL. C: the graph learned by our implementation of the dependency chain algorithm. D: the graph learned by our dependency tree algorithm. illustrate the types of probability models learned by PBIL, a dependency-chain algorithm similar to MIMIC, and our dependency tree algorithm. We use Bayesian network notation for our graphs: an arrow from node Xp to node Xc indicates that Xc's probability distribution is conditionally dependent on the value of Xp. These models were learned while optimizing a noisy version of a two-color graph coloring problem (shown in See A: A noisy two-color graph coloring problem. B: the empty dependency graph used by PBIL. C: the graph learned by our implementation of the dependency chain algorithm. D: the graph learned by our dependency tree algorithm. a) in which there is a 0.5 probability of adding 1 to the evaluation function for every edge constraint satisfied by the candidate solution.

We, and other researchers, have conducted numerous empirical comparisons of genetic algorithms and algorithms based on probabilistic modeling, see for example [Juels, 1996][Baluja, 1997] [Baluja & Davies, 1997] [Baluja & Davies, 1998] [Greene, 1996]. In many standard benchmark problems, the probabilistic models perform significantly better than many variants of genetic algorithms. Surprisingly, in many of the larger, real-world problems, simple models which do not maintain dependency information (for example, PBIL), perform better than GAs which attempt to implicitly capture this information. On test problems which are designed to exhibit large amounts of interparameter dependencies the performance outcomes are less predictable [Eshelman et. al, 1996] with PBIL. However, in the large majority of these problems, the optimal-dependency tree based models consistently outperformed other technique [Baluja & Davies, 1999].

Perhaps the most interesting result found in this line of study is that the performance of the combinatorial optimization algorithms consistently improved as the accuracy of their statistical models increased. Trees generally performed better than chains (such as used in MIMIC), and chains generally performed better than independent models as used in PBIL. This suggests the possibility of using even more complex models. Unfortunately, when we move toward models in which variables can have more than one parent variable, the problem of finding an optimal network with which to model a set of data becomes NP-complete [Chickering, et al., 1995]. Nonetheless, heuristics have been developed for automatically learning Bayesian networks from data (for example [Heckerman, et  al., 1995]); this is one of the most exciting directions for future research.

 

References

Baluja, S. (1997) "Genetic Algorithms and Explicit Search Statistics," Advances in Neural Information Processing Systems 9, 1996. Mozer, M.C., Jordan, M.I., & Petsche, T. (Eds). MIT Press.

Baluja, S. (1995),"Population-Based Incremental Learning: A Method for Integrating Genetic Search Based Function Optimization and Competitive Learning", Technical Report CMU-CS-94-163, Carnegie Mellon University, Pittsburgh, PA.

Baluja, S. & Caruana, R. (1995), "Removing the Genetics from the Standard Genetic Algorithm," In (eds) Prieditis, A., Russel, S. The International Conference on Machine Learning, 1995 (ML-95). Morgan Kaufmann Publishers. San Mateo, CA. pp. 38-46.

Baluja, S. & Davies, S. (1997), "Using Optimal Dependency-Trees for Combinatorial Optimization: Learning the Structure of the Search Space," In (eds) The International Conference on Machine Learning, 1997 (ML-97). Morgan Kaufmann Publishers. San Mateo, CA.

Baluja, S.; Davies, S. (1998) "Fast Probabilistic Modeling for Combinatorial Optimization". Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98). Madison, WI, USA, 1998. 469-476.

Chickering, D., Geiger, D., and Heckerman, D. (1995) "Learning Bayesian networks: Search methods and experimental results," Proc.of Fifth Conference on Artificial Intelligence and Statistics

Chou. C. and Liu, C. (1968) Approximating discrete probability distributions with dependence trees. IEEE Trans. on Info. Theory, 14:462-467.

De Bonet, J., Isbell, C., and Viola, P. (1997) "MIMIC: Finding Optima by Estimating Probability Densities," Advances in Neural Information Processing Systems, 1996. Mozer, M.C., Jordan, M.I, & Petsche, T. (Eds).

Eshelman, L.J., Mathias, K.E., Schaffer, J.D. "Convergence Controlled Variation", in Proc. Foundations of Genetic Algorithms 4. Morgan Kaufmann Publishers, San Mateo, CA.

Greene, J.R. (1996) "Population-Based Incremental Learning as a Simple Versatile Tool for Engineering Optimization". In Proceedings of the First International Conf. on EC and Applications. pp. 258-269.

Heckerman, D., Geiger, D., and Chickering, D. (1995) "Learning Bayesian networks: The combination of knowledge and statistical data," Machine Learning 20:197-243.

Hertz, J., Krogh A., & Palmer R.G. (1991), Introduction to the Theory of Neural Computing. Addison-Wesley, Reading, MA.

Hohfeld, M. & Rudolph, G. (1997) "Towards a Theory of Population-Based Incremental Learning", International Conference on Evolutionary Computation. pp. 1-5.

Juels, A. (1996) Topics in Black-box Combinatorial Optimization. Ph.D. Thesis, University of California - Berkeley.

Kvasnica, V., Pelikan, M, Pospical, J. "Hill Climbing with Learning (An Abstraction of Genetic Algorithm). In Proceedings of the First International Conference on Genetic Algorithms (MENDEL, `95). pp. 65-73.

Syswerda, G. (1993) "Simulated Crossover in Genetic Algorithms," in (ed.) Whitley, D.L., Foundations of Genetic Algorithms 2, Morgan Kaufmann Publishers, San Mateo, CA. 239-255.

Thathachar, M, & Sastry, P.S. (1987) "Learning Optimal Discriminant Functions Through a Cooperative Game of Automata", IEEE Transactions on Systems, Man, and Cybernetics, Vol. 17, No. 1.