next up previous contents
Next: Value Function Approximation for Up: Introduction Previous: Background

Proposal

This thesis will investigate new machine learning algorithms for generating evaluation functions automatically. The algorithms are simulation-based: they gather data by simulating trajectories through the search space, and use the outcomes of those trajectories to improve the evaluation function. The learned evaluation function may be applied either to guide further exploration of the same space, or to improve performance in new problem spaces which share similar features.

Two general families of learning algorithms apply here: reinforcement learning and meta-optimization. The reinforcement learning approach, dating back to Samuel's checkers player [Samuel1959] but with more recent successes on backgammon [Tesauro1992] and job-shop scheduling [Zhang and Dietterich1995], is based on approximating a special evaluation function known as the value function. The value function is defined to predict the best possible long-term outcome expected from each state; a greedy search guided by it will produce an optimal result. Learning the value function in combinatorially large domains, however, is quite challenging. The currently-popular approximation algorithms, Q-learning and TD( tex2html_wrap_inline2400 ) with neural networks, run very slowly and may even be unstable [Boyan and Moore1995]. Section 2 of this proposal reviews the literature on value function approximation and describes an original algorithm for more efficient learning. Section 3 focuses specifically on the application of value function approximation to combinatorial optimization problems, and presents another new algorithm.

The second approach, meta-optimization, is conceptually simpler: we assume a fixed parametric form for the evaluation function and optimize it directly. This approach was also used by Samuel [Samuel1959] and has been applied in the simulated annealing community [Ochotta1994]. Meta-optimization methods, lacking the theoretical advantages of dynamic programming, have been largely ignored by the reinforcement-learning community; however, recent advances in local optimization [Moore and Schneider1996] may help make them superior to reinforcement learning in practice. My research will lay out a framework for understanding both methods more formally; investigate improvements in both; and evaluate empirically which approach works better. Section 4 reviews the literature in meta-optimization and outlines the experiments I plan to carry out.

I will apply these techniques to all or most of the following large-scale combinatorial domains: VLSI channel routing; treatment planning for radiation therapy; scheduling a factory production line; configuring the subsystems of the NASA Outer Planet Orbiter; and move selection in the game of backgammon. These applications are described in Section 5. I will also validate my new algorithms on standard problems from the optimization and control literatures. Finally, a byproduct of this research will be a software package for public use.


next up previous contents
Next: Value Function Approximation for Up: Introduction Previous: Background

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996