next up previous contents
Next: VFA in Acyclic Domains: Up: Value Function Approximation for Previous: Markov Decision Problems

Literature Review

Any review of the literature on reinforcement learning and evaluation functions must begin with the pioneering work of Arthur Samuel on the game of checkers [Samuel1959, Samuel1967]. Samuel recognized the worth of the value function, saying that

we are attempting to make the score, calculated for the current board position, look like that calculated for the terminal board position of the chain of moves which most probably will occur during actual play. Of course, if one could develop a perfect system of this sort it would be the equivalent of always looking ahead to the end of the game. [Samuel1959, p. 219,]
Samuel's program incrementally changed the coefficients of an evaluation polynomial so as to make each visited state's value closer to the value obtained from lookahead search.

After Samuel, work in evaluation function learning was sporadic until the 1980s. Christensen [Christensen1986] tried replacing Samuel's coefficient-tweaking procedure with least-squares regression, and was able to learn reasonable weights for the chess material-advantage function. Lee and Mahajan [Lee and Mahajan1988] trained a nonlinear evaluation function on expertly-played games, and it played high-quality Othello. But the biggest advance in the state of the art came more recently, when the reinforcement-learning community elaborated the connection between AI search and the Bellman equations [Barto et al. 1989, Watkins1989, Sutton1990]. This connection had been unexplored despite the publication of an AI textbook by Richard Bellman himself [Bellman1978]!

Reinforcement learning's most celebrated success has also been in a game domain, the game of backgammon [Tesauro1992, Tesauro1994, Boyan1992]. Tesauro modified Sutton's TD( tex2html_wrap_inline2400 ) algorithm [Sutton1988], which is normally thought of as a model-free algorithm for learning to predict, into a model-based algorithm for learning to control. An implementation is given in Appendix A.1. When tex2html_wrap_inline1552 , this algorithm reduces to the RTDP algorithm [Barto et al. 1995], which is closely related to VI; the key difference is that its backups are done along sample trajectories through the MDP, rather than along sweeps of the entire state space. Applying TD( tex2html_wrap_inline2400 ) with a standard multi-layer perceptron function approximator, Tesauro's program learned an evaluation function which produced expert-level backgammon play. Boyan [Boyan1992] replicated Tesauro's results and demonstrated improvements using modular neural networks.

Tesauro's combination of TD( tex2html_wrap_inline2400 ) and neural networks has been applied to other domains, including elevator control [Crites and Barto1996] and job-shop scheduling [Zhang and Dietterich1995]. (I will discuss the scheduling application in detail in Section 3.2.) Nevertheless, it is important to note that when function approximators are used, TD( tex2html_wrap_inline2400 ) provides no guarantees of optimality. In the case of uncontrolled Markov chains and linear function approximators, online TD( tex2html_wrap_inline2400 ) does converge [Sutton1988, Dayan1992, Tsitsiklis and Roy1996]--but even then, if tex2html_wrap_inline1562 , convergence is not necessarily to a good approximation of tex2html_wrap_inline1400 [Bertsekas1995]. Moreover, in the general case of arbitrary function approximators and Markov Decision Problems, repeatedly applying one-step backups may propagate and enlarge approximation errors, leading to instability [Thrun and Schwartz1993, Boyan and Moore1995, Gordon1995].


next up previous contents
Next: VFA in Acyclic Domains: Up: Value Function Approximation for Previous: Markov Decision Problems

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996