Value Function Approximation workshop at ML95

July 9, 1995

VFA workshop home -- Other ML95 workshops -- ML95 conference home

Introduction: Value Function Approximation
Andrew Moore (CMU)
The key computational idea underlying reinforcement learning is the iterative approximation of the value function---the mapping from states (or state-action pairs) to an estimate of the long-term future reward obtainable from that state. For large problems, the approximation of the value function must involve generalization from examples to reduce memory requirements and training time. Moreover, generalization may help beat the curse of dimensionality for problems in which the complexity of the value function increases sub-exponentially with the number of state variables. Generalizing function approximators have been used effectively in reinforcement learning as far back as Samuel's checker player, which used a linear approximator, and Michie and Chambers' BOXES system, which used state aggregation. Tesauro's TD-Gammon, which used a backpropagation network, provides a tantalizing recent demonstration of just how effective this can be.

However, almost all of the theory of reinforcement learning is dependent on the assumption of tabular representations of the value function, for which generalization is impossible. Moreover, several researchers have argued that there are serious hazards intrinsic to the approximation of value functions by reinforcement learning methods. It is an important and still unclear question as to just how serious these hazards might be. In this workshop we will survey the substantial recent and ongoing work pertaining to this question. We will seek to answer it or, failing that, to identify the remaining outstanding issues.


New Approaches to Dynamic Programming for Continuous Systems
Chris Atkeson (Georgia Tech)
This talk describes the use of local optimization and sophisticated data structures in dynamic programming for continuous systems. The techniques use local linear models of the system and local quadratic models of the criteria and are related to LQ and DDP techniques (LQ: Linear Quadratic, DDP: Differential Dynamic Programming). These techniques can be used to define regions where greedy local optimization is sufficient to produce globally optimal trajectories.

Fast convergence for value function approximation
Leemon Baird (USAF)
This talk will briefly motivate and describe residual gradient algorithms (which will also be presented in the main conference). It will then focus on why the slow convergence of the residual gradient algortihms cannot be fixed easily through modification of the function approximation system, or through simple fixes such as Delta-Bar-Delta or Quickprop. This problem will not be discussed in the main conference, but it was the motivation for residual algorithms, which will not be described in the workshop, but will be described in detail in the conference.

Robust VFA Algorithms Not Based On Value Iteration
Justin Boyan (CMU)
In value iteration and related TD algorithms for learning value functions, values all across the state space must be repeatedly and recursively re-estimated, which can cause instability in the context of function approximator error. A more robust alternative is to generate optimal values by working strictly backwards from goal states, since such values, once found, can be held fixed. I will give the conditions under which such working-backwards is possible, discuss the issues involved in scaling up to continuous state spaces with function approximation, and present a pair of promising algorithms.

Reinforcement Learning With Compact Function Approximation: A Case Study from the Elevator Dispatching Domain
Bob Crites (U. Mass.)
This talk describes the application of reinforcement learning (RL) to elevator dispatching, a difficult problem with a very large state space, where lookup table representations are completely infeasible. As in Tesauro's successful TD-Gammon system, neural networks trained with backpropagation were used as compact function approximators. The elevator system operates in a continuous state space and in continuous time as a discrete event dynamic system. The state is not fully observable and there is non-stationarity due to changing passenger arrival rates. In addition, we used a team of RL agents, each of which was responsible for controlling one elevator car. In spite of all of these complications, we show results that surpass the best of the heuristic elevator control algorithms of which we are aware. These results provide evidence that rather than being an isolated event, the success of RL with compact function approximation in backgammon may be repeatable for many difficult problems.

Value Function Approximations and Job Shop Scheduling
Tom Dietterich (Oregon State)
We use TD(lambda) with a standard multilayer sigmoidal network to learn a value function for iterative repair job-shop scheduling. The learned value function is very good, and the performance of the scheduler beats the best previous method on this task. Measurements of the learning process show that TD(lambda) is succeeding in value iteration---information about the final reinforcement is reflected in preceding states. Value function approximation works fine in this domain. Indeed, it might work well even without doing value iteration---just by locally smoothing the objective function.

AHC for Stochastic Minimum Cost Path Problems Using Neural Net Function Approximators
Stuart Dreyfus (Berkeley IEOR)
I plan to begin my presentation by suggesting that the adaptive heuristic critic (AHC) idea implemented using neural network value and action function approximators seems a plausible explanation of animal learned skillful coping behavior. I will note that the output of the function approximators should depend, not only on current stimulae (state), but on some internal state of the approximator that in turn depends on the prior trajectory of the process. (An Elman net is a particular realization of this requirement.) This automatically introduces sensitivity of value and action to whatever in the past is relevant to the dynamics and reinforcement. Having touched on this intriguing conceptual matter, I will quickly turn to an unrelated, mundane but practical, concern.

Suppose that, in an AHC algorithm, the value and action mappings are approximated via a parametrized functional form (e.g., a neural network). Parameters are adjusted to fit learned values and actions using TD(0) for the value adjustment. Let squared error=(value of current step+value of finishing given the next state-value of finishing given current state)**2. Changing the mapping parameters affects the second and third term of the squared error. On a small toy problem which is a special case of a markovian decision problem, if the dynamics are deterministic and parameters are adjusted in the negative gradient direction based on the above observation about parameter effects and the step-size slowly goes to zero, the right answer appears to result. Interestingly, if one neglects the effect of parameters on the second term in the squared error above when computing the "gradient", all still goes well, perhaps even more rapidly. Now, and this is the point of my presentation, if the dynamics are stochastic, using the wrong gradient (where the effect of parameter changes on the second term in the squared error is neglected) seems to yield almost, but not quite, correct expected values, while using the right gradient including the effect on the second term yields very wrong results. While I have some intuition about why the right gradient gives the wrong result for a stochastic problem, I have none concerning why the wrong gradient appears to give the right result for the deterministic case and almost, but not quite, correct results for the stochastic case. The listener is encouraged to use the wrong gradient in his/her AHC computations for stochastic problems, and hopefully to figure out why it almost works. More crucial, one wonders, since both gradients are probably provably wrong, if any procedure using a parametrized function approximator is provably correct or if "approximately optimal" is the best one can hope for concerning markovian decision models.

Online Fitted Reinforcement Learning
Geoff Gordon (CMU)
My paper in the main portion of the conference deals with fitted value iteration or Q-learning for offline problems, ie, those where we have a model of the environment so that we can examine arbitrary transitions in arbitrary order. The same techniques also allow us to do Q-learning for an online problem, ie, one where we have no model but must instead perform experiments inside the MDP to gather data. I will describe how.

Solving Partially Observable Markov Decision Processes Via VFA
Michael L. Littman (Brown)
Partially observable Markov decision processes (POMDPs) generalize MDPs to the case where the decision maker does not necessarily have perfect information as to its current state. It is well-known that, given a complete description of a POMDP environment, such problems can be formulated as continuous-space MDPs where the states are probability distributions over the discrete states of the environment. POMDPs can be viewed as one of the early, tentative successful applications of VFA. Using some knowledge of the special structure of POMDP value functions, we have used a simple VFA scheme to generate high quality solutions to larger problems than are possible using traditional analytic techniques. In my talk, I will summarize the work in this area by Parr and Russell at Berkeley, and that of Littman, Cassandra, and Kaelbling at Brown.

[Special note: I am very interested in seeing how well various approaches to VFA handle POMDP domains. If you have a general approach to finding good value functions for continuous space MDPs and would be willing to let us try your algorithm on our problem, please contact me: mlittman@cs.brown.edu.]

Learning Smooth Control by Fitting Value Function Derivatives
Satinder Singh (MIT)
The Hamilton-Jacobi-Bellman (HJB) equation for continuous optimal control problems is expressed solely in terms of derivatives of the value function*. Discretizing such problems leads to the familiar Bellman equation that forms the basis for the "backups" in both dynamic programming and reinforcement learning algorithms. We propose a new method for solving continuous control problems that attempts to train a function approximator to directly represent the derivatives of the value function, without discretizing the state space. The goal of the training is to satisfy as best as possible the constraints imposed by the HJB equation (and the curl condition). We also use the crucial fact that it is possible to do policy iteration based on these derivatives alone. Note that there is nothing analogous to the usual backup in this method. We tested our method of approximating the derivatives of the value function as a linearly weighted combination of local conservative vector fields derived as gradients of standard radial basis functions. The method has as yet been applied to simple problems only, and may have several limitations (including difficulty in dealing with discontinuities in the value function space).

* (Note that derivatives of the value function are closely related to the difference between Q (or state-action) values and V (or state) values, called "advantages" by Baird. However, there does not seem to be any way of learning the relative quantities (advantages) without also learning absolute quantities in the general discrete MDP. We show that at least in certain continuous problems, it is possible to learn the relative quantities (derivatives of value functions) alone.)

On The Virtues of Linear Learning and Online Training
Rich Sutton (U. Mass.)
In contrast to recent work by Boyan and Moore, I have obtained excellent results applying conventional VFA methods to control tasks such as ``Mountain Car" and ``Acrobot". The difference in our results is due to the form of function approximator and the nature of the training. I used a local function approximator known as a CMAC---essentially a linear learner using a sparse, coarse-coded representation of the state. Although not a complete solution, I argue that this approach solves the ``curse of dimensionality" as well as one can expect, and enables VFA systems to generalize as well as other artificial learning systems. Also, I trained online, using actual, experienced state trajectories and a form of the TD(lambda) algorithm. A theorem by Dayan assures stability of TD(lambda) when trained online. Gordon and Tsitsiklis and Van Roy, however, have recently shown that TD(lambda) can be unstable when trained offline as by Boyan and Moore. I conclude by discussing possibilities for extending Dayan's theorem to prove tight bounds on the accuracy of approximation under conditions of online training.

Counterintuitive aspects of TD-Gammon
Gerry Tesauro (IBM)
As most attendees at this workshop probably know, there are a number of surprising and counterintuitive results obtained in applying TD learning and a neural network function approximator to the domain of backgammon, that by and large have not yet been obtained in other applications. This talk will focus on some of these counterintuitive findings, with particular emphasis on new data that has been obtained in the domain of "Hypergammon," i.e., 3-checker backgammon. Hypergammon is a much simpler yet still non-trivial variant of backgammon, and it has the advantage that TD-trained neural nets can be compared to the exact solution, which is commercially available as a database on CD-ROM. Specific topics to be covered include: (a) why relative error (i.e. the quality of the policy) is substantially better than absolute error (i.e. the quality of the value function); (b) scaling and parameter tuning, and the apparent insensitivity of training to the particular value of lambda; (c) why quality of play is surprisingly strong even with a "raw" input description; (d) why early-game play is significantly better than end-game play.

[Effects of Overestimation in Q-learning with Function Aproximation]
Sebastian Thrun (Bonn)
[abstract]

Input Representation Engineering
Paul Utgoff (U. Mass.)
The effectiveness of value-function approximation depends in large part on the ease with which a good approximation can be represented over the input variables. I will illustrate some of the input representation engineering that was needed to enable a TD-lambda learner to find a high-quality evaluation function for Othello.

PANEL DISCUSSION
Baird-Boyan-Moore-Singh-Sutton and audience
Some questions for discussion include:
  1. Are there function approximators especially suited to VFA?
  2. What does online learning (backups on real trajectories) buy you?
  3. Are there alternatives to TD/Backup-based methods?
  4. Do we need value functions at all (what about partigame, valueless approaches, or optimization approaches)?
  5. What are the roles of models? Note that the success stories described by Tesauro, Crites and Dietterich have all applied to problems in which the model is known a priori.
  6. Have these problems already been attacked in other fields? Games? Control? OR?
  7. How important are the consequences if VFA-based RL becomes a well-used reliable method?


Maintainer:
Justin.Boyan@cs.cmu.edu