I have presented two reasons why working strictly backwards may be
desirable: efficiency, because updates need only be done on the
``frontier'' rather than all over state space; and robustness, because
correct
values, once assigned, need never again be changed. I
have therefore investigated generalizations of the Dijkstra and DAG-SP
algorithms specifically modified to accommodate huge state spaces and
value function approximation. My variant of Dijkstra's algorithm,
called Grow-Support, was presented in [Boyan and Moore1995] and will not
be explored further in this thesis. My variant of DAG-SP is an
algorithm called ROUT [Boyan and Moore1996], which this thesis will
develop further.
In the huge domains for which ROUT is designed, DAG-SP's key
preprocessing step--topologically sorting the entire state space--is
no longer tractable. Instead, ROUT must expend some extra effort to
identify states on the current frontier. Once identified (as
described below), a frontier state is assigned its optimal
value
by a simple one-step backup, and this {state
value} pair
is added to a training set for a function approximator. Thus, ROUT's
main loop consists of identifying a frontier state; determining its
value; and retraining the approximator (see
Appendix A.2). The training set, constructed
adaptively, grows backwards from the goal.
ROUT's key subroutine, HUNTFRONTIERSTATE, is responsible for identifying a good state x to add to the training set. In particular:
Figure 1: A schematic of ROUT working on an acyclic two-dimensional
navigation domain, where the allowable actions are only
,
, and
.
Suppose that ROUT has thus far established training values for
at the triangles, and that the function approximator has
successfully generalized
throughout the shaded region.
Now, when HUNTFRONTIERSTATE generates a trajectory from the start state to
termination (solid line), it finds that several states along
that trajectory are inconsistent (marked by crosses). The last
such cross becomes the new starting point for HUNTFRONTIERSTATE. From there,
all trajectories generated (dashed lines) are fully
self-consistent, so that state gets added to ROUT's training
set. When the function approximator is re-trained, the shaded
region of validity should grow, backwards from the goal.
The parameters of the ROUT algorithm are H, the number of
trajectories generated to certify a state's readiness, and
,
the tolerated Bellman residual. ROUT's convergence to the optimal
, assuming the function approximator can fit the
training
set perfectly, can be guaranteed in the limiting case where
(assuring exploration of all states reachable from
x) and
. In practice, of course, we want to be tolerant
of some approximation error. Typical settings I used were H=20 and
roughly 5% of the range of
).
In experiments on two medium-sized domains (
states),
ROUT learned evaluation functions which were as good or better than
those learned by TD(
) with the same neural-net approximator, and used
an order of magnitude less training data to do so. These results are
detailed in Appendix B.1.
When a function approximator is capable of fitting
, ROUT will,
in the limit, find it. However, for ROUT to be efficient, the
frontier must grow backward from the goal quickly, and this depends on
good extrapolation from the training set. When good extrapolation
does not occur, ROUT may become stuck, repeatedly adding points near
the goal region and never progressing backwards. Some function
approximators may be especially well-suited to ROUT's required
extrapolation from accurate training data, and I plan to explore this.
I will also explore algorithmic refinements to make ROUT more robust
in general. One such refinement is to adapt the tolerance level
, thereby guaranteeing progress at the expense of accuracy.
With such improvements, I hope to develop ROUT into the algorithm of
choice for learning evaluation functions for large-scale acyclic
domains. I will validate it empirically on the control problem of
factory scheduling (Section 5.3), the game of
backgammon
, and a variety of
combinatorial optimization problems formulated as acyclic MDPs as
described below.