next up previous contents
Next: Literature Review Up: Value Function Approximation for Previous: Value Function Approximation for

Markov Decision Problems

 

Formally, let our search space be represented as a Markov decision problem (MDP), defined by

An agent in an MDP environment observes its current state tex2html_wrap_inline1480 , selects an action tex2html_wrap_inline1482 , and as a result receives a reward and moves probabilistically to another state tex2html_wrap_inline1484 . The model is flexible enough to represent AI planning problems, stochastic games (e.g. backgammon) against a fixed opponent, and combinatorial optimization search spaces. With natural extensions, it can also represent continuous stochastic control domains, two-player games, and many other problem formulations [Littman and Szepesvári1996, Harmon et al. 1995].

Specifying a fixed deterministic policy tex2html_wrap_inline1486 reduces an MDP to a Markov chain. The policy value function tex2html_wrap_inline1488 is the expected long-term reward accumulated over trajectories through the chain:

displaymath1455

Here, tex2html_wrap_inline1492 is a discount factor which determines how important long-term rewards are; my work in this thesis predominantly uses the undiscounted case of tex2html_wrap_inline1410 . The form of tex2html_wrap_inline1496 above is convenient because it satisfies this linear system of Bellman equations for prediction:

  equation89

The optimal solution to an MDP is a policy tex2html_wrap_inline1498 which maximizes tex2html_wrap_inline1488 . The policy value function of tex2html_wrap_inline1498 is the optimal value function tex2html_wrap_inline1400 . It satisfies the Bellman equations for control:

  equation94

From the value function tex2html_wrap_inline1400 , it is easy to infer the optimal policy: at any state x, any action which instantiates the max in Equation 4 is an optimal choice [Bellman1957]. This formalizes the notion that tex2html_wrap_inline1400 is an ideal evaluation function.

For small problems of up to tex2html_wrap_inline1513 or so discrete states, the value function can be stored in a lookup table and computed by any of a variety of algorithms, including linear programming [Denardo1982], policy iteration [Howard1960], and value iteration [Bellman1957]. Value iteration (VI) has been the basis for the most important simulation-based algorithms for learning tex2html_wrap_inline1400 , so I focus on it here.

VI computes tex2html_wrap_inline1400 by repeatedly sweeping over the state space, applying Equation 4 as an assignment statement (this is called a ``one-step backup'') at each state in parallel. If the lookup table is initialized with all 0's, then after the tex2html_wrap_inline1516 sweep of VI, the table will represent the maximum expected return of a path of length i from each state. For certain goal-oriented domains, this corresponds to the intuition that VI works by propagating correct tex2html_wrap_inline1400 values backwards, by one step per iteration, from the terminal states.

More precisely, there are two main classes of MDPs for which correct tex2html_wrap_inline1400 values can be assigned by working strictly backwards from terminal states:

  1. deterministic domains with no positive-reward cycles and with every state able to reach at least one terminal state. This class includes shortest-path and minimum cost-to-go problems.
  2. (possibly stochastic) acyclic domains: domains where no legal trajectory can pass through the same state twice. Many problems naturally have this property (e.g. games like tic-tac-toe and Connect-Four; combinatorial optimization domains as formulated in Section 3 below; any finite-horizon problem for which time is a component of the state).
Using VI to solve MDPs belonging to either of these special classes can be quite inefficient, since VI performs backups over the entire space, whereas the only backups useful for improving tex2html_wrap_inline1400 are those on the ``frontier'' between already-correct and not-yet-correct tex2html_wrap_inline1400 values. In fact, for small discrete problems there are classical algorithms for both problem classes which compute tex2html_wrap_inline1400 more efficiently by explicitly working backwards: for the deterministic class, Dijkstra's shortest-path algorithm; and for the acyclic class, DIRECTED-ACYCLIC-GRAPH-SHORTEST-PATHS (DAG-SP) [Cormen et al. 1990].gif DAG-SP first topologically sorts the MDP, producing a linear ordering of the states in which every state x precedes all states reachable from x. Then, it runs through that list in reverse, performing one backup per state. Worst-case bounds for VI, Dijkstra, and DAG-SP in deterministic domains with X states and A actions per state are tex2html_wrap_inline1538 , tex2html_wrap_inline1540 , and O(AX), respectively.

Another difference between VI and working backwards is that VI repeatedly re-estimates the values at every state, using old predictions to generate new training values. By contrast, Dijkstra and DAG-SP are always explicitly aware of which states have their tex2html_wrap_inline1400 values already known, and can hold those values fixed. This will be important when we introduce generalization and the possibility of approximation error.

The VI, Dijkstra and DAG-SP algorithms all apply exclusively to MDPs for which the state space can be exhaustively enumerated and the value function represented as a lookup table. For the quantized high-dimensional state spaces characteristic of real-world control tasks, the curse of dimensionality makes such enumeration intractable. Computing tex2html_wrap_inline1400 requires generalization: one natural technique is to encode the states as real-valued feature vectors and to use a function approximator to fit tex2html_wrap_inline1400 over this feature space. This technique goes by the names Neuro-Dynamic Programming [Bertsekas and Tsitsiklis1996] and Value Function Approximation (VFA) [Boyan et al. 1995].


next up previous contents
Next: Literature Review Up: Value Function Approximation for Previous: Value Function Approximation for

Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996