Anytime State-Based Solution Methods

The main drawback of traditional dynamic programming algorithms such as policy iteration [32] is that they explicitly enumerate all states that are reachable from in the entire MDP. There has been interest in other state-based solution methods, which may produce incomplete policies, but only enumerate a fraction of the states that policy iteration requires. Let denote the envelope of policy , that is the set of states that are reachable (with a non-zero probability) from the initial state under the policy. If is defined at all , we say that the policy is complete, and that it is incomplete otherwise. The set of states in at which is undefined is called the fringe of the policy. The fringe states are taken to be absorbing, and their value is heuristic. A common feature of anytime state-based algorithms is that they perform a forward search, starting from and repeatedly expanding the envelope of the current policy one step forward by adding one or more fringe states. When provided with admissible heuristic values for the fringe states, they eventually converge to the optimal policy without necessarily needing to explore the entire state space. In fact, since planning operators are used to compactly represent the state space, they may not even need to construct more than a small subset of the MDP before returning the optimal policy. When interrupted before convergence, they return a possibly incomplete but often useful policy. These methods include the envelope expansion algorithm [17], which deploys policy iteration on judiciously chosen larger and larger envelopes, using each successive policy to seed the calculation of the next. The more recent LAO algorithm [28] which combines dynamic programming with heuristic search can be viewed as a clever implementation of a particular case of the envelope expansion algorithm, where fringe states are given admissible heuristic values, where policy iteration is run up to convergence between envelope expansions, and where the clever implementation only runs policy iteration on the states whose optimal value can actually be affected when a new fringe state is added to the envelope. Another example is a backtracking forward search in the space of (possibly incomplete) policies rooted at [45], which is performed until interrupted, at which point the best policy found so far is returned. Real-time dynamic programming (RTDP) [8] is another popular anytime algorithm which is to MDPs what learning real-time A [36] is to deterministic domains, and which has asymptotic convergence guarantees. The RTDP envelope is made up of sample paths which are visited with a frequency determined by the current greedy policy and the transition probabilities in the domain. RTDP can be run on-line, off-line for a given number of steps or until interrupted. A variant called LRTDP [9] incorporates mechanisms that focus the search on states whose value has not yet converged, resulting in convergence speed up and finite time convergence guarantees. The FLTL translation we are about to present targets these anytime algorithms, although it could also be used with more traditional methods such as value and policy iteration.
Sylvie Thiebaux 2006-01-20