next up previous
Next: Previous Work Related to Up: Related Work Previous: Point-Based DP Update and

Point-Based Value Iteration and Value Function Approximation

Point-based value iteration starts with a set of vectors and generates a sequence of vector sets by repeatedly applying point-based update. The last set can be used to approximate the optimal value function.

Various methods for approximating the optimal value function have been developed previously.gif We will compare them against point-based value iteration along two dimensions: (1) Whether they map one set of vectors to another, that is whether the can be interleaved with standard updates, and (2) if they do, whether they can guarantee convergence when interleaved with standard updates.

Lovejoy (1993) proposes to approximate the optimal value function tex2html_wrap_inline1068 of a POMDP using the optimal value function of the underlying Markov decision process (MDP). The latter is a function over the state space. So tex2html_wrap_inline1068 is being approximated by one vector. Littman et al. (1995b) extend this idea and approximate tex2html_wrap_inline1068 using tex2html_wrap_inline1776 vectors, each of which corresponds to a Q-function of the underlying MDP. A further extension is recently introduced by Zubek and Dietterich (2000). Their idea is to base the approximation not on the underlying MDP, rather on a so-called even-odd POMDP that is identical to the original POMDP except that the state is fully observable at even time steps. Platzman (1980) suggests approximating tex2html_wrap_inline1068 using the value functions of one or more fixed suboptimal policies that are constructed heuristically. Those methods do not start with a set of vectors and hence do not map a set of vectors to another. However, they can easily be adapted to do so. However, they all put a predetermined limit on the number of output vectors. Consequently, convergence is not guaranteed when they are interleaved with standard updates.

Fast informed bound (Hauskrecht 1997a), Q-function curve fitting (Littman et al. 1995b), and softmax curve fitting (Parr and Russell 1995) do map a set of vectors to another. However, they differ drastically from point-based value iteration and from each other in their ways of deriving the next set of vectors from the current one. Regardless of the size of the current set, fast informed bound and Q-function curve fitting always produces tex2html_wrap_inline1776 vectors, one for each action. In softmax curve fitting, the number of vectors is also determined a priori, although it is not necessarily related to the number of actions. Those methods can be interleaved with standard DP updates. Unlike point-based value iteration, they themselves may not converge (Hauskrecht 2000). Even in cases where they do converge themselves, the algorithms resulting from interleaving them with standard updates do not necessarily converge due to the a priori limits on the number of vectors.

Grid-based interpolation/extrapolation methods (Lovejoy 1991, Brafman 1997, Hauskrecht 1997b) approximate value functions by discretizing the belief space using a fixed or variable grid and by maintaining values only for the grid points. Values at non-grid points are estimated by interpolation/extrapolation when needed. Such methods cannot be interleaved with standard DP updates because they do not work with sets of vectors.

There are grid-based methods that work with sets of vectors. Lovejoy's method to lower bound the optimal value function (Lovejoy 1991), for instance, falls into this category. This method is actually identical to point-based value iteration except for the way it derives the next set of vectors from the current one. Instead of using point-based update, it backs up on grid points in a regular grid. Convergence of this method is not guaranteed. The algorithm resulting from interleaving it with standard updates may not converge either.

The incremental linear-function method (Hauskrecht 2000) roughly corresponds to a variation of point-based value iteration that uses non-LP point-based update (Section 7.2) augmented with the Gauss-Seidel asynchronous update. The method does not have access to witness point. So it starts, for the purpose of backup, with extreme points of the belief space and supplement them with projected points. This choice of points appears poor because it leads to a large number of vectors and consequently the backup process is ``usually stopped well before" convergence (Hauskrecht 2000).

next up previous
Next: Previous Work Related to Up: Related Work Previous: Point-Based DP Update and

Dr. Lian Wen Zhang
Thu Feb 15 14:47:09 HKT 2001