As indicated in the introduction, we propose to perform point-based DP update a number of times in between two standard DP updates. To be more specific, we propose to modify value iteration in the way as shown in Figure 3. Note that the only change is at line 5. Instead of assigning directly to , we pass it to a subroutine POINT-BASED-VI and assign the output of the subroutine to . The subroutine functions in the same way as value iteration, except that it performs point-based DP updates rather than standard DP updates. Hence we call it point-based value iteration.
Figure 4 illustrates the basic idea behind modified value iteration in contrast to value iteration. When the initial value function is properly selected, the sequence of value functions produced by value iteration converges monotonically to the optimal value function. Convergence usually takes a long time partially because standard DP updates, indicated by fat upward arrows, are computationally expensive. Modified value iteration interleaves standard DP updates with point-based DP updates, which are indicated by the thin upward arrows. Point-based DP update does not improve a value function as much as standard DP update. However, its complexity is much lower. As a consequence, modified value iteration can hopefully converge in less time.
The idea of interleaving standard DP updates with approximate updates that back up at a finite number of belief points is due to Cheng (1988). Our work differs from Cheng's method mainly in the way we select the belief points. A detailed discussion of the differences will be given in Section 8.
Figure 4: Illustration of the Basic Idea behind Modified Value Iteration.
The modified value iteration algorithm raises three issues. First, what stopping criterion do we use for point-based value iteration? Second, how can we guarantee the stopping criterion can eventually be satisfied? Third, how do we guarantee the convergence of the modified value iteration algorithm itself? To address those issues, we introduce the concept of uniformly improvable value functions.