The basic idea of our modified value iteration algorithm VI1 is to add, in between two consecutive standard updates, operations that are inexpensive. The hope is that those operations can significantly improve the quality of a vector set and hence reduce the number of standard updates.
Several previous algorithms work in the same fashion. The differences lie in the operations that are inserted between standard updates. The reward revision algorithm (White et al. 1989) constructs, at each iteration, a second POMDP based on the current set of vectors. It runs value iteration on the second POMDP for a predetermined number of steps. The output is used to modify the current set of vectors and the resulting set of vectors is fed to the next standard update.
Why is reward revision expected to speed up value iteration? Let V be the value function represented by the current set of vectors. The second POMDP is constructed in such way that it shares the same optimal value function as the original POMDP if V is optimal. As such, one would expect the two POMDPs to have similar optimal value functions if V is close to optimal. Consequently, running value iteration on the second POMDP should improve the current value function. And it is inexpensive to do so because the second POMDP is fully observable.
Reward revision is conceptually much more complex than VI1 and seems to be less efficient. According to White et al. (1989), reward revision can, on average, reduce the number of standard updates by 80% and computational time by 85%. From Tables 1 and 2, we see that the differences between VI1 and VI are much larger.
The iterative discretization procedure (IDP) proposed by Cheng (1988) is very similar to VI1. There are two main differences. While VI1 uses point-based update, IDP uses non-LP point-based update. While point-based update in VI1 backs up on witness points and belief points found by linear programs, non-LP point-based update in IDP backs up on extreme points of witness regions found as by-products by Cheng's linear-support or relaxed region algorithms.
Cheng has conducted extensive experiments to determine the effectiveness of IDP in accelerating value iteration. It was found that IDP can cut the number of standard updates by as much as 55% and the amount of time by as much as 80%. Those are much less significant than the reductions presented in Tables 1 and 2.
Hansen's policy iteration (PI) algorithm maintains a policy in the form of a finite-state controller. Each node in the controller represents a vector. At each iteration, a standard update is performed on the set of vectors represented in the current policy. The resulting set of vectors is used to improve the current policy and the improved policy is evaluated by solving a system of linear equations. This gives rise to a third set of vectors, which is fed to the next standard update.
We compared the performance of Hansen's PI algorithm to VI1. Table 6 shows, for each test problem, the number of standard updates and the amount of time the algorithm took. Comparing with the statistics for VI1 (Table 4), we see that PI performed more standard updates than VI1. This indicates that policy improvement/evaluation is less effective than point-based value iteration in cutting down the number of standard updates. In terms of time, PI is more efficient than VI1 on the first three problems but significantly less efficient on all other problems.
It might be possible to combine VI1 and PI. To be more specific, one can probably insert a policy improvement/evaluation step between two point-based updates in point-based value iteration (Figure 2). This should accelerate point-based value iteration and hence VI1. This possibility and its benefits are yet to be investigated.