The basic idea of our modified value iteration algorithm VI1 is to add, in between two consecutive standard updates, operations that are inexpensive. The hope is that those operations can significantly improve the quality of a vector set and hence reduce the number of standard updates.

Several previous algorithms work in the same fashion.
The differences lie in the operations that are inserted between standard updates.
The reward revision algorithm (White *et al.* 1989)
constructs, at each iteration, a second POMDP based
on the current set of vectors.
It runs value iteration on the second
POMDP for a predetermined number of steps. The output
is used to modify the current set of vectors and the resulting set of
vectors is fed to the next standard update.

Why is reward revision expected to speed up value iteration?
Let *V* be the value function represented by the current set of
vectors.
The second POMDP is constructed in such way that it
shares the same optimal value function as the original POMDP if
*V* is optimal. As such, one would expect
the two POMDPs to have similar optimal value functions
if *V* is close to optimal. Consequently, running value iteration
on the second POMDP should improve the current value function.
And it is inexpensive to do so because the second POMDP is fully observable.

Reward revision is conceptually much more complex than
VI1 and seems to be less efficient. According to
White *et al.* (1989), reward revision can, on average, reduce the number
of standard updates by 80% and computational time by 85%.
From Tables 1 and 2, we see that the differences between VI1 and VI
are much larger.

The iterative discretization procedure (IDP) proposed by Cheng (1988) is very similar to VI1. There are two main differences. While VI1 uses point-based update, IDP uses non-LP point-based update. While point-based update in VI1 backs up on witness points and belief points found by linear programs, non-LP point-based update in IDP backs up on extreme points of witness regions found as by-products by Cheng's linear-support or relaxed region algorithms.

Cheng has conducted extensive experiments to determine the effectiveness of IDP in accelerating value iteration. It was found that IDP can cut the number of standard updates by as much as 55% and the amount of time by as much as 80%. Those are much less significant than the reductions presented in Tables 1 and 2.

Hansen's policy iteration (PI) algorithm maintains a policy in the form of a finite-state controller. Each node in the controller represents a vector. At each iteration, a standard update is performed on the set of vectors represented in the current policy. The resulting set of vectors is used to improve the current policy and the improved policy is evaluated by solving a system of linear equations. This gives rise to a third set of vectors, which is fed to the next standard update.

We compared the performance of Hansen's PI algorithm to VI1. Table 6 shows, for each test problem, the number of standard updates and the amount of time the algorithm took. Comparing with the statistics for VI1 (Table 4), we see that PI performed more standard updates than VI1. This indicates that policy improvement/evaluation is less effective than point-based value iteration in cutting down the number of standard updates. In terms of time, PI is more efficient than VI1 on the first three problems but significantly less efficient on all other problems.

4x3CO | Cheese | 4x4 | Paint | Tiger | Shuttle | Network | Aircraft |

3 | 7 | 7 | 10 | 14 | 9 | 18 | 9 |

.14 | .87 | 3.4 | 3.8 | 4.5 | 60 | 1,109 | 66,964 |

It might be possible to combine VI1 and PI. To be more specific, one can probably insert a policy improvement/evaluation step between two point-based updates in point-based value iteration (Figure 2). This should accelerate point-based value iteration and hence VI1. This possibility and its benefits are yet to be investigated.

Thu Feb 15 14:47:09 HKT 2001