Computing POMDP policies

The Exponential-family Principal Components Analysis model gives us a way to find a low-dimensional representation of the beliefs that occur in any particular problem. For the two real-world navigation problems we have tried, the algorithm proved to be effective at finding very low-dimensional representations, showing reductions from states and states down to 5 or 6 bases. A 5 or 6 dimensional belief space will allow much more tractable computation of the value function, and so we will be able to solve much larger POMDPs than we could have solved previously.

Unfortunately, we can no longer use conventional POMDP value iteration to find the optimal policy given the low-dimensional set of belief space features. POMDP value iteration depends on the fact that the value function is convex over the belief space. When we compute a non-linear transformation of our beliefs to recover their coordinates on the low-dimensional belief surface, we lose the convexity of the value function (compare Figure 3 and Figure 6 to see why). As a result, the value function cannot be expressed as the supremum of a set of hyperplanes over the low-dimensional belief space.

So, instead of using POMDP value iteration, we will build a low-dimensional
discrete belief space MDP and use MDP value iteration. Since we do not know
the form of the value function, we will turn to function approximation.
Gordon [Gor95] proved that the *fitted value iteration* algorithm
is guaranteed to find a bounded-error approximation to a (possibly discounted)
MDP's value function, so long as we use it in combination with a function
approximator that is an *averager*. Averagers are function approximators
which are non-expansions in max-norm; that is, they do not exaggerate errors
in their training data. In our experiments below, we use regular grids as well
as irregular, variable-resolution grids based on -nearest-neighbour
discretization, represented by a set of low-dimensional beliefs
,

(29) |

Both of these approximations are averagers; other averagers include linear interpolation, -nearest-neighbours, and local weighted averaging. We will not focus in detail on the exact mechanism for discretizing the low-dimensional space, as this is outside the scope of this paper. The resolution of the regular grid in all cases was chosen empirically; in section 7 we describe a specific variable resolution discretization scheme that worked well empirically. The reader can consult Munos & Moore [MM02] or Zhou & Hansen [ZH01] for more sophisticated representations.

The fitted value iteration algorithm uses the following update rule to compute a -step lookahead value function from a -step lookahead value function :

Here and are approximate reward and transition functions based on the dynamics of our POMDP, the result of our E-PCA, and the finite set of low-dimensional belief samples that we are using as our function approximator. Note that in all problems described in this paper, the problem did not require discounting ( ). The following sections describe how to compute the model parameters and .

The original reward function represents the immediate reward
of taking action at state . We cannot know, given either
a low-dimensional or high-dimensional belief, what the immediate
reward *will* be, but we can compute the *expected*
reward. We therefore represent the reward as the expected value of the
immediate reward of the full model, under the current belief:

Equation (32) requires us to recover the high-dimensional belief from the low-dimensional representation , as shown in equation (28).

For many problems, the reward function will have the effect of giving a low immediate reward for belief states with high entropy. That is, for many problems the planner will be driven towards beliefs that are centred on high-reward states and have low uncertainty. This property is intuitively desirable: in such beliefs the robot does not have to worry about an immediate bad outcome.

Computing the low-dimensional transition function
is not as simple as computing the
low-dimensional reward function
: we need to
consider pairs of low-dimensional beliefs,
and
. In the original high-dimensional belief space, the
transition from a prior belief to a posterior belief is
described by the Bayes filter equation:

Here is the action we selected and is the observation we saw; is the original POMDP transition probability distribution, and is the original POMDP observation probability distribution.

Equation (33) describes a deterministic transition conditioned upon a
prior belief, an action and an observation. The transition to the
posterior is stochastic when the observation is not known; that
is, the transition from to occurs only when a specific
is generated, and the probability of this transition is the
probability of generating observation . So, we can separate the
full transition process into a deterministic transition to , the
belief after acting but before sensing, and a stochastic transition to
, the full posterior:

Equations 34 and 35 describe the transitions of the high-dimensional beliefs for the original POMDP. Based on these high-dimensional transitions, we can compute the transitions in our low-dimensional approximate belief space MDP. Figure 14 depicts the process.

As the figure shows, we start with a low-dimensional belief . From we reconstruct a high-dimensional belief according to equation (28). Then we apply an action and an observation as described in equation (34) and equation (35) to find the new belief . Once we have we can compress it to a low-dimensional representation by iterating equation (26). Finally, since may not be a member of our sample of low-dimensional belief states, we map to a nearby according to our function approximator.If our function approximator is a grid, the last step above means replacing by a prototypical which shares its grid cell. More generally, our function approximator may represent as a combination of several states, putting weight on each . (For example, if our approximator is -nearest-neighbour, for each of the closest samples in .) In this case we replace the transition from to with several transitions, each from to some , and scale the probability of each one by .

For each transition we can assign a probability

The total transition probability is the sum, over all observations , of . Step 3 in Table 2 performs this computation, but shares work between the computation of for different posterior beliefs which are reachable from the same prior belief under action .

With the reward and transition functions computed in the previous sections, we can use value iteration to compute the value function for our belief space MDP. The full algorithm is given in Table 2.

- Generate the discrete low-dimensional belief space using E-PCA (cf. Table 1)
- Compute the low-dimensional reward function :
- For each
- Recover from
- Compute .

- Compute the low-dimensional transition function :
- For each
- Compute the value function for
- t = 0
- For each
- do
- change = 0
- For each :
- change = change +
- while change