The Exponential-family Principal Components Analysis model gives us a way to find a low-dimensional representation of the beliefs that occur in any particular problem. For the two real-world navigation problems we have tried, the algorithm proved to be effective at finding very low-dimensional representations, showing reductions from states and states down to 5 or 6 bases. A 5 or 6 dimensional belief space will allow much more tractable computation of the value function, and so we will be able to solve much larger POMDPs than we could have solved previously.
Unfortunately, we can no longer use conventional POMDP value iteration to find the optimal policy given the low-dimensional set of belief space features. POMDP value iteration depends on the fact that the value function is convex over the belief space. When we compute a non-linear transformation of our beliefs to recover their coordinates on the low-dimensional belief surface, we lose the convexity of the value function (compare Figure 3 and Figure 6 to see why). As a result, the value function cannot be expressed as the supremum of a set of hyperplanes over the low-dimensional belief space.
So, instead of using POMDP value iteration, we will build a low-dimensional discrete belief space MDP and use MDP value iteration. Since we do not know the form of the value function, we will turn to function approximation. Gordon [Gor95] proved that the fitted value iteration algorithm is guaranteed to find a bounded-error approximation to a (possibly discounted) MDP's value function, so long as we use it in combination with a function approximator that is an averager. Averagers are function approximators which are non-expansions in max-norm; that is, they do not exaggerate errors in their training data. In our experiments below, we use regular grids as well as irregular, variable-resolution grids based on -nearest-neighbour discretization, represented by a set of low-dimensional beliefs ,
The fitted value iteration algorithm uses the following update rule to compute a -step lookahead value function from a -step lookahead value function :
The original reward function represents the immediate reward
of taking action at state . We cannot know, given either
a low-dimensional or high-dimensional belief, what the immediate
reward will be, but we can compute the expected
reward. We therefore represent the reward as the expected value of the
immediate reward of the full model, under the current belief:
For many problems, the reward function will have the effect of giving a low immediate reward for belief states with high entropy. That is, for many problems the planner will be driven towards beliefs that are centred on high-reward states and have low uncertainty. This property is intuitively desirable: in such beliefs the robot does not have to worry about an immediate bad outcome.
Computing the low-dimensional transition function
is not as simple as computing the
low-dimensional reward function
: we need to
consider pairs of low-dimensional beliefs,
. In the original high-dimensional belief space, the
transition from a prior belief to a posterior belief is
described by the Bayes filter equation:
Equation (33) describes a deterministic transition conditioned upon a
prior belief, an action and an observation. The transition to the
posterior is stochastic when the observation is not known; that
is, the transition from to occurs only when a specific
is generated, and the probability of this transition is the
probability of generating observation . So, we can separate the
full transition process into a deterministic transition to , the
belief after acting but before sensing, and a stochastic transition to
, the full posterior:
Equations 34 and 35 describe the transitions of the high-dimensional beliefs for the original POMDP. Based on these high-dimensional transitions, we can compute the transitions in our low-dimensional approximate belief space MDP. Figure 14 depicts the process.
If our function approximator is a grid, the last step above means replacing by a prototypical which shares its grid cell. More generally, our function approximator may represent as a combination of several states, putting weight on each . (For example, if our approximator is -nearest-neighbour, for each of the closest samples in .) In this case we replace the transition from to with several transitions, each from to some , and scale the probability of each one by .
For each transition we can assign a probability
With the reward and transition functions computed in the previous sections, we can use value iteration to compute the value function for our belief space MDP. The full algorithm is given in Table 2.