next up previous
Next: Solving Large POMDPs Up: Finding Approximate POMDP Solutions Previous: E-PCA Performance


Computing POMDP policies

The Exponential-family Principal Components Analysis model gives us a way to find a low-dimensional representation of the beliefs that occur in any particular problem. For the two real-world navigation problems we have tried, the algorithm proved to be effective at finding very low-dimensional representations, showing reductions from $ \approx 800$ states and $ \approx 2,000$ states down to 5 or 6 bases. A 5 or 6 dimensional belief space will allow much more tractable computation of the value function, and so we will be able to solve much larger POMDPs than we could have solved previously.

Unfortunately, we can no longer use conventional POMDP value iteration to find the optimal policy given the low-dimensional set of belief space features. POMDP value iteration depends on the fact that the value function is convex over the belief space. When we compute a non-linear transformation of our beliefs to recover their coordinates on the low-dimensional belief surface, we lose the convexity of the value function (compare Figure 3 and Figure 6 to see why). As a result, the value function cannot be expressed as the supremum of a set of hyperplanes over the low-dimensional belief space.

So, instead of using POMDP value iteration, we will build a low-dimensional discrete belief space MDP and use MDP value iteration. Since we do not know the form of the value function, we will turn to function approximation. Gordon [Gor95] proved that the fitted value iteration algorithm is guaranteed to find a bounded-error approximation to a (possibly discounted) MDP's value function, so long as we use it in combination with a function approximator that is an averager. Averagers are function approximators which are non-expansions in max-norm; that is, they do not exaggerate errors in their training data. In our experiments below, we use regular grids as well as irregular, variable-resolution grids based on $ 1$-nearest-neighbour discretization, represented by a set of low-dimensional beliefs $ \tilde{B}^*$,

$\displaystyle \tilde{B}^* = \{ \tilde{b}^*_1, \tilde{b}^*_2, \ldots, \tilde{b}^*_{\vert\tilde{B}^*\vert}\}.$ (29)

Both of these approximations are averagers; other averagers include linear interpolation, $ k$-nearest-neighbours, and local weighted averaging. We will not focus in detail on the exact mechanism for discretizing the low-dimensional space, as this is outside the scope of this paper. The resolution of the regular grid in all cases was chosen empirically; in section 7 we describe a specific variable resolution discretization scheme that worked well empirically. The reader can consult Munos & Moore [MM02] or Zhou & Hansen [ZH01] for more sophisticated representations.

The fitted value iteration algorithm uses the following update rule to compute a $ t$-step lookahead value function $ V^t$ from a $ (t-1)$-step lookahead value function $ V^{t-1}$:

$\displaystyle V^t(\tilde{b}^*_i) = \max_a \left ( \tilde{R}^*(\tilde{b}^*_i, a)...
...lde{T}^*(\tilde{b}^*_i, a, \tilde{b}^*_j) \cdot V^{t-1}(\tilde{b}^*_j) \right )$ (30)

Here $ \tilde{R}^*$ and $ \tilde{T}^*$ are approximate reward and transition functions based on the dynamics of our POMDP, the result of our E-PCA, and the finite set of low-dimensional belief samples $ \tilde{B}^*$ that we are using as our function approximator. Note that in all problems described in this paper, the problem did not require discounting ( $ \gamma = 1$). The following sections describe how to compute the model parameters $ \tilde{R}^*$ and $ \tilde{T}^*$.

Computing the Reward Function

The original reward function $ R(s, a)$ represents the immediate reward of taking action $ a$ at state $ s$. We cannot know, given either a low-dimensional or high-dimensional belief, what the immediate reward will be, but we can compute the expected reward. We therefore represent the reward as the expected value of the immediate reward of the full model, under the current belief:

$\displaystyle \tilde{R}^*(\tilde{b}^*, a)$ $\displaystyle =$ $\displaystyle {\rm E}_{b^*} (R(s, a))$ (31)
  $\displaystyle =$ $\displaystyle \sum^{\vert\mathcal{S}\vert}_{i=1}R(s_i, a) b(s_i).$ (32)

Equation (32) requires us to recover the high-dimensional belief $ b$ from the low-dimensional representation $ \tilde{b}^*$, as shown in equation (28).

For many problems, the reward function $ \tilde{R}^*$ will have the effect of giving a low immediate reward for belief states with high entropy. That is, for many problems the planner will be driven towards beliefs that are centred on high-reward states and have low uncertainty. This property is intuitively desirable: in such beliefs the robot does not have to worry about an immediate bad outcome.

Computing the Transition Function

Computing the low-dimensional transition function $ \tilde{T}^* =
p(\tilde{b}^*_j \vert a, \tilde{b}^*_i)$ is not as simple as computing the low-dimensional reward function $ \tilde{R}^*$: we need to consider pairs of low-dimensional beliefs, $ \tilde{b}^*_i$ and $ \tilde{b}^*_j$. In the original high-dimensional belief space, the transition from a prior belief $ b_i$ to a posterior belief $ b_j$ is described by the Bayes filter equation:

$\displaystyle b_j(s)$ $\displaystyle =$ $\displaystyle \alpha
 O(s, a, z) \sum^{\vert\mathcal{S}\vert}_{k=1} T(s_k, a, s) b_i(s_k)$ (33)

Here $ a$ is the action we selected and $ z$ is the observation we saw; $ T$ is the original POMDP transition probability distribution, and $ O$ is the original POMDP observation probability distribution.

Equation (33) describes a deterministic transition conditioned upon a prior belief, an action and an observation. The transition to the posterior $ b_j$ is stochastic when the observation is not known; that is, the transition from $ b_i$ to $ b_j$ occurs only when a specific $ z$ is generated, and the probability of this transition is the probability of generating observation $ z$. So, we can separate the full transition process into a deterministic transition to $ b_a$, the belief after acting but before sensing, and a stochastic transition to $ b_j$, the full posterior:

$\displaystyle b_a(s)$ $\displaystyle =$ $\displaystyle \sum^{\vert\mathcal{S}\vert}_{j=1} T(s_j, a, s) b_i(s_j)$ (34)
$\displaystyle b_j(s)$ $\displaystyle =$ $\displaystyle \alpha   O(s, a, z) b_a(s).$ (35)

Equations 34 and 35 describe the transitions of the high-dimensional beliefs for the original POMDP. Based on these high-dimensional transitions, we can compute the transitions in our low-dimensional approximate belief space MDP. Figure 14 depicts the process.

Figure 14: The process of computing a single transition probability.
As the figure shows, we start with a low-dimensional belief $ \tilde b^*_i$. From $ \tilde{b}^*_i$ we reconstruct a high-dimensional belief $ b$ according to equation (28). Then we apply an action $ a$ and an observation $ z$ as described in equation (34) and equation (35) to find the new belief $ b'$. Once we have $ b'$ we can compress it to a low-dimensional representation $ \tilde{b}'$ by iterating equation (26). Finally, since $ \tilde{b}'$ may not be a member of our sample $ \tilde{B}^*$ of low-dimensional belief states, we map $ \tilde{b}'$ to a nearby $ \tilde{b}^*_j
\in \tilde{B}^*$ according to our function approximator.

If our function approximator is a grid, the last step above means replacing $ \tilde{b}'$ by a prototypical $ \tilde{b}^*_j$ which shares its grid cell. More generally, our function approximator may represent $ \tilde{b}'$ as a combination of several states, putting weight $ w(\tilde{b}^*_j, \tilde{b}')$ on each $ \tilde{b}^*_j$. (For example, if our approximator is $ k$-nearest-neighbour, $ w(\tilde{b}^*_j, \tilde{b}')=\frac{1}{k}$ for each of the closest $ k$ samples in $ \tilde{B}^*$.) In this case we replace the transition from $ \tilde{b}^*_i$ to $ \tilde{b}'$ with several transitions, each from $ \tilde{b}^*_i$ to some $ \tilde{b}^*_j$, and scale the probability of each one by $ w(\tilde{b}^*_j, \tilde{b}')$.

For each transition $ \tilde{b}^*_i\rightarrow b\rightarrow b_a \rightarrow b'
\rightarrow \tilde{b}' \rightarrow \tilde{b}^*_j$ we can assign a probability

$\displaystyle p(z,j\vert i,a) = p(z\vert b_a)  w(\tilde{b}^*_j, \tilde{b}') = ...
...e{b}^*_j, \tilde{b}')\sum^{\vert\mathcal{S}\vert}_{l=1} p(z \vert s_l) b_a(s_l)$ (36)

The total transition probability $ \tilde{T}^*(\tilde{b}^*_i, a,
\tilde{b}^*_j)$ is the sum, over all observations $ z$, of $ p(z,j\vert i,a)$. Step 3 in Table 2 performs this computation, but shares work between the computation of $ \tilde{T}^*(\tilde{b}^*_i, a,
\tilde{b}^*_j)$ for different posterior beliefs $ \tilde b_j$ which are reachable from the same prior belief $ \tilde b_i$ under action $ a$.

Computing the Value Function

With the reward and transition functions computed in the previous sections, we can use value iteration to compute the value function for our belief space MDP. The full algorithm is given in Table 2.

  1. Generate the discrete low-dimensional belief space $ \tilde{B}^*$ using E-PCA (cf. Table 1)
  2. Compute the low-dimensional reward function $ \tilde{R}^*$:
  3. For each $ \tilde{b}^* \in \tilde{B}^*, a \in \mathcal{A}$
    1. Recover $ b$ from $ \tilde{b}^*$
    2. Compute $ \tilde{R}^*(\tilde{b}, a)
= \sum^{\vert\mathcal{S}\vert}_{i=1}R(s_i, a) b(s_i)$.
  4. Compute the low-dimensional transition function $ \tilde{T}^*$:
  5. For each $ \tilde{b}^*_i \in \tilde{B}^*, a \in \mathcal{A}$
    1. For each $ \tilde{b}^*_j : \tilde{T}^*(\tilde{b}^*_i, a,
\tilde{b}^*_j) = 0$
    2. Recover $ b_i$ from $ \tilde{b}^*_i$
    3. For each observation $ z$
    4. Compute $ b_j$ from the Bayes' filter equation (33) and $ b$.
    5. Compute $ \tilde{b}'$ from $ b_j$ by iterating equation (26).
    6. For each $ \tilde{b}^*_j$ with $ w(\tilde{b}^*_j, \tilde
    7. Add $ p(z,j\vert i,a)$ from equation (36) to $ \tilde{T}^*(\tilde{b}^*_i, a,
  6. Compute the value function for $ \tilde{B}^*$
    1. t = 0
    2. For each $ \tilde{b}^*_i \in \tilde{B}^*: V^0 (\tilde{b}^*_i) = 0$
    3. do
    4. change = 0
    5. For each $ \tilde{b}^*_i \in \tilde{B}^*$:
    6. $ V^t(\tilde{b}^*_i) = \max_a \left (
\tilde{R}^*(\tilde{b}^*_i, a) + \gamma
...e{T}^*(\tilde{b}^*_i, a, \tilde{b}^*_j) \cdot
V^{t-1}(\tilde{b}^*_j) \right ) $
    7. change = change + $ V^t(\tilde{b}^*_i) -
    8. while change $ > 0 $
Table 2: Value Iteration for an E-PCA POMDP

next up previous
Next: Solving Large POMDPs Up: Finding Approximate POMDP Solutions Previous: E-PCA Performance
Nicholas Roy 2005-01-16