Solving Large POMDPs

In this section, we present the application of our algorithm to finding policies for large POMDPs.

We first tested the E-PCA belief features using a regular grid representation on a version of the toy problem described earlier. To ensure that we only needed a small set of belief samples ${\tilde{b}^*_i}$ , we made the goal region larger. We also used a coarser discretization of the underlying state space (40 states instead of 200) to allow us to compute the low-dimensional model more quickly.

Figure 15 shows a comparison of the policies from the different algorithms. The E-PCA does approximately twice as well as the Maximum-Likelihood heuristic; this heuristic guesses its corridor, and is correct only about half the time. The ``AMDP Heuristic'' algorithm is the Augmented MDP algorithm reported by Roy & Thrun [RT99]. This controller attempts to find the policy that will result in the lowest-entropy belief in reaching the goal. This controller does very poorly because it is unable to distinguish between a unimodal belief that knows which corridor it is in but not its position within the corridor, and a bimodal belief that knows its position but not which corridor. The results in Figure 15 are averaged over 10,000 trials.

**Figure 15:** A comparison of policy performance using different numbers of bases, for 10,000 trials, with regular grid discretization. Policy performance was given by total reward accumulated over trials.
$\includegraphics[width=2.65in]{figs/1drewards.ps}$

It should be noted that this problem is sufficiently small that conventional PCA fares reasonably well. In the next sections, we will see problems where the PCA representation does poorly compared to E-PCA.

Robot Navigation

We tested the E-PCA POMDP algorithm on simulated robot navigation problems in two example environments, the Wean Hall corridor shown in Figure 16 and the Longwood retirement facility shown in Figure 1(b). The model parameters are given by robot navigation models (see [FBT99]).

We evaluated the policy for the relatively simple problem depicted in Figure 16. We set the robot's initial belief such that it may have been at one of two locations in the corridor, with the objective to get to within

of the goal state (each grid cell is $0.2m \times 0.2m$ ). The controller received a reward of

for arriving at the goal state and taking an at_goal action; a reward of

was given for (incorrectly) taking this action at a non-goal state. There was a reward of

for each motion. The states used for planning in this example were the 500 states along the corridor, and the actions were forward and backward motion.

Figure 16 shows a sample robot trajectory using the E-PCA policy and 5 basis functions. Notice that the robot drives past the goal to the lab door in order to verify its orientation before returning to the goal; the robot does not know its true position, and cannot know that it is in fact passing the goal. If the robot had started at the other end of the corridor, its orientation would have become apparent on its way to the goal.

**Figure 16:** An example robot trajectory, using the policy learned using 5 basis functions. On the left are the start conditions and the goal. On the right is the robot trajectory. Notice that the robot drives past the goal to the lab door to localize itself, before returning to the goal.
$\includegraphics[height=1in]{figs/start.eps}$ $\includegraphics[height=1in]{figs/wean.trajectory.eps}$

Figure 17 shows the average policy performance for three different techniques. The Maximum-Likelihood heuristic could not distinguish orientations, and therefore approximately 50% of the time declared the goal in the wrong place. We also evaluated a policy learned using the best 5 bases from conventional PCA. This policy performed substantially better than the maximum-likelihood heuristic in that the controller did not incorrectly declare that the robot had arrived at the goal. However, this representation could not detect when the robot was at the goal, and also chose sub-optimal (with respect to the E-PCA policy) motion actions regularly. The E-PCA outperformed the other techniques in this example because it was able to model its belief accurately, in contrast to the result in Figure 15 where PCA had sufficient representation to perform as well or better than E-PCA.

**Figure 17:** A comparison of policy performance using E-PCA, conventional PCA and the Maximum Likelihood heuristic, for 1,000 trials.
$\includegraphics[height=2in]{figs/robot_reward.ps}$

Figure 18(a) shows a second example of navigation in simulation. Notice that the initial belief for this problem is bi-modal; a good policy will take actions to disambiguate the modes before proceeding to the goal. Using a sample set of 500 beliefs, we computed the low-dimensional belief space $\tilde{B}$ . Figure 18(b) shows the average KL divergence between the original and reconstructed beliefs. The improvement in the KL divergence error measure slowed down substantially around 6 bases; we therefore used 6 bases to represent the belief space.

**Figure 18:** (a) The sample navigation problem in Longwood, cf. Figure 2. This problem involves multi-modal distributions. (c) The average KL divergence between the sample beliefs and their reconstructions against the number of bases used, for 500 samples beliefs for a navigating mobile robot in this environment. (d) A comparison of policy performance using E-PCA, conventional MDP and the AMDP heuristic.
$\fbox{\includegraphics[height=\picheight]{figs/epca.start.eps}}$ [Initial Distribution]	$\includegraphics[width=3in]{figs/longwood_kl.ps}$ [Reconstruction Performance]
$\fbox{\includegraphics[height=\picheight]{figs/full_trajectory.ps}}$ [The Complete Trajectory]	$\includegraphics[width=3in]{figs/navigation_performance.ps}$ [Policy Performance]

Figure 18(c) shows an example execution of the policy computed using the E-PCA. The reward parameters were the same as in the previous navigation example. The robot parameters were maximum laser range of

, and high motion model variance. The first action the policy chose was to turn the robot around and move it closer to the nearest wall. This had the effect of eliminating the second distribution mode on the right. The robot then followed essentially a ``coastal'' trajectory up the left-hand wall in order to stay localized, although the uncertainty in the

direction became relatively pronounced. We see that as the uncertainty eventually resolved itself at the top of the image, the robot moved to the goal.

It is interesting to note that this policy contains a similar ``coastal'' attribute as some heuristic policies (e.g., the Entropy heuristic and the AMDP, [CKK96,RT99]). However, unlike these heuristics, the E-PCA representation was able to reach the goal more accurately (that is, get closer to the goal). This representation was successful because it was able more accurately to represent the beliefs and the effects of actions on the beliefs.

Finding People

**Figure 19:** The performance of PCA and E-PCA on a sample belief. The map is $238 \times 85$ grid cells, at a resolution. (a) A sample belief. (b) The PCA reconstruction, using 40 bases. (c) The E-PCA reconstruction, using 6 bases.
$\fbox{\includegraphics[width=\picwidth]{figs/wean.orig.ps}}$ [Original Belief]
$\fbox{\includegraphics[width=\picwidth]{figs/wean.pca.ps}}$ [Reconstruction with PCA]	$\fbox{\includegraphics[width=\picwidth]{figs/wean.psvd.ps}}$ [Reconstruction with E-PCA]

In addition to the synthetic problem and the robot navigation problems described in the previous sections, we also tested our algorithm on a more complicated POMDP problem, that of finding a person or object moving around in an environment. This problem is motivated from the Nursebot domain, where residents experiencing cognitive decline can sometimes become disoriented and start to wander. In order to make better use of the health-care providers' time, we would like to use a robot such as Pearl (Figure 1a) to find the residents quickly. We assume that the person is not adversarial.

The state space in this problem is much larger than the previous robot navigation problems: it is the cross-product of the person's position and the robot's position. However, we assume for simplicity that the robot's position is known, and therefore the belief distribution is only over the person's position. The transitions of the person state feature are modelled by Brownian motion with a fixed, known velocity, which models the person's motion as random, independent of the robot position. (If the person was moving to avoid being ``captured'' by the robot, a different transition model would be required.) We assume that the position of the person is unobservable until the robot is close enough to see the person (when the robot has line-of-sight to the person, up to some maximum range, usually 3 metres); the observation model has 1% false negatives and no false positives. The reward function is maximal when the person and the robot are in the same location.

Figure 19(a) shows an example probability distribution that can occur in this problem (not shown is the robot's position). The grey dots are particles drawn from the distribution of where the person could be in the environment. The distribution is initially uniform over the reachable areas (inside the black walls). After the robot receives sensor data, the probability mass is extinguished within the sensor range of the robot. As the robot moves around, more of the probability mass is extinguished, focusing the distribution on the remaining places the person can be. However, the probability distribution starts to recover mass in places the robot visits but then leaves. In the particle filter, this can be visualized as particles leaking into areas that were previously emptied out.

We collected a set of 500 belief samples using a heuristic controller given by driving the robot to the maximum likelihood location of the person, and used E-PCA to find a good low-dimensional representation of the beliefs. Figure 19(b) shows the reconstruction of the example belief in Figure 19(a), using conventional PCA and 40 bases. This figure should reinforce the idea that PCA performs poorly at representing probability distributions. Figure 19(c) shows the reconstruction using E-PCA and 6 bases, which is a qualitatively better representation of the original belief.

Recall from section 6 that we use a function approximator for representing the value function. In the preceding examples we used a regular grid over the low-dimensional surface which performed well for finding good policies. However, the problem of finding people empirically requires a finer resolution representation than would be computationally tractable with a regular grid. We therefore turn to a different function approximator, the

-nearest-neighbour variable resolution representation. We add new low-dimensional belief states to the model by periodically re-evaluating the model at each grid cell, and splitting the grid-cell into smaller discrete cells where a statistic predicted from the model disagrees with the statistic computed from experience. A number of different statistics have been suggested for testing the model against data from the real world [MM99], such as reduction in reward variance, or value function disagreement. We have opted instead for a simpler criterion of transition probability disagreement. We examine the policy computed using a fixed representation, and also the policy computed using an incrementally refined representation. Note that we have not fully explored the effect of different variable resolution representations of the value function, e.g., using

-nearest-neighbour interpolations such as described by Hauskrecht [Hau00]. These experiments are beyond the scope of this paper, as our focus is on the utility of the E-PCA decomposition. No variable resolution representation of a value function has been shown to scale effectively beyond a few tens of dimensions at best [MM02].

**Figure 20:** An example of a suboptimal person finding policy. The grey particles are drawn from the distribution of where the person might be, initially uniformly distributed in (a). The black dot is the true (unobservable) position of the person. The open circle is the observable position of the robot. Through the robot's poor action selection, the person is able to escape into previously explored areas.
$\fbox{\includegraphics[height=\picheight]{figs/people1.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people2.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people3.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people4.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people5.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people6.ps}}$

This problem shares many attributes with the robot navigation problem, but we see in Figure 19 and figures 20 and 21 that this problem generates spatial distributions of higher complexity. It is somewhat surprising that E-PCA is able to find a good representation of these beliefs using only 6 bases, and indeed the average KL divergence is generally higher than for the robot navigation task. Regardless, we are able to find good controllers, and this is an example of a problem where PCA performs very poorly even with a large number of bases.

**Figure 21:** The policy computed using the E-PCA representation. The initial conditions in panel (a) are the same as in Figure 20. Notice that, unlike the previous figure, this strategy ensures that the probability mass is located in one place, allowing the robot to find the person with significantly higher probability.
$\fbox{\includegraphics[height=\picheight]{figs/people.good.1.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people.good.2.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people.good.3.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people.good.4.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people.good.5.ps}}$ $\fbox{\includegraphics[height=\picheight]{figs/people.good.6.ps}}$

Figure 20 shows an example trajectory from the heuristic control strategy, driving the robot to the maximum likelihood location of the person at each time step. The open circle is the robot position, starting at the far right. The solid black circle is the position of the person, which is unobservable by the robot until within a

range. The person starts in the room above the corridor (a), and then moves down into the corridor once the robot has moved to the far end of the corridor (b). As the robot returns to search inside the room (c) and (d), the person moves unobserved into the previously searched corridor (e). Although we have deliberately chosen an example where the heuristic performs poorly, the person is not following an unlikely or adversarial trajectory: at all times the solid black circle remains in regions of high probability. The robot's belief accurately reflects the possibility that the person will slip past, but the heuristic control algorithm has no way to take this possibility into account.

Using the policy found for the low-dimensional belief space as described in previous sections, we are able to find a much better controller. A sample trajectory for this controller is shown in Figure 21. The robot travels from the right-most position in the corridor (a) to only part-way down the corridor (b), and then returns to explore the room (c) and (d). In this example, the person's starting position was different from the one given in the previous example--the E-PCA policy would find the person at this point, starting from the same initial conditions as the previous example). After exploring the room and eliminating the possibility that the person is inside the room (e), the policy has reduced the possible locations of the person down to the left-hand end of the corridor, and is able to find the person reliably at that location.

Note that figures 20 and 21 have the target person in the worst-case start position for each planner. If the person were in the same start position in Figure 21 as in Figure 20, the policy would have found the person by panel (d). Similarly, if the person had started at the end of corridor as in Figure 21, the policy shown in Figure 20 would have found the person by panel (b).

**Figure 22:** A comparison of 6 policies for person finding in a simple environment. The baseline is the fully-observable, i.e., cheating, solution (the solid line). The E-PCA policy is for a fixed (variable resolution) discretization. The Refined E-PCA is for a discretization where additional belief samples have been added. The PCA policy was approximately 6 times worse than the best E-PCA policy.
$\includegraphics[width=2.5in]{figs/wean.performance.ps}$

Figure 22 shows a quantitative comparison of the performance of the E-PCA against a number of other heuristic controllers in simulation, comparing the average time to find the person for these different controllers. The solid line depicts the baseline performance, using a controller that has access to the true state of the person at all times (i.e., a fully observable lower bound on the best possible performance). The travel time in this case is solely a function of the distance to the person; no searching is necessary or performed. Of course, this is not a realizable controller in reality. The other controllers are:

Solving Large POMDPs

Toy problem

Robot Navigation

Finding People