POMDP Approach

Next: Reinforcement Learning Applications Up: Policies with Internal State Previous: Finite-history-window Approach

POMDP Approach

Another strategy consists of using hidden Markov model (HMM) techniques to learn a model of the environment, including the hidden state, then to use that model to construct a perfect memory controller [20, 67, 79].

Chrisman [22] showed how the forward-backward algorithm for learning HMMs could be adapted to learning POMDPs. He, and later McCallum [75], also gave heuristic state-splitting rules to attempt to learn the smallest possible model for a given environment. The resulting model can then be used to integrate information from the agent's observations in order to make decisions.

Figure 10 illustrates the basic structure for a perfect-memory controller. The component on the left is the state estimator, which computes the agent's belief state, b as a function of the old belief state, the last action a, and the current observation i. In this context, a belief state is a probability distribution over states of the environment, indicating the likelihood, given the agent's past experience, that the environment is actually in each of those states. The state estimator can be constructed straightforwardly using the estimated world model and Bayes' rule.

Figure 10: Structure of a POMDP agent.

Now we are left with the problem of finding a policy mapping belief states into action. This problem can be formulated as an MDP, but it is difficult to solve using the techniques described earlier, because the input space is continuous. Chrisman's approach [22] does not take into account future uncertainty, but yields a policy after a small amount of computation. A standard approach from the operations-research literature is to solve for the optimal policy (or a close approximation thereof) based on its representation as a piecewise-linear and convex function over the belief space. This method is computationally intractable, but may serve as inspiration for methods that make further approximations [20, 65].

Next: Reinforcement Learning Applications Up: Policies with Internal State Previous: Finite-history-window Approach

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996