Some improvement can be gained by considering stochastic policies; these are mappings from observations to probability distributions over actions. If there is randomness in the agent's actions, it will not get stuck in the hall forever. Jaakkola, Singh, and Jordan [50] have developed an algorithm for finding locally-optimal stochastic policies, but finding a globally optimal policy is still NP hard.

In our example, it turns out that the optimal stochastic policy is for the agent, when in a state that looks like a hall, to go east with probability and west with probability . This policy can be found by solving a simple (in this case) quadratic program. The fact that such a simple example can produce irrational numbers gives some indication that it is a difficult problem to solve exactly.

Wed May 1 13:19:13 EDT 1996