next up previous
Next: Dyna Up: Computing Optimal Policies by Previous: Computing Optimal Policies by

Certainty Equivalent Methods


We begin with the most conceptually straightforward method: first, learn the T and R functions by exploring the environment and keeping statistics about the results of each action; next, compute an optimal policy using one of the methods of Section 3. This method is known as certainty equivlance [57].

There are some serious objections to this method:

Figure: In this environment, due to Whitehead [130], random exploration would take take tex2html_wrap_inline1714 steps to reach the goal even once, whereas a more intelligent exploration strategy (e.g. ``assume any untried action leads directly to goal'') would require only tex2html_wrap_inline1716 steps.

A variation on this idea is certainty equivalence, in which the model is learned continually through the agent's lifetime and, at each step, the current model is used to compute an optimal policy and value function. This method makes very effective use of available data, but still ignores the question of exploration and is extremely computationally demanding, even for fairly small state spaces. Fortunately, there are a number of other model-based algorithms that are more practical.

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996