next up previous
Next: Policy Iteration Up: Finding a Policy Given Previous: Finding a Policy Given

Value Iteration

One way, then, to find an optimal policy is to find the optimal value function. It can be determined by a simple iterative algorithm called value iteration that can be shown to converge to the correct tex2html_wrap_inline1844 values [10, 13].

 
		initialize V(s) arbitrarily

loop until policy good enough

loop for tex2html_wrap_inline1964

loop for tex2html_wrap_inline1966

tex2html_wrap_inline1968

tex2html_wrap_inline1970

end loop

end loop

It is not obvious when to stop the value iteration algorithm. One important result bounds the performance of the current greedy policy as a function of the Bellman residual of the current value function [134]. It says that if the maximum difference between two successive value functions is less than tex2html_wrap_inline1972 , then the value of the greedy policy, (the policy obtained by choosing, in every state, the action that maximizes the estimated discounted reward, using the current estimate of the value function) differs from the value function of the optimal policy by no more than tex2html_wrap_inline1974 at any state. This provides an effective stopping criterion for the algorithm. Puterman [90] discusses another stopping criterion, based on the span semi-norm, which may result in earlier termination. Another important result is that the greedy policy is guaranteed to be optimal in some finite number of steps even though the value function may not have converged [13]. And in practice, the greedy policy is often optimal long before the value function has converged.

Value iteration is very flexible. The assignments to V need not be done in strict order as shown above, but instead can occur asynchronously in parallel provided that the value of every state gets updated infinitely often on an infinite run. These issues are treated extensively by Bertsekas [16], who also proves convergence results.

Updates based on Equation 1 are known as full backups since they make use of information from all possible successor states. It can be shown that updates of the form

displaymath1958

can also be used as long as each pairing of a and s is updated infinitely often, s' is sampled from the distribution T(s,a,s'), r is sampled with mean R(s,a) and bounded variance, and the learning rate tex2html_wrap_inline1902 is decreased slowly. This type of sample backup [111] is critical to the operation of the model-free methods discussed in the next section.

The computational complexity of the value-iteration algorithm with full backups, per iteration, is quadratic in the number of states and linear in the number of actions. Commonly, the transition probabilities T(s,a,s') are sparse. If there are on average a constant number of next states with non-zero probability then the cost per iteration is linear in the number of states and linear in the number of actions. The number of iterations required to reach the optimal value function is polynomial in the number of states and the magnitude of the largest reward if the discount factor is held constant. However, in the worst case the number of iterations grows polynomially in tex2html_wrap_inline1994 , so the convergence rate slows considerably as the discount factor approaches 1 [66].


next up previous
Next: Policy Iteration Up: Finding a Policy Given Previous: Finding a Policy Given

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996