Value iteration works by producing successive approximations of the optimal value function. Each iteration can be performed in steps, or faster if there is sparsity in the transition function. However, the number of iterations required can grow exponentially in the discount factor [27]; as the discount factor approaches 1, the decisions must be based on results that happen farther and farther into the future. In practice, policy iteration converges in fewer iterations than value iteration, although the per-iteration costs of can be prohibitive. There is no known tight worst-case bound available for policy iteration [66]. Modified policy iteration [91] seeks a trade-off between cheap and effective iterations and is preferred by some practictioners [96].

Linear programming [105] is an extremely general problem, and MDPs can be solved by general-purpose linear-programming packages [35, 34, 46]. An advantage of this approach is that commercial-quality linear-programming packages are available, although the time and space requirements can still be quite high. From a theoretic perspective, linear programming is the only known algorithm that can solve MDPs in polynomial time, although the theoretically efficient algorithms have not been shown to be efficient in practice.

Wed May 1 13:19:13 EDT 1996