next up previous
Next: Technical and Notational Considerations Up: POMDPs and Value Iteration Previous: Policies and Value Functions

Value Iteration

To explain value iteration, we need to consider how belief state evolves over time. Let b be the current belief state. The belief state at the next point in time is determined by the current belief state, the current action a, the next observation z. We denote it by tex2html_wrap_inline1082 . For any state s', tex2html_wrap_inline1086 is given by


where tex2html_wrap_inline1088 and tex2html_wrap_inline1090 is the renormalization constant. As the notation suggests, the constant can also be interpreted as the probability of observing z after taking action a in belief state b.

Define an operator T that takes a value function V and returns another value function TV as follows:


where tex2html_wrap_inline1104 is the expected immediate reward for taking action a in belief state b. For a given value function V, a policy tex2html_wrap_inline1042 is said to be V-improving if


Value iteration is an algorithm for finding tex2html_wrap_inline1070 -optimal policies. It starts with an initial value function tex2html_wrap_inline1118 and iterates using the following formula:


It is known (e.g., Puterman 1990, Theorem 6.9) that tex2html_wrap_inline1120 converges to tex2html_wrap_inline1068 as n goes to infinity. Value iteration terminates when the Bellman residual tex2html_wrap_inline1126 falls below tex2html_wrap_inline1128 . When it does, a tex2html_wrap_inline1120 -improving policy is tex2html_wrap_inline1070 -optimal (e.g., Puterman 1990).

Since there are infinitely many belief states, value functions cannot be explicitly represented. Fortunately, the value functions that one encounters in the process of value iteration admit implicit finite representations. Before explaining why, we first introduce several technical concepts and notations.

Dr. Lian Wen Zhang
Thu Feb 15 14:47:09 HKT 2001