To explain value iteration, we need to consider how belief state evolves over time. Let b be the current belief state. The belief state at the next point in time is determined by the current belief state, the current action a, the next observation z. We denote it by . For any state s', is given by
where and is the renormalization constant. As the notation suggests, the constant can also be interpreted as the probability of observing z after taking action a in belief state b.
Define an operator T that takes a value function V and returns another value function TV as follows:
where is the expected immediate reward for taking action a in belief state b. For a given value function V, a policy is said to be V-improving if
Value iteration is an algorithm for finding -optimal policies. It starts with an initial value function and iterates using the following formula:
It is known (e.g., Puterman 1990, Theorem 6.9) that converges to as n goes to infinity. Value iteration terminates when the Bellman residual falls below . When it does, a -improving policy is -optimal (e.g., Puterman 1990).
Since there are infinitely many belief states, value functions cannot be explicitly represented. Fortunately, the value functions that one encounters in the process of value iteration admit implicit finite representations. Before explaining why, we first introduce several technical concepts and notations.