To explain value iteration, we need to consider
how belief state evolves over time.
Let *b* be the current belief state.
The belief state at the next point in time
is determined by the current belief state,
the current action *a*, the next
observation *z*.
We denote it by .
For any state *s*', is given by

where
and is the renormalization constant.
As the notation suggests, the constant
can also be interpreted as the probability of
observing *z* after taking action *a* in belief state *b*.

Define an operator *T* that takes a
value function *V* and returns another value function
*TV* as follows:

where is the
expected immediate reward for taking action *a* in
belief state *b*.
For a given value function *V*,
a policy is said to be * V-improving*
if

Value iteration is an algorithm for finding -optimal policies. It starts with an initial value function and iterates using the following formula:

It is known (e.g., Puterman 1990, Theorem 6.9) that
converges to as *n* goes to infinity.
Value iteration terminates when the *Bellman residual*
falls below .
When it does,
a -improving policy is -optimal
(e.g., Puterman 1990).

Since there are infinitely many belief states, value functions cannot be explicitly represented. Fortunately, the value functions that one encounters in the process of value iteration admit implicit finite representations. Before explaining why, we first introduce several technical concepts and notations.

Thu Feb 15 14:47:09 HKT 2001