Since the current observation does not fully reveal the identity
of the current state, the agent
needs to consider all previous observations
and actions when choosing an action.
Information about the current state contained in the current
observation, previous observations, and previous actions
can be summarized by a probability distribution over the
state space (Aström 1965). The probability distribution is sometimes
called a *belief state* and denoted by *b*.
For any possible state *s*, *b*(*s*) is the probability that
the current state is *s*.
The set of all possible belief states is called
the *belief space*. We denote it by .

A *policy* prescribes an action for each
possible belief state.
In other words, it is a mapping from to . Associated
with a policy is its *value function* .
For each belief state *b*, is the expected total discounted
reward that the agent receives by following the policy starting
from *b*, that is

where is the reward received at
time *t* and ( ) is the *discount factor*.
It is known that there exists a policy such
that for any other policy and
any belief state *b* (Puterman 1990).
Such a policy is called an *optimal policy*.
The value function of an optimal policy is called the *optimal
value function*. We denote it by .
For any positive number , a policy is
* -optimal *
if

Thu Feb 15 14:47:09 HKT 2001