Since the current observation does not fully reveal the identity of the current state, the agent needs to consider all previous observations and actions when choosing an action. Information about the current state contained in the current observation, previous observations, and previous actions can be summarized by a probability distribution over the state space (Aström 1965). The probability distribution is sometimes called a belief state and denoted by b. For any possible state s, b(s) is the probability that the current state is s. The set of all possible belief states is called the belief space. We denote it by .
A policy prescribes an action for each possible belief state. In other words, it is a mapping from to . Associated with a policy is its value function . For each belief state b, is the expected total discounted reward that the agent receives by following the policy starting from b, that is
where is the reward received at time t and ( ) is the discount factor. It is known that there exists a policy such that for any other policy and any belief state b (Puterman 1990). Such a policy is called an optimal policy. The value function of an optimal policy is called the optimal value function. We denote it by . For any positive number , a policy is -optimal if