A decision process with non-Markovian rewards is identical to an MDP
except that the domain of the reward function is . The idea is
that if the process has passed through state sequence up
to stage , then the reward is received at stage
. Figure 1 gives an example. Like the reward
function, a policy for an NMRDP depends on history, and is a mapping
from to . As before, the value of policy is the
expectation of the discounted cumulative reward over an infinite
horizon:
For a decision process
and a state , we let
stand for the set of state sequences rooted at
that are feasible under the actions in , that is:
. Note that the definition of
does not depend on and therefore applies
to both MDPs and NMRDPs.
Figure 1:
A Simple NMRDP
|
Sylvie Thiebaux
2006-01-20