A decision process with non-Markovian rewards is identical to an MDP except that the domain of the reward function is $S^{*}$. The idea is that if the process has passed through state sequence $\Gamma(i)$ up to stage $i$, then the reward $R(\Gamma(i))$ is received at stage $i$. Figure 1 gives an example. Like the reward function, a policy for an NMRDP depends on history, and is a mapping from $S^{*}$ to $A$. As before, the value of policy $\pi$ is the expectation of the discounted cumulative reward over an infinite horizon:

V_{\pi}(s_{0}) = \lim_{n\rightarrow\infty} \makebox[1em]{$\m...
...} \beta^{i} R(\Gamma(i)) \mid \pi, \Gamma_{0} = s_{0} \bigg{]}

For a decision process $D=\langle S,s_0,A,\Pr ,R\rangle$ and a state $s\in S$, we let $\widetilde{D}(s)$ stand for the set of state sequences rooted at $s$ that are feasible under the actions in $D$, that is: $\widetilde{D}(s) = \{ \Gamma\in S^\omega \mid \Gamma_0 = s
\mbox{~and~}\forall i~\exists a \in A(\Gamma_i)~~
\Pr(\Gamma_{i},a,\Gamma_{i+1}) > 0 \}$. Note that the definition of $\widetilde{D}(s)$ does not depend on $R$ and therefore applies to both MDPs and NMRDPs.
Figure 1: A Simple NMRDP
...,s_0,s_0,s_0,s_1 \rangle$
\hspace*{1em} etc.

Sylvie Thiebaux 2006-01-20