NMRDPs

A decision process with non-Markovian rewards is identical to an MDP except that the domain of the reward function is $S^{*}$. The idea is that if the process has passed through state sequence $\Gamma(i)$ up to stage $i$, then the reward $R(\Gamma(i))$ is received at stage $i$. Figure 1 gives an example. Like the reward function, a policy for an NMRDP depends on history, and is a mapping from $S^{*}$ to $A$. As before, the value of policy $\pi$ is the expectation of the discounted cumulative reward over an infinite horizon:

\begin{displaymath}
V_{\pi}(s_{0}) = \lim_{n\rightarrow\infty} \makebox[1em]{$\m...
...} \beta^{i} R(\Gamma(i)) \mid \pi, \Gamma_{0} = s_{0} \bigg{]}
\end{displaymath}

For a decision process $D=\langle S,s_0,A,\Pr ,R\rangle$ and a state $s\in S$, we let $\widetilde{D}(s)$ stand for the set of state sequences rooted at $s$ that are feasible under the actions in $D$, that is: $\widetilde{D}(s) = \{ \Gamma\in S^\omega \mid \Gamma_0 = s
\mbox{~and~}\forall i~\exists a \in A(\Gamma_i)~~
\Pr(\Gamma_{i},a,\Gamma_{i+1}) > 0 \}$. Note that the definition of $\widetilde{D}(s)$ does not depend on $R$ and therefore applies to both MDPs and NMRDPs.
Figure 1: A Simple NMRDP
\begin{figure}\centering
\begin{picture}(0,8)(0,-4)
%
\put(-8.4,0){\begin{pictur...
...,s_0,s_0,s_0,s_1 \rangle$
\hspace*{1em} etc.
}}
\end{picture}\par\end{figure}


Sylvie Thiebaux 2006-01-20