NMRDPs

A decision process with non-Markovian rewards is identical to an MDP except that the domain of the reward function is $S^{*}$ . The idea is that if the process has passed through state sequence $\Gamma(i)$ up to stage

, then the reward $R(\Gamma(i))$ is received at stage

. Figure 1 gives an example. Like the reward function, a policy for an NMRDP depends on history, and is a mapping from $S^{*}$ to

. As before, the value of policy $\pi$ is the expectation of the discounted cumulative reward over an infinite horizon:

$\begin{displaymath} V_{\pi}(s_{0}) = \lim_{n\rightarrow\infty} \makebox[1em]{$\m... ...} \beta^{i} R(\Gamma(i)) \mid \pi, \Gamma_{0} = s_{0} \bigg{]} \end{displaymath}$

For a decision process $D=\langle S,s_0,A,\Pr ,R\rangle$ and a state $s\in S$ , we let $\widetilde{D}(s)$ stand for the set of state sequences rooted at

that are feasible under the actions in

, that is: $\widetilde{D}(s) = \{ \Gamma\in S^\omega \mid \Gamma_0 = s \mbox{~and~}\forall i~\exists a \in A(\Gamma_i)~~ \Pr(\Gamma_{i},a,\Gamma_{i+1}) > 0 \}$ . Note that the definition of $\widetilde{D}(s)$ does not depend on

and therefore applies to both MDPs and NMRDPs.

**Figure 1:** A Simple NMRDP
$\begin{figure}\centering \begin{picture}(0,8)(0,-4) % \put(-8.4,0){\begin{pictur... ...,s_0,s_0,s_0,s_1 \rangle$ \hspace*{1em} etc. }} \end{picture}\par\end{figure}$

Sylvie Thiebaux 2006-01-20