Translation Into MDP

We now exploit the compact representation of a non-Markovian reward function as a reward function specification to translate an NMRDP into an equivalent MDP amenable to state-based anytime solution methods. Recall from Section 2 that each e-state in the MDP is labelled by a state of the NMRDP and by history information sufficient to determine the immediate reward. In the case of a compact representation as a reward function specification $\phi_0$, this additional information can be summarised by the progression of $\phi_0$ through the sequence of states passed through. So an e-state will be of the form $\langle s, \phi\rangle$, where $s\in S$ is a state, and $\phi \subseteq \mbox{\mbox{\$}FLTL}\ \times \mbox{I$\!$R}$ is a reward function specification (obtained by progression). Two e-states $\langle s, \phi\rangle$ and $\langle t,\psi \rangle$ are equal if $s=t$, the immediate rewards are the same, and the results of progressing $\phi$ and $\psi$ through $s$ are semantically equivalent.8

Definition 5   Let $D=\langle S,s_{0},A,\Pr ,R\rangle$ be an NMRDP, and $\phi_{0}$ be a reward function specification representing $R$ (i.e., $R_{\phi_0} = R$, see Definition 4). We translate $D$ into the MDP $D'=\langle S',s'_{0},A',\Pr',R'\rangle$ defined as follows:
  1. $S' \subseteq S\times 2^{\mbox{\mbox{\$}FLTL}\ \times \mbox{I$\!$R}}$
  2. $s'_{0} = \langle s_{0},\phi_0 \rangle$
  3. $A'(\langle s,\phi \rangle) = A(s)$
  4. If $a\in A'(\langle s, \phi\rangle)$, then $\Pr'(\langle
s,\phi\rangle,a,\langle s', \phi'\rangle) =$ $\left\{ \begin{array}{ll}
\Pr(s,a,s') & \mbox{if~} \phi' = \mbox{{\em R}\mbox{\em Prog}}(s,\phi)\\ 0 &
\end{array} \right.$
    If $a\not\in A'(\langle s, \phi\rangle)$, then $\Pr'(\langle
s,\phi\rangle,a,\bullet)$ is undefined
  5. $R'(\langle s,\phi\rangle) = {\displaystyle \sum_{(f:r) \in \phi}
\{r \mid \mbox{\em Rew}(s,f)\}}$
  6. For all $s' \in S'$, $s'$ is reachable under $A'$ from $s'_0$.

Item 1 says that the e-states are labelled by a state and a reward function specification. Item 2 says that the initial e-state is labelled with the initial state and with the original reward function specification. Item 3 says that an action is applicable in an e-state if it is applicable in the state labelling it. Item 4 explains how successor e-states and their probabilities are computed. Given an action $a$ applicable in an e-state $\langle s, \phi\rangle$, each successor e-state will be labelled by a successor state $s'$ of $s$ via $a$ in the NMRDP and by the progression of $\phi$ through $s$. The probability of that e-state is $\Pr(s,a,s')$ as in the NMRDP. Note that the cost of computing $\Pr'$ is linear in that of computing $\Pr $ and in the sum of the lengths of the formulae in $\phi$. Item 5 has been motivated before (see Section 3.6). Finally, since items 1-5 leave open the choice of many MDPs differing only in the unreachable states they contain, item 6 excludes all such irrelevant extensions. It is easy to show that this translation leads to an equivalent MDP, as defined in Definition 1. Obviously, the function $\tau$ required for Definition1 is given by $\tau(\langle s,\phi\rangle) = s$, and then the proof is a matter of checking conditions. In our practical implementation, the labelling is one step ahead of that in the definition: we label the initial e-state with $\mbox{R}\mbox{Prog}(s_0,\phi_0)$ and compute the current reward and the current reward specification label by progression of predecessor reward specifications through the current state rather than through the predecessor states. As will be apparent below, this has the potential to reduce the number of states in the generated MDP. Figure 7 shows the equivalent MDP produced for the $FLTL version of our NMRDP example in Figure 3. Recall that for this example, the PLTL reward formula was $q \wedge \circleddash \circleddash p$. In $FLTL, the allocation of rewards is described by $\mbox{$\Box$}((p \wedge
\raisebox{0.6mm}{$\scriptstyle \bigcirc$}\raisebox{0.6m...}{$\scriptstyle \bigcirc$}\raisebox{0.6mm}{$\scriptstyle \bigcirc$}\mbox{\$})$. The figure also shows the relevant formulae labelling the e-states, obtained by progression of this reward formula. Note that without progressing one step ahead, there would be 3 e-states with state $\{p\}$ on the left-hand side, labelled with $\{f_1\}$, $\{f_1,f_2\}$, and $\{f_1,f_2,f_3\}$, respectively.
Figure 7: Equivalent MDP Produced by FLTL
$\textstyle \parbox{0.55\textwidth}{\includegraphics[height=0.4\textheight]{figures/pqfltl}}$$\textstyle \parbox{0.4\textwidth}{\footnotesize
The following formulae lab...
....6mm}{$\scriptstyle \bigcirc$}\mbox{\$}$\\
$f_3: q \rightarrow \mbox{\$}$\\
Sylvie Thiebaux 2006-01-20