The PLTLSTR translation can be seen as a symbolic version of PLTLSIM. The set $T$ of added temporal variables contains the purely temporal subformulae $\mbox{PTSub}(F)$ of the reward formulae in $F$, to which the $\circleddash $ modality is prepended (unless already there): $T =
\{ \circleddash \psi \mid \psi \in \mbox{PTSub}(F), \psi \neq \circleddash \psi'\} \cup
\{\circleddash \psi \mid \circleddash \psi \in \mbox{PTSub}(F)\}$. By repeatedly applying the equivalence $f_1 \mathbin{\mbox{\sf S}}f_2 \equiv f_2 \vee (f_1 \wedge \circleddash (f_1
\mathbin{\mbox{\sf S}}f_2))$ to any subformula in $\mbox{PTSub}(F)$, we can express its current value, and hence that of reward formulae, as a function of the current values of formulae in $T$ and state variables, as required by the compact representation of the transition and reward models. For our NMRDP example in Figure 3, the set of purely temporal variables is $\mbox{PTSub}(F) = \{\circleddash p,\circleddash \circleddash p\}$, and $T$ is identical to $\mbox{PTSub}(F)$. Figure 6 shows some of the ADDs forming part of the symbolic MDP produced by PLTLSTR: the ADDs describing the dynamics of the temporal variables, i.e., the ADDs describing the effects of the actions $a$ and $b$ on their respective values, and the ADD describing the reward.
Figure 6: ADDs Produced by PLTLSTR. prv (previously) stands for $\circleddash $
\includegraphics[height=0.12\textheight]{figures/pqpltlstructprvp} \includegraphics[height=0.12\textheight]{figures/pqpltlstructprvprvp} \includegraphics[height=0.2\textheight]{figures/pqpltlstructreward}
1. dynamics of $\circleddash p$ 2. dynamics of $\circleddash \circleddash p$ 3. reward
As a more complex illustration, consider this example [3] in which

\begin{displaymath}F = \{\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0...
...{\sf S}}\ (p \mathbin{\mbox{\sf S}}\ (q \vee \circleddash r))\}\end{displaymath}

We have that

\begin{displaymath}\mbox{PTSub}(F) = \{ \top \mathbin{\mbox{\sf S}}\ (p \mathbin...
...athbin{\mbox{\sf S}}\ (q \vee \circleddash r), \circleddash r\}\end{displaymath}

and so the set of temporal variables used is

\begin{displaymath}T = \{ t_1: \circleddash (\top \mathbin{\mbox{\sf S}}\ (p \ma...
...n{\mbox{\sf S}}\ (q \vee \circleddash r)), t3: \circleddash r\}\end{displaymath}

Using the equivalences, the reward can be decomposed and expressed by means of the propositions $p,q$ and the temporal variables $t_1, t_2, t_3$ as follows:

\top \mathbin{\mbox{\sf S}}\ (p \mathbin{\...
(q \vee t_3) \vee (p \wedge t_2) \vee t_1

As with PLTLSIM, the underlying MDP produced by PLTLSTR is far from minimal - the encoded history features do not even vary from one state to the next. However, size is not as problematic as with state-based approaches, because structured solution methods do not enumerate states and are able to dynamically ignore some of the variables that become irrelevant during policy construction. For instance, when solving the MDP, they may be able to determine that some temporal variables have become irrelevant because the situation they track, although possible in principle, is too costly to be realised under a good policy. This dynamic analysis of rewards contrast with PLTLMIN's static analysis [2] which must encode enough history to determine the reward at all reachable future states under any policy. One question that arises is that of the circumstances under which this analysis of irrelevance by structured solution methods, especially the dynamic aspects, is really effective. This is another question our experimental analysis will try to address.
Sylvie Thiebaux 2006-01-20