Language and Semantics

Compactly representing non-Markovian reward functions reduces to compactly representing the behaviours of interest, where by behaviour we mean a set of finite sequences of states (a subset of $S^{*}$), e.g. the $\{\langle s_0,s_1\rangle, \langle
s_0,s_0,s_1\rangle, \langle s_0,s_0,s_0,s_1\rangle \ldots\}$ in Figure 1. Recall that the reward is issued at the end of any prefix $\Gamma(i)$ in that set. Once behaviours are compactly represented, it is straightforward to represent non-Markovian reward functions as mappings from behaviours to real numbers - we shall defer looking at this until Section 3.6. To represent behaviours compactly, we adopt a version of future linear temporal logic (FLTL) (see [20]), augmented with a propositional constant `$', intended to be read `The behaviour we want to reward has just happened' or `The reward is received now'. The language $FLTL begins with a set of basic propositions ${\cal P}$ giving rise to literals:

{\cal L}::= {\cal P}\;\vert\; \neg {\cal P}\;\vert\; \mbox{$\top$}\;\vert\; \mbox{$\bot$}
\;\vert\; \mbox{\$}

where $\mbox{$\top$}$ and $\mbox{$\bot$}$ stand for `true' and `false', respectively. The connectives are classical $\wedge$ and $\vee$, and the temporal modalities $\scriptstyle \bigcirc$ (next) and $\mathbin{\mbox{\sf U}}$ (weak until), giving formulae:

{\cal F}::= {\cal L}\;\vert\; {\cal F}\wedge {\cal F}\;\vert...
...vert\; {\cal F}\makebox[1em]{$\mathbin{\mbox{\sf U}}$}{\cal F}

Our `until' is weak: $f_1 \makebox[1em]{$\mathbin{\mbox{\sf U}}$}f_2$ means $f_{1}$ will be true from now on until $f_{2}$ is, if ever. Unlike the more commonly used strong `until', this does not imply that $f_2$ will eventually be true. It allows us to define the useful operator $\Box$ (always): $\mbox{$\Box$}f
\equiv f \makebox[1em]{$\mathbin{\mbox{\sf U}}$}\mbox{$\bot$}$ ($f$ will always be true from now on). We also adopt the notations $\raisebox{0.6mm}{$\scriptstyle \bigcirc$}^{k} f$ for $k$ iterations of the $\raisebox{0.6mm}{$\scriptstyle \bigcirc$}$ modality (f will be true in exactly $k$ steps), $\mbox{$\Diamond_{k}$} f$ for $\bigvee_{i=1}^{k} \raisebox{0.6mm}{$\scriptstyle \bigcirc$}^i f$ (f will be true within the next $k$ steps), and $\mbox{$\Box_{k}$} f$ for $\bigwedge_{i=1}^{k} \raisebox{0.6mm}{$\scriptstyle \bigcirc$}^{i} f$ ($f$ will be true throughout the next $k$ steps). Although negation officially occurs only in literals, i.e., the formulae are in negation normal form (NNF), we allow ourselves to write formulae involving it in the usual way, provided that they have an equivalent in NNF. Not every formula has such an equivalent, because there is no such literal as $\neg\mbox{\$}$ and because eventualities (`f will be true some time') are not expressible. These restrictions are deliberate. If we were to use our notation and logic to theorise about the allocation of rewards, we would indeed need the means to say when rewards are not received or to express features such as liveness (`always, there will be a reward eventually'), but in fact we are using them only as a mechanism for ensuring that rewards are given where they should be, and for this restricted purpose eventualities and the negated dollar are not needed. In fact, including them would create technical difficulties in relating formulae to the behaviours they represent. The semantics of this language is similar to that of FLTL, with an important difference: because the interpretation of the constant $\mbox{\$}$ depends on the behaviour $B$ we want to reward (whatever that is), the modelling relation $\models$ must be indexed by $B$. We therefore write $(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$ to mean that formula $f$ holds at the $i$-th stage of an arbitrary sequence $\Gamma\in
S^{\omega}$, relative to behaviour $B$. Defining $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$ is the first step in our description of the semantics:

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \mbox{\$}$ iff $\Gamma(i)\in B$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \mbox{$\top$}$

$(\Gamma,i)\mbox{$\,\not\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \mbox{$\bot$}$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} p$, for $p\in {\cal P}$, iff $p\in\Gamma_i$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \neg p$, for $p\in {\cal P}$, iff $p\not\in\Gamma_i$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_1 \w...
... } (\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_2$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_1 \v...
... } (\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_2$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \raise...
(\Gamma,i+1)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$

$(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_1 \m...
...\Gamma,j)\mbox{$\,\not\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_2$) $\mbox{ then } (\Gamma,k)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f_1$ Note that except for subscript $B$ and for the first rule, this is just the standard FLTL semantics, and that therefore $-free formulae keep their FLTL meaning. As with FLTL, we say $\Gamma
\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$ iff $(\Gamma,0)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$, and $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$ iff $\Gamma
\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$ for all $\Gamma\in
S^{\omega}$. The modelling relation $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$ can be seen as specifying when a formula holds, on which reading it takes $B$ as input. Our next and final step is to use the $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$ relation to define, for a formula $f$, the behaviour $B_f$ that it represents, and for this we must rather assume that $f$ holds, and then solve for $B$. For instance, let $f$ be $\mbox{$\Box$}(p \rightarrow \mbox{\$})$, i.e., we get rewarded every time $p$ is true. We would like $B_f$ to be the set of all finite sequences ending with a state containing $p$. For an arbitrary $f$, we take $B_f$ to be the set of prefixes that have to be rewarded if $f$ is to hold in all sequences:

Definition 2   $B_f \;\equiv\; \bigcap\{B \mid \mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f\}$

To understand Definition 2, recall that $B$ contains prefixes at the end of which we get a reward and $ evaluates to true. Since $f$ is supposed to describe the way rewards will be received in an arbitrary sequence, we are interested in behaviours $B$ which make $ true in such a way as to make $f$ hold without imposing constraints on the evolution of the world. However, there may be many behaviours with this property, so we take their intersection,3ensuring that $B_f$ will only reward a prefix if it has to because that prefix is in every behaviour satisfying $f$. In all but pathological cases (see Section 3.4), this makes $B_f$ coincide with the (set-inclusion) minimal behaviour $B$ such that $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$. The reason for this `stingy' semantics, making rewards minimal, is that $f$ does not actually say that rewards are allocated to more prefixes than are required for its truth. For instance, $\mbox{$\Box$}(p \rightarrow \mbox{\$})$ says only that a reward is given every time $p$ is true, even though a more generous distribution of rewards would be consistent with it.
Sylvie Thiebaux 2006-01-20