Representing Rewards with PLTL

The syntax of PLTL, the language chosen to represent rewarding behaviours, is that of propositional logic, augmented with the operators $\circleddash $ (previously) and $\mathbin{\mbox{\sf S}}$ (since), see e.g., [20]. Whereas a classical propositional logic formula denotes a set of states (a subset of $S$), a PLTL formula denotes a set of finite sequences of states (a subset of $S^*$). A formula without temporal modality expresses a property that must be true of the current state, i.e., the last state of the finite sequence. $\circleddash f$ specifies that $f$ holds in the previous state (the state one before the last). $f_1 \mathbin{\mbox{\sf S}}f_2$, requires $f_2$ to have been true at some point in the sequence, and, unless that point is the present, $f_1$ to have held ever since. More formally, the modelling relation $\models$ stating whether a formula $f$ holds of a finite sequence $\Gamma(i)$ is defined recursively as follows: From $\mathbin{\mbox{\sf S}}$, one can define the useful operators $\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f
\equiv \mbox{$\top$}\mathbin{\mbox{\sf S}}f$ meaning that $f$ has been true at some point, and $\boxminus f \equiv \neg\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}\neg f$ meaning that $f$ has always been true. E.g, $g \wedge \neg\circleddash \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}g$ denotes the set of finite sequences ending in a state where $g$ is true for the first time in the sequence. Other useful abbreviation are $\circleddash ^k$ ($k$ times ago), for $k$ iterations of the $\circleddash $ modality, $\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}_k f$ for $\vee_{i=1}^{k} \circleddash ^i f$ ($f$ was true at some of the $k$ last steps), and $\boxminus _k f$ for $\wedge_{i=1}^{k}
\circleddash ^i f$ ($f$ was true at all the $k$ last steps). Non-Markovian reward functions are described with a set of pairs $(f_i
: r_i)$ where $f_i$ is a PLTL reward formula and $r_i$ is a real, with the semantics that the reward assigned to a sequence in $S^*$ is the sum of the $r_i$'s for which that sequence is a model of $f_i$. Below, we let $F$ denote the set of reward formulae $f_i$ in the description of the reward function. Bacchus et al. [2] give a list of behaviours which it might be useful to reward, together with their expression in PLTL. For instance, where $f$ is an atemporal formula, $(f: r)$ rewards with $r$ units the achievement of $f$ whenever it happens. This is a Markovian reward. In contrast $(\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f : r)$ rewards every state following (and including) the achievement of $f$, while $(f \wedge \neg\circleddash \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f: r)$ only rewards the first occurrence of $f$. $(f \wedge \boxminus _k
\neg f: r)$ rewards the occurrence of $f$ at most once every $k$ steps. $(\circleddash ^n \neg \circleddash \mbox{$\bot$}: r)$ rewards the $n^{\mbox{th}}$ state, independently of its properties. $(\circleddash ^2 f_1 \wedge \circleddash
f_2 \wedge f_3 :r)$ rewards the occurrence of $f_1$ immediately followed by $f_2$ and then $f_3$. In reactive planning, so-called response formulae which describe that the achievement of $f$ is triggered by a condition (or command) $c$ are particularly useful. These can be written as $(f \wedge \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}c: r)$ if every state in which $f$ is true following the first issue of the command is to be rewarded. Alternatively, they can be written as $(f \wedge \circleddash (\neg
f \mathbin{\mbox{\sf S}}c) : r)$ if only the first occurrence of $f$ is to be rewarded after each command. It is common to only reward the achievement $f$ within $k$ steps of the trigger; we write for example $(f \wedge \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}_k c : r)$ to reward all such states in which $f$ holds. From a theoretical point of view, it is known [38] that the behaviours representable in PLTL are exactly those corresponding to star-free regular languages. Non star-free behaviours such as $(pp)*$ (reward an even number of states all containing $p$) are therefore not representable. Nor, of course, are non-regular behaviours such as $p^nq^n$ (e.g. reward taking equal numbers of steps to the left and right). We shall not speculate here on how severe a restriction this is for the purposes of planning.
Sylvie Thiebaux 2006-01-20