Representing Rewards with PLTL

The syntax of PLTL, the language chosen to represent rewarding behaviours, is that of propositional logic, augmented with the operators $\circleddash$ (previously) and $\mathbin{\mbox{\sf S}}$ (since), see e.g., [20]. Whereas a classical propositional logic formula denotes a set of states (a subset of

), a PLTL formula denotes a set of finite sequences of states (a subset of

). A formula without temporal modality expresses a property that must be true of the current state, i.e., the last state of the finite sequence. $\circleddash f$ specifies that

holds in the previous state (the state one before the last). $f_1 \mathbin{\mbox{\sf S}}f_2$ , requires

to have been true at some point in the sequence, and, unless that point is the present,

to have held ever since. More formally, the modelling relation $\models$ stating whether a formula

holds of a finite sequence $\Gamma(i)$ is defined recursively as follows:

$\Gamma(i) \models p$ iff $p\in\Gamma_i$ , for $p\in {\cal P}$ , the set of atomic propositions
$\Gamma(i) \models \neg f$ iff $\Gamma(i)\not\models f$
$\Gamma(i) \models f_1 \wedge f_2$ iff $\Gamma(i) \models f_1$ and $\Gamma(i) \models f_2$
$\Gamma(i) \models \circleddash f$ iff and $\Gamma(i-1) \models f$
$\Gamma(i) \models f_1 \mathbin{\mbox{\sf S}}f_2$ iff $\exists j \leq i, \Gamma(j) \models f_2$ and $\forall k, j<k\leq i, \Gamma(k) \models f_1$

From $\mathbin{\mbox{\sf S}}$ , one can define the useful operators $\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f \equiv \mbox{$\top$}\mathbin{\mbox{\sf S}}f$ meaning that

has been true at some point, and $\boxminus f \equiv \neg\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}\neg f$ meaning that

has always been true. E.g, $g \wedge \neg\circleddash \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}g$ denotes the set of finite sequences ending in a state where

is true for the first time in the sequence. Other useful abbreviation are $\circleddash ^k$ (

times ago), for

iterations of the $\circleddash$ modality, $\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}_k f$ for $\vee_{i=1}^{k} \circleddash ^i f$ (

was true at some of the

last steps), and $\boxminus _k f$ for $\wedge_{i=1}^{k} \circleddash ^i f$ (

was true at all the

last steps). Non-Markovian reward functions are described with a set of pairs

where

is a PLTL reward formula and

is a real, with the semantics that the reward assigned to a sequence in

is the sum of the

's for which that sequence is a model of

. Below, we let

denote the set of reward formulae

in the description of the reward function. Bacchus et al. [2] give a list of behaviours which it might be useful to reward, together with their expression in PLTL. For instance, where

is an atemporal formula,

rewards with

units the achievement of

whenever it happens. This is a Markovian reward. In contrast $(\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f : r)$ rewards every state following (and including) the achievement of

, while $(f \wedge \neg\circleddash \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}f: r)$ only rewards the first occurrence of

. $(f \wedge \boxminus _k \neg f: r)$ rewards the occurrence of

at most once every

steps. $(\circleddash ^n \neg \circleddash \mbox{$\bot$}: r)$ rewards the $n^{\mbox{th}}$ state, independently of its properties. $(\circleddash ^2 f_1 \wedge \circleddash f_2 \wedge f_3 :r)$ rewards the occurrence of

immediately followed by

and then

. In reactive planning, so-called response formulae which describe that the achievement of

is triggered by a condition (or command)

are particularly useful. These can be written as $(f \wedge \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}c: r)$ if every state in which

is true following the first issue of the command is to be rewarded. Alternatively, they can be written as $(f \wedge \circleddash (\neg f \mathbin{\mbox{\sf S}}c) : r)$ if only the first occurrence of

is to be rewarded after each command. It is common to only reward the achievement

within

steps of the trigger; we write for example $(f \wedge \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}_k c : r)$ to reward all such states in which

holds. From a theoretical point of view, it is known [38] that the behaviours representable in PLTL are exactly those corresponding to star-free regular languages. Non star-free behaviours such as

(reward an even number of states all containing

) are therefore not representable. Nor, of course, are non-regular behaviours such as

(e.g. reward taking equal numbers of steps to the left and right). We shall not speculate here on how severe a restriction this is for the purposes of planning.

Sylvie Thiebaux 2006-01-20