### Representing Rewards with PLTL

The syntax of PLTL, the language chosen to represent rewarding behaviours, is that of propositional logic, augmented with the operators (previously) and (since), see e.g., [20]. Whereas a classical propositional logic formula denotes a set of states (a subset of ), a PLTL formula denotes a set of finite sequences of states (a subset of ). A formula without temporal modality expresses a property that must be true of the current state, i.e., the last state of the finite sequence. specifies that holds in the previous state (the state one before the last). , requires to have been true at some point in the sequence, and, unless that point is the present, to have held ever since. More formally, the modelling relation stating whether a formula holds of a finite sequence is defined recursively as follows:
• iff , for , the set of atomic propositions
• iff
• iff and
• iff and
• iff and
From , one can define the useful operators meaning that has been true at some point, and meaning that has always been true. E.g, denotes the set of finite sequences ending in a state where is true for the first time in the sequence. Other useful abbreviation are ( times ago), for iterations of the modality, for ( was true at some of the last steps), and for ( was true at all the last steps). Non-Markovian reward functions are described with a set of pairs where is a PLTL reward formula and is a real, with the semantics that the reward assigned to a sequence in is the sum of the 's for which that sequence is a model of . Below, we let denote the set of reward formulae in the description of the reward function. Bacchus et al. [2] give a list of behaviours which it might be useful to reward, together with their expression in PLTL. For instance, where is an atemporal formula, rewards with units the achievement of whenever it happens. This is a Markovian reward. In contrast rewards every state following (and including) the achievement of , while only rewards the first occurrence of . rewards the occurrence of at most once every steps. rewards the state, independently of its properties. rewards the occurrence of immediately followed by and then . In reactive planning, so-called response formulae which describe that the achievement of is triggered by a condition (or command) are particularly useful. These can be written as if every state in which is true following the first issue of the command is to be rewarded. Alternatively, they can be written as if only the first occurrence of is to be rewarded after each command. It is common to only reward the achievement within steps of the trigger; we write for example to reward all such states in which holds. From a theoretical point of view, it is known [38] that the behaviours representable in PLTL are exactly those corresponding to star-free regular languages. Non star-free behaviours such as (reward an even number of states all containing ) are therefore not representable. Nor, of course, are non-regular behaviours such as (e.g. reward taking equal numbers of steps to the left and right). We shall not speculate here on how severe a restriction this is for the purposes of planning.
Sylvie Thiebaux 2006-01-20