## Language and Semantics

Compactly representing non-Markovian reward functions reduces to compactly representing the behaviours of interest, where by behaviour we mean a set of finite sequences of states (a subset of ), e.g. the in Figure 1. Recall that the reward is issued at the end of any prefix in that set. Once behaviours are compactly represented, it is straightforward to represent non-Markovian reward functions as mappings from behaviours to real numbers - we shall defer looking at this until Section 3.6. To represent behaviours compactly, we adopt a version of future linear temporal logic (FLTL) (see [20]), augmented with a propositional constant $', intended to be read The behaviour we want to reward has just happened' or The reward is received now'. The language$FLTL begins with a set of basic propositions giving rise to literals:

where and stand for true' and false', respectively. The connectives are classical and , and the temporal modalities (next) and (weak until), giving formulae:

Our until' is weak: means will be true from now on until is, if ever. Unlike the more commonly used strong until', this does not imply that will eventually be true. It allows us to define the useful operator (always): ( will always be true from now on). We also adopt the notations for iterations of the modality (f will be true in exactly steps), for (f will be true within the next steps), and for ( will be true throughout the next steps). Although negation officially occurs only in literals, i.e., the formulae are in negation normal form (NNF), we allow ourselves to write formulae involving it in the usual way, provided that they have an equivalent in NNF. Not every formula has such an equivalent, because there is no such literal as and because eventualities (f will be true some time') are not expressible. These restrictions are deliberate. If we were to use our notation and logic to theorise about the allocation of rewards, we would indeed need the means to say when rewards are not received or to express features such as liveness (always, there will be a reward eventually'), but in fact we are using them only as a mechanism for ensuring that rewards are given where they should be, and for this restricted purpose eventualities and the negated dollar are not needed. In fact, including them would create technical difficulties in relating formulae to the behaviours they represent. The semantics of this language is similar to that of FLTL, with an important difference: because the interpretation of the constant depends on the behaviour we want to reward (whatever that is), the modelling relation must be indexed by . We therefore write to mean that formula holds at the -th stage of an arbitrary sequence , relative to behaviour . Defining is the first step in our description of the semantics:

iff

, for , iff

, for , iff

) Note that except for subscript and for the first rule, this is just the standard FLTL semantics, and that therefore $-free formulae keep their FLTL meaning. As with FLTL, we say iff , and iff for all . The modelling relation can be seen as specifying when a formula holds, on which reading it takes as input. Our next and final step is to use the relation to define, for a formula , the behaviour that it represents, and for this we must rather assume that holds, and then solve for . For instance, let be , i.e., we get rewarded every time is true. We would like to be the set of all finite sequences ending with a state containing . For an arbitrary , we take to be the set of prefixes that have to be rewarded if is to hold in all sequences: To understand Definition 2, recall that contains prefixes at the end of which we get a reward and$ evaluates to true. Since is supposed to describe the way rewards will be received in an arbitrary sequence, we are interested in behaviours which make \$ true in such a way as to make hold without imposing constraints on the evolution of the world. However, there may be many behaviours with this property, so we take their intersection,3ensuring that will only reward a prefix if it has to because that prefix is in every behaviour satisfying . In all but pathological cases (see Section 3.4), this makes coincide with the (set-inclusion) minimal behaviour such that . The reason for this stingy' semantics, making rewards minimal, is that does not actually say that rewards are allocated to more prefixes than are required for its truth. For instance, says only that a reward is given every time is true, even though a more generous distribution of rewards would be consistent with it.
Sylvie Thiebaux 2006-01-20