$FLTL Formula Progression

Having defined a language to represent behaviours to be rewarded, we now turn to the problem of computing, given a reward formula, a minimum allocation of rewards to states actually encountered in an execution sequence, in such a way as to satisfy the formula. Because we ultimately wish to use anytime solution methods which generate state sequences incrementally via forward search, this computation is best done on the fly, while the sequence is being generated. We therefore devise an incremental algorithm based on a model-checking technique normally used to check whether a state sequence is a model of an FLTL formula [4]. This technique is known as formula progression because it `progresses' or `pushes' the formula through the sequence. Our progression technique is shown in Algorithm 1. In essence, it computes the modelling relation $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$ given in Section 3.2. However,unlike the definition of $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$, it is designed to be useful when states in the sequence become available one at a time, in that it defers the evaluation of the part of the formula that refers to the future to the point where the next state becomes available. Let $s$ be a state, say $\Gamma_i$, the last state of the sequence prefix $\Gamma(i)$ that has been generated so far, and let $b$ be a boolean true iff $\Gamma(i)$ is in the behaviour $B$ to be rewarded. Let the $\mbox{\mbox{\$}FLTL}$ formula $f$ describe the allocation of rewards over all possible futures. Then the progression of $f$ through $s$ given $b$, written $\mbox{Prog}(b,s,f)$, is a new formula which will describe the allocation of rewards over all possible futures of the next state, given that we have just passed through $s$. Crucially, the function $\mbox{Prog}$ is Markovian, depending only on the current state and the single boolean value $b$. Note that $\mbox{Prog}$ is computable in linear time in the length of $f$, and that for $-free formulae, it collapses to FLTL formula progression [4], regardless of the value of $b$. We assume that Prog incorporates the usual simplification for sentential constants $\mbox{$\bot$}$ and $\mbox{$\top$}$: $f \wedge \mbox{$\bot$}$ simplifies to $\mbox{$\bot$}$, $f \wedge \mbox{$\top$}$ simplifies to $f$, etc.
\begin{algorithm}
% latex2html id marker 679\caption{
\$FLTL Progression}
\beg...
...og}(s,f)$\ & = & $\mbox{Prog}(\mbox{Rew}(s,f),s,f)$
\end{tabular}\end{algorithm}
The fundamental property of $\mbox{Prog}$ is the following. Where $b \Leftrightarrow (\Gamma(i) \in B)$:

Property 1   $(\Gamma,i)\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} f$ iff $(\Gamma,i+1) \mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$} \mbox{\em Prog}(b,
\Gamma_i,f)$

Proof: See Appendix B. $\Box$
Like $\mbox{$\,\models_{\!\!\!\raisebox{-0.7ex}{\scriptsize$B$}}\:$}$, the function $\mbox{Prog}$ seems to require $B$ (or at least $b$) as input, but of course when progression is applied in practice we only have $f$ and one new state at a time of $\Gamma$, and what we really want to do is compute the appropriate $B$, namely that represented by $f$. So, similarly as in Section 3.2, we now turn to the second step, which is to use $\mbox{Prog}$ to decide on the fly whether a newly generated sequence prefix $\Gamma(i)$ is in $B_f$ and so should be allocated a reward. This is the purpose of the functions $\mbox{\$}\mbox{Prog}$ and $\mbox{Rew}$, also given in Algorithm 1. Given $\Gamma$ and $f$, the function $\mbox{\$}\mbox{Prog}$ in Algorithm 1 defines an infinite sequence of formulae $\langle f_0, f_1, \ldots \rangle$ in the obvious way:

		 $f_0 = f$ 

$f_{i+1} = \mbox{\$}\mbox{Prog}(\Gamma_i,f_i)$
To decide whether a prefix $\Gamma(i)$ of $\Gamma$ is to be rewarded, $\mbox{Rew}$ first tries progressing the formula $f_i$ through $\Gamma_i$ with the boolean flag set to `false'. If that gives a consistent result, we need not reward the prefix and we continue without rewarding $\Gamma(i)$, but if the result is $\mbox{$\bot$}$ then we know that $\Gamma(i)$ must be rewarded in order for $\Gamma$ to satisfy $f$. In that case, to obtain $f_{i+1}$ we must progress $f_i$ through $\Gamma_i$ again, this time with the boolean flag set to the value `true'. To sum up, the behaviour corresponding to $f$ is $\{\Gamma(i)\vert
\mbox{Rew}(\Gamma_i,f_i)\}$. To illustrate the behaviour of $FLTL progression, consider the formula $f = \neg p \makebox[1em]{$\mathbin{\mbox{\sf U}}$}(p \wedge \mbox{\$})$ stating that a reward will be received the first time $p$ is true. Let $s$ be a state in which $p$ holds, then $\mbox{Prog}(\mbox{false},s,f) = \bot \vee (\bot
\wedge \neg p \makebox[1em]{$\mathbin{\mbox{\sf U}}$}(p \wedge \mbox{\$})) \equiv \bot$. Therefore, since the formula has progressed to $\bot$, $\mbox{Rew}(s,f)$ is true and a reward is received. $\mbox{\$}\mbox{Prog}(s,f) = \mbox{Prog}(\mbox{true},s,f) = \top \vee (\bot \wedge
\neg p \makebox[1em]{$\mathbin{\mbox{\sf U}}$}(p \wedge \mbox{\$})) \equiv \top$, so the reward formula fades away and will not affect subsequent progression steps. If, on the other hand, $p$ is false in $s$, then $\mbox{Prog}(\mbox{false},s,f) =
\bot \vee (\top \wedge \neg p \makebox[1em]{$\m...
...$})) \equiv \neg p
\makebox[1em]{$\mathbin{\mbox{\sf U}}$}(p \wedge \mbox{\$}))$. Therefore, since the formula has not progressed to $\bot$, $\mbox{Rew}(s,f)$ is false and no reward is received. $\mbox{\$}\mbox{Prog}(s,f) =
\mbox{Prog}(\mbox{false},s,f) = \neg p \makebox[1em]{$\mathbin{\mbox{\sf U}}$}(p \wedge \mbox{\$})$, so the reward formula persists as is for subsequent progression steps. The following theorem states that under weak assumptions, rewards are correctly allocated by progression:

Theorem 1   Let $f$ be reward-normal, and let $\langle f_0, f_1, \ldots \rangle$ be the result of progressing it through the successive states of a sequence $\Gamma$ using the function $\mbox{\$}\mbox{\em Prog}$. Then, provided no $f_i$ is $\mbox{$\bot$}$, for all $i$ $\mbox{\em Rew}(\Gamma_i,f_i)$ iff $\Gamma(i)\in B_{f}$.

Proof: See Appendix B $\Box$
The premise of the theorem is that $f$ never progresses to $\mbox{$\bot$}$. Indeed if $f_i = \mbox{$\bot$}$ for some $i$, it means that even rewarding $\Gamma(i)$ does not suffice to make $f$ true, so something must have gone wrong: at some earlier stage, the boolean $\mbox{Rew}$ was made false where it should have been made true. The usual explanation is that the original $f$ was not reward-normal. For instance $\raisebox{0.6mm}{$\scriptstyle \bigcirc$}p \rightarrow \mbox{\$}$, which is reward unstable, progresses to $\mbox{$\bot$}$ in the next state if p is true there: regardless of $\Gamma_0$, $f_0 =
\raisebox{0.6mm}{$\scriptstyle \bigcirc$}p \rightarrow \mbox{\$}= \raisebox{0.6mm}{$\scriptstyle \bigcirc$}\neg p \vee \mbox{\$}$, $\mbox{Rew}(\Gamma_0,f_0) = \mbox{false}$, and $f_{1} = \neg p$, so if $p\in\Gamma_1$ then $f_{2} = \mbox{$\bot$}$. However, other (admittedly bizarre) possibilities exist: for example, although $\raisebox{0.6mm}{$\scriptstyle \bigcirc$}p \rightarrow \mbox{\$}$ is reward-unstable, its substitution instance $\raisebox{0.6mm}{$\scriptstyle \bigcirc$}\raisebox{0.6mm}{$\scriptstyle \bigcirc$}\mbox{$\top$}\rightarrow \mbox{\$}$, which also progresses to $\mbox{$\bot$}$ in a few steps, is logically equivalent to $\mbox{\$}$ and is reward-normal. If the progression method were to deliver the correct minimal behaviour in all cases (even in all reward-normal cases) it would have to backtrack on the choice of values for the boolean flags. In the interest of efficiency, we choose not to allow backtracking. Instead, our algorithm raises an exception whenever a reward formula progresses to $\bot$, and informs the user of the sequence which caused the problem. The onus is thus placed on the domain modeller to select sensible reward formulae so as to avoid possible progression to $\bot$. It should be noted that in the worst case, detecting reward-normality cannot be easier than the decision problem for $FLTL so it is not to be expected that there will be a simple syntactic criterion for reward-normality. In practice, however, commonsense precautions such as avoiding making rewards depend explicitly on future tense expressions suffice to keep things normal in all routine cases. For a generous class of syntactically recognisable reward-normal formulae, see Appendix A.
Sylvie Thiebaux 2006-01-20