Influence of Reward Types

The type of reward appears to have a stronger influence on performance than dynamics. This is unsurprising, as the reward type significantly affects the size of the generated MDP: certain rewards only make the size of the minimal equivalent MDP increase by a constant number of states or a constant factor, while others make it increase by a factor exponential in the length of the formula. Table 1 illustrates this. The third column reports the size of the minimal equivalent MDP induced by the formulae on the left hand side.12 A legitimate question is whether there is a direct correlation between size increase and (in)appropriateness of the different methods. For instance, we might expect the state-based methods to do particularly well in conjunction with reward types inducing a small MDP and otherwise badly in comparison with structured methods. Interestingly, this is not always the case. For instance, in Table 1 whose last two columns report the fastest and slowest methods over the range of hand-coded domains where $1 \leq n \leq 12$, the first row contradicts that expectation. Moreover, although PLTLSTR is fastest in the last row, for larger values of $n$ (not represented in the table), it aborts through lack of memory, unlike the other methods.

Table 1: Influence of Reward Type on MDP Size and Method Performance
type formula size fastest slowest
first time all $p_i$s $(\wedge_{i=1}^{n} p_i) \wedge (\neg\circleddash \makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}\wedge_{i=1}^{n} p_i)$ ${\cal O}(1)\vert\vert S\vert\vert$ PLTLSTR(A) PLTLMIN
$p_i$s in sequence from start state $(\wedge_{i=1}^{n}\circleddash ^{i} p_i) \wedge \circleddash ^{n} \neg \circleddash \mbox{$\top$}$ ${\cal O}(n)\vert\vert S\vert\vert$ FLTL PLTLSTR
two consecutive $p_i$s $\vee_{i=1}^{n-1} (\circleddash p_i \wedge p_{i+1})$ ${\cal O}(n^{k})\vert\vert S\vert\vert$ PLTLSTR FLTL
all $p_i$s $n$ times ago $\circleddash ^{n} \wedge_{i=1}^{n} p_i$ ${\cal O}(2^n)\vert\vert S\vert\vert$ PLTLSTR PLTLMIN

Figure 12: Changing the Syntax
The most obvious observations arising out of these experiments is that PLTLSTR is nearly always the fastest - until it runs out of memory. Perhaps the most interesting results are those in the second row, which expose the inability of methods based on PLTL to deal with rewards specified as long sequences of events. In converting the reward formula to a set of subformulae, they lose information about the order of events, which then has to be recovered laboriously by reasoning. $FLTL progression in contrast takes the events one at a time, preserving the relevant structure at each step. Further experimentation led us to observe that all PLTL based algorithms perform poorly where reward is specified using formulae of the form $\circleddash ^k f$, $\makebox[8pt][c]{\makebox[0pt][c]{$\Diamond$}\makebox[0pt][c]{\raisebox{0.5pt}{-}}}_{k} f$, and $\boxminus _{k}f$ ($f$ has been true $k$ steps ago, within the last $k$ steps, or at all of the last $k$ steps).
Sylvie Thiebaux 2006-01-20