The type of reward appears to have a stronger influence on performance
than dynamics. This is unsurprising, as the reward type significantly
affects the size of the generated MDP: certain rewards only make the
size of the minimal equivalent MDP increase by a constant number of
states or a constant factor, while others make it increase by a factor
exponential in the length of the formula. Table 1 illustrates this. The third column reports the size of the minimal
equivalent MDP induced by the formulae on the left hand
side.^{12}
A legitimate question is whether there is a direct correlation between
size increase and (in)appropriateness of the different methods. For
instance, we might expect the state-based methods to do particularly
well in conjunction with reward types inducing a small MDP and
otherwise badly in comparison with structured methods. Interestingly,
this is not always the case. For instance, in Table 1 whose last two columns report the fastest and slowest methods over the
range of hand-coded domains where
, the first row
contradicts that expectation. Moreover, although PLTLSTR is
fastest in the last row, for larger values of (not represented in
the table), it aborts through lack of memory, unlike the other
methods.
Table 1:
Influence of Reward Type on MDP Size and Method Performance
type
formula
size
fastest
slowest
first time all s
PLTLSTR(A)
PLTLMIN
s in sequence from start state
FLTL
PLTLSTR
two consecutive s
PLTLSTR
FLTL
all s times ago
PLTLSTR
PLTLMIN
Figure 12:
Changing the Syntax
The most obvious observations arising out of these experiments is that
PLTLSTR is nearly always the fastest - until it runs out of
memory. Perhaps the most interesting results are those in the second
row, which expose the inability of methods based on PLTL to deal with
rewards specified as long sequences of events. In converting the
reward formula to a set of subformulae, they lose information about
the order of events, which then has to be recovered laboriously by
reasoning. $FLTL progression in contrast takes the events one at a
time, preserving the relevant structure at each step. Further
experimentation led us to observe that all PLTL based algorithms
perform poorly where reward is specified using formulae of the form
,
, and
( has been
true steps ago, within the last steps, or at all of the last
steps).