Discussion

Existing approaches [2,3] advocate the use of PLTL over a finite past to specify non-Markovian rewards. In the PLTL style of specification, we describe the past conditions under which we get rewarded now, while with $FLTL we describe the conditions on the present and future under which future states will be rewarded. While the behaviours and rewards may be the same in each scheme, the naturalness of thinking in one style or the other depends on the case. Letting the kids have a strawberry dessert because they have been good all day fits naturally into a past-oriented account of rewards, whereas promising that they may watch a movie if they tidy their room (indeed, making sense of the whole notion of promising) goes more naturally with $FLTL. One advantage of the PLTL formulation is that it trivially enforces the principle that present rewards do not depend on future states. In $FLTL, this responsibility is placed on the domain modeller. The best we can offer is an exception mechanism to recognise mistakes when their effects appear, or syntactic restrictions. On the other hand, the greater expressive power of $FLTL opens the possibility of considering a richer class of decision processes, e.g. with uncertainty as to which rewards are received (the dessert or the movie) and when (some time next week, before it rains). At any rate, we believe that $FLTL is better suited than PLTL to solving NMRDPs using anytime state-based solution methods. While the PLTLSIM translation could be easily embedded in such a solution method, it loses the structure of the original formulae when considering subformulae individually. Consequently, the expanded state space easily becomes exponentially bigger than the blind-minimal one. This is problematic with the solution methods we consider, because size severely affects their performance in solution quality. The pre-processing phase of PLTLMIN uses PLTL formula regression to find sets of subformulae as potential labels for possible predecessor states, so that the subsequent generation phase builds an MDP representing all and only the histories which make a difference to the way actually feasible execution sequences should be rewarded. Not only does this recover the structure of the original formula, but in the best case, the MDP produced is exponentially smaller than the blind-minimal one. However, the prohibitive cost of the pre-processing phase makes it unsuitable for anytime solution methods. We do not consider that any method based on PLTL and regression will achieve a meaningful relaxed notion of minimality without a costly pre-processing phase. FLTL is an approach based on $FLTL and progression which does precisely that, letting the solution method resolve the tradeoff between quality and cost in a principled way intermediate between the two extreme suggestions above. The structured representation and solution methods targeted by Bacchus et al. [3] differ from the anytime state-based solution methods FLTL primarily aims at, in particular in that they do not require explicit state enumeration at all. Here, non-minimality is not as problematic as with the state-based approaches. In virtue of the size of the MDP produced, the PLTLSTR translation is, as PLTLSIM, clearly unsuitable to anytime state-based methods.9 In another sense, too, FLTL represents a middle way, combining the advantages conferred by state-based and structured approaches, e.g. by PLTLMIN on one side, and PLTLSTR on the other. From the former FLTL inherits a meaningful notion of minimality. As with the latter, approximate solution methods can be used and can perform a restricted dynamic analysis of the reward formulae. In particular, formula progression enables even state-based methods to exploit some of the structure in `$FLTL space'. However, the gap between blind and true minimality indicates that progression alone is insufficient to always fully exploit that structure. There is a hope that PLTLSTR is able to take advantage of the full structure of the reward function, but also a possibility that it will fail to exploit even as much structure as FLTL, as efficiently. An empirical comparison of the three approaches is needed to answer this question and identify the domain features favoring one over the other.
Sylvie Thiebaux 2006-01-20