Existing Approaches

Both existing approaches to NMRDPs [2,3] use a temporal logic of the past (PLTL) to compactly represent non-Markovian rewards and exploit this compact representation to translate the NMRDP into an MDP amenable to off-the-shelf solution methods. However, they target different classes of MDP representations and solution methods, and consequently, adopt different styles of translations. Bacchus et al. [2] target state-based MDP representations. The equivalent MDP is first generated entirely - this involves the enumeration of all e-states and all transitions between them. Then, it is solved using traditional dynamic programming methods such as value or policy iteration. Because these methods are extremely sensitive to the number of states, attention is paid to producing a minimal equivalent MDP (with the least number of states). A first simple translation which we call PLTLSIM produces a large MDP which can be post-processed for minimisation before being solved. Another, which we call PLTLMIN, directly results in a minimal MDP, but relies on an expensive pre-processing phase. The second approach [3], which we call PLTLSTR, targets structured MDP representations: the transition model, policies, reward and value functions are represented in a compact form, e.g. as trees or algebraic decision diagrams (ADDs) [29,12]. For instance, the probability of a given proposition (state variable) being true after the execution of an action is specified by a tree whose internal nodes are labelled with the state variables on whose previous values the given variable depends, whose arcs are labelled by the possible previous values ($\mbox{$\top$}$ or $\mbox{$\bot$}$) of these variables, and whose leaves are labelled with probabilities. The translation amounts to augmenting the structured MDP with new temporal variables tracking the relevant properties of state sequences, together with the compact representation of (1) their dynamics, e.g. as trees over the previous values of relevant variables, and (2) of the non-Markovian reward function in terms of the variables' current values. Then, structured solution methods such as structured policy iteration or the SPUDD algorithm are run on the resulting structured MDP. Neither the translation nor the solution methods explicitly enumerates the states. We now review these approaches in some detail. The reader is referred to the respective papers for additional information.

Sylvie Thiebaux 2006-01-20