Optimality Criteria

We have shown how to construct an MDP from the PPDDL encoding of a planning problem. The plan objective is to maximize the expected reward for the MDP. This objective can be interpreted in different ways, for example as expected discounted reward or expected total reward. The suitable interpretation depends on the problem. For process-oriented planning problems (for example, the “Coffee Delivery” problem), discounted reward is typically desirable, while total reward often is the interpretation chosen for goal-oriented problems (for example, the “Bomb and Toilet” problem). PPDDL does not include any facility for enforcing a given interpretation or specifying a discount factor.

For the competition, we used expected total reward as the optimality criterion. Without discounting, some care is required in the design of planning problems to ensure that the expected total reward is bounded for the optimal policy. The following restrictions were made for problems used in the planning competition:

  1. Each problem had a goal statement, identifying a set of absorbing goal states.
  2. A positive reward was associated with transitioning into a goal state.
  3. A negative reward (cost) was associated with each action.
  4. A “done” action was available in all states, which could be used to end further accumulation of reward.
These conditions ensure that an MDP model of a planning problem is a positive bounded model (Puterman, 1994). The only positive reward is for transitioning into a goal state. Since goal states are absorbing (that is, they have no outgoing transitions), the maximum value for any state is bounded by the goal reward. Furthermore, the “done” action ensures that there is an action available in each state that guarantees a non-negative future reward.

Håkan L. S. Younes