### MDPs

A Markov decision process of the type we consider is a 5-tuple , where is a finite set of fully observable states, is the initial state, is a finite set of actions ( denotes the subset of actions applicable in ), is a family of probability distributions over , such that is the probability of being in state after performing action in state , and is a reward function such that is the immediate reward for being in state . It is well known that such an MDP can be compactly represented using dynamic Bayesian networks [18,11] or probabilistic extensions of traditional planning languages, see e.g., [37,45,50]. A stationary policy for an MDP is a function , such that is the action to be executed in state . The value of the policy at , which we seek to maximise, is the sum of the expected future rewards over an infinite horizon, discounted by how far into the future they occur:

where is the discount factor controlling the contribution of distant rewards.
Sylvie Thiebaux 2006-01-20