A Markov decision process of the type we consider is a 5-tuple $\langle S,s_0,A,\Pr ,R\rangle$, where $S$ is a finite set of fully observable states, $s_{0}\in S$ is the initial state, $A$ is a finite set of actions ($A(s)$ denotes the subset of actions applicable in $s\in S$), $\{\Pr(s,a,\bullet) \mid s\!\in\!S, a\!\in\!A(s)\}$ is a family of probability distributions over $S$, such that $\Pr(s,a,s')$ is the probability of being in state $s'$ after performing action $a$ in state $s$, and $R: S \mapsto \mbox{I$\!$R}$ is a reward function such that $R(s)$ is the immediate reward for being in state $s$. It is well known that such an MDP can be compactly represented using dynamic Bayesian networks [18,11] or probabilistic extensions of traditional planning languages, see e.g., [37,45,50]. A stationary policy for an MDP is a function $\pi : S \mapsto A$, such that $\pi(s)\in A(s)$ is the action to be executed in state $s$. The value $V_\pi$ of the policy at $s_0$, which we seek to maximise, is the sum of the expected future rewards over an infinite horizon, discounted by how far into the future they occur:

V_\pi(s_0) = \lim_{n\rightarrow\infty} \makebox[1em]{$\mathb...
...n} \beta^{i} R(\Gamma_i) \mid \pi, \Gamma_{0} = s_{0} \bigg{]}

where $0 \leq \beta < 1$ is the discount factor controlling the contribution of distant rewards.
Sylvie Thiebaux 2006-01-20