next up previous
Next: Computing Optimal Policies by Up: Learning an Optimal Policy: Previous: Q-learning

Model-free Learning With Average Reward

As described, Q-learning can be applied to discounted infinite-horizon MDPs. It can also be applied to undiscounted problems as long as the optimal policy is guaranteed to reach a reward-free absorbing state and the state is periodically reset.

Schwartz [106] examined the problem of adapting Q-learning to an average-reward framework. Although his R-learning algorithm seems to exhibit convergence problems for some MDPs, several researchers have found the average-reward criterion closer to the true problem they wish to solve than a discounted criterion and therefore prefer R-learning to Q-learning [69].

With that in mind, researchers have studied the problem of learning optimal average-reward policies. Mahadevan [70] surveyed model-based average-reward algorithms from a reinforcement-learning perspective and found several difficulties with existing algorithms. In particular, he showed that existing reinforcement-learning algorithms for average reward (and some dynamic programming algorithms) do not always produce bias-optimal policies. Jaakkola, Jordan and Singh [50] described an average-reward learning algorithm with guaranteed convergence properties. It uses a Monte-Carlo component to estimate the expected future reward for each state as the agent moves through the environment. In addition, Bertsekas presents a Q-learning-like algorithm for average-case reward in his new textbook [14]. Although this recent work provides a much needed theoretical foundation to this area of reinforcement learning, many important problems remain unsolved.

Leslie Pack Kaelbling
Wed May 1 13:19:13 EDT 1996