Gradient Reinforcement Learning of POMDP Policy Graphs
Optimal planning in partially observable Markov decision process (POMDP) settings is at least PSPACE-complete, even when given a model of the environment. To act optimally without a model, the controller must incorporate internal-state to remember important past occurrences. To date, most model-less methods either assume a finite window of memory over past occurrences, or use internal state representations that do not maximize the average reward.
This talk will discuss a method of using policy-gradient reinforcement learning algorithms to learn policies with a finite amount of internal state. We learn the optimal n node policy graph directly, by-passing the problem of learning value functions for belief states. The internal state can be used to remember occurrences from infinitely far into the past, and only occurrences that help maximize the reward will be remembered. An important practical limitation of this method will be discussed and a possible solution will be demonstrated on a simplified Heaven-Hell POMDP problem.
Time allowing, I will also discuss a technique for adding prior knowledge to reduce the variance of gradient estimates in any RL algorithm that uses an eligibility trace. Such techniques are especially important for policy-gradient algorithms which are criticized for their slow convergence due to high variance.
Doug is a 3rd year Ph.D student at the Australian National University, working under Jonathan Baxter at WhizBang! labs. He is visiting CMU for a year to work with Jonathan and sponge knowledge and pizza off CMU folk. His main research interest is internal state RL algorithms and their application to speech processing and other practical problems. His secondary interest in large, high-performance neural nets, especially on Beowulf style clusters.