Policy Search Methods for Robustness and Exploration in Reinforcement Learning

Drew Bagnell


In many reinforcement learning problems of interest, particularly applications to robotics and other physically-embodied systems, it is of paramount importance that controllers act "safely" during the learning process. Traditional model-based reinforcement learning algorithms make a certainty equivalence assumption on their learned models, and calculate optimal policies for a maximumlikelihood Markovian model. In our work, we consider algorithms that evaluate and synthesize controllers under probability distributions of Markovian models- explicitly taking into account the uncertainty introduced when learning a model from real data. Without paying tribute to this uncertainty, optimal control synthesis algorithms that lie at the heart of model-based reinforcement learning often generate highly aggressive controllers that perform quite poorly, and potentially catastrophically [2], on the actual dynamics of the system.

Unfortunately, maintaining distributions over models induces a learning problem that is inherently non-Markovian, even when the underlying system is Markovian. As this model uncertainty leads to a partially-observable learning problem, it is very difficult to apply well-founded methods for learning safe controllers via value-function approximation. We thus focus on synthesizing structured controllers with a class of policy search methods, related to Ng and Jordan's [1] Pegasus algorithm, which naturally generalizes to this problem.

By explicitly modeling uncertainty due to learning, we can consider robust learning criteria, where synthesis algorithms generate controllers that, with high probability, will achieve acceptable performance. We show this consideration of model confidence can lead to controllers that are more robust than those resulting from the certainty-equivalence assumption. We validate the approach by demonstrating the presented algorithm flying Carnegie Mellon's autonomous helicopter.

Hierarchical Policy Constraint for POMDP Planning and Execution

Joelle Pineau


POMDPs provide a useful framework for decision-making in the presence of uncertainty, however finding an optimal policy is computationally infeasible for large problems. We will present a hierarchical approach to POMDPs which takes advantage of structure in the domain to find policies for complex tasks. The idea is to divide the action set along domain-specific task decompositions. A POMDP problem is thereby broken-down into a hierachically related set of smaller problems (i.e. subtasks). A value function is learned locally (i.e. related to the local set of actions) for each subtask, thereby yielding local policies.

This extends the work on hierarchical reinforcement learning (Dietterich, NIPS'99) to the POMDP framework; the task decomposition and state abstraction are similar, however we propose significantly different planning and execution algorithms to ensure consistent belief states between subtasks, and to accomodate partial observability of subtask completion. This work is applied to the problem of dialogue management, for which we will present results showing that planning time was significantly reduced compared to a conventional POMDP, while the policy performed much better than when using a greedy heuristic approximation.

Charles Rosenberg
Last modified: Mon Nov 20 21:55:06 EST 2000