Machine Learning Thesis Proposal
- Gates Hillman Centers
- CHRISTOPH DANN
- Ph.D. Student
- Machine Learning Department
- Carnegie Mellon University
Strategic Exploration in Reinforcement Learning - New Algorithms and Learning Guarantees
Reinforcement learning (RL) focuses on an essential aspect of intelligent behavior – how an agent can learn to make good decisions given experience and rewards in a stochastic world. Yet popular RL algorithms that have enabled exciting successes in domains with good simulators (Go, Atari, etc) still often fail to learn in other domains because they rely on simple heuristics for exploration. This provides additional empirical justification for essential questions around RL, specifically around algorithms that learn in a provably efficient manner through strategic exploration in any considered domain. This thesis provides new algorithms and theory that enable good performance with respect to existing theoretical frameworks for evaluating RL algorithms (specifically, probably approximately correct) and introduces new stronger evaluation criteria, that may be particularly of interest as RL is applied to more real world problems.
For the first line of work on probably approximately correct (PAC) RL algorithms, we introduce an algorithm that has a substantially better PAC sample complexity bound than prior methods for episodic tabular domains. Look up tables are required by most sample efficient and computationally tractable algorithms, but cannot represent many practical domains. We therefore also present a new RL algorithm that can learn a good policy in environments with high dimensional observations and hidden deterministic states; unlike predecessors, this algorithm provably explores not only in a statistically but also computationally efficient manner assuming access to function classes with efficient optimization oracles.
To make progress it is critical to have the right measures of success. While empirical demonstrations are quite clear, we find that for theoretical properties, two of the most commonly used learning frameworks, PAC guarantees and regret guarantees, each allow undesirable algorithm behavior (e.g. ignoring new observations that could improve the policy). We present a new stronger learning framework called Uniform-PAC that unifies the existing frameworks and prevents undesirable algorithm properties. One caveat of all existing learning frameworks is that for any particular episode, we do not know how well the algorithm will perform. To address this, we introduce the IPOC framework that requires algorithms to provide a certificate before each episode bounding how suboptimal the current policy can be. Such certifications may be of substantial interest in high stakes scenarios when an organization may wish to track or even pause an online RL system should the potential expected performance bound drop below a required expected outcome.
Emma Brunskill (Chair, Stanford University)
Benjamin Recht (UC Berkeley)
Benjamin Van Roy (Stanford University)
Rémi Munos (Deepmind)