Thursday 6/23/94, 1:30, WeH 4601
PAC Analysis of Reinforcement Learning
Nicolas Fiechter, University of Pittsburgh
Abstract:
In this paper we propose a new formal model for studying reinforcement
learning, based on Valiant's PAC framework.
In our model the learner does not have direct access to every state of
the environment. Instead, every sequence of experiments starts in a
fixed initial state and the learner is provided with a ``reset''
operation that interrupts the current sequence of experiments and
starts a new one (from the initial state).
We do not require the agent to learn the optimal policy but only a
good approximation of it with high probability. More precisely, we
require the learner to produce a policy whose expected value from the
initial state is $\epsi$-close to that of the optimal policy, with
probability no less than $1-\delta$.
For this model, we describe an algorithm that produces such an
($\epsi$,$\delta$)-optimal policy, for any environment, in time
polynomial in $N$, $K$, $1/\epsi$, $1/\delta$, $1/(1-\df)$ and
$\rmax$, where $N$ is the number of states of the environment, $K$ is
the maximum number of actions in a state, $\df$ is the discount factor
and $\rmax$ is the maximum reward on any transition.