Thursday 6/23/94, 1:30, WeH 4601 PAC Analysis of Reinforcement Learning Nicolas Fiechter, University of Pittsburgh Abstract: In this paper we propose a new formal model for studying reinforcement learning, based on Valiant's PAC framework. In our model the learner does not have direct access to every state of the environment. Instead, every sequence of experiments starts in a fixed initial state and the learner is provided with a ``reset'' operation that interrupts the current sequence of experiments and starts a new one (from the initial state). We do not require the agent to learn the optimal policy but only a good approximation of it with high probability. More precisely, we require the learner to produce a policy whose expected value from the initial state is $\epsi$-close to that of the optimal policy, with probability no less than $1-\delta$. For this model, we describe an algorithm that produces such an ($\epsi$,$\delta$)-optimal policy, for any environment, in time polynomial in $N$, $K$, $1/\epsi$, $1/\delta$, $1/(1-\df)$ and $\rmax$, where $N$ is the number of states of the environment, $K$ is the maximum number of actions in a state, $\df$ is the discount factor and $\rmax$ is the maximum reward on any transition.