% outstanding issues/questions:
%  we don't discuss the implicit assumption Plan must work
% on``subworlds''
%  fix LOS figure such that x's are more visible
\documentclass{article}
\usepackage{graphics}
\usepackage{icml2003}
\usepackage{times}
\usepackage{amsmath}
\newtheorem{thm}{Theorem}[section]
\newtheorem{lem}[thm]{Lemma}
\newcommand{\qed}{\rule{5pt}{7pt}\bigskip}
\title{Exploration in Metric State Spaces}
\begin{document}
\twocolumn[
\icmltitle{Exploration in Metric State Spaces}
\icmlauthor{Sham Kakade}{sham@gatsby.ucl.ac.uk}
\icmladdress{Gatsby Unit \,
University College London \,
London, England}
\icmlauthor{Michael Kearns}{mkearns@cis.upenn.edu}
\icmladdress{Department of Computer and Information Science\,
University of Pennsylvania,
Philadelphia, Pennsylvania}
\icmlauthor{John Langford}{jcl@cs.cmu.edu}
\icmladdress{Math.\ Sci.\ Dept.\,
I.B.M. T.~J.~Watson Research Center,
Yorktown Heights, NY 10598}
\vskip 0.3in
\vskip 0.3in
]
\begin{abstract}
We present metric$E^3$, a provably nearoptimal algorithm for
reinforcement learning in Markov decision processes in which there is
a natural {\em metric\/} on the state space that allows the
construction of accurate local models. The algorithm is a
generalization of the $E^3$ algorithm of Kearns and Singh, and assumes
a black box for approximate planning. Unlike the original $E^3$,
metric$E^3$ finds a near optimal policy in an amount of time that does
not directly depend on the size of the state space, but instead
depends on the {\em covering\/} number of the state space.
Informally, the covering number is the number of neighborhoods
required for accurate local modeling.
\end{abstract}
\vspace{0.15in}
\section{Introduction, Motivation, and Background}
\label{section:intro}
Recent years have seen the introduction and study of a number of
representational approaches to Markov Decision Processes (MDPs) with
very large or infinite state spaces. These include the broad family
known as function approximation, in which a parametric functional form
is used to approximate value functions, and direct models of the
underlying dynamics and rewards, such as factored or Dynamic Bayes Net
(DBN) MDPs. For each of these approaches, there are now at least
plausible heuristics, and sometimes formal analysis, for problems of
planning \cite{Approxplan} and learning.
Less studied and more elusive has been the problem of {\em global
exploration\/}, or managing the explorationexploitation
tradeoff. Here the goal is to learn a (globally) nearoptimal
$T$step policy in an amount of time that has no direct dependence on
the state space size, but only on the complexity of the chosen
representation. Global exploration was first solved in the
deterministic finitestate setting \cite{sven,thrun} and then progress
slowed. It is only recently that provably correct and efficient
algorithms for exploration in small nondeterministic state spaces
became known (such as the $E^3$ algorithm\cite{E3} and its
generalizations\cite{Rmax}). This approach has been generalized to
factored MDPs under certain assumptions \cite{DBN}, but there remain
many unresolved questions regarding efficient exploration in large
MDPs, including whether modelbased approaches are required~\footnote{
Recent work on gradient methods for approximate planning
(\cite{grad1,grad2}) do not address exploration in the strong sense of
interest here, but instead examines convergence to policies which
small amounts of {\em random\/} exploration cannot improve (local
optimality). In general, effective exploration may require the careful
{\em planning\/} of a long sequence of steps that might never be
encountered by a random walk. See \cite{CPI} for a further
discussion.}.
In general, it is intuitively clear that any general exploration
algorithm has a polynomial dependence on the size of the state (see
\cite{sham} for a more formal statement). Hence, to obtain nearoptimal
algorithms with sublinear dependence on the size of the statespace
further assumptions and restrictions on the MDP must be made. The
factored $E^3$ algorithm \cite{DBN} considers one restriction where
the MDP are represented in terms of a factored graph ({\em ie\/} a
dynamic Bayes net). Here, the number of steps the agent must act in
the MDP in order to obtain a $T$step near optimal policy is
polynomial in the representation size of the factored graph.
In this work, we examine the problem of exploration in environments in
which there is a {\em metric\/} on stateaction pairs with the
property that ``nearby'' stateactions can be useful in predicting
stateaction dynamics. Such conditions are common for navigation or
control problems, but may be more broadly applicable as well. Given
sufficient ``nearby'' experience to predict outcomes, we have an
implicit nonparametric model of the dynamics in a neighborhood of the
stateaction space. These implicit models can be ``pieced together''
and used for planning on a subset of the global space.
One natural approach in the largestate space setting is aggregate
state methods which group states together and assume Markov dynamics
on these aggregate states \cite{Dean,aggregation}. Clearly, this
approach is useful only if a compatible set of aggregate states can be
found which preserve the Markov dynamics on these aggregate states and
where the size the aggregate state space is considerably smaller than
that of the underlying state space. A benefit of this approach is
that planning under this model can be done with traditional dynamic
programming approaches on the aggregate states. Unfortunately, in many
navigation domains, it appears that nontrivial state aggregation often
destroys the Markov assumption required for planning in aggregate
state methods (and we provide one such example later).
The local modeling assumption is \emph{not} equivalent to an aggregate
state method since we do not group any states together and do not
assume a Markov property holds for aggregate states. In fact, under
this assumption (as in factored $E^3$), the size of the state space is
not diminished in any real way, unlike in aggregate state
methods. Hence, the computational problem of planning is still with us
strongly. As with factored $E^3$, we assume a ``black box'' planning
algorithm to abstract away the difficulty of planning from that of
exploration. This assumption is not meant to trivialize the planning
problem, but is made in order to isolate and quantify the difficulty
of exploration.
Given the ability to plan, we prove that the local modeling assumption
implies the time required for global exploration depends only on the
metric resolution and {\em not\/} on the size of the state space.
More precisely, we give a generalization of the $E^3$ algorithm for
metric MDPs which learns a (globally) approximately optimal $T$step
policy in time depending only on the {\em covering numbers\/}, a
natural and standard notion of the resolution required for local
modeling under the metric.
Metric MDPs are a natural complement to more direct parametric
assumptions on value functions and dynamics. These results provide
evidence that, as for factored environments\cite{DBN}, effective
exploration mechanisms are available for metric MDPs.
\vspace{0.15in}
\section{Definitions and Assumptions}
\label{sections:defs}
\vspace{0.15in}
We work in the standard MDP setting. Let \( P(s'a,s) \) be the
probability of a state \( s' \) given an action \( a \) and state \( s
\). Let \( R(s) \) be the reward received in state $s$. For
simplicity, assume that all rewards are deterministic and fall in the
interval $[0,1]$. Define \( V_M (\pi ,s)\equiv E_{(s_1,s_2,\ldots
s_T)\sim \pi ,s,M}\frac{1}{T}\sum _{t=1}^{T}R(s_{t}) \) to be the
average reward received over \( T \) steps starting from state \( s \)
while acting under $\pi$ in MDP $M$.
We first formalize the assumption that there is a notion of distance
that permits local modeling of dynamics. Thus, let $d((s',a'),(s,a))$
measure the ``distance'' between two stateaction pairs. The results
require that this metric obey $d((s,a),(s,a)) = 0$ for all $(s,a)$,
and symmetry (\emph{i.e.,} $d((s,a),(s',a')) = d((s',a'),(s,a))$ for
all $(s,a),(s',a')$), but they do \emph{not} require the triangle
inequality. This is fortunate since demanding the triangle inequality
limits the applicability of the notion in several natural scenarios.
Let $t_{\mbox{metric}}$ denote the time required to evaluate the
metric.
We now provide a standard definition of coverings under a metric. An
{\em $\alpha$cover\/} is a set $C$ of stateaction pairs with the
property that for any $(s,a)$, there exists $(s',a') \in C$ such that
$d((s,a),(s',a')) \leq \alpha$. Let $N(\alpha)$ be the size of the
\emph{largest minimal} $\alpha$cover  that is, the largest
$\alpha$cover $C$ such that the removal of any $(s,a)$ would render
$C$ no longer a cover.
Our first assumption is that the metric permits local modeling of
dynamics of an MDP $M$ with transition model $P$ and reward function
$R$:
\textbf{Local Modeling Assumption.} There exists an algorithm
\emph{Model} for the MDP $M$ such that, for any $(s,a)$, if
\emph{Model} is given $m$ transitions $(s',a')\rightarrow s''$ and
rewards $R(s')$ distributed in accordance with $M$ and in which all
$d((s,a),(s',a')) \leq \alpha $, then \emph{Model} outputs a state
$\hat{s} \sim \hat{P}(\hat{s}s,a)$ and a reward $\hat{R}$, where
\emph{ $\sum_{\hat{s}}\hat{P}(\hat{s}s,a)  P(\hat{s}s,a)\leq
\alpha$, } and $R(\hat{s})  \hat{R} \leq \alpha$. Let
$t_{\mbox{model}}$ be the maximum running time of \emph{Model}.
Thus, with a sufficient number $m$ of local stateaction experiences,
\emph{Model} can form an accurate approximation of the local
environment. Note that there is {\em no\/} requirement that a {\em
destination\/} state $\hat{s}$ be in the neighborhood of $(s,a)$ 
we ask only that nearby stateactions permit generalization in
nextstate distributions, not that these distributions be on nearby
states. The next subsection provides natural examples where the Local
Modeling Assumption can be met, but we expect there are many rather
different ones as well.
In addition to an assumption about the ability to build local
(generative) models, we need an assumption about the ability to use
such models in planning.
\textbf{Approximate Planning Assumption.} There exists an algorithm,
\emph{Plan}, which given a generative model for an unknown MDP $M$ and
a state $s$, returns a policy \( \pi \) whose average reward \( V
_{M}(\pi ,s) \) satisfies $V _{M}(\pi ,s)>V _{M}(\pi ^{*},s)\beta$,
where \( \pi ^{*} \) is the optimal $T$step policy from $s$. Let
$t_{\mbox{plan}}$ upper bound the running time of \emph{Plan} and
$c_{\mbox{gen}}$ upper bound the calls to the generative model.
Note that the Local Modeling Assumption does not reduce the state
space size, so for an arbitrary and large MDP, great computational
resources may be required to meet the Approximate Planning Assumption.
The purpose is not to falsely diminish the difficulty of this task,
but to abstract it away from the problem of explorationexploitation.
The same approach was necessary in analyzing factored$E^3$.
There are at least three broad scenarios where this assumption might
be met. The first is settings where specialized planning heuristics
can do approximate planning due to strong parametric constraints on
the state dynamics. For example, the recent work on planning
heuristics for factored MDPs is of this form. The second is the {\em
sparse sampling\/} \cite{Sparse} approach, in which it has been shown
that the Approximate Planning Assumption can in fact be met for
arbitrary finiteaction MDPs by a policy that uses a generative model
as a subroutine. Here the sample complexity $c_{\mbox{gen}}$ is
exponential in $T$ per state visited (see \cite{Sparse}), but has {\em
no\/} dependence on the state space size. The third setting requires
a regression algorithm that is capable of accurately estimating the
value of a given policy. This algorithm can be used iteratively to
find a nearoptimal policy \cite{CPI}.
At a high level, then, we have introduced the notion of a metric over
stateactions, an assumption that this metric permits the construction
or inference of local models, and an assumption that such models
permit planning. We believe these assumptions are broadly consistent
with many of the current proposals on large state spaces. We now
provide an example that demonstrates the role of covering numbers, and
then show that these assumptions are sufficient for solving the
explorationexploitation problem in time depending not on the size of
the state space, but on the (hopefully much smaller) covering numbers
under the metric.
\subsection{An Example}
We can imagine at least two natural scenarios in which the Local
Modeling Assumption might be met. One of these is where there is
sufficient sensor information and advance knowledge of the expected
effects of actions that the local modeling assumption can be satisfied
even with $m=1$. As a simple example, people can typically
predict the approximate effects of most physical actions available to
them immediately upon entering a room and seeing its layout and
content (e.g., if I go left I will exit through that door; if I go
straight I will hit that wall). They could not make such predictions
for unfamiliar distant rooms. Consider the MDP where the state space
is the Euclidean maze world shown in Figure \ref{figmaze}.(a), and
where the agent is equipped with a vision sensor. In this world, it is
plausible that the local dynamics can be predicted at any ``seen''
location. To apply this analysis, we must first specify a metric. The
obvious choice is \( d_{\textrm{sight}}((s,a),(s',a'))=0 \) if there
exists lineofsight between \( s \) and \( s' \) and \( \infty \)
otherwise. Note that this metric satisfies symmetry, but not the
triangle inequality (which would be somewhat unnatural in this
setting). For any $\alpha \geq 0$, the covering number $N(\alpha)$ is
the maximum number of points which can be positioned in the space so
that no pair have lineofsight. One maximal set is given by the dots
in Figure \ref{figmaze}.(b). Note that even though this a continuous
state space, the covering number is much smaller, and naturally
determined by the geometric properties of the domain.
\begin{figure}
{\centering (a)\resizebox*{0.27\columnwidth}{!}{\includegraphics{maze.eps}}
(b)\resizebox*{0.27\columnwidth}{!}{\includegraphics{cover.eps}}
(c)\resizebox*{0.27\columnwidth}{!}{\includegraphics{metric_cover_2.eps}}
}
\caption{\label{figmaze} (a) a maze world (b) a largest minimal cover
for the lineofsight metric (c) a largest minimal cover for the line
of sight + Euclidean distance metric. }
\end{figure}
It is unrealistic to assume that local dynamics are modeled at distant
locations as well as near locations which implies that modeling error
grows with distance. In this case, a reasonable alternative is to
define \(
d((s,a),(s',a'))=d_{\textrm{sight}}((s,a),(s',a'))+cd_{\textrm{euclidean}}((s,a),(s',a'))\)
where \( c \) is a constant controlling the rate of modeling error
with Euclidean distance. Using this metric, the covers shown in Figure
\ref{figmaze}.(c) might naturally arise. Note that (in general) we
are free to use actions as well as states in defining the metric.
The above examples are applicable to the $m=1$ case of the Local
Modeling Assumption. The second natural case is the more general
``learning'' setting, in which the nextstate dynamics permit some
parameterization that is smooth with respect to the distance metric,
thus allowing a finite sample of an environment to provide enough data
to fit a parametric nextstate distribution for the neighborhood. For
instance, if reward appeared stochastically in some region, it might
be necessary to visit nearby states a number of times before this
distribution is learned. Alternatively, the dynamics could be
different in different parts of the state space. For instance, a skier
moving down a hill has dynamics dependent on the terrain conditions,
such as slope, snow type, and other factors.
\begin{figure}
{\centering
\resizebox*{0.9\columnwidth}{!}{\includegraphics{instability.eps}}
}
\caption{\label{fignoagg} An example showing how simple state space
aggregation does not work because the precise location within the
aggregate state $C_1$ influences the next (aggregate) state outcome of
an action (to $C_2$ or $C_3$).}
\end{figure}
Incidentally, Figure \ref{fignoagg} illustrates the reason why
standard state space aggregation techniques \cite{Dean} do not work
here. In particular, for partitioning induced by a cover on a
Euclidean spaces there exist ``corners'' where 3 (or more) sets meet.
When taking actions ``toward'' this corner from within one of the
sets, the distribution over the next aggregate state set is inherently
unstable.
\vspace{0.15in}
\section{Metric\protect\( E^{3}\protect \)}
\vspace{0.15in}
The algorithm, Metric\(E^{3}\), is a direct generalization of the
$E^3$ algorithm\cite{E3}. We first outline this original algorithm. A
crucial notion in $E^3$ is that of a ``known'' state  a state
visited often enough such that the dynamics and rewards are accurately
modeled at this state. When the agent is not in the current set of
known states, the agent wanders randomly to obtain new
information. While at a known state, it must decide whether to explore
or exploit  a decision which can be made efficiently. Intuitively,
the decision to explore is made by determining how much potential
reward the agent can obtain by ``escaping'' the known states to get
maximal reward elsewhere. If this number is sufficiently large, the
agent explores. This number can be computed by \emph{planning} to
``escape'' in a fictitious MDP $M_{\textrm{explore}}$ which provides
maximal reward for entering an unknown state. The crucial step in the
proof of $E^3$ is showing that either the agent exploits for near
optimal reward, or it can explore {\em quickly\/}, which results in
increasing the size of the set of known states. Since the size of the
known set is bounded, the algorithm eventually exploits and obtains
near optimal reward.
Metric $E^3$ has a few key differences. Here, a ``known'' stateaction
is a pair $(s,a)$ meeting the antecedent of the Local Modeling
Assumption  namely, any pair $(s,a)$ for which the algorithm has
obtained at least $m$ $\alpha$close experiences
$(s',a',s'',R(s''))$. Unlike in $E^3$, our algorithm does not
explicitly enumerate this set of known states, but rather is only able
to decide if a particular stateaction is known. Thus, in the most
general version of our algorithm, our model of the MDP is represented
simply by a list of all prior experience.
As in the original $E^3$, a key step in Metric$E^3$ is the creation of
the {\em known\/} MDP  a model for just that part of the global MDP that we
can approximate well. Here the known MDP at any moment is given as a generative
model that ``patches together'' in a particular way the generative models provided by the planning
algorithm at known states.
More precisely, the {\em approximate known MDP generative model\/}
takes any stateaction $(s,a)$ and a flag bit {\em exploit} and operates as follows:
\begin{enumerate}
\item If $(s,a)$ is not a known stateaction, output ``fail'' and halt.
\item Else give $(s,a)$ and the $m$ prior experiences $(s',a',s'',R(s''))$ in the
$\epsilon$neighborhood of $(s,a)$ to algorithm \emph{Model}; let
the resulting outputs be $\hat{s}$ and $\hat{r}$.
\item If {\em exploit} is 1, set $r \leftarrow \hat{r}$ and $q \leftarrow 0$;
otherwise $r \leftarrow 0$ and $q \leftarrow 1$.
\item If for some action $\hat{a}$, the pair $(\hat{s},\hat{a})$ is itself a known
stateaction, output $\hat{s}$ and $r$ and halt.
\item Else output a special state $z$ and reward $q$ and halt.
\end{enumerate}
Intuitively, we have described a generative model for two MDPs with
identical transition dynamics, but differing rewards according to the
value of the {\em exploit\/} bit. In both models, all transitions that
end in a state with no known actions are ``redirected'' to a single,
special absorbing state $z$, while all other transitions of the global
MDP are preserved. Thus initially the known MDP dynamics are a small
subset of the global MDP, but over time may cover much or all of the
global state space. For rewards, when {\em exploit} is 1, rewards from
the real environment are preserved, whereas when {\em exploit} is 0,
reward is obtained only at the absorbing state, thus rewarding (rapid)
exploration (escape from known stateactions).
We shall use $\hat{M}_{\textrm{exploit}}$ to denote the MDP
corresponding to the generative model above when the {\em exploit}
input bit is set to 1, and $\hat{M}_{\textrm{explore}}$ to denote the
MDP generated by setting {\em exploit} to 0.
Note that under our assumptions, we can always simulate the
approximate known MDP generative model. We can also view it as being
an approximate (hence the name) generative model for what we shall
call the {\em true known MDP}  the MDP whose generative model is
exactly the same as described above, except where the local modeling
algorithm \emph{Model} is perfect (that is, in the Local Modeling
Assumption, $d((s,a),(s',a')) \leq \alpha$ implies $\sum
_{\hat{s}}\hat{P}(\hat{s}s,a)  P(\hat{s}s,a) = 0$, and
$R(\hat{s})  \hat{R} = 0$). This may still be only a partial model
of the global MDP, but it has the true probabilities for all known
stateactions. We shall use ${M}_{\textrm{exploit}}$ to denote the
MDP corresponding to the generative model above with a perfect
\emph{Model} and the {\em exploit} input bit set to 1, and
${M}_{\textrm{explore}}$ to denote the MDP generated with a perfect
\emph{Model} and {\em exploit} set to 0.
Now we outline the full Metric$E^3$ algorithm. It is important to
emphasize that this algorithm \emph{never} needs to explicitly
enumerate the set of known stateactions.
\textbf{Algorithm Metric\( E^{3} \)\\
Input: $d(\cdot,\cdot)$, \emph{Model}, \emph{Plan}\\
Output: A policy $\pi$}
\begin{enumerate}
\item Use random moves until encountering a state $s$ with at least one known action $a$
(that is, where there are at least \( m \) \( \alpha \)close
previous experiences to $(s,a)$).
\item Execute \emph{Plan} twice, once using
the generative model for \( \hat{M}_{\textrm{exploit}} \)
and once using the generative model for \( \hat{M}_{\textrm{explore}} \).
Let the resulting policies be
\( \pi _{\textrm{exploit}} \)
and \( \pi _{\textrm{explore}} \), respectively.
\item If \( V _{\hat{M}_{\textrm{explore}}}(\pi _{\textrm{explore}},s)>\epsilon \),
execute \( \pi _{\textrm{explore}} \) for the next $T$
steps, then go to Step 1.
\item Else, HALT and output \( \pi _{\textrm{exploit}} \).
\end{enumerate}
The claim is that this algorithm finds a near optimal policy, in
sample complexity and running time that depend only on the covering
number under the metric. We now turn to the analysis.
\vspace{0.15in}
\section{Metric\protect\( E^{3}\protect \) Analysis}
\vspace{0.15in}
\noindent
We first state the main theorems\footnote{The form of these claims
differs from the original $E^3$ statement because the results hold
\emph{without} an assumption of a mixing MDP. Theorems similar to the
original $E^3$ can be constructed in the metric case by making an
additional assumption of mixing. The ``mixing free'' form stated here
is subject to fewer assumptions, and therefore more general. See
\cite{sham} for details.} of the paper.
In the following theorems, we use:
\begin{enumerate}
\item $\pi^*$ is an optimal policy in $M$
\item $T$ is the time horizon
\item $m$ and $\alpha$ are the sample complexity and precision defined
in the Local Modeling Assumption
\item $\beta$ is the precision defined in the Approximate Planning Assumption
\item $\epsilon$ is an accuracy parameter
\item $\delta$ a confidence parameter.
\end{enumerate}
\begin{thm} (Sample Complexity)
\label{thm1} Suppose $\epsilon > \alpha(T+1)$.
With probability $1\delta$, after at most
$\frac{TmN(\alpha)}{\epsilon \alpha
(T+1)}\ln({1}/{\delta})+mN(\alpha)$ actions in $M$, Metric\( E^{3} \)
halts in a state $s$, and outputs a policy $\pi$ such that $V
_{M}(\pi,s)\geq V _{M}(\pi ^{*},s)\epsilon2\beta 2 \alpha (T+1)$.
\end{thm}
This shows that the sample complexity (the number of actions required) is
bounded in terms of the covering number $N(\alpha)$ (and not the size of the
state space). In addition to bounding the sample complexity, we bound
the time complexity.
\noindent
\begin{thm} (Time Complexity)
\label{thm2} Let \( k \) be the overall sample complexity. Metric\( E^{3} \) runs
in time at most \( \frac{k(k1)}{2} t_{\mbox{metric}} + 2k
(t_{\mbox{plan}} + c_{\mbox{gen}} t_{\mbox{model}}) + O(k) \).
\end{thm}
A few lemmas are useful in the proofs.
First we define
$\hat{M}$ to be an $\alpha$approximation of $M$ if for all states
$s$, $\sum _{s'}\hat{P}(s's,a)  P(s's,a)\leq \alpha$, and $R(s)

\hat{R}(s) \leq \alpha$. The original Simulation Lemma for $E^3$ had a dependence
on the size of the state space that we cannot tolerate in our setting, so we
first need an improved version:
\begin{lem}
\label{lemsim}(Simulation Lemma)
If \( \hat{M} \) is an $\alpha$approximation of \( M \), then for any
initial state \( s \), any horizon \( T \), and any policy \( \pi
\),
\[
V _{\hat{M}}(\pi,s )V _{M}(\pi,s )\leq \alpha (T+1)\]
\end{lem}
\emph{Proof.} Let $H_t=\{(s_1,s_2,\ldots s_t)\}$ be the set of
length $t$ paths. For $h\in H_t$, let $h_t$ be the $t$th state in $h$
and let $A_t(h)$ and $\hat{A_t}(h)$ be the probability of $h$ in $M$
and $\hat{M}$, respectively. Let $Q(s's)$ and $\hat{Q}(s's)$ be the
transition probabilities under $\pi$ in $M$ and $\hat{M}$,
respectively. Since $\hat{M}$ is an $\alpha$approximation, for any
state $s'$, $\sum_{s}Q(ss')\hat{Q}(ss')\leq\alpha$. Then
\begin{eqnarray*}
& & \sum _{h\in H_{t+1}} A_{t+1}(h)\hat{A}_{t+1}(h) \\
&= & \sum _{h\in H_{t},s}A_t(h)Q(sh_t)\hat{A}_t(h)\hat{Q}(sh_t) \\
& \leq & \sum _{h\in H_{t},s}
A_t(h)Q(sh_t)\hat{A}_t(h)Q(sh_t) \\
& &+ \hat{A}_t(h)Q(sh_t)\hat{A}_t(h)\hat{Q}(sh_t) \\
& = & \sum _{h\in H_{t},s}
Q(sh_t)A_t(h)\hat{A}_t(h) \\
& &+ \hat{A}_t(h)Q(sh_t)\hat{Q}(sh_t) \\
& \leq & \sum _{h\in H_{t}} A_t(h)\hat{A}_t(h) + \alpha
\end{eqnarray*}
where we have used the triangle inequality and linearity of
expectation. Induction on \( t \) implies that:
\[
\sum _{\textrm{paths}}\left\Pr _{s_{i}\sim M,\pi,s_0 }(s_{1},...,s_{T})
\Pr _{s_{i}\sim \hat{M},\pi,s_0 }(s_{1},...,s_{T})\right\leq \alpha T.
\]
Since the rewards $\hat{R}$ in $\hat{M}$ are also $\alpha$accurate,
\[
\leftV _{\hat{M}}(\pi ,s)E_{\textrm{length $T$ paths in
}\hat{M}}\left[\frac{1}{T}\sum_{t=1}^{T}R(s_{t})\right]\right\leq
\alpha \, .
\]
The result follows using the previous two equations.
\qed
Now we restate the ``ExploreorExploit'' lemma from \cite{E3}.
\noindent
\begin{lem}
\label{lemexpexp}(Explore or Exploit) Let $\pi^*$ be the optimal
policy for the global MDP \( M \), and let
\( \pi^*_{\textrm{exploit}} \) be the optimal policy for the true known MDP
\( M_{\textrm{exploit}} \) described above.
Then for any state $s$ of
\( M_{\textrm{exploit}} \) and
for any \( 0<\epsilon <1 \),
either
\[
V _{M_{\textrm{exploit}}}(\pi _{\textrm{exploit}}^{*},s)>V _{M}(\pi ^{*},s)\epsilon \]
or the optimal policy \( \pi^*_{\textrm{explore}} \) for $M_{\textrm{explore}}$
has probability of at least \( \epsilon \) of leaving the known states in \( T \)
steps in $M$.
\end{lem}
One subtle distinction from the original $E^3$ algorithm exists.
Here, although the algorithm plans to reach some unknown state, by the
time this state is reached, it might actually be known due to the
Local Modeling Assumption. Note that in the maze world example, the
agent might plan to escape by moving around a corner. However, when
actually executing this escape policy, the states around the corner
could become known before they are reached in $T$ steps, if they come
into line of sight beforehand.
We now establish that Metric\( E^{3} \) ceases to explore in a reasonable
amount of time. In the original \( E^{3} \) this was a consequence of
the Pigeonhole Principle applied to the number of states. A similar
statement holds here, but now we use the size of a cover under the
metric. It is important to note that this lemma holds whether or not
the covering number
\( N(\alpha) \) is known.
\begin{lem}
\label{lemexplimit}(Exploration Bound) Metric$E^{3}$ encounters
at most $mN(\alpha)$ unknown stateactions.
\end{lem}
\emph{Proof.}
First, consider the $m=1$ case. We construct a set $C$ as follows: the
stateaction $(s,a)$ at time $t$ is added to the set $C$ if
\[
\forall (s',a') \in C:\,\,\, d((s,a),(s',a')) > \alpha \, .
\]
Note that the state at time $t$ is unknown if and only if
\[
\forall (s',a') \in \textrm{\{earlier stateactions\}}:\,\,\, d((s,a),(s',a')) > \alpha
\]
and so if $(s,a)$ is unknown, then it is added to $C$. Thus, the size
of $C$ at time $t$ is an upper bound on the number of unknown
stateaction pairs encountered by the algorithm before time $t$. Since
no element of $C$ covers another element in $C$, $C$ is minimal. In
particular, if any element is removed from $C$ the set of states
covered by $C$ is reduced. It follows that for all $t$ the size
of $C$ is less than $N(\alpha)$, and hence the algorithm cannot
encounter more than $N(\alpha)$ unknown stateactions.
For the general $m$ case, consider constructing $m$ different sets,
$C_1, \ldots,C_m$. The state action at time $t$ is added to only one
of the sets $C_i$ if there is no $\alpha$close element in $C_i$. By
an analogous argument, if a stateaction is unknown, it is added to
some $C_i$, and so the sum of sizes of $C_i$ bounds the number of
unknown stateactions encountered by the algorithm before time
$t$. Again, by construction, each $C_i$ is minimal for all $t$. Hence,
the size of each $C_i$ is bounded by $N(\alpha)$ and so the number of
unknown stateactions encountered by the algorithm is bounded by
$mN(\alpha)$. \qed
We now provide the proofs of the main theorems.
\emph{Proof of \ref{thm1}.} The exploration bound of Lemma
\ref{lemexplimit} implies we encounter a known state after a number
of actions that is at most $mN(\alpha)$, which bounds the number of
successful exploration attempts. Each attempted exploration occurs
when
$V_{\hat{M}_{\textrm{explore}}}(\pi_{\textrm{explore}},s)>\epsilon $,
and so $V
_{M_{\textrm{explore}}}(\pi_{\textrm{explore}},s)>\epsilon\alpha
(T+1)$. By definition of $M_{\textrm{explore}}$, the chance of
successful exploration is greater than $\epsilon\alpha (T+1)$. Hence,
at most, \( \frac{TmN(\alpha) }{\epsilon \alpha (T+1)}\ln
({1}/{\delta) } \) actions successful exploration of the state spaces
occurs with a $\delta$ chance of error. The total number of actions
before halting is less than the sum of the exploration actions known
states and the actions taken in unknown states.
The decision to halt occurs when
$V_{\hat{M}_{\textrm{explore}}}(\pi_{\textrm{explore}},s)\leq\epsilon
$, which implies
$V_{M_{\textrm{explore}}}(\pi_{\textrm{explore}}^*,s)\leq\epsilon+\alpha
(T+1) +\beta$
due to planning and simulation error. By the Explore
or Exploit lemma
\[
V _{M_{\textrm{exploit}}}(\pi_{\textrm{exploit}}
^{*},s)>V_{M}(\pi ^{*},s)\epsilon \alpha (T+1)  \beta \, .
\]
Due to simulation and planning error in computing an optimal policy in $M_{\textrm{exploit}}$,
\[
V _{M_{\textrm{exploit}}}(\pi _{\textrm{exploit}},s)
>V_{M}(\pi ^{*},s)\epsilon 2\alpha (T+1)2\beta \, .
\]
The result follows since a policy in $M$ has no less reward
than in $M_{\textrm{exploit}}$.
\qed
\emph{Proof of \ref{thm2}.} It is never necessary to evaluate the
metric between two samples more than once. There are at most \(
\frac{k(k1)}{2} \) pairs of samples, so line 1 of Metric$E^3$ take
time at most $ t_{\mbox{metric}} \frac{k(k1)}{2} $ computation. Step
2 is executed at most $k$ times since at least one transition occurs
before reentering step 2. One call to $\emph{Plan}$ requires time at
most $t_{\mbox{plan}} + c_{\mbox{gen}} t_{\mbox{model}}$ so the total
time spent on step 2 is at most $2k (t_{\mbox{plan}} + c_{\mbox{gen}}
t_{\mbox{model}})$. Step $3$ takes total time at most $O(k)$. The
result follows by adding these times. \qed
\section{Discussion}
It is difficult to quantify the exact scaling improvements of
metric$E^3$ over $E^3$ because the improvements are inherently
dependent upon the exact form of the local modeling assumption. In
the extreme case where the stateaction space is continuous and
$N(\alpha)$ is finite, $E^3$ has an infinite sample complexity while
metric$E^3$ has a finite sample complexity. In less extreme cases,
the advantage of metric$E^3$ is (naturally) less extreme. It is
worth noting that the extreme case is not too unusual. Certainly,
many control problems are modeled using continuous (or virtually
continuous) parameters.
The metric$E^3$ analysis implies that local modeling requires weaker
assumptions about the behavior of the world than state aggregation.
It is not necessary for aggregations of states to have Markovian
dynamics in order to engage in successful exploration. Instead, all
that we need is the ability to generalize via local modeling. Of
course, when aggregations of states \emph{do} have Markovian dynamics,
state aggregation may work well.
\small
\begin{thebibliography}{1}
\bibitem{grad2} J. Baxter and P. L. Bartlett. ``Direct GradientBased Reinforcement
Learning: I. Gradient Estimation Algorithms''. Technical report ,
Australian National University, 1999.
\bibitem{Approxplan} Carlos Guestrin, Relu Patrascu, and Dale Schuurmans, ``AlgorithmDirected Exploration for
ModelBased Reinforcement Learning in Factored MDPs'', ICML 2002, pages 235242.
\bibitem{DBN}M. Kearns and D. Koller. ``Efficient
Reinforcement Learning in Factored MDPs''. Proceedings of IJCAI, 1999.
\bibitem{E3}M. Kearns and S. Singh. ``NearOptimal Reinforcement Learning in
Polynomial Time''. Proceedings of ICML, 1998.
\bibitem{Rmax}R. I. Brafman and M. Tennenholtz. ``Rmax  A
General Polynomial Time Algorithm for NearOptimal Reinforcement
Learning''. Proceedings of IJCAI, 2001.
\bibitem{Sparse}M. Kearns, Y. Mansour, and A. Ng. ``A Sparse Sampling Algorithm
for NearOptimal Planning in Large Markov Decision Processes''. Proceedings
of IJCAI, 1999.
\bibitem{sham}S. Kakade. Thesis. University College London. 2003.
\bibitem{CPI}S. Kakade and J. Langford. ``Approximately Optimal Approximate Reinforcement
Learning''. Proceedings of ICML, 2002.
\bibitem{sven}S. Koenig and R. Simmons. ``Complexity Analysis of RealTime Reinforcement Learning'' Proceedings of the National Conference on Artificial Intelligence, pages 99105, 1993.
\bibitem{thrun}S. Thrun ``Efficient exploration in reinforcement learning'' Technical Report CMUCS92102, Carnegie Mellon University, Computer Science Department, Pittsburgh, PA, 1992.
\bibitem{Partigame}A. W. Moore. ``The Partigame Algorithm for Variable Resolution Reinforcement
Learning in Multidimensional Statespaces''. In NIPS 6, 1993.
\bibitem{Dean}T. Dean and R. Given. ``Model Minimization in Markov
Decision Processes''. In AAAI, 1997.
\bibitem{aggregation} J. Rust, ``A Comparison of Policy Iteration
Methods for Solving ContinuousState, InfiniteHorizon Markovian
Decision Problems Using Random, Quasirandom, and Deterministic
Discretizations'',
http://econwpa.wustl.edu:8089/eps/comp/papers/9704/9704001.ps
\bibitem{grad1} R. Sutton, D. McAllester, S. Singh and
Y. Mansour. ``Policy Gradient Methods for Reinforcement Learning with
Function Approximation''. In NIPS 13, 2000.
\end{thebibliography}
\end{document}