\documentclass[english]{amsart}
\usepackage[T1]{fontenc}
\usepackage{babel}
\makeatletter
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% LyX specific LaTeX commands.
\providecommand{\LyX}{L\kern-.1667em\lower.25em\hbox{Y}\kern-.125emX\@}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% Textclass specific LaTeX commands.
\theoremstyle{plain}
\newtheorem{thm}{Theorem}[section]
\numberwithin{equation}{section} %% Comment out for sequentially-numbered
\numberwithin{figure}{section} %% Comment out for sequentially-numbered
\theoremstyle{plain}
\newtheorem{cor}[thm]{Corollary} %%Delete [thm] to re-start numbering
\theoremstyle{plain}
\newtheorem{lem}[thm]{Lemma} %%Delete [thm] to re-start numbering
\theoremstyle{definition}
\newtheorem{defn}[thm]{Definition}
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% User specified LaTeX commands.
\usepackage{icml2002}
\usepackage{epsf}
\makeatother
\begin{document}
\twocolumn[
\icmltitle{Competitive Analysis of the Explore/Exploit Tradeoff}
\icmlauthor{John Langford}{jcl+@cs.cmu.edu}
\icmladdress{Computer Science Department, Carnegie-Mellon University,
5000 Forbes Avenue, Pittsburgh, PA 15213}
\icmlauthor{Martin Zinkevich}{maz+@cs.cmu.edu}
\icmladdress{Computer Science Department, Carnegie-Mellon University,
5000 Forbes Avenue, Pittsburgh, PA 15213}
\icmlauthor{Sham Kakade}{sham@gatsby.ucl.ac.uk}
\icmladdress{Gatsby Computational Neuroscience Unit, UCL,
London WC1N 3AR, UK}
]
\renewcommand{\baselinestretch}{0.95} \normalsize
\begin{abstract}
We investigate the explore/exploit trade-off in reinforcement learning
using competitive analysis applied to an abstract model. We state and
prove lower and upper bounds on the competitive ratio. The essential
conclusion of our analysis is that optimizing the explore/exploit
trade-off is much easier with a few pieces of extra knowledge such as
the stopping time or upper and lower bounds on the value of the
optimal exploitation policy.
\end{abstract}
\section{Introduction}
It is well known that efficient reinforcement learning algorithms must
\emph{}use some means for active exploration (see for example
\cite{explore}). We consider an abstract class of algorithms where an
\emph{explicit} decision between exploration and exploitation is
made. In this framework, we use competitive analysis to analyze the
trade-off so as to maximize the cumulative reward the agent receives.
There are two relevant pieces of prior work that have well-defined
notions of optimality in the on-line setting. First, the \( E^{3} \)
algorithm of Kearns and Singh \cite{kearns98nearoptimal} shows that,
given certain assumptions about the world, the agent can achieve
cumulative reward close to optimal in polynomial time (in the number
of states). The notion of optimality used is the largest possible
payoff among those policies that ``mix'' within some pre-chosen
time. The time required to achieve near optimal payoff is then
polynomial in this ``mixing time''. The other notion of optimality is
in the competitive analysis of the N-armed bandit problem
\cite{auer95gambling}\cite{Duff}. In this analysis we start with a box
containing several levers each of which gives some reward when
pulled. The goal is to choose the right lever in order to maximize
reward and the expected reward is compared with the ``optimal''
algorithm which chooses the one lever which returns the largest
reward.
Both of these notions of ``optimality'' are reasonable, and their
respective algorithms specify means to get close to their respective
notion of optimality. However, both notions assume the optimal
algorithm always chooses actions under a policy that achieves the
maximal reward. This is unrealistic since any practical
algorithm must engage in exploration first before it finds this
optimal policy. Hence, any feasible algorithm will behave in a manner
quite different from optimal behavior.
We focus on analyzing \emph{only} the exploration/exploitation
trade-off. Informally, in the model we consider, the agent only has
to decide between exploring and exploiting. We assume that if n
exploration steps are carried out during the course of the game then
this results in the same policy improvement, regardless of which state
the agent is in or when this exploration was carried out. This is
clearly an abstraction since exploration is, in general, state
dependent. We also make a weaker assumption; that exploitation does
not result in exploration.
The explore/exploit trade-off problem is a search problem.
Competitive analysis has been previously applied to search problems
(see \cite{online} for examples) although this particular search
problem appears new. The most strongly related previous problems are
competitive financial decision making problems where the goal is to
choose the optimal trading time \cite{EFKT}\cite{BSZ}. The techniques
applied to these problems are used here although our model is
different enough such that no direct reduction exists.
In competitive analysis, the performance of our algorithm is compared
with the performance that we could have achieved given all available
information before decisions are made. This comparison is done on the
worst sequence so the notion of an good algorithm is the algorithm
that is not much worse than the optimal algorithm. This paper analyzes
the worst case gap between how well an algorithm \emph{could} have
done and how well an algorithm \emph{did.}
This paper has the following results:
\begin{enumerate}
\item An abstract model for analyzing algorithms which make a trade-off between exploration
and exploitation.
\item Upper and lower bounds on competitive optimality for general algorithms in this
model.
\item Upper and lower bounds on the competitive optimality for algorithms which are
given the stopping time, \( T \), or upper and lower bounds (\( R_{\textrm{min}} \)
and \( R_{\textrm{max}} \)) on the average reward the optimal algorithm can
achieve.
\end{enumerate}
\subsection{The abstract model}
The common notion of optimal policy is one which does not need to
engage in exploration: it simply always chooses the best actions from
the start of the game. Any practical algorithm must engage in
exploration and can thus never reasonably behave in a manner similar
to an ``optimal'' algorithm for this notion of optimal. Our aim is to
construct an abstract model in which an optimal algorithm must also
explore. We construct an abstract model where the ``optimal''
algorithm engages in exploration and compare our algorithm to this
(more realistic) notion of ``optimal''.
Our goal is to analyze the trade-off between exploration and
exploitation. Suppose we have an agent, which can do two
``meta-actions'' which are labeled ``explore '' and ``exploit''. In
addition, the agent has the knowledge of the current value of its
``exploit'' action. Let this \emph{exploitation value} be \( R \).
When the ``exploit'' action is pressed, the agent receives reward \( R
\). When the ``explore'' button is pressed the agent receives \( 0 \)
reward but the number \( R \) monotonically increases due to the
information gained from exploration. The goal of the agent is simply
to maximize the total reward that it receives.
The motivation for this problem is that it is an abstraction of a very common
problem. It is often the case that an agent may choose between either exploiting
with the best method derivable from the known knowledge of the world, or sacrificing
the exploitation by instead gathering information which (hopefully) allows better
exploitation in the future.
We insist that the \emph{exploitation value} \( R \) increase monotonically
with exploration simply because extra information acquired during exploration
phases hopefully improves the exploitation efficiency. It is important
to note that we do not require that the reward \( R \) increase \emph{strictly.}
An \emph{algorithm} is some method to switch between exploration and exploitation,
which can be based on the previous exploitation values. We define the \emph{value}
of an explore/exploit algorithm to be cumulative reward obtained in \( T \)
steps, and the \emph{optimal algorithm} is one which maximizes this cumulative
reward.
An example sequence of changing exploitation values, \( R_{i} \), might be:
{\par\centering 0 0 0 0 1 1 2 3 9 9 9 9 9 10\par}
Here, every time we explore we advance one increment in this sequence, and the
starting exploitation value is \( R_{0} \). Thus, after exploring 3 times,
our exploitation policy has value 0. After 4 explorations, 1, and after 8, 9.
We also assume that any exploitation does not advance any increments.
Let us consider the situation where the game ends in \( T=13 \) rounds. In
particular, we might imagine a strategy which explores 4 rounds, then exploits
for 9 rounds, and a strategy which explores 12 rounds, then exploits for one
round. The first strategy has cumulative reward:
\[
1*9=9\]
and the second:
\[
9*1=9\]
The optimal strategy explores 8 times and then exploits 5 times to achieve
value:
\[
9*5=45\] Note here that we have achieved the goal of a realistic
notion of optimal: our optimal algorithm engages in exploration. Also
note that monotonicity implies that the optimal strategy always does
exploration before exploitation. Thus, the optimal strategy always
consists of exploration, and then exploitation. The problem of
optimizing the trade-off is then simply finding the best switching
time.
\subsection{The analysis}
\label{sec-comp}
We wish to compare the best possible strategy (given prior knowledge
of the monotonically increasing reward sequence) with an algorithm
that makes explore/exploit decisions based upon only observed values
\( R \). We do this in the common ``competitive analysis'' setting
\cite{online}. This comparison is done with a worst case over all
problems.
Competitive analysis typically compares an algorithm with the ``best
possible'' algorithm (given all information). Let \( C_{*} \) be the
cumulative reward of the optimal algorithm. We want to use a
definition which allows randomized algorithms for trading off between
exploration and exploitation. A randomized algorithm has a source of
uniform random bits independent of the problem. In general, any
randomized algorithm, $A$ can be regarded as a deterministic algorithm
which takes it's random bits as an additional argument. If a
randomized algorithm \( A \) which uses random bits \( r \) achieves
total cumulative reward \( C_{A,r} \), then the competitive ratio is
typically defined as:
\[
\max _{\textrm{problems}}\frac{C_{*}}{E_{r}C_{A,r}}\] Here \( r \) is
all the random bits used by the algorithm \( A \) and \( E_r \) is the
expectation over the random bits $r$.
\section{Competitive Analysis}
Before starting the analysis, we prove a meta-theorem which implies
that we need only consider optimal algorithms which switch from
exploration to exploitation, but not vice versa. The implication of
this theorem is that the space of possible optimal algorithms
is much smaller than expected.
\begin{lem}
\label{lem-monotone} (explore then exploit) For all sequences \( R_{i} \)
and all algorithms which complete \( n \) explorations and \( m \)
exploitations, the algorithm with the largest possible reward first
does \( n \) explorations then \( m \) exploitations.
\end{lem}
\begin{proof}
Consider any algorithm which exploits during round \( t \) and
explores during some later round \( t' \). Consider the altered
algorithm which explores during round \( t \) and exploits during
round \( t' \). This alteration leaves unaffected the values of
exploitations at rounds earlier than \( t \) or later than \( t' \),
and it monotonically improves the reward received from any
exploitations in the interval \( (t,t'] \). The lemma follows by
induction on all pairs \( (t,t') \) of exploitation followed by
exploration.
\end{proof}
This lemma implies that the optimal algorithm (which already knows the
value of exploration) \emph{always} does exploration followed by
exploitation. Our algorithm may or may not follow this
behavior. Later, we prove that the best algorithm which knows the
stopping time, \( T \) does all exploration before exploitation.
The optimal algorithm explores first then exploits for the
remaining time. Let \( R_{\textrm{opt}} \) be the exploitation value
per time step, and let \( t_{\textrm{opt}} \) be the number of times
steps the optimal exploits. The cumulative reward received by optimal
is then \( R_{\textrm{opt}}t_{\textrm{opt}} \) and the competitive
ratio is:
\[
\max _{\textrm{problems}}\frac{R_{\textrm{opt}}(\textrm{problem})t_{\textrm{opt}}(\textrm{problem})}{E_{r}C_{A,r}(\textrm{problem})}\, .\]
\subsection{Results}
We now present upper and lower bounds on the competitive ratio for
deterministic and randomized algorithms with varying pieces of
available information. In a randomized algorithm the decision to
explore or exploit is stochastic. In competitive analysis, an upper
bound is a bound on the \emph{maximal} lower bound. In particular, an
upper bound of X implies that there exists an algorithm with
competitive ratio of X. Our proofs of the upper bounds presented
involve explicitly defining such an algorithm.
Let \( T \) be the number of rounds before termination and let \( R_{\textrm{max}} \)
and \( R_{\textrm{min}} \) be upper and lower bounds on \( R_{\textrm{opt}} \).
Note that we assume \( 0 \) is the minimum possible average reward.
We consider algorithms with knowledge of:
\begin{enumerate}
\item nothing
\item \( T \)
\item \( R_{\textrm{max}} \) and \( R_{\textrm{min}}>0 \).
\item \( T \), \( R_{\textrm{max}} \) and \( R_{\textrm{min}}>0 \).
\end{enumerate}
Here is a table of our results for deterministic algorithms:
\vspace{0.2cm}
{\centering \begin{tabular}{|c|c|c|}
\hline
&
Upper bound&
Lower Bound\\
\hline
\hline
Nothing&
\( \infty \)&
\( \infty \)\\
\hline
\( R_{\textrm{max}},R_{\textrm{min}} \)&
\( \frac{R_{\textrm{max}}}{R_{\textrm{min}}} \)&
\( \frac{R_{\textrm{max}}}{2R_{\textrm{min}}} \)\\
\hline
\( T \)&
\( T \)&
\( T \)\\
\hline
\( T,R_{\textrm{max}},R_{\textrm{min}} \)&
\( \min \left( T,\frac{R_{\textrm{max}}}{R_{\textrm{min}}}\right) \)&
\( \min \left( T,\frac{R_{\textrm{max}}}{2R_{\textrm{min}}}\right) \)\\
\hline
\end{tabular}\par}
\vspace{0.3cm} Note that with no information, the deterministic
competitive ratio is very bad.
For randomized algorithms, it is
convenient to define \( f(x)=\Omega (\frac{\log x}{\log \log x}) \)
and \( g(x)=1+\ln x \). Note that \( g(x) \) differs by only a \( \log
\log x \) factor from \( \frac{\log x}{\log \log x} \) implying that
$g(x)$ and $f(x)$ have similar behavior. The table of results for
randomized algorithms is considerably more encouraging:
\vspace{0.2cm}
{\centering \begin{tabular}{|c|c|c|}
\hline
&
Upper bound&
Lower Bound\\
\hline
\hline
Nothing&
\( \forall \epsilon>0 \,\,\,\, \frac{1+\epsilon }{\epsilon }T^{1+\epsilon } \)&
\( T-1 \)\\
\hline
\( R_{\textrm{max}},R_{\textrm{min}} \)&
\( g\left( \frac{R_{\textrm{max}}}{R_{\textrm{min}}}\right) \)&
\( f\left( \frac{R_{\textrm{max}}}{R_{\textrm{min}}}\right) \)\\
\hline
\( T \)&
\( g(T) \)&
\( f(T) \)\\
\hline
\( T,R_{\textrm{max}},R_{\textrm{min}} \)&
\( g(\min (T,\frac{R_{\textrm{max}}}{R_{\textrm{min}}})) \)&
\( f\left( \min \left( T,\frac{R_{\textrm{max}}}{R_{\textrm{min}}}\right) \right) \)\\
\hline
\end{tabular}\par}
\vspace{0.3cm}
The competitive analysis of reinforcement learning shows that optimizing the
explore/exploit trade-off is, in general, quite difficult. The analysis shows
several important things:
\begin{enumerate}
\item Randomized algorithms achieve significantly better performance
than deterministic algorithms by amortizing worst-case situations.
\item The lower bounds are near to the upper bounds (up to \( \log
\log \) factors).
\item Knowledge of \( T \) (the termination time) implies there exists
an algorithm with a competitive ratio of \( \log T \).
\item Knowledge of \( R_{\textrm{max }} \) and \( R_{\textrm{min}} \),
which are upper and lower bounds on the exploitation value \(
R_{\textrm{opt}} \) of the optimal policy, implies there exists an
algorithm with a competitive ratio of \( \ln
\frac{R_{\textrm{max}}}{R_{\textrm{min}}} \).
\end{enumerate}
\subsection{No knowledge }
The worst case is that our algorithm knows nothing. Deterministic
algorithms perform remarkably poorly here.
\begin{cor}
\label{th-infinite}(No Knowledge, deterministic lower bound) All deterministic
algorithms have a competitive ratio of \( \infty \).
\end{cor}
\begin{proof}
The proof is done by taking the limit as \( R_{\textrm{max}}\rightarrow \infty \)
of the competitive ratio resulting from a theorem \ref{th-bound-lower}, which
we prove later.
\end{proof}
The statements about what we can and cannot prove given no knowledge of \( T \)
alter radically when randomization is allowed and we look at expected competitive
ratios. Randomization allows us to greatly reduce both the upper and lower bounds.
\begin{thm}
(No Knowledge, randomized upper bound) For all \( \epsilon >0 \), there exists
a randomized algorithm with a competitive ratio of \( \frac{1+\epsilon }{\epsilon }T^{1+\epsilon } \).
\end{thm}
\begin{proof}
Consider the randomized algorithm which switches from exploration to exploitation
at time step \( t \) with probability:
\[
\Pr (t)\propto \frac{1}{t^{1+\epsilon }}\]
The normalization for this distribution is bounded. In particular:
\[
\sum _{t=1}^{\infty }\frac{1}{t^{1+\epsilon }}=1+\sum _{t=2}^{\infty }\frac{1}{t^{1+\epsilon }}\leq 1+\int _{t=1}^{\infty }\frac{1}{t^{1+\epsilon }}dt=\frac{1+\epsilon }{\epsilon }\]
The optimal algorithm switches from exploration to exploitation at some value
\( t_{\textrm{opt}} \). Our randomized algorithm switches at time \( t_{\textrm{opt}} \)
with probability greater than \( \frac{\epsilon }{1+\epsilon }\frac{1}{t_{\textrm{opt}}^{1+\epsilon }} \).
This implies that the expected value is at least \( \frac{\epsilon }{1+\epsilon }\frac{C_{\textrm{opt}}}{t_{\textrm{opt}}^{1+\epsilon }} \)
and so the competitive ratio is bounded by:
\[
\frac{1+\epsilon }{\epsilon }T^{1+\epsilon }\]
\end{proof}
This upper bound is near to the following randomized lower bound.
\begin{thm}
(No Knowledge, randomized lower bound) For all randomized algorithms, the competitive
ratio is at least \( T-1 \).
\end{thm}
\begin{proof}
Since the competitive ratio is defined in a worst case scenario, we choose a
particular sequence and show how all randomized algorithms have a competitive
ratio of at least \( T-1 \) on this sequence.
Consider the sequence which has an \( i \)th entry of \( R_{i}=(4i+4)^{i} \).
This sequence has the property that the optimal policy always explores except
for the final round where it exploits.
Let \( p_{i} \) be the chance the algorithm first switches from exploration
to exploitation on the \( i \)th round in this sequence. Since \( \sum _{t=1}^{\infty }\frac{1}{t}=\infty \)
and probability distributions are normalized, we know that there must exist
some time \( t \) where \( p_{t}\leq \frac{1}{2t} \). Let us choose the stopping
time to be \( T=t+1 \) and let \( C_{i} \) be the total cumulative reward
obtained if the algorithm switches at time \( i \). Since a property of the
sequence is that \( C_{i}\geq C_{i-1} \) and since \( C_{i}=0 \) for \( i\geq t+1 \),
the payoff is
\begin{eqnarray*}
EA & = & \sum ^{t}_{i=1}p_{i}C_{i}\\
& \leq & C_{t}p_{t}+C_{t-1}\sum ^{t-1}_{i=1}p_{i}\\
& \leq & \frac{C_{t}}{2t}+C_{t-1}\\
& = & \frac{R_{t}}{2t}+2R_{t-1}\\
& \leq & \frac{R_{t}}{2t}+\frac{R_{t}}{2t}
\end{eqnarray*}
where the last step follows since by definition of the sequence, \( R_{t-1}\leq \frac{R_{t}}{4t} \).
The optimal algorithm achieves total reward \( C_{\textrm{opt}}=R_{t} \)
so the competitive ratio is at least \( t=T-1 \).
\end{proof}
\subsection{Knowledge of \protect\( T\protect \) only}
The upper and lower bounds shift dramatically when \( T \) is known. The deterministic
competitive ratio drops from \( \infty \) to \( T \).
\begin{thm}
(Know \( T \), deterministic upper bound) Given knowledge of \( T \), there
exists a deterministic algorithm with competitive ratio of \( T \).
\end{thm}
\begin{proof}
Consider the algorithm which explores until the last step, and then exploits
once. This algorithm has a one round reward at least as large as \( R_{\textrm{opt}} \)
which implies a competitive ratio of \( T \), since optimal could obtain at
most \( R_{\textrm{opt}}T \).
\end{proof}
A lower bound of \( T \) also applies.
\begin{cor}
(Know \( T \), deterministic lower bound) All deterministic algorithms have
a competitive ratio of at least \( T \).
\end{cor}
\begin{proof}
Optimization of the full knowledge deterministic lower bound (Theorem \ref{th-full-knowledge-det}).
\end{proof}
Randomized algorithms also improve their competitive ratio significantly.
\begin{thm}
(Know \( T \), randomized upper bound) There exists a randomized algorithm
which achieves a competitive ratio of \( 1+\ln T \) given knowledge of \( T \).
\end{thm}
\begin{proof}
For any sequence, let \( t_{\textrm{opt}} \) be the number of times that the
optimal algorithm exploits. Also let \( R_{\textrm{opt}} \) be the reward received
each turn the optimal algorithm exploits.
We analyze a randomized algorithm which starts out doing exploration
and randomly decides the amount of time \( t \) which it exploits. If
we guess \( t\leq t_{\textrm{opt}} \), then we receive a return on
each round of exploitation which is at least as large as \(
R_{\textrm{opt}} \) because we explore at least as often as the
optimal algorithm. If we guess a \( t>t_{\textrm{opt}} \), then we
could obtain \( 0 \) reward, since not enough exploration was
conducted. Hence, the expected cumulative reward of our algorithm is
bounded by:
\begin{eqnarray*}
E_{r}[C_{r}] & \geq & E_{r}[C_{r}|t\leq t_{\textrm{opt}}]\Pr _{r}[t\leq t_{\textrm{opt}}]\\
& \geq & E_{r}[R_{\textrm{opt}}t|t\leq t_{\textrm{opt}}]\Pr _{r}[t\leq t_{\textrm{opt}}]\\
& \geq & R_{\textrm{opt}}E_{r}[t|t\leq t_{\textrm{opt}}]\Pr _{r}[t\leq t_{\textrm{opt}}]
\end{eqnarray*}
Now we specify how to choose this time. First, choose \( v \) from the real
numbers in \( [1,T] \) according to the following cumulative distribution:
\[
\forall x\in [1,T],\Pr _{r}[v\leq x]=\frac{1+\ln x}{1+\ln T}\]
Notice that this cumulative distribution places a point mass on \( v=1 \).
Let the switching time be \( t=\left\lceil v\right\rceil \). Since \( t_{\textrm{opt}} \)
is an integer, \( t\leq t_{\textrm{opt}} \) if and only if \( v\leq t_{\textrm{opt}} \).
Therefore, \( E_{r}[t|t\leq t_{\textrm{opt}}]\geq E_{r}[v|v\leq t_{\textrm{opt}}] \).
Also, \( \Pr _{r}[t\leq t_{\textrm{opt}}]=\Pr _{r}[v\leq t_{\textrm{opt}}] \).
It follows that:
\begin{eqnarray*}
& & E_{r}[C_{r}]\\
& \geq & R_{\textrm{opt}}E_{r}[v|v\leq t_{\textrm{opt}}]\Pr _{r}[v\leq t_{\textrm{opt}}]\\
& \geq & R_{\textrm{opt}}E_{r}[v|v=1]\Pr _{r}[v=1]\\
& & +R_{\textrm{opt}}E_{r}[v|10 \).
We define a set of \( l \) exploitation value sequences and show that
all algorithms (randomized or deterministic) perform poorly on these
sequences. Let \( R_{kj} \) denote the \( j \)th exploitation value of the
\( k \)th sequence. For all \( i\in \{1,\ldots ,l\} \) let \( T_{i}=T-l^{l-i} \).
For all \( k\in \{1,\ldots ,l\} \), and \( j\in \{1,\ldots ,T\} \) define
\[
\begin{array}{rcll}
R_{kj} & = & 0\textrm{ for all }jk \), we get:
\[
\begin{array}{cc}
& E_{r}C_{r}\\
\leq & \sum _{i=1}^{k}p^{\prime }_{li}R^{\prime}_{ki}l^{l-i}+R^{\prime}_{kk}l^{l-k-1}\sum ^{l}_{i=k+1}p^{\prime }_{li}\\
= & R_{\textrm{min}}[p^{\prime }_{lk}l^{l+k}+l^{l+k-1}\sum _{i=k+1}^{l}p^{\prime }_{li}+\sum _{i=1}^{k-1}p^{\prime }_{li}l^{l+i}]\\
\leq & R_{\textrm{min}}[p^{\prime }_{lk}l^{l+k}+l^{l+k-1}]\\
= & R_{\textrm{min}}(p^{\prime }_{lk}+\frac{1}{l})l^{l+k}\\
\end{array}\]
Since \( \sum _{i}p^{\prime }_{li}=1 \) there exists an \( i \) for which
\( p^{\prime }_{li}\leq \frac{1}{l} \). On sequence \( i \), we have an expected
reward of at most \( E_{r}C_{r}\leq 2R_{\textrm{min}}l^{l+i-1} \). On sequence
\( i \) the optimal algorithm always chooses to exploit at round \( i \)
and receive reward:
\[
C_{\textrm{opt}}=R_{\textrm{min}}l^{2i}(T-T_{i})=R_{\textrm{min}}l^{l+i}\]
Consequently, the competitive ratio is at least \( \frac{C_{\textrm{opt}}}{E_{r}C_{r}}\geq \frac{l}{2} \)
establishing a lower bound of \( \Omega (l) \). To achieve the theorem, note
that \( f(x)^{f(x)}\leq x \).
\end{proof}
\section{Application}
\subsection{Application to Markov Decision Processes}
The model in this paper in which an agent can either do pure
information gathering or pure exploitation does not quite fit the
standard Markov Decision Process used to formalize the reinforcement
learning setting. We now outline and discuss where specific aspects do
not hold.
The four assumptions implicitly made in defining our abstract algorithm are:
\begin{enumerate}
\item The value of the exploitation policy (given the exploration done) is known and
a ``sub-algorithm'' exists which monotonically increases the exploitation policy
with more exploration.
\item The sequence of exploitation values is fixed.
\item Exploration results in zero reward while exploitation results in zero exploration.
\item Any information gained through exploitation does not result in policy improvement.
\end{enumerate}
Assumption 1 does not hold in stochastic policies, though it might
still hold with high probability in some situations.
Realizing assumption 2 is difficult in a multi-state world. The
potential benefit of exploration in general depends on the state the
agent is in, so the decision to explore or exploit should be state
dependent. However, if each round of the algorithm represents a game
which ends in some finite time, then the agent might decide whether it
explores or exploits during the course of each game. The criterion is
then to maximize the cumulative reward in the \( T \) games being
played.
The impact of assumption 3 is unclear, in the more general
framework. Exploration in general results in some reward, otherwise
the exploitation policy could not improve. However, the impact this
has on the current situation is not clear.
The impact of assumption 4 is in general be problem dependent.
Fully analyzing the exact connections between this abstract analysis
and the reinforcement learning is an open problem and important for
practical algorithms.
\subsection{Application to computational search problems}
There are many examples of planning and search problems where the
computational difficulty of finding a good solution is much greater
than the computational difficulty of verifying a good solution. This
class of problems includes the NP-complete problems as well as
others. Many of these problems have known methods for
approximation---in particular it may be possible to find an
approximate solution quickly and the quality of the solution can be
refined with additional computation. For a discussion of some of these
situations see \cite{search}.
When the utility of an approximate solution decreases with time used
to find such a solution, these problems have an inherent
exploration/exploitation trade-off. A meta-algorithm can either
decide to ``explore'' (and thus find a better solution) or ``exploit''
and execute the solution locking in the value of the solution. At any
round, it is often easy to calculate the optimal ``exploitation''
value. Furthermore, our assumption that exploitation does not provide
exploration is often satisfied since using an approximate solution
often does not aid in the calculation of a better approximate
solution.
\section{Conclusion}
We analyzed the explore/exploit tradeoff with an abstract model and
show that a small amount of extra information (knowledge of the number
of timesteps $T$ or upper and lower bounds on $R_{opt}$) can greatly
improve the competitive ratio. The lower bounds and the upper bounds
in the analysis are tight up to $ \log \log $ factors.
There are two undesirable properties of our analysis which can be
addressed with a refined analysis. Although we show that we can
achieve something near optimal in expectation, the variance on the
reward achieved can be quite large. This problem has cropped up
elsewhere (see \cite{soda} for an example) in competitive analysis and
been successfully addressed.
Another undesirable property of our analysis is that every upper bound is
instantiated with an algorithm that does all exploration before
exploitation. An alteration of the model removes this unintuitive
result. In particular, if the algorithm knows the \emph{distribution}
of ending times, there are examples where it is desirable to switch
from exploitation back to exploration.
\subsection*{Acknowledgments}
Thanks to Avrim Blum for many helpful suggestions and good advice.
\begin{thebibliography}{1}
\bibitem{auer95gambling}P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire. \char`\"{}Gambling
in a rigged casino: the adversarial multi-armed bandit problem\char`\"{}, Proc.
of the 36th Annual Symposium on Foundations of Computer Science, 1995.
\bibitem{BSZ}A. Blum, T. Sandholm, M. Zinkevich. ``Online Algorithms
for Market Clearing'', Thirteenth Annual ACM-SIAM Symposium on
Discrete Algorithms, 2002
\bibitem{online}Allan Borodin and Ran El-Yaniv. ``Online Computation and Competitive Analysis'',
Cambridge University Press, 1998.
\bibitem{Duff}Michael Duff, ``Q-Learning for Bandit Problems'', Proceedings of Machine Learning 1995.
\bibitem{EFKT}R. El-Yaniv, A. Fiat, R. M. Karp, and G. Turpin, ``Competitive Analysis of Financial Games'', FOCS92, pages 327-333.
\bibitem{kearns98nearoptimal}M. Kearns and S. Singh. \char`\"{}Near-optimal reinforcement learning in polynomial
time\char`\"{}, Proc. 15th International Conf. on Machine Learning, 1998.
\bibitem{search}K. Larson and T. Sandholm. ``Bargaining with limited computation: Deliberation
equilibrium'' Artificial Intelligence 2001, 132(2): 183--217.
\bibitem{soda}Stefano Leonardi, Alberto Marchetti-Spaccamela, Alessio Presciutti, Adi Ros'en,
``On-line Randomized Call Control Revisited'', SODA (Symposium on Discrete
Algorithms) 1998.
\bibitem{explore}S. D. Whitehead, ``Complexity and cooperation in Q-learning'', Proc. 8th International
Conf. on Machine Learning, pages 363-367, 1991
\end{thebibliography}
\end{document}