%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   %                                                     %
%  #####   # #      % File:     nrdp.tex                                  %
%    #     # #      % Author:   Davies, Ng, Moore                         %
%    # ###  #       %                                                     %
%    # #   # #      %                                                     %
%    # ### # #      %                                                     %  
%      #            %                                                     %
%      ###          %                                                     %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Things to do:
% Make certain it is clear it's min-COST not just min-time to goal

\documentstyle[psfig,aaai]{article}
%\begin{document}

%\renewcommand{\baselinestretch}{2}

\newcommand{\myindex}[1]{\mbox{\scriptsize #1}}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Prologue                   %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\twopixwidth}{2.5in}
\newcommand{\threepixunitwidth}{1.6in}
\newcommand{\threecapwidth}{1.6in}

\def\astar {{A^{\ast}}}
\def\mountaincar {{\sc modified-car}}
\def\cartpole {{\sc move-cart-pole}}
\def\acrobot {{\sc acrobot}}
\def\slider {{\sc slider}}

\newcounter{my-figures-ctr}

% \input{../bib/include.tex}
\input{twopix}
\input{threepix}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Title                  %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%scottd: I think maybe we should emphasize the continuous-state
%  aspect of the paper, since that's where the novel stuff in the paper
%  is.  
\title{Applying Online Search Techniques to Continuous-State Reinforcement Learning}
%\title{Applying Model-based Search Techniques to Reinforcement Learning}


\iffalse	% AAAI98 reviews are anonymous
\author{Scott Davies$^{\ast}$ \hbox to 0.3in{}  Andrew Y. Ng$^{\dag}$
		\hbox to 0.3in{}  Andrew Moore$^{\ast}$ \\ \\
\parbox{3.5in}{
\centerline{
$^{\ast}$
School of Computer Science}
\centerline{Carnegie-Mellon University}
\centerline{Pittsburgh, PA 15213}
}
\parbox{3.5in}{
\centerline{
$^{\dag}$
Artificial Intelligence Lab}
\centerline{Massachusetts Institute of Technology}
\centerline{Cambridge, MA 02139} 
} \\
}
\fi

\author{} 	% AAAI98 reviews are anonymous

\begin{document}

\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Text                   %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\uppara}{\ \\

\vspace{-17mm}

\ \\}

% I couldn't figure out how to make LaTeX do a footnote without 
% the footnote index number.
\footnotetext{Copyright notice will appear here in final version.}

\begin{abstract}
%\begin{quote}

%The goal of Reinforcement Learning is to find sequences of actions
%that maximize long-term reward.  Other fields of research have, at an
%abstract level, been concerned largely with the same problem: for
%example, the AI/game playing literature discusses algorithms for
%selecting moves that will eventually lead to victory, and robot motion
%planning is used to find series of actuator torques that will place
%robots in desired configurations.  Both of these areas are rich with
%search techniques.  In this paper, we investigate the application of
%similar online search algorithms to model-based Reinforcement Learning
%domains.  Our search algorithms fall into two categories: first,
%``local'' searches, where the agent performs a finite-depth lookahead
%search over possible trajectories; and second, ``global'' searches,
%where the agent performs a search for a trajectory all the way from
%the current state to a goal state.  The key to the success of these
%methods lies in combining the use of on-line search to find good
%trajectories specifically from the current state with the use of
%an approximate value function which gives a rough solution to the much
%harder problem of finding good trajectories from all possible states.

%In reinforcement learning it is frequently necessary to resort to an
%approximation to the true optimal value function. Here we investigate
%the benefits of online search in such cases.  We examine ``local''
%searches, where the agent performs a finite-depth lookahead search,
%and ``global'' searches, where the agent performs a search for a trajectory
%all the way from the current state to a goal state.  The key to the
%success of these methods lies in combining the use of a rough solution
%to the hard problem of finding good trajectories from all possible
%states, with an accurate solution to the easier problem of finding a
%good trajectory specifically from the current state.

%scottd: added ``particularly in continuous-state domains''
In reinforcement learning it is frequently necessary to resort to an
approximation to the true optimal value function.
Here we investigate the benefits of online search in such cases,
particularly in continuous-state domains.  We examine ``local''
searches, where the agent performs a finite-depth lookahead search,
and ``global'' searches, where the agent performs a search for a trajectory
all the way from the current state to a goal state.  The key to the
success of these methods lies in taking a value function, which gives
a rough solution to the hard problem of finding good trajectories 
from every single state, and combining that with online search, which 
then gives an accurate solution to the easier problem of finding a
good trajectory {\em specifically} from the current state.

%\end{quote}
\end{abstract}

% NOTE: keywords must be selected from the list supplied by AAAI
{\small
\begin{verbatim}
Content Area: Machine Learning, Reinforcement Learning
Additional Keywords: search, planning and control
\end{verbatim}
}

\section{Introduction}
A common approach to Reinforcement Learning involves approximating 
the value function, and then executing the greedy policy with respect to
the learned value function.  But particularly in continuous high-dimensional 
state spaces, it is often the case that even when our agent is 
given a perfect model of the world, it can be computationally 
expensive to fit a highly accurate value function. This problem is 
even worse when the agent is learning a model of the world and is 
repeatedly updating its dynamic programming solution online. What's
to be done?

In this paper, restricted to deterministic domains, we investigate the
idea that rather than executing greedy policies with respect to
approximated value functions, we can utilize search techniques to find
better trajectories.  Briefly, we may perform a ``local'' search for a
``good'' short multi-step path, or a ``global'' search for a
trajectory all the way from the current state to a goal state, and
where the search may be guided by the learned value function.   
With such searches, we perform online computation that is directed towards
finding a good trajectory {\em from the current state}; this is in
contrast to, say, offline learning of a value function, which tries to
solve the much harder problem of learning a good policy for every
point in the state space.
%scottd: following sentence added
We discuss the application of such search
techniques to continuous state spaces, and demonstrate that these
techniques often dramatically improve the quality of solutions found
at relatively little computation expense.

\section{Local Search}
	\label{section-localsearch}

Given a value function, agents typically execute a greedy
policy using a one-step lookahead search, possibly using a learned
model for the lookahead. The computational cost per step of this 
is $O(|A|)$ where $A$ is the set of actions. This can be thought of as 
performing a depth 1 search for the 1-step trajectory $T$ that 
gives the highest $R_T + \gamma V(s_T)$, where $R_T$ is the reinforcement, 
$s_T$ is the state (possibly according to a learned world model) reached upon 
executing $T$, and $\gamma$ is the discount factor. 
A natural extension is then to perform a search of depth $d$, 
to find the trajectory that maximizes $R_T + \gamma^d V(s_T)$,
where discounting is incorporated in the natural way into $R_T$. The 
computational expense is $O(d |A|^d)$.

To make deeper searches computationally cheaper, we might 
consider only a subset of these trajectories. 
Especially for dynamic control, often an optimal trajectory repeatedly 
selects and then {\em holds} a certain action for some time, such as
suggested by~(Boone 1997);\nocite{bn:mnmm}
a natural subset of the $|A^d|$ trajectories, therefore,
are trajectories that switch their actions rarely.
When we constrain the number of switches between actions
to be $s$,
%(The sequence $a_1,a_1,a_2,a_2,a_2$ has one ``switch,'' 
%whereas $a_1,a_2,a_2,a_2,a_1$ has two ``switches.'')
the time for such a search is then 
$O( d \left(\hbox to 0pt {$^d$}_{s}\right) |A|^{s+1})$--considerably
cheaper than a full search if $s \ll d$.
%ayn UNCOMMENTED next 4 lines
We also suggest that $s$ is easily chosen for a particular domain by 
an expert, by asking how often action switches can reasonably be 
expected in an optimal trajectory, and then picking $s$ accordingly 
to allow an appropriate number of switches in a trajectory of length $d$. 
Figure~\ref{fig-example-localsearch} shows a search performed in the
{\mountaincar} task (described later) using $d=20$ and $s=1$. 

%\begin{figure}[h]
%\hspace*{-0.5in}
\parbox{2.80in}
{
\centerline{\psfig{figure=localSearch.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-example-localsearch}
 \centerline{\parbox{2.55in} 
	{%\footnotesize 
	Figure~\ref{fig-example-localsearch}: Local Search example: a twenty-step search with at most one switch in actions}}
}
%\nolinebreak
%\parbox{0.20in}
%{ \hbox to 0.01in {} }
%\nolinebreak
\smallskip

During execution, the local search algorithm iteratively
finds the best trajectory $T$ of length $d$ with the search algorithm above, 
executes the first action on that trajectory, and then does a 
new search from the resulting state.  
If $B$ is the ``parallel backup operator''~(Bersekas 1995)\nocite{bersekas95:dpaoc1} so
that $BV(s) = max_{a \in A} R(s,a) + V(\delta(s,a))$, then executing
the full $|A|^d$ search is formally equivalent to executing the greedy
policy with respect to the value function $B^{d-1}V$. Noting that,
under mild regularity assumptions, as $k \rightarrow \infty$, $B^kV$
becomes the optimal value function, we can generally expect $B^{d-1}V$
to be a better value function than $V$. For example, in discounted 
problems, if the largest absolute error in $V$ is $\varepsilon$, the 
largest absolute error in $B^{d-1}V$ is $\gamma^{d-1}\varepsilon$. Lastly, 
for well chosen $s$, we expect the cheaper, restricted search to approximate 
this full search well.

This approach, a form of receding horizon control, has most famously 
been applied to minimax game playing programs~(Russell and Norvig 1995)\nocite{rssl:artf} and 
has also been used in single-agents on small discrete domains (e.g. (Korf 1990)\nocite{krf:rltm}).

%[ayn: Would this section be clearer if written in terms of search over sequences 
%of actions rather than trajectories?]
%%%awm I like it the way you already have it

%Planning Time:
%Develop an approximate value function by dynamic programming. In our
%examples our function approximator is ????~\ite{davies}. 

\section{Global Search}

Local search is not the only way to improve an approximate value
function. Here, we describe global search for solving
least-cost-to-goal problems in continuous state spaces with
non-negative costs.  We assume the set of goal states is known.

\subsection{Uninformed Global Search}

%Here, we also use local searches to maintain a position on a
%trajectory and to increase the power of the high level searcher.

Instead of searching locally, why not continue growing a search tree
until it finds a goal state? The answer is clear---the combinatorial
explosion would be devastating.  In order to deal with this problem,
we borrow a technique from robot motion planning
~(Latombe 1991).\nocite{ltmb:rbtm}
%scottd: changed, then I changed my mind and changed it back
We first divide the state space up into a fine uniform grid. 
%We first divide the state space up into partitions --- in this
%paper, we use a fine uniform grid.
A local search procedure
is then used to find paths from one grid element to
%%%AWM This idea is unique I believe. There should be at least one experiment
%%%AWM demoing that this works.
another.  Multiple trajectories entering the same grid element are
pruned, keeping only the least-cost trajectory into that grid element
(breaking ties arbitrarily). The rationale for the pruning is an
assumed similarity between among points in the same grid element. In
this manner, the algorithm attempts to builds a complete trajectory to
the goal using the learned or provided world model.  
%[Move this down to the next section!]
%It is worth noting that
%Uninformed Global Search is equivalent to Informed Global Search
%(described below) using a constant 0 heuristic evaluation function.
%scottd UNCOMMENTED
When the planner
finds a trajectory to the goal, it is executed in its entirety.

%scottd ADDED next sentence
The overall procedure is essentially a lowest-cost-first search
over a graph structure in which the graph nodes correspond to grid
elements, and in which the edges between graph nodes correspond to
trajectories between grid elements as found by the local search procedure.
A graph showing a such a search for the
{\mountaincar} domain (to be described shortly) is depicted in
Figure~\ref{fig-example-uninformed-global}. 

%\hspace*{-0.35in}
\smallskip
\parbox{2.8in}
{
\centerline{\psfig{figure=noVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-example-uninformed-global}
  \centerline{\parbox{2.55in}
	{%\footnotesize 
	Figure~\ref{fig-example-uninformed-global}:
	Uninformed Global Search example.
	Velocity on $x$-axis, car position on $y$ axis.  Large black dot
	is starting state; the small dots are grid
	elements' ``representative states.'' }}
}
\smallskip

\parbox{2.8in}
{
\centerline{\psfig{figure=crappyVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-example-informed-global}
 \centerline{\parbox{2.55in} 
	{%\footnotesize 
	Figure~\ref{fig-example-informed-global}:
		Informed Global Search example
		 on {\mountaincar}, with a crudely approximated value function.}}
}
%\end{figure}
\smallskip

\parbox{2.8in}
{
\centerline{\psfig{figure=goodVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-example-informed-global2}
 \centerline{\parbox{2.55in} 
	{%\footnotesize 
	Figure~\ref{fig-example-informed-global2}:
		Informed Global Search example
		 on {\mountaincar}, with a more accurately approximated value function.}}
}
%\end{figure}
\smallskip


%%%awm The below paragraph is now redundant and can be eliminated provided
%%%awm hng:plnn is given credit somewhere
%This algorithm was inspired by the non-holonomic path planner
%of~\cite{ltmb:rbtm}, where instead of searching through a dynamic
%statespace, the authors search configuration space using
%distance-to-goal as a heuristic.  Similar techniques were employed
%by~\cite{bn:mnmm}, which discusses this and other search approaches
%for the Acrobot in more detail, and~\cite{hng:plnn}.

%scottd RERECOMMENTED, DAMMIT :)
%  (First sentence uncommented elsewhere; other stuff moved to the
%  ``learning a model online'' section)
%ayn REUNCOMMENTED
%scottd RECOMMENTED
%ayn UNCOMMENTED
%When the planner finds a trajectory to the goal, it is executed in its
%entirety. But in the case where we are learning a model of the world,
%it is possible for the agent to successfully plan a continuous 
%trajectory using the learned world model, but to fail to reach the 
%goal when it 
%tries to follow the planned trajectory.  In this case, failure to follow 
%the successfully planned trajectory can directly be attributed to
%inaccuracies in the agent's model; and in executing the path anyway, the
%agent will naturally reach the area where the actual trajectory
%diverges from the predicted/planned trajectory and thereby improve its
%model of the world in that area.

%Using these trajectories can significantly improve the agent's online performance.
%Describe how the planning vs execution aspect of this is organized.
%Mention we don't need an approximate eval function at all. But we do
%need a goal or some criterion for stopping the search.

\subsection{Informed Global Search}

We can modify Uninformed
Global Search by using an approximated value function to {\em guide} the 
search expansions in the style of $\astar$ search~(Nilsson 1971),\nocite{nlss:prbl}
(written out in detail below).
% from grid element to grid element. 
With the perfect value function, this causes the search to traverse 
exactly the optimal path to the goal; with only an approximation to 
the value function, it can still dramatically reduce the fraction of 
the state space that is searched. 
%One other
%modification is that in the pruning of multiple trajectories entering
%the same grid element, we now keep the trajectory that maximizes $R_T
%+ V(s_T)$ (unlike Uninformed Global Search, which does not use an
%$V(s_T)$ term,) and this finer pruning can cause better trajectories
%than Uninformed Global Search to be found.
%%%AWM Is there any chance of an experiment demoing the above point?
%%%ayn: Yup. Done by Scott in Section 5.

The search uses a sparse representation so that only grid
cells that are visited take up memory\footnote{This has a flavor not
dissimilar to the hashed sparse coarse encodings of
(Sutton 1996).~\nocite{sutton96:girl}}.
Notice also that like Local
Search, we are performing online computation in the sense that 
we are performing a search only when we know the ``current state,'' and 
to find a trajectory specifically from the current state;
%rather than 
this is 
in contrast to offline computation for finding a value
function, which tries to solve the much more difficult
problem of finding a good trajectory to the goal from every single point in
the state space. 

Writen out in full, the search algorithm is:

%\begin{itemize}
%\item Let $V'(s)$ be an approximate value function for the state space $s
%\in S$
%\item Let $G$ be a sparse array of grid elements; let $G(s)$ represent the
%grid element containing a state $s$.  
%\item Let $P$ be a priority queue of grid elements in $G$.
%\item Initialize $P$ to contain $g(s_0)$, where s_0 is the current state
%\item Until we find a goal state, or $P$ is empty:
%\begin{itemize}
%\item Let $g(s)$
%\end{itemize}
%\end{itemize}

{% \small
\begin{enumerate}
\item Suppose $g(s_0)$ is the grid element containing the current state $s_0$.
Set $g(s_0)$'s ``representative state'' to be $s_0$, and add $g(s_0)$ to
a priority queue $P$ with priority $V(s_0)$, where $V$ is an
approximated value function. 
\item Until a goal state has been found, or $P$ is empty:
\begin{itemize}
\item Remove a grid element $g$ from the top of $P$. Suppose $s$ is $g$'s
``representative state.''  
\item Starting from $s$, perform a ``local'' search as in
Section~\ref{section-localsearch}, except search trajectories are pruned
once they reach a state in a different grid $g'$.
%it reaches a state $s'$ in some other grid element $g' \neq g$, we cut 
%the search off at $s'$.
If $g'$ has not been visited before, add $g'$ to $P$ with a priority
$p(g') = R_T(s_0, \ldots, s')+ \gamma^{|T|} V(s')$, where $R_T$ is the reward accumulated
along the recorded trajectory $T$ from $s_0$ to $s'$, and set $g'$'s
``representative state'' to $s'$.
Similarly, if $g'$ has been visited before, but 
$p(g') \leq R_T(s_0, \ldots, s') + \gamma^{|T|} V(s')$,
then update $p(g')$ to the latter quantity and set $g'$'s
``representative state'' to $s'$.  Either way, if $g'$'s 
``representative state'' was set to $s'$, record the sequence 
of actions required to get from $s$ to $s'$, and set $s'$'s 
predecessor to $s$.
\end{itemize}
\item If a goal state has been found, execute the trajectory.
Otherwise, the search has failed, because our
grid was too coarse, our state transition model inaccurate,
or the problem insoluble.
\end{enumerate}}

%scottd ADDED
The above procedure is very similar to a standard $\astar$ search,
with two important differences.  First, the ``heuristic function''
used here is an automatically generated approximate value function
rather than a hand-coded heuristic.  This has the advantage of being a
more ``autonomous'' approach requiring relatively little hand-encoding
of domain-specific knowledge.  On the other hand, it also typically
means that the heuristic function used here may sometimes overestimate
the cost required to get from some points to the goal, which can
sometimes lead to suboptimal solutions -- that is, the approximated
value function is not necessarily an {\em optimistic} or {\em
admissible} heuristic.  {\em However, within the context of the search
procedure used above, inadmissable but relatively accurate value
functions can lead to much better solutions than those found with
optimistic but inaccurate heuristics.}  (Note that the Uniformed
Global Search algorithm above is essentially ``Informed Global
Search'' with an optimistic but inaccurate heuristic that estimates a
remaining-cost-to-go of 0 everywhere.)  This is due to the second
important difference between our search procedure above and a standard
$\astar$ search in a discrete-state domain: the search procedure above
uses the approximated value function not only to decide what grid
element to search from next, but also from what particular point in
that grid element it will search for local trajectories to neighboring
grid elements.
%scottd: I'm not sure about this next sentence.  
Overall, the search procedure described above might be looked at as a
combination of (1) ``classic AI'' $\astar$ search; (2)
value function approximation techniques normally used in reinforcement
learning, and (3) trajectory-based search techniques taken from robot
motion planning.
%(Something like ``marriage'' would sound cooler than ``combination'',
%but there are three things here and the French phrase for such situations
%would probably be a bit racy)

An example of the search performed by this algorithm on the
{\mountaincar} domain is shown in
Figure~\ref{fig-example-informed-global}.  The value function was
approximated with a simplex-based
interpolation~(Davies 1997)\nocite{davies97:mtaifrl:nips} on a coarse 7 by 7 grid,
with all other parameters the same as in
Figure~\ref{fig-example-uninformed-global}.  Much less of state
space is searched than by Uninformed Global Search.
%scottd: added next paragraph along with corresponding figure.
%Do we want this?

In Figure~\ref{fig-example-informed-global2}, the value function was
approximated more accurately with a simplex-based interpolation on a
21 by 21 grid.  With this accurate a value function, the ``search''
simply goes straight to the goal without searching down any suboptimal
paths.  Naturally, in such a situation the ``search'' is actually
unnecessary, since merely greedily following the approximated value
function would have produced the same solution.  Because we can easily
afford to compute an approximated value function with a 21 by 21 grid,
the search techniques discussed here are not necessary in practice for
a domain as simple as the {\mountaincar} domain; we only use the
{\mountaincar} example here since it is two-dimensional and thus
easily used for illustrations.  However, when we move to
higher-dimensional problems, such as problems examined in the next
section, high-resolution approximated value functions become
prohibitively expensive to calculate, and the search procedure above
can be a very cost-effective way of improving performance.

\section{Experiments}

We tested our algorithms on the following 
domains\footnote{C code for all 4 domains (implemented with numerical 
integration and smooth dynamics) will shortly be made available on the 
Web.}:

%{\small
\begin{itemize}
\item {\mountaincar} (2 dimensional): A car has to reach and park at the top of a 
	one dimensional hill; the car needs to back up first in order to gather 
	enough momentum to get to the goal.
	Note this is slightly more difficult than the normal mountain car, 
	as we require a velocity near 0 at the top of the hill~(Moore and Atkeson 1995).\nocite{mr:thpr2}
	State consists of $x$-position 
	and velocity. Actions are accelerate forward or backward.
\item {\acrobot} (4 dimensional): An acrobot is a two-link planar robot acting
	in the vertical plane under gravity with only one weak actuator at its elbow joint.
	% and an unactuated shoulder. 
	The goal is to raise the hand at least one link's height above 
	the shoulder~(Sutton 1997).\nocite{sutton96:girl}
	State consists of joint angles and angular velocities at the 
	shoulder and elbow. Actions are positive or negative torque.
\item {\cartpole} (4 dimensional): A cart-and-pole system starting
	with the pole upright is to be moved some distance to a goal
	state, keeping the pole upright (harder than the stabilization
	problem). It terminates with a huge penalty ($-10^6$) if the
	%controller allows the 
	pole falls over. 
        State consists of the cart position and velocity, and the pole 
	angle and angular velocity.  Actions are accelerate left or right.
\item {\slider} (4 dimensional): Like a two-dimensional mountain car, where a
	``slider'' has to reach a goal region in a two-dimensional terrain.
	The terrain's contours are shown in Figure~\ref{fig-slider-beta}.
	State is two dimensional position and two dimensional velocity.
	Actions are acceleration in the NE, NW, SW, or SE directions. 
\end{itemize}
%}

\parbox{2.80in}
{
\centerline{\psfig{figure=beta.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-slider-beta}
  \centerline{\parbox{2.55in} {%\footnotesize 
	Figure~\ref{fig-slider-beta}:
		{\slider}'s terrain. Goal at upper left.}}
}
%\nolinebreak
\smallskip

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|} \hline
$d$ & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \hline
cost & 49994& 42696& 31666& 14386& 10339& 27766& 11679& 8037& 9268& 10169 \\ \hline
time &0.66& 0.64& 1.24& 1.02& 1.13& 2.07& 3.32& 3.84& 7.30& 15.50  \\ \hline
%time &0.657& 0.637& 1.244& 1.021& 1.127& 2.065& 3.316& 3.839& 7.301& 15.496 \\ \hline
\end{tabular}}
\caption{Local search on {\cartpole}}
	\label{table-ls-cartpole}
\end{table*}

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|} \hline
$d$  & 1 & 2 & 3 & 4 & 6 & 8 & 12 & 16 & 24  \\ \hline \hline
cost & 187 & 180 & 188 & 161 & 140 & 133 & 133 & 134 & 112 \\ \hline
%time & 0.023 & 0.047 & 0.101 & 0.158 & 0.356 & 0.701 & 2.078 & 4.624 & 12.435 \\ \hline
time & 0.02 & 0.05 & 0.10 & 0.16 & 0.36 & 0.70 & 2.08 & 4.62 & 12.44 \\ \hline
\end{tabular}}
\caption{Local search on {\mountaincar}}
	\label{table-ls-mountaincar}
\end{table*}


All four are undiscounted tasks. {\cartpole}'s cost on each step
is quadratic in distance to
goal. The other three domains cost a constant $-1$ per step.
All results are averages of
1000 of trials with a start state chosen uniformly at random
in the state space, with the exception of the \cartpole, in which
only the pole's initial distance from its goal configuration is varied.
%%%awm comment that this too is harder than usual formulations of
%%%awm acrobot, car-on-hill, carpole?
% ayn: I think this actually makes it easier for acrobot or caronhill
%  (for most offline grid-base VI DP methods at least. Maybe not Sutton's' 
%   stuff.)
The algorithm we
used to learn a value function is the simplex-interpolation algorithm
described in~(Davies 1997).\nocite{davies97:mtaifrl:nips} But note that
the choice of a function approximator is orthogonal to the
search techniques we describe here, which can readily be applied to
any other value function computed by any other algorithm, such as an
LQR solution to a linearized problem or a neural net value function
computed with TD.

For now, we consider only the case where we are given a model of the
world, and leave the model-learning case to the next section.

\subsection{Local Search}

%Here, we look at the effects of different parameter settings for Local
%Search.  Consider {\cartpole} with no restriction on the
%number of switches (setting $s = d-1$).
%The approximate value function was found using a four-dimensional
%simplex-interpolation grid with quantization $13^4$, which is about 
%the finest resolution simplex-grid that we could reasonably afford
%to use. As we
%increase the depth of the search from 1 (greedy policy with respect to
%$V$) up to 10 (greedy policy with respect to $B^9V$), we see that
%performance is significantly improved.
%Mean CPU time per complete trial, in seconds, on a 100MHz HP is also shown.

Here, we look at the effects of different parameter settings for Local
Search.  We first consider {\cartpole}. Empirically, 
a good trajectory ``switches'' actions very often, and we therefore 
chose not to assume much ``action-holding,'' and set $s=d-1$.
The approximate value function was found using a four-dimension 
simplex-interpolation grid with quantization $13^4$, which is about 
the finest resolution simplex-grid that we could reasonably afford
to use. See Table~\ref{table-ls-cartpole}; as we increase the 
depth of the search from 1 (greedy policy 
with respect to $V$) up to 10 (greedy policy with respect to $B^9V$), 
we see that performance is significantly improved, but with CPU 
time per trial (on a 100MHz HP C300 9000, given in seconds) 
increasing exponentially.

The next experiment we consider here is {\mountaincar} on a coarse
($7^2$) grid.  
Empirically, entire trajectories (of $> 100$ steps) to the goal can 
often be executed with 2 or 3 action switches, and the optimal trajectory 
to the goal from the bottom of the hill at rest requires only 
about 3 switches. Thus, for the depth of searches we performed, 
we very conservatively chose $s=2$.
In Table~\ref{table-ls-mountaincar}, our
experimental results again show solution quality significantly
increased by Local Search, but with running times growing much slower
with $d$ than before.

%scottd: section name tweaked
%\subsection{Experimental Results}
\subsection{Comparative Experiments}

%The table below shows the average cost incurred per trial, as well as
%the average CPU time required per trial, for each algorithm in each of
%the four domains.  The average number of local searches (\#LS) per trial
%performed by the global search algorithms is also provided.  

Table~\ref{table-results-summary} summarizes our experimental 
results\footnote{The parameters for the 4 domains were, in order:
value function interpolation grid resolution: $7^2, 13^4,13^4,13^4$,
Local Search: $d=6,s=2, d=5,s=4, d=5,s=4, d=10,s=1$, 
Global Search Grid resolution: $50^2, 50^4, 50^4, 20^4$,
Local search within Global search: $d=20,s=1$ for all 4.}.
``cost'' is average
cost per trial, ``time'' is average CPU seconds per trial, and \#LS is 
the average number of local searches performed by the global search 
algorithms (which indicates the amount of state space considered).

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c||c|c||c|c|c||c|c|c|} \hline
& \multicolumn{2}{c||}{No Search} & \multicolumn{2}{c||}{Local Search} & 
		\multicolumn{3}{c||}{Uninformed Global} & \multicolumn{3}{c|}{Informed Global} \\ 
& \bf cost & time & \bf cost & time & \bf cost & \#LS & time & \bf cost & \#LS & time \\ \hline \hline
\mountaincar & \bf 187 & 0.02 & \bf 140 & 0.36 & \bf FAIL      & -- & -- & \bf 151              & 259 & 0.14 \\ \hline 
%             &     &     & & & ($50^2$ grid) &   &   & ($50^2$ grid) & & \\ \hline
%%             &     &     & & & Search grid: $100^2$ & & & Search grid: $100^2$ & &   \\
%             &     &     & & & 109 & 3870 & .82 & 138 & .29 & 1150  \\ \hline \hline
\acrobot & \bf 454 & 0.10 & \bf 305  & 1.2 & \bf 407 & 14250 & 5.8 & \bf 198 & 914 & 0.47  \\  \hline 
%         &     &     & ($s=4,d=5$)        & & ($50^4$ grid) & & & ($50^4$  grid) & & \\ \hline 
\cartpole & \bf 49993 & 0.66 & \bf 10339 & 1.13 &  \bf 3164           & 7605 & 3.45 & \bf 5073               & 1072 & 0.64 \\ \hline
%          & & & $s=s,d=d$  & &  ($50^4$ grid)     & & & ($50^4$ grid)           & &  \\ \hline \hline
\slider    & \bf 212 & 1.9 & \bf 197         & 51.72 & \bf 104          & 23690 & 94 & \bf 54          & 533 & 2.0 \\  \hline 
%           &    &      & ($s=s,d=d$) & & ($20^4$ grid) &       &    & ($20^4$ grid)  & &   \\ \hline
\end{tabular}
}
\caption{Summary of experimental results}
  \label{table-results-summary}
\end{table*}

%in such cases, we naively chose an action at random and re-tried the
%search from the resulting state, which usually worked.
%(Cleverer methods of dealing with such failures can certainly be
%imagined, but are not investigated here.)

Trends we draw attention to are: Local Search consistently beat No Search, 
but at the cost of increased computational time. Informed Global Search 
significantly beats No Search; and it also searches {\em much} less of
state space than Uninformed Global Search, resulting in
correspondingly faster running times.  
%scottd: added next two sentences
In fact, because the solutions
found by Informed Global Search are often of much shorter length when using
no search at all, the computational time per trial is sometimes essentially
{\em the same} for Informed Global Search and No Search, while the quality of
the solution found by Informed Global Search is many times
better --- for example,  a factor of 4 in the {\slider} domain.
(It performs a factor of 10 better in the {\cartpole} domain,
but that is largely a function of the particular penalty associated with
the pole falling all the way down.)
Also note that because of the sparse
representation of Global Search grids, we can comfortably use 
grid resolutions as high as $50^4$ without running out of memory.

%scottd: changed next sentence
%While relatively simple, {\mountaincar} demonstrates interesting phenomena. 
While too simple a problem for the search techniques described here to be
useful in practice (since it would have been easy to have simply calculated a
much more accurate value function), the {\mountaincar} results
demonstrate interesting phenomena that carry over to higher-dimensional
problems.
Despite the use of a $50^2$ grid for the global search, Uninformed 
Global Search often surprisingly fails to find a path to the goal,
where Informed Global Search, despite searching much less of the 
state space, succeeds.
This is because Informed Global Search uses a value function to guide 
its pruning of multiple trajectories entering the same grid cell, and 
therefore makes better selection of ``representative states'' for grid 
elements. 
%scottd: added next sentence
(In particular, it normally recognized high-velocity states
as ``better'' in the appropriate parts of the state space, whereas
the Uniformed Global Search algorithm often erroneously pruned these states in
favor of lower-velocity states in the same grid elements.)
This also helps explain Informed Global Search 
beating Uninformed Global Search on 2 of the 3 four-dimensional domains.

%scottd: added paragraph break here
When the Global Search 
grid resolution is increased to $100^2$ for \mountaincar, both global 
searches consistently succeed. But, Uninformed Global Search 
(mean cost 109) now beats Informed Global Search (mean cost 138).
The finer grid causes good selection of ``representative
states'' to be less important; meanwhile, inaccuracies in the value
function guiding Informed Global Search causes it to ``miss''
certain good trajectories. This is a phenomenon that often occurs
in $\astar$-like searches when one's heuristic evaluation function
is not strictly ``optimistic''~(Russell and Norvig 1995).\nocite{rssl:artf} This is not a problem
for Uninformed Global Search, which is effectively using the maximally 
optimistic ``constant 0'' evaluation function.  
%scottd: added stuff below
It is interesting to note that in the {\cartpole} domain, in which
Uniformed Global Search was found to find better solutions than Informed Global
Search, the step size was large enough and the dynamics were nonlinear
enough that single steps often crossed multiple grid elements, and
each grid element typically only had one previously found trajectory
entering it when it was searched from.  Thus, this was a case in which
Informed Global Search's ability to discriminate between ``good'' and
``bad'' states within the same grid element was not relevant.  

%100^2:  with cost about 109.
%furthermore, while the average cost of the informed searcher's
%trajectories only decreased from 151 to 138, the uninformed searcher
%managed an average cost of only 109.  Because the approximated value

%function is not ``optimistic'' -- that is, it is not an admissible
%heuristic for $\astar$ search -- the informed searcher still finds
%significantly suboptimal trajectories to the goal.  Since the
%uninformed searcher effectively has an ``optimistic'' heuristic,
%however, it finds trajectories which are close to optimal.
%
% ayn: i think it's actually pruning, not optimism, that is the issue. 
% i.e. the "constant 0" value function is optimistic, but doesn't do
% such a great job because of the pruning.

\section{Learning a Model Online}

\newcommand{\kdtree}{{\it k}d-tree}

Occasionally, the state transition function is not known but rather must be 
learned online.  This does not preclude the use of online search techniques;
as a toy example, Figure~\ref{fig-learn} shows cumulative cost learning curves 
for {\mountaincar}.
For each action, a {\kdtree} implementation of 1 nearest neighbor is used 
to learn the state transitions, and to encourage exploration, states 
sufficiently far from points stored in both trees are optimistically 
assumed to be zero-cost absorbing states.  A $7^2$ value function approximator 
is updated online with the changing state transition model. Without search, 
the learner eventually attains an average cost per trial of about 212; with 
Informed Global Search (search grid resolution $50^2$), it quickly 
(after about 5 trials) achieves 
an average cost of 155; with Local Search ($d=20, s=1$), it 
achieves an average cost of 127 (also after about 5 trials).

\parbox{2.80in}
{
\centerline{\psfig{figure=learn.ps,width=2.70in,height=2.25in}}
\refstepcounter{my-figures-ctr}
  \label{fig-learn}
  \centerline{\parbox{2.55in} {%\footnotesize 
	Figure~\ref{fig-learn}:
		Cumulative cost curves on {\mountaincar} with model learning. 
%		(Shallow gradients are good.)
		}}
}
\smallskip

%scottd REUNCOMMENTED (e.g., basically moved back here from the ``Uninformed
%  Global Search'' section)
%scottd RECOMMENTED
%ayn UNCOMMENTED

As before, when the planner finds a trajectory to the goal, it is
executed in its entirety. But in the case where we are learning a
model of the world, it is possible to successfully plan a continuous
trajectory using the learned world model, but for the agent to fail to
reach the goal when it tries to follow the planned trajectory.  In
this case, failure to follow the successfully planned trajectory can
directly be attributed to inaccuracies in the agent's model; and in
executing the path anyway, the agent will naturally reach the area
where the actual trajectory diverges from the predicted/planned
trajectory and thereby improve its model of the world in that area.

However, several interesting issues do arise when 
the state transition function is being approximated online.  
Inaccuracies in the
model may cause the Global Searches to fail in cases where more
accurate models would have let them find paths to the goal.
%scottd: added next sentence
Optimistic exploration policies can be used to help the system
gather enough data to reduce these inaccuracies, but in
even moderately high-dimensional spaces such exploration would
become very expensive.
Furthermore, trajectories supposedly found during search will
certainly not be followed exactly by an open-loop controller; adaptive
closed-loop controllers may help alleviate this problem to some
extent.  
Finally, using the models to predict state transitions should be 
computationally cheap, since we will be using them to update the 
approximated value function with the changing model, as well as 
to perform searches.

%Interesting There is no space to do more than summarize the interesting issues
%when using search with online model-learning.  Model inaccuracies may
%cause the Global Search to fail. Trajectories supposedly found by
%Global Search will not be followed open loop so closed-loop trajectory
%trackers should be used. Model predictions must be very cheap, as they
%are heavily used in search and value function approximation.

\section{Future Research}

How well will these techniques extend to non-deterministic systems?
They may work for problems in which certain regularity assumptions are
reasonable, but more sophisticated state transition function approximators
may be required when learning a model online.

How useful is Local Search in comparison with building a local linear
controller for trajectories? During execution some combination
of the two may be best. Local Search also plays an important role
in the inner loop of global search: it is unclear how local linear
control could do the same.
%ayn: A lot! I still believe non-holonomy will kill local controllers.
% Plus, local search applies not only to dynamical systems, so I'd rather
% not being up local controllers.
%%%awm I will go and find a large eel to slap you around the head with.
%%%awm What you say might be true, but it must be said, else Atkeson-like
%%%awm reviewers will simply whine about us not mentioning this. They'll
%%%awm think we're just a bunch of computer scientists who don't know
%%%awm a thing about real control theory. (Which is sadly somewhat true).

The experiments presented here are low dimensional. It is encouraging
that informed search permits us to survive $50^4$ grids, but to properly
thwart the curse of dimensionality we can conclude that 
%scottd: changed to list format, rearranged the order a bit, and reordered
%stuff
\begin{enumerate}
\item Informed Global Search is often much more tractable 
         than Uninformed Global Search, even with relatively
         crudely approximated value functions.
\item However, more accurate (yet computationally tractable) value function 
approximators may be needed than the simplex-grid-based approximators
used here.
\item Variable resolution methods (e.g. extensions
to~(Moore and Atkeson 1995))\nocite{mr:thpr2} would probably be needed for 
the Global Search's state-space partitions rather than the uniform grids
used here.
\end{enumerate}
% ayn: changed "will" to "might" since NNs do perfectly well in high dims.
%
%ayn: choice of DP solver is orthogonal to search
%%%awm More reasonable, but again let's be sure the reviewers know this too.
%%%awm A suggested addition of mentioning TD and LQR above will solve this.

Lastly, algorithms to learn reasonably accurate yet 
consistently ``optimistic''~(Russell and Norvig 1995)\nocite{rssl:artf} value functions 
might be helpful for Informed Global Search.

%\vskip 0.01in{}

%{\scriptsize
\bibliography{all-awm,cv-bib}
%\bibliography{nrdp}
%\bibliographystyle{named}}
\bibliographystyle{aaai}
%}
\end{document}

-----------------------------------------------


Intro
 - lots of standard algorithms in robotic control
 - search: A*, roadmaps, clever graph thingys, etc
 - this avenue surprisingly unexplored in continuous-state RL literature. 
     we will take these standard algorithms, apply to improve any 
     previous RL method that learns an approximate value (or Q?) function. 
     For example, w/extra computation at run-time.
 - Why run time? Because know exactly where we are, wrt to a non-discretized
     position, unlike offline computations.
**** - Little-discussed question: Given a value function, how to make that 
****     into a policy? Especially for continuous time?
 - Mention: our treatment will primarily be for deterministic domains.

Searches
 - Describe full search. 
 - Motivate it: depth 3-4 might be tractable. O(|A|^d) time in general. Expensive.
 - briefly mention stochastic case: need to to Monte Carlo 
	simulations. (--> runtime *= constant)

Microsearch:
 - maxnumswitches makes it cheap(er). Rationale for why it's a reasonable thing to do.
 - parameters:
	1. how often we consider switching actions
	2. maxnumswitches
	3. depth
     briefly discuss how one might set them. 
     (i.e. for 2&3, expert decides: upperbound on numswitches/step, then set 
	depth to the max tractable depth.)
 - expts with known model.
 - mention: if model murky, might not want to search to deep. 
		(EXPTS: Do expts show this?)
 - Theoretical results:
     non-max microsearch to depth n == using $T^n V$ as the value function
   ?? (note: discrete state->can learn exact value function. If have that, then
		no need for search. Do we mention this in either intro or 
		searches section?)
 - Further justification:
	Also relate to multi-grid methods: learn a coarse approximation 
	to the value function using standard value-iteration or whatever, and 
	then simulating (or approximating) using an *infinitely* fine grid to 
	do 3 more iterations of backups/value iteration. 
 - mention: works not only for shortest-path problems

Macrosearch: 
 - Mention "Latombe's algorithm"
 - much longer search range. 
 - parameters
	- maxdepth
	- microsearch params
 - relate to how fast standard (graph-based) shortest-path algorithms are.
 - A*, and we use the learned value function to guide it.
 - mention? not so much an improvement to old RL V/Q functions, but a new approach
    that marries traditional robot-motion-planning algo to A* + value function
 - metion: hash-table implementation for low-dim embedded in high dim

Learning Models: (where to put this section??)
 - describe learner
? - Describe how to modify kd-tree algorithm to make it much faster for mem-based
	model learning. (EXPT: Can we get timing results?)
	(?? Is this slightly off the paper's theme?)
 - Experiments with having to learn a model
	-> show tradeoff?
 - exploration strategy. (cite antecedent)
 - Discuss the clever things to decide when to recompute model, redo
	forward integration, etc., for when learning a model, so 
	that more tractable than redoing entire dp at every step.
	(?? Is this slightly off the paper's theme?)

Discussion/Extensions: 
 - propose: shift computation of search back to offline, by doing the 
    search, then finding another representation for the policy.
        - backward search from goal. (if have inverse kinematics.) 
 - multiresolution (multires models & multires approach to macrosearch)
	- increase res where needed
	- can increase searchgrid res till we can prove no path to goal, 
		then go improve model. Whatever. 
		relate to robot motion planning algos.
 - ? mention it's worth considering possibility of applying other 
	robot motion planning algos (roadmaps, potential field, other searches, ...)
	to problems with highly non-holonomic constraints or where we do have
	have an exact world model, or pure non-robotic tasks, 
	which were traditionally the domain of RL.


Comparisons:

   local greedy
   micro
   macro
   micro+macro
   macro w/out value function
   micro+macro w/out value function
   

------------ Additional things we might want to include ------------------

--------------- Random things [some snipped from old email]---------------

Possible extensions: 
If we were to use a grid-based discretization with no interpolation, 
it's reasonable (?) to expect that the policy "implied" by the learned value 
function, will not have much finer resolution than the grid used to learn 
the value function? Therefore, it is reasonable to do the possibly-expensive 
searches offline, and store their results a grid of only "somewhat finer" 
granularity then the one used to learn the value function. If this were the 
case, then all the computation can be offline.
Interesting grid-size vs. search-cost computational tradeoff? 

I'm a little murky on this, though, since it's not unlike doing standard 
value iteration on a coarse grid, then (say) 3 iterations on an infinitely 
fine grid, then 1 iteration on a fine-ish grid. Hard to say what the last
step does, but if the coarse grid introduces e1 (absolute) error into 
whatever it's modelling, and the fine-ish grid e2 error, then I *think* 
the error of the final value function can be bounded by something 
like  e2 + gamma^3 * e1/(1-gamma) , possibly modulo some multiplicative 
constants.

Random justification for max-algorithm. [omitted from here.]

Random Thought #4:
If we can convince ourselves the largest absolute error in the value
function is e (say, largest abs bellman error is b, and e = b/(1-gamma),
I *think*), then we can evaluate every single node we ever get to, 
and do some pruning of the search tree in a pretty obvious way.




