%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   %                                                     %
%  #####   # #      % File:     nrdp.tex                                  %
%    #     # #      % Author:   Davies, Ng, Moore                         %
%    # ###  #       %                                                     %
%    # #   # #      %                                                     %
%    # ### # #      %                                                     %  
%      #            %                                                     %
%      ###          %                                                     %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

% Changes on 4/7/98 by Scott: search for ``ascottd'' in comments.

% Changes on 1/18/98 by Andrew: search for ``jawm'' at the start
% of comments.

% Things to do:
% Make certain it is clear it's min-COST not just min-time to goal

\documentstyle[psfig,aaai]{article}
% jawm: above is what it SHOULD be
% \documentstyle[psfig,nips97]{article}

%\begin{document}

%\renewcommand{\baselinestretch}{2}

\newcommand{\myindex}[1]{\mbox{\scriptsize #1}}


%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Prologue                   %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\twopixwidth}{2.5in}
\newcommand{\threepixunitwidth}{1.6in}
\newcommand{\threecapwidth}{1.6in}

\def\astar {{A^{\ast}}}
% \def\mountaincar {{\sc modified-car}}
% jscottd: changed to ``mountain-parking''
\def\mountaincar {{\sc mountain-parking}}
\def\cartpole {{\sc move-cart-pole}}
\def\acrobot {{\sc acrobot}}
\def\slider {{\sc slider}}

\newcounter{my-figures-ctr}

% \input{../bib/include.tex}
% jawm commented out \input{twopix}
% jawm commented out \input{threepix}

\newcommand{\normalfig}[2] {\begin{figure}
\centerline{\psfig{file=#1.ps,width=0.37\textwidth}}
\begin{center}
\begin{minipage}{0.35\textwidth}
{\footnotesize
\refstepcounter{figure}
\label{#1}
\noindent
Figure~\ref{#1}: #2}
\end{minipage}
\end{center}
\end{figure}}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Title                  %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

%scottd: I think maybe we should emphasize the continuous-state
%  aspect of the paper, since that's where the novel stuff in the paper
%  is.  
% jawm: Absolutely agree
\title{Applying Online Search Techniques to Continuous-State Reinforcement Learning}
%\title{Applying Model-based Search Techniques to Reinforcement Learning}

\author{Scott Davies$^{\ast}$ \hbox to 0.3in{}  Andrew Y. Ng$^{\dag}$
		\hbox to 0.3in{}  Andrew Moore$^{\ast}$ \\ \\
\parbox{3.5in}{
\centerline{
$^{\ast}$
School of Computer Science}
\centerline{Carnegie-Mellon University}
\centerline{Pittsburgh, PA 15213}
}
\parbox{3.5in}{
\centerline{
$^{\dag}$
Artificial Intelligence Lab}
\centerline{Massachusetts Institute of Technology}
\centerline{Cambridge, MA 02139} 
} \\
}

\begin{document}

\maketitle

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%                   Text                   %
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\uppara}{\ \\

\vspace{-17mm}

\ \\}

% I couldn't figure out how to make LaTeX do a footnote without 
% the footnote index number.
%\footnotetext{Copyright notice will appear here in final version.}
%ascottd: changed copyright
\insert\footins{\footnotesize{\noindent Copyright \copyright 1998, American Association for Artificial Intelligence (www.aaai.org).  All rights reserved.}}

\begin{abstract}
%\begin{quote}

%The goal of Reinforcement Learning is to find sequences of actions
%that maximize long-term reward.  Other fields of research have, at an
%abstract level, been concerned largely with the same problem: for
%example, the AI/game playing literature discusses algorithms for
%selecting moves that will eventually lead to victory, and robot motion
%planning is used to find series of actuator torques that will place
%robots in desired configurations.  Both of these areas are rich with
%search techniques.  In this paper, we investigate the application of
%similar online search algorithms to model-based Reinforcement Learning
%domains.  Our search algorithms fall into two categories: first,
%``local'' searches, where the agent performs a finite-depth lookahead
%search over possible trajectories; and second, ``global'' searches,
%where the agent performs a search for a trajectory all the way from
%the current state to a goal state.  The key to the success of these
%methods lies in combining the use of on-line search to find good
%trajectories specifically from the current state with the use of
%an approximate value function which gives a rough solution to the much
%harder problem of finding good trajectories from all possible states.

%In reinforcement learning it is frequently necessary to resort to an
%approximation to the true optimal value function. Here we investigate
%the benefits of online search in such cases.  We examine ``local''
%searches, where the agent performs a finite-depth lookahead search,
%and ``global'' searches, where the agent performs a search for a trajectory
%all the way from the current state to a goal state.  The key to the
%success of these methods lies in combining the use of a rough solution
%to the hard problem of finding good trajectories from all possible
%states, with an accurate solution to the easier problem of finding a
%good trajectory specifically from the current state.

%scottd: added ``particularly in continuous-state domains''
%jawm: altered to bring the main points right up front
%jawm: here and elsewhere I have also tried to reduce the amount
%      of double-quoting of words.
%jscottd: altered wording so first two sentences can be parsed by mere mortals
%This paper extends the efficiency of computing and the quality of
%reinforcement learning solutions for continuous-state 
%learning of optimal control. We provide new algorithms
%that exploit online search in various ways that
%boost the power of very approximate value functions
%discovered by traditional value function approximation.

In this paper, we describe methods for efficiently computing better
solutions to control problems in continuous state spaces.  We
% ascottd: removed ``new'' to tone down claims of originality
% provide new algorithms that exploit online search to boost the power
provide algorithms that exploit online search to boost the power
of very approximate value functions discovered by traditional
reinforcement learning techniques.  
We examine local
searches, where the agent performs a finite-depth lookahead search,
and global searches, where the agent performs a search for a trajectory
all the way from the current state to a goal state.  

The key to the
success of the local methods lies in taking a value function, which gives
a rough solution to the hard problem of finding good trajectories 
from every single state, and combining that with online search, which 
then gives an accurate solution to the easier problem of finding a
good trajectory {\em specifically} from the current state.

% jawm: New...
The key to the success of the global methods lies in using
aggressive state-space search techniques such as uniform-cost search and
$\astar$, tamed into a tractable form by exploiting neighborhood 
relations and trajectory constraints that arise from continuous-space
dynamic control.
%\end{quote}
\end{abstract}

% NOTE: keywords must be selected from the list supplied by AAAI
%{\small
%\begin{verbatim}
%Content Area: Machine Learning, Reinforcement Learning
%Additional Keywords: search, planning and control
%Tracking Number: A472
%\end{verbatim}
%}

\section{Introduction}
A common approach to Reinforcement Learning involves approximating 
the value function, and then executing the greedy policy with respect to
the learned value function.  

% jscottd: tweaked next sentence
%But particularly in continuous high-dimensional 
%state spaces, it is often the case that even when our agent is 
%given a perfect model of the world, it can be computationally 
%expensive to fit a highly accurate value function. 
However, particularly in high-dimensional continuous state
spaces, it can often be computationally expensive to fit a highly
accurate value function, even when our agent is given a perfect
model of the world.
This problem is 
even worse when the agent is learning a model of the world and is 
repeatedly updating its dynamic programming solution online. What's
to be done?

% ascottd: reworded following sentence to emphasize continuous-state
%  stuff
%In this paper, restricted to deterministic domains, we investigate the
%idea that rather than executing greedy policies with respect to
%approximated value functions, we can use online search techniques 
%to find better trajectories. 
In this paper, we investigate the idea that rather than executing greedy
policies with respect to approximated value functions in continuous-state
domains, we can use online search techniques to find better trajectories.
We restrict our attention to deterministic domains.
The paper consists of a progression of 
improvements to conventional action-selection from value functions, 
along the way using techniques from value function 
approximation~(Davies, 1997)\nocite{dvs:mltd},
%real-time search~\cite{korf,gametree},
real-time search~(Korf, 1990)\nocite{krf:rltm},
%constrained trajectories~\cite{optimalcontrol}, 
constrained trajectories~(Burghes and Graham 1980)\nocite{brgh:intr},
%and robot planning~\cite{latombe,mooreVFAworkshop,booneAstar}.
%jscottd: what is mooreVFAworkshop supposed to refer to --
%   a particular paper presented, or the whole workshop?
and robot motion planning~(Latombe 1991\nocite{ltmb:rbtm}, Boyan {\it et al.} 1995\nocite{vfa95}, Boone 1997\nocite{bn:mnmm}).
All of the algorithms perform search online to find a good trajectory
from some current state.  Briefly, the progression is as follows:

\begin{itemize}
\item {\bf LS: Local Search.}
Takes a forward-dynamics model and an approximate value function,
and performs a limited-depth lookahead search of possible 
trajectories from the current state before suggesting an action.

\item {\bf CLS: Constrained Local Search.}
%jscottd: reworded next sentence
%Does the same job as LS, but is able to search deeper or faster
%by only considering trajectories with few action-changes.
Does a similar job as LS, but considers only trajectories in which
the action is changed infrequently. This results in substantial 
computational savings that allow it to search much deeper or faster.

\item {\bf UGS: Uninformed Global Search.}
For least-cost-to-goal problems, takes a forward-dynamics model 
and plans an approximately-least-cost path from the current state
all the way to the goal, using LS or CLS along with a 
neighborhood-based pruning technique to permit tractable searches
even when they cover large areas of the continuous state space.

\item {\bf IGS: Informed Global Search.}
% jscottd: changed wording
% ascottd: added mention of A*, along with mention of significantly improved
% solution quality
Does a similar job as UGS, but uses an approximate value function 
to guide the search in a manner very similar to $\astar$~(Nilsson 1971)\nocite{nlss:prbl}, thereby vastly reducing the search time and (somewhat
surprisingly) often dramatically improving the solution quality as well.

\item {\bf LLS: Learning Local Search.}
% ascottd: changed wording
%{\em Learns} a forward-dynamics model, and uses that with an approximate 
%value function for the LS and CLS approaches.
{\em Learns} a forward-dynamics model, and uses it to generate approximate 
value functions for the LS and CLS approaches.

\item {\bf LGS: Learning Global Search.}
% ascottd: changed wording to reflect fact that value function is being tweaked
%{\em Learns} a forward-dynamics model, and uses it with an approximate 
%value function for the UGS and IGS approaches.
{\em Learns} a forward-dynamics model, and uses it to generate approximate
value functions for the LS and CLS approaches.
\end{itemize}

% jawm new:
% jscottd: ``any reinforcement learning algorithm''?  Only works with
%   model-based algorithms.  Added ``model-based''
%In this paper, the approximate value functions are obtained by $k$-dimensional
%simplex interpolation combined with value iteration~(Davies 1997)\nocite{dvs:mltd}, but
%the approaches are applicable for accelerating any reinforcement learning
%algorithm that produces approximate value functions.

% jscottd: I'm also not convinced that this isn't out of place here
In this paper, the approximate value functions are obtained by $k$-dimensional
simplex interpolation combined with value iteration~(Davies 1997)\nocite{dvs:mltd}, but
the approaches are applicable for accelerating any model-based reinforcement learning
algorithm that produces approximate value functions,
% jscottd: moved below clause from a later part of the paper, where
%   redundant talk of simplex interpoltation has been commented out
such as an
LQR solution to a linearized problem or a neural net value function
computed with TD.

With such searches, we perform online computation that is directed towards
finding a good trajectory {\em from the current state}; this is in
contrast to, say, offline learning of a value function, which tries to
solve the much harder problem of learning a good policy for every
point in the state space.
% ascottd: added following sentence
However, the global search algorithms go beyond shallow lookahead
methods and instead use a pruned ``bush'' of search trajectories to find
continuous trajectories all the way to the goal.

% jscottd: Reworded following sentence.
%At the end of the paper, we discuss the application of such search
%techniques to continuous state spaces, and demonstrate that these
%techniques often dramatically improve the quality of solutions found
%at relatively little computation expense.
We apply these search techniques to several continuous-state problems,
and demonstrate that they often dramatically improve the quality of
solutions at relatively little computational expense.

% jawm: If we've got space, it is worth making sure people
%       understand what we mean by a continuous space dynamic
%       control task, and the significance of our state-space
%       diagrams early.

% jscottd: got rid of ``puck'' terminology everywhere
\subsection{{\Large\bf\sc mountain-parking}: An example of a continuous-space
dynamic control task}
\label{mountaincar}
Figure~\ref{hillcar} depicts a car (idealized as a frictionless puck)
on a very steep hill. The car can accelerate forward or backward with a
limited maximum thrust.  The goal is to park the car in a space
near the top of the hill (that is, occupy the space while having a near-zero
velocity).  Because of gravity, there is a region near
the center of the hill at which the maximum forward thrust is not
strong enough to accelerate up the slope. This is depicted on the
two-dimensional diagram in Figure~\ref{phase}. Thus if the goal is at
the top of the slope, a strategy that proceeded by greedily choosing
actions to thrust towards the goal would get stuck. 
% jscottd: added next sentence
Figure~\ref{optimal} shows a sample minimum-time path for one
possible initial state.
This task,
although trivial to solve by dynamic programming on a very fine grid,
will be used as an illustration during the exposition because its
state space can be drawn as a two-dimensional diagram. In the
% jscottd: the section reference wasn't working, so I just did it by hand
Experiments section we will see empirical results for problems that
would be intractably expensive for dynamic programming on a fine grid.

%\normalfig{hillcar}{A
%frictionless puck acted on by gravity and a horizontal thruster. The
%puck must get to the goal as quickly as possible. There are bounds on
%the maximum thrust.}

\normalfig{hillcar}{A car acted on by gravity and limited forward/backward
thrust.
The car must park in the goal area as quickly as possible.}

\normalfig{phase}{The state transition function for a car
constantly thrusting to the right with maximum thrust. A point on the diagram
represents a state of the car. Horizontal position denotes the physical
car position. Vertical diagram position denote the car's velocity.}

\normalfig{optimal}{A minimum-time path
 for the car on the hill. The optimal value function is shown by
dots. The shorter the time to goal, the larger the black
dot.}

\section{LS: Local Search}
\label{section-localsearch}

Given a value function, agents typically execute a greedy
policy using a one-step lookahead search, possibly using a learned
model for the lookahead. The computational cost per step of this 
is $O(|A|)$ where $A$ is the set of actions. This can be thought of as 
performing a depth 1 search for the 1-step trajectory $T$ that 
gives the highest $R_T + \gamma V(s_T)$, where $R_T$ is the reinforcement, 
$s_T$ is the state (possibly according to a learned world model) reached upon 
executing $T$, and $\gamma$ is the discount factor. 
A natural extension is then to perform a search of depth $d$, 
to find the trajectory that maximizes $R_T + \gamma^d V(s_T)$,
where discounting is incorporated in the natural way into $R_T$. The 
computational expense is $O(d |A|^d)$.

% jscottd: following material had been in the CLS section, which
%   seemed wrong.  

During execution, the LS algorithm iteratively
finds the best trajectory $T$ of length $d$ with the search algorithm above, 
executes the first action on that trajectory, and then does a 
new search from the resulting state.  
If $B$ is the ``parallel backup operator''~(Bertsekas 1995)\nocite{bersekas95:dpaoc1} so
that $BV(s) = max_{a \in A} R(s,a) + \gamma V(\delta(s,a))$, then executing
the full $|A|^d$ search is formally equivalent to executing the greedy
policy with respect to the value function $B^{d-1}V$. Noting that,
under mild regularity assumptions, as $k \rightarrow \infty$, $B^kV$
becomes the optimal value function, we can generally expect $B^{d-1}V$
to be a better value function than $V$. For example, in discounted 
problems, if the largest absolute error in $V$ is $\varepsilon$, the 
largest absolute error in $B^{d-1}V$ is $\gamma^{d-1}\varepsilon$. 

This approach, a form of receding horizon control, has most famously 
been applied to minimax game playing programs~(Russell and Norvig 1995)\nocite{rssl:artf} and 
% jscottd: added ``systems''
has also been used in single-agent systems on 
% jawm: removed "small"
discrete domains (e.g. (Korf 1990)\nocite{krf:rltm}).
% ascottd: added following sentence
In game-playing scenarios it has also been used in conjunction with
automatically learned value functions, such as in Samuel's celebrated
checkers program~(Samuel 1959)\nocite{sml:smst0} and Tesauro's backgammon
player~(Tesauro and Galperin, 1997)\nocite{tsr:olpiusmc}.


% jawm: added new section heading
\section{CLS: Constrained Local Search}

To make deeper searches computationally cheaper, we might 
consider only a subset of all possible trajectories of depth $d$. 
Especially for dynamic control, often an optimal trajectory repeatedly 
selects and then {\em holds} a certain action for some time, such as
suggested by~(Boone 1997)\nocite{bn:mnmm}.
Therefore, a natural subset of the $|A|^d$ possible trajectories
are trajectories that switch their actions rarely.
When we constrain the number of switches between actions
to be $s$,
%(The sequence $a_1,a_1,a_2,a_2,a_2$ has one ``switch,'' 
%whereas $a_1,a_2,a_2,a_2,a_1$ has two ``switches.'')
the time for such a search is then 
$O( d \left(\hbox to 0pt {$^d$}_{s}\right) |A|^{s+1})$---considerably
cheaper than a full search if $s \ll d$.
%ayn UNCOMMENTED next 4 lines
We also suggest that $s$ is easily chosen for a particular domain by 
an expert, by asking how often action switches can reasonably be 
expected in an optimal trajectory, and then picking $s$ accordingly 
to allow an appropriate number of switches in a trajectory of length $d$. 
Figure~\ref{localsearch} shows CLS performed in the
{\mountaincar} task using $d=20$ and $s=1$. 

% jscottd: added next sentence.  Is this OK?
Since LS is the same as CLS with the maximum-number-of-switches
parameter $s$ set to $d-1$, we may use ``LS'' or ``local search'' to refer
generically to both CLS and LS at certain points throughout the rest
of the paper.


% jscottd: munged this to use the same style as the first figures
%\begin{figure}[h]
%\hspace*{-0.5in}
%\parbox{2.80in}
%{
%\centerline{\psfig{figure=localSearch.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-example-localsearch}
% \centerline{\parbox{2.55in} 
%	{%\footnotesize 
%	Figure~\ref{fig-example-localsearch}: Constrained Local Search (CLS) example: a twenty-step search with at most one switch in actions}}
%}
%\nolinebreak
%\parbox{0.20in}
%{ \hbox to 0.01in {} }
%\nolinebreak
%\smallskip
%\normalfig{fig-example-localsearch}{Constrained Local Search (CLS) example: a twenty-step search with at most one switch in actions}
\normalfig{localsearch}{Constrained Local Search (CLS) example: a twenty-step search with at most one switch in actions}

% jawm: So we're not using LaTeX's citation capabilities. What's that
%       about?

%[ayn: Would this section be clearer if written in terms of search over sequences 
%of actions rather than trajectories?]
%%%awm I like it the way you already have it

%Planning Time:
%Develop an approximate value function by dynamic programming. In our
%examples our function approximator is ????~\ite{davies}. 

\section{UGS: Uninformed Global Search}

% jscottd: changed wording a tad since we're not really ``improving
%   the value function''
%Local searches (LS and CLS) are not the only way to improve an
%approximate value function. Here, we describe global search for
Local searches (LS and CLS) are not the only way to more effectively
use an approximated value function.  Here, we describe global search for
solving least-cost-to-goal problems in continuous state spaces with
non-negative costs.  We assume the set of goal states is known.

%\subsection{Uninformed Global Search}

%Here, we also use local searches to maintain a position on a
%trajectory and to increase the power of the high level searcher.

Why not continue growing a search tree
until it finds a goal state? The answer is clear---the combinatorial
explosion would be devastating.  In order to deal with this problem,
we borrow a technique from robot motion planning
~(Latombe 1991).\nocite{ltmb:rbtm}
%scottd: changed, then I changed my mind and changed it back
We first divide the state space up into a fine uniform grid. 
% jscottd: following sentence was moved from IGS section
A sparse representation is used so that only grid cells that
are visited take up memory\footnote{This has a flavor not dissimilar
to the hashed sparse coarse encodings of (Sutton
1996).~\nocite{sutton96:girl}}.


%We first divide the state space up into partitions --- in this
%paper, we use a fine uniform grid.
A local search procedure (LS or CLS)
is then used to find paths from one grid element to
%%%AWM This idea is unique I believe. There should be at least one experiment
%%%AWM demoing that this works.
% jawm: Have we done the above?
% jscottd: I just now (1/21/98) ran some experiments in which s 
%   (max # action switches) was reduced from 1 to 0 on the
%   slider and acrobot domains.  It made no significant difference 
%   in terms of solution quality :-(
another.  Multiple trajectories entering the same grid element are
pruned, keeping only the least-cost trajectory into that grid element
(breaking ties arbitrarily).  The point at which this least-cost trajectory
first enters a grid element is used as the grid element's ``representative
state,'' and acts as the starting point for the local search.  The rationale for the pruning is an
assumed similarity among points in the same grid element. In
this manner, the algorithm attempts to builds a complete trajectory to
the goal using the learned or provided world model.  
%[Move this down to the next section!]
%It is worth noting that
%Uninformed Global Search is equivalent to Informed Global Search
%(described below) using a constant 0 heuristic evaluation function.
%scottd UNCOMMENTED
When the planner
finds a trajectory to the goal, it is executed in its entirety.

%scottd ADDED next sentence
The overall procedure is essentially a lowest-cost-first search
over a graph structure in which the graph nodes correspond to grid
elements, and in which the edges between graph nodes correspond to
trajectories between grid elements as found by the CLS procedure.
A graph showing such a search for the
{\mountaincar} domain is depicted in
Figure~\ref{uninformed}. 

% jscottd: canonicalized figure
%\hspace*{-0.35in}
%\smallskip
%\parbox{2.8in}
%{
%\centerline{\psfig{figure=noVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-example-uninformed-global}
%  \centerline{\parbox{2.55in}
%	{%\footnotesize 
%	Figure~\ref{fig-example-uninformed-global}:
%	Uninformed Global Search (UGS) example.
%	Velocity on $x$-axis, car position on $y$ axis.  Large black dot
%	is starting state; the small dots are grid
%	elements' ``representative states.'' }}
%}
%\smallskip

\normalfig{uninformed}
{Uninformed Global Search (UGS) example.
	Velocity on $x$-axis, car position on $y$ axis.  Large black dot
	is starting state; the small dots are grid
	elements' ``representative states.'' }


%%%awm The below paragraph is now redundant and can be eliminated provided
%%%awm hng:plnn is given credit somewhere
%This algorithm was inspired by the non-holonomic path planner
%of~\cite{ltmb:rbtm}, where instead of searching through a dynamic
%statespace, the authors search configuration space using
%distance-to-goal as a heuristic.  Similar techniques were employed
%by~\cite{bn:mnmm}, which discusses this and other search approaches
%for the Acrobot in more detail, and~\cite{hng:plnn}.

%scottd RERECOMMENTED, DAMMIT :)
%  (First sentence uncommented elsewhere; other stuff moved to the
%  ``learning a model online'' section)
%ayn REUNCOMMENTED
%scottd RECOMMENTED
%ayn UNCOMMENTED
%When the planner finds a trajectory to the goal, it is executed in its
%entirety. But in the case where we are learning a model of the world,
%it is possible for the agent to successfully plan a continuous 
%trajectory using the learned world model, but to fail to reach the 
%goal when it 
%tries to follow the planned trajectory.  In this case, failure to follow 
%the successfully planned trajectory can directly be attributed to
%inaccuracies in the agent's model; and in executing the path anyway, the
%agent will naturally reach the area where the actual trajectory
%diverges from the predicted/planned trajectory and thereby improve its
%model of the world in that area.

%Using these trajectories can significantly improve the agent's online performance.
%Describe how the planning vs execution aspect of this is organized.
%Mention we don't need an approximate eval function at all. But we do
%need a goal or some criterion for stopping the search.

\section{IGS: Informed Global Search}

We can modify Uninformed Global Search (UGS) by using an approximated
value function to {\em guide} the search expansions in the style of
$\astar$ search~(Nilsson 1971),\nocite{nlss:prbl} as written out in
% ascottd: should we remove Nilsson ref. now that it's been mentioned 
% in intro?
detail below.  The search proceeds from the most promising-looking states
first, where the ``promise'' of a state is the cost to get to the state
(along previously searched trajectories) plus the remaining-cost-to-go
as estimated with the value function.
% from grid element to grid element. 
With the perfect value function, this causes the search to traverse 
exactly the optimal path to the goal; with only an approximation to 
the value function, it can still dramatically reduce the fraction of 
the state space that is searched. 
%One other
%modification is that in the pruning of multiple trajectories entering
%the same grid element, we now keep the trajectory that maximizes $R_T
%+ V(s_T)$ (unlike Uninformed Global Search, which does not use an
%$V(s_T)$ term,) and this finer pruning can cause better trajectories
%than Uninformed Global Search to be found.
%%%AWM Is there any chance of an experiment demoing the above point?
%%%ayn: Yup. Done by Scott in Section 5.

% jscottd: moved next sentence to UGS section
%The search uses a sparse representation so that only grid cells that
%are visited take up memory\footnote{This has a flavor not dissimilar
%to the hashed sparse coarse encodings of (Sutton
%1996).~\nocite{sutton96:girl}}.  
As in UGS, the grid is represented sparsely.
Notice also that like LS and CLS, we
are performing online computation in the sense that we are performing
a search only when we know the ``current state,'' and to find a
trajectory specifically from the current state;
%rather than 
this is 
in contrast to offline computation for finding a value
function, which tries to solve the much more difficult
problem of finding a good trajectory to the goal from every single point in
the state space. 

Written out in full, the search algorithm is:

%\begin{itemize}
%\item Let $V'(s)$ be an approximate value function for the state space $s
%\in S$
%\item Let $G$ be a sparse array of grid elements; let $G(s)$ represent the
%grid element containing a state $s$.  
%\item Let $P$ be a priority queue of grid elements in $G$.
%\item Initialize $P$ to contain $g(s_0)$, where s_0 is the current state
%\item Until we find a goal state, or $P$ is empty:
%\begin{itemize}
%\item Let $g(s)$
%\end{itemize}
%\end{itemize}

{ \small
\begin{enumerate}
\item Suppose $g(s_0)$ is the grid element containing the current state $s_0$.
Set $g(s_0)$'s ``representative state'' to be $s_0$, and add $g(s_0)$ to
a priority queue $P$ with priority $V(s_0)$, where $V$ is an
approximated value function. 
\item Until a goal state has been found, or $P$ is empty:
\begin{itemize}
\item Remove a grid element $g$ from the top of $P$. Suppose $s$ is $g$'s
``representative state.''  
% jscottd: reference wasn't working so I fixed it by hand
\item Starting from $s$, perform LS or CLS as described in the
Local Search section, except search trajectories are pruned
once they reach a state in a different grid element $g'$.
%it reaches a state $s'$ in some other grid element $g' \neq g$, we cut 
%the search off at $s'$.
If $g'$ has not been visited before, add $g'$ to $P$ with a priority
$p(g') = R_T(s_0, \ldots, s')+ \gamma^{|T|} V(s')$, where $R_T$ is the reward accumulated
along the recorded trajectory $T$ from $s_0$ to $s'$, and set $g'$'s
``representative state'' to $s'$.
Similarly, if $g'$ has been visited before, but 
$p(g') \leq R_T(s_0, \ldots, s') + \gamma^{|T|} V(s')$,
then update $p(g')$ to the latter quantity and set $g'$'s
``representative state'' to $s'$.  Either way, if $g'$'s 
``representative state'' was set to $s'$, record the sequence 
of actions required to get from $s$ to $s'$, and set $s'$'s 
predecessor to $s$.
\end{itemize}
\item If a goal state has been found, execute the trajectory.
Otherwise, the search has failed, because our
grid was too coarse, our state transition model inaccurate,
or the problem insoluble.
\end{enumerate}}

%scottd ADDED
The above procedure is very similar to a standard $\astar$ search,
with two important differences.  First, the heuristic function used
here is an automatically generated approximate value
function
% jscottd: footnote removed since the simplex stuff is mentioned
%    elsewhere
%\footnote{Obtained from dynamic programming using an
%approximate value function interpolated from tessellated
%simplices~(Davies 97)\nocite{dvs:mltd}.} 
rather than a hand-coded heuristic.  This
has the advantage of being a more autonomous approach requiring
relatively little hand-encoding of domain-specific knowledge.  On the
other hand, it also typically means that the heuristic function used
here may sometimes overestimate the cost required to get from some
points to the goal, which can sometimes lead to suboptimal solutions
-- that is, the approximated value function is not necessarily an {\em
optimistic} or {\em admissible} heuristic~(Russell and Norvig 1995)\nocite{rssl:artf}.  {\em However, within the
context of the search procedure used above, inadmissible but
relatively accurate value functions can lead to much better solutions
than those found with optimistic but inaccurate heuristics.}  (Note
that UGS is essentially IGS with an optimistic but inaccurate
heuristic ``value function'' of 0 everywhere.)
This is due to the second important difference between our search
procedure above and a standard $\astar$ search in a discrete-state
domain: IGS uses the approximated value function not only to decide
what grid element to search from next, but also from what particular
point in that grid element it will search for local trajectories to
neighboring grid elements.
%scottd: I'm not sure about this next sentence.  
% jscottd: took it back out again
%Overall, IGS might be looked at as a
%combination of (1) ``classic AI'' $\astar$ search; (2)
%value function approximation techniques normally used in reinforcement
%learning, and (3) trajectory-based search techniques taken from robot
%motion planning.
% ascottd: added comparison to Chris's work

The above algorithm is also similar in spirit to algorithms presented
in ~(Atkeson, 1994)\nocite{atks:usng}.  Atkeson's algorithms also
found continuous trajectories from start states to goals.  The search
for such trajectories was performed either within the context of a
regular grid (as in the algorithm above) or a pair of constant-cost
contours gradually grown out from the start and goal states.  Our
algorithms differ from Atkeson's in that our algorithm works with a
small set of discrete actions and can handle some discontinuities in
the dynamics, whereas Atkeson's algorithm requires smooth dynamics
(continuous first and second derivatives) with continuous actions.
Unlike Atkeson's work, our algorithm does not yet locally optimize the
trajectories found by our search algorithms.  However, also unlike
Atkeson's work, we first compute a crude but quick approximation to
the value function (except in the case of uninformed global search),
and using this approximate value function speeds up the search
considerably.

An example of IGS on the
{\mountaincar} domain is shown in
Figure~\ref{crappyvfglobal}.  The value function was
approximated with a simplex-based
interpolation~(Davies 1997)\nocite{dvs:mltd} on a coarse 7 by 7 grid,
with all other parameters the same as in
Figure~\ref{uninformed}.  Much less of state
space is searched than by UGS.

% jscottd: Canonicalized figure
%\parbox{2.8in}
%{
%\centerline{\psfig{figure=crappyVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-example-informed-global}
% \centerline{\parbox{2.55in} 
%	{%\footnotesize 
%	Figure~\ref{fig-example-informed-global}:
%		Informed Global Search (IGS) example
%		 on {\mountaincar}, with a crudely approximated value function.}}
%}
%%\end{figure}
%\smallskip
\normalfig{crappyvfglobal}{Informed Global Search (IGS) example
		 on {\mountaincar}, with a crudely approximated value
	 function.}


%\parbox{2.8in}
%{
%\centerline{\psfig{figure=goodVFGlobalSearch.grid.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-example-informed-global2}
% \centerline{\parbox{2.55in} 
%	{%\footnotesize 
%	Figure~\ref{fig-example-informed-global2}:
%		Informed Global Search (IGS) example
%		 on {\mountaincar}, with a more accurately approximated value function.}}
%}
%%\end{figure}
%\smallskip
\normalfig{goodvfglobal}{Informed Global Search (IGS) example
 on {\mountaincar}, with a more accurately approximated value function.}

In Figure~\ref{goodvfglobal}, the value function was
approximated more accurately with a simplex-based interpolation on a
21 by 21 grid.  With this accurate a value function, the search
goes straight down a near-optimal path to the goal.
Naturally, in such a situation the search is actually
unnecessary, since merely greedily following the approximated value
function would have produced the same solution.  
% jscottd: commented out stuff below
%Because we can easily
%afford to compute an approximated value function with a 21 by 21 grid,
%the search techniques discussed here are not necessary in practice for
%a domain as simple as the {\mountaincar} domain; we only use the
%{\mountaincar} example here since it is two-dimensional and thus
%easily used for illustrations.  
However, when we move to
higher-dimensional problems, such as problems examined in the next
section, high-resolution approximated value functions become
prohibitively expensive to calculate, and IGS
can be a very cost-effective way of improving performance.

\section{Experiments}

We tested our algorithms on the following 
domains\footnote{C code for all 4 domains (implemented with numerical 
integration and smooth dynamics) will shortly be made available on the 
Web.}:

%{\small
\begin{itemize}
\item {\mountaincar} (2 dimensional): As described in 
        the Introduction. This is slightly more difficult
	than the normal mountain-car problem, as we require a velocity near 0
	at the top of the hill~(Moore and Atkeson
	1995).\nocite{mr:thpr2} State consists of $x$-position and
	velocity. Actions are accelerate forward or backward.
\item {\acrobot} (4 dimensional): An acrobot is a two-link planar robot acting
	in the vertical plane under gravity with only one weak actuator at its elbow joint.
	% and an unactuated shoulder. 
	The goal is to raise the hand at least one link's height above 
	the shoulder~(Sutton 1997).\nocite{sutton96:girl}
	State consists of joint angles and angular velocities at the 
	shoulder and elbow. Actions are positive or negative torque.
% jscottd: added following reference
\item {\cartpole} (4 dimensional): A cart-and-pole system~(Barto {\it et al.} 1983)\nocite{brt:nrnl} starting
	with the pole upright is to be moved some distance to a goal
	state, keeping the pole upright (harder than the stabilization
	problem). 
	It terminates with a huge penalty ($-10^6$) if the
	%controller allows the 
	pole falls over. 
        State consists of the cart position and velocity, and the pole 
	angle and angular velocity.  Actions are accelerate left or right.
\item {\slider} (4 dimensional): Like a two-dimensional mountain car, where a
	``slider'' has to reach a goal region in a two-dimensional terrain.
	The terrain's contours are shown in Figure~\ref{beta}.
	State is two-dimensional position and two-dimensional velocity.
	Actions are acceleration in the NE, NW, SW, or SE directions. 
\end{itemize}
%}

%\parbox{2.80in}
%{
%\centerline{\psfig{figure=beta.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-slider-beta}
%  \centerline{\parbox{2.55in} {%\footnotesize 
%	Figure~\ref{fig-slider-beta}:
%		{\slider}'s terrain. Goal at upper left.}}
%}
%%\nolinebreak
%\smallskip
\normalfig{beta}{{\slider}'s terrain.  Goal at upper left.}

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|c|} \hline
$d$ & 1 & 2 & 3 & 4 & 5 & 6 & 7 & 8 & 9 & 10 \\ \hline \hline
cost & 49994& 42696& 31666& 14386& 10339& 27766& 11679& 8037& 9268& 10169 \\ \hline
time &0.66& 0.64& 1.24& 1.02& 1.13& 2.07& 3.32& 3.84& 7.30& 15.50  \\ \hline
%time &0.657& 0.637& 1.244& 1.021& 1.127& 2.065& 3.316& 3.839& 7.301& 15.496 \\ \hline
\end{tabular}}
\caption{Local search (LS) on {\cartpole}}
	\label{table-ls-cartpole}
\end{table*}

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c|c|c|c|c|c|c|c|} \hline
$d$  & 1 & 2 & 3 & 4 & 6 & 8 & 12 & 16 & 24  \\ \hline \hline
cost & 187 & 180 & 188 & 161 & 140 & 133 & 133 & 134 & 112 \\ \hline
%time & 0.023 & 0.047 & 0.101 & 0.158 & 0.356 & 0.701 & 2.078 & 4.624 & 12.435 \\ \hline
time & 0.02 & 0.05 & 0.10 & 0.16 & 0.36 & 0.70 & 2.08 & 4.62 & 12.44 \\ \hline
\end{tabular}}
\caption{Constrained Local search (CLS) on {\mountaincar}}
	\label{table-ls-mountaincar}
\end{table*}


All four are undiscounted tasks. {\cartpole}'s cost on each step
is quadratic in distance to
goal. The other three domains cost a constant $-1$ per step.
All results are averages of
1000 trials with a start state chosen uniformly at random
in the state space, with the exception of the \cartpole, in which
only the pole's initial distance from its goal configuration is varied.
%%%awm comment that this too is harder than usual formulations of
%%%awm acrobot, car-on-hill, carpole?
% ayn: I think this actually makes it easier for acrobot or caronhill
%  (for most offline grid-base VI DP methods at least. Maybe not Sutton's' 
%   stuff.)
% jscottd: commented following sentence out, since it's been mentioned before
%The algorithm we
%used to learn a value function is the simplex-interpolation algorithm
%described in~(Davies 1997).\nocite{dvs:mltd} But note that
%the choice of a function approximator is orthogonal to the
%search techniques we describe here, which can readily be applied to
%any other value function computed by any other algorithm, such as an
%LQR solution to a linearized problem or a neural net value function
%computed with TD.

For now, we consider only the case where we are given a model of the
world, and leave the model-learning case to the next section.
% ascottd: added following sentence
In this case, the value functions used during search (except by the
Uniformed Global Search) are calculated using the
simplex-interpolation algorithm described in 
~(Davies 1997)\nocite{dvs:mltd}; once generated, they
need not be updated during the search process.
 
\subsection{Local Search}

%Here, we look at the effects of different parameter settings for Local
%Search.  Consider {\cartpole} with no restriction on the
%number of switches (setting $s = d-1$).
%The approximate value function was found using a four-dimensional
%simplex-interpolation grid with quantization $13^4$, which is about 
%the finest resolution simplex-grid that we could reasonably afford
%to use. As we
%increase the depth of the search from 1 (greedy policy with respect to
%$V$) up to 10 (greedy policy with respect to $B^9V$), we see that
%performance is significantly improved.
%Mean CPU time per complete trial, in seconds, on a 100MHz HP is also shown.

Here, we look at the effects of different parameter settings for Local
Search.  We first consider {\cartpole}. Empirically, good trajectories
in this domain ``switch'' actions very often; therefore, we chose
not to assume much ``action-holding,'' and set $s=d-1$.  The
approximate value function was found using a four-dimension
simplex-interpolation grid with quantization $13^4$, which is about
the finest resolution simplex-grid that we could reasonably afford to
use. 
% ascottd: added following sentence
(Calculating the approximate value function even with this seemingly low 
resolution can take minutes of CPU time and most of the system's memory.)
See Table~\ref{table-ls-cartpole}; as we increase the depth of
the search from 1 (greedy policy with respect to $V$) up to 10 (greedy
policy with respect to $B^9V$), we see that performance is
significantly improved, but with CPU time per trial (on a 100MHz HP
C300 9000, given in seconds) increasing exponentially.

The next experiment we consider here is {\mountaincar} on a coarse
($7^2$) grid.  
Empirically, entire trajectories (of $> 100$ steps) to the goal can 
often be executed with 2 or 3 action switches, and the optimal trajectory 
to the goal from the bottom of the hill at rest requires only 
about 3 switches. Thus, for the depth of searches we performed, 
we very conservatively chose $s=2$.
In Table~\ref{table-ls-mountaincar}, our
experimental results again show solution quality significantly
increased by Local Search, but with running times growing much more slowly
with $d$ than before.

%scottd: section name tweaked
\label{results}
%\subsection{Experimental Results}
\subsection{Comparative Experiments}

%The table below shows the average cost incurred per trial, as well as
%the average CPU time required per trial, for each algorithm in each of
%the four domains.  The average number of local searches (\#LS) per trial
%performed by the global search algorithms is also provided.  

Table~\ref{table-results-summary} summarizes our experimental
results\footnote{The parameters for the 4 domains were, in order:
value function simplex interpolation grid resolution: $7^2, 13^4,13^4,13^4$;
Local Search: $d=6,s=2, d=5,s=4, d=5,s=4, d=10,s=1$; Global Search
Grid resolution: $50^2, 50^4, 50^4, 20^4$; Local search within Global
search: $d=20,s=1$ for all 4.}.  {\em cost} is average cost per trial,
{\em time} is average CPU seconds per trial, and {\em \#LS} is the
average number of local searches performed by the global search
algorithms (which indicates the amount of state space considered).

\begin{table*}
\centerline{
\begin{tabular}{|l|c|c||c|c||c|c|c||c|c|c|} \hline
& \multicolumn{2}{c||}{No Search} & \multicolumn{2}{c||}{Local Search} & 
		\multicolumn{3}{c||}{Uninformed Global} & \multicolumn{3}{c|}{Informed Global} \\ 
& \bf cost & time & \bf cost & time & \bf cost & \#LS & time & \bf cost & \#LS & time \\ \hline \hline
\mountaincar & \bf 187 & 0.02 & \bf 140 & 0.36 & \bf FAIL      & -- & -- & \bf 151              & 259 & 0.14 \\ \hline 
%             &     &     & & & ($50^2$ grid) &   &   & ($50^2$ grid) & & \\ \hline
%%             &     &     & & & Search grid: $100^2$ & & & Search grid: $100^2$ & &   \\
%             &     &     & & & 109 & 3870 & .82 & 138 & .29 & 1150  \\ \hline \hline
\acrobot & \bf 454 & 0.10 & \bf 305  & 1.2 & \bf 407 & 14250 & 5.8 & \bf 198 & 914 & 0.47  \\  \hline 
%         &     &     & ($s=4,d=5$)        & & ($50^4$ grid) & & & ($50^4$  grid) & & \\ \hline 
\cartpole & \bf 49993 & 0.66 & \bf 10339 & 1.13 &  \bf 3164           & 7605 & 3.45 & \bf 5073               & 1072 & 0.64 \\ \hline
%          & & & $s=s,d=d$  & &  ($50^4$ grid)     & & & ($50^4$ grid)           & &  \\ \hline \hline
\slider    & \bf 212 & 1.9 & \bf 197         & 51.72 & \bf 104          & 23690 & 94 & \bf 54          & 533 & 2.0 \\  \hline 
%           &    &      & ($s=s,d=d$) & & ($20^4$ grid) &       &    & ($20^4$ grid)  & &   \\ \hline
\end{tabular}
}
\caption{Summary of comparative experimental results}
  \label{table-results-summary}
\end{table*}

%in such cases, we naively chose an action at random and re-tried the
%search from the resulting state, which usually worked.
%(Cleverer methods of dealing with such failures can certainly be
%imagined, but are not investigated here.)

Trends we draw attention to are: Local Search consistently beat No Search, 
but at the cost of increased computational time. Informed Global Search (IGS) 
significantly beats No Search; and it also searches {\em much} less of
state space than Uninformed Global Search (UGS), resulting in
correspondingly faster running times.  
%scottd: added next two sentences
In fact, because the solutions
found by IGS are often of much shorter length than when using
no search at all, the computational time per trial is sometimes essentially
{\em the same} for IGS and No Search, while the quality of
the solution found by IGS is many times
better --- for example,  a factor of 4 in the {\slider} domain.
(It performs a factor of 10 better in the {\cartpole} domain,
but that is largely a function of the particular penalty associated with
the pole falling over.)
Also note that because of the sparse
representation of Global Search grids, we can comfortably use 
grid resolutions as high as $50^4$ without running out of memory.

%scottd: changed next sentence
%While relatively simple, {\mountaincar} demonstrates interesting phenomena. 
% jscottd: changed it back since similar apologetics now appear
%   earlier in the paper
%While too simple a problem for the search techniques described here to
%be useful in practice (since it would have been easy to have simply
%calculated a much more accurate value function), the {\mountaincar}
%results demonstrate interesting phenomena that carry over to
%higher-dimensional problems.  
While relatively~simple, {\mountaincar} demonstrates interesting phenomena. 
Despite the use of a $50^2$ grid for the
global search, UGS often surprisingly fails to find a path to the
goal, where IGS, despite searching much less of the
state space, succeeds.  This is because IGS uses a
value function to guide its pruning of multiple trajectories entering
the same grid cell, and therefore makes better selection of
``representative states'' for grid elements.
%scottd: added next sentence
% jscottd: then took it back out again
%(In particular, it normally recognized high-velocity states as better
%in the appropriate parts of the state space, whereas UGS often
%erroneously pruned these states in favor of lower-velocity states in
%the same grid elements.)  
This also helps explain IGS finding better
solutions than UGS on 2 of the 3 four-dimensional domains.

%scottd: added paragraph break here
When the Global Search grid resolution is increased to $100^2$ for
\mountaincar, both UGS and IGS consistently succeed. But, UGS (mean
cost 109) now finds better solutions than IGS (mean cost 138).  The
finer search grid causes good selection of representative states to be
less important; meanwhile, inaccuracies in the value function guiding
Informed Global Search causes it to miss certain good
trajectories. This is a phenomenon that often occurs in $\astar$-like
searches when one's heuristic evaluation function is not strictly
optimistic~(Russell and Norvig 1995).\nocite{rssl:artf} This is not a
problem for UGS, which is effectively using the maximally optimistic
``constant 0'' evaluation function.
%scottd: added stuff below
It is interesting to note that in the {\cartpole} domain, in which UGS
found better solutions than IGS, the step size was large
enough and the dynamics nonlinear enough that single steps often
crossed multiple grid elements, and each grid element was typically
reached no more than once during the search.
Thus, this was a case in which IGS's ability
to discriminate between good and bad states within the same
grid element was not relevant.

%100^2:  with cost about 109.
%furthermore, while the average cost of the informed searcher's
%trajectories only decreased from 151 to 138, the uninformed searcher
%managed an average cost of only 109.  Because the approximated value

%function is not ``optimistic'' -- that is, it is not an admissible
%heuristic for $\astar$ search -- the informed searcher still finds
%significantly suboptimal trajectories to the goal.  Since the
%uninformed searcher effectively has an ``optimistic'' heuristic,
%however, it finds trajectories which are close to optimal.
%
% ayn: i think it's actually pruning, not optimism, that is the issue. 
% i.e. the "constant 0" value function is optimistic, but doesn't do
% such a great job because of the pruning.

\section{LLS and LGS: Learning a Model Online}

\newcommand{\kdtree}{{\it k}d-tree}

Occasionally, the state transition function is not known but rather
must be learned online.  This does not preclude the use of online
search techniques; as a toy example, Figure~\ref{learn} shows
cumulative reward learning curves for {\mountaincar}.  For each action,
%a {\kdtree} implementation of 1-nearest-neighbor~(Moore {\it et al.} 1997)\nocite{mr:lwpr} is used to learn the state
a {\kdtree} implementation of 1-nearest-neighbor~(Friedman {\it et al.} 1977)\nocite{bntl:anal} is used to learn the state
%neighbor~\cite{MooreSchneiderDeng97} is used to learn the state
transitions, and to encourage exploration, states sufficiently far
from points stored in both trees are optimistically assumed to be
zero-cost absorbing states.  
% ascottd: changed next sentence
%A $7^2$ value function approximator 
A 7-by-7 simplex interpolation grid used for the value function approximator
is updated online with the changing state transition model.  Without
search, the learner eventually attains an average cost per trial of
about 212; with Learning Global Search (LGS) (search grid resolution
$50^2$), it quickly (after about 5 trials) achieves an average cost of
155; with Learning Local Search (LLS) ($d=20, s=1$), it achieves an
average cost of 127 (also after about 5 trials).

%\parbox{2.80in}
%{
%\centerline{\psfig{figure=learn.ps,width=2.70in,height=2.25in}}
%\refstepcounter{my-figures-ctr}
%  \label{fig-learn} \centerline{\parbox{2.55in} {%\footnotesize
%  Figure~\ref{fig-learn}: Cumulative reward curves on {\mountaincar}
%  with model learning.
%%		(Shallow gradients are good.)
%		}}
%}
%\smallskip
\normalfig{learn}{Cumulative reward curves on {\mountaincar} with 
model learning.  (Shallow gradients are good.)}

%scottd REUNCOMMENTED (e.g., basically moved back here from the ``Uninformed
%  Global Search'' section)
%scottd RECOMMENTED
%ayn UNCOMMENTED

As before, when the planner finds a trajectory to the goal, it is
executed in its entirety
% ascottd: added following clause
in an open-loop fashion. But in the case where we are learning a
model of the world, it is possible to successfully plan a continuous
trajectory using the learned world model, but for the agent to fail to
reach the goal when it tries to follow the planned trajectory.  In
this case, failure to follow the successfully planned trajectory can
directly be attributed to inaccuracies in the agent's model; and in
executing the path anyway, the agent will naturally reach the area
where the actual trajectory diverges from the predicted/planned
trajectory and thereby improve its model of the world in that area.

However, several interesting issues do arise when 
the state transition function is being approximated online.  
Inaccuracies in the
model may cause the Global Searches to fail in cases where more
accurate models would have let them find paths to the goal.
%scottd: added next sentence
Optimistic exploration policies can be used to help the system
gather enough data to reduce these inaccuracies, but in
even moderately high-dimensional spaces such exploration would
become very expensive.
Furthermore, trajectories supposedly found during search will
certainly not be followed exactly by an open-loop controller; adaptive
closed-loop controllers may help alleviate this problem to some
extent.  
Finally, using the models to predict state transitions should be 
computationally cheap, since we will be using them to update the 
approximated value function with the changing model, as well as 
to perform searches.

%Interesting There is no space to do more than summarize the interesting issues
%when using search with online model-learning.  Model inaccuracies may
%cause the Global Search to fail. Trajectories supposedly found by
%Global Search will not be followed open loop so closed-loop trajectory
%trackers should be used. Model predictions must be very cheap, as they
%are heavily used in search and value function approximation.

\section{Future Research}

How well will these techniques extend to non-deterministic systems?
They may work for problems in which certain regularity assumptions are
reasonable, but more sophisticated state transition function approximators
may be required when learning a model online.

How useful is Local Search in comparison with building a local linear
controller for trajectories? During execution some combination
of the two may be best. Local Search also plays an important role
in the inner loop of global search; it is unclear how local linear
control could do the same.
%ayn: A lot! I still believe non-holonomy will kill local controllers.
% Plus, local search applies not only to dynamical systems, so I'd rather
% not being up local controllers.
%%%awm I will go and find a large eel to slap you around the head with.
%%%awm What you say might be true, but it must be said, else Atkeson-like
%%%awm reviewers will simply whine about us not mentioning this. They'll
%%%awm think we're just a bunch of computer scientists who don't know
%%%awm a thing about real control theory. (Which is sadly somewhat true).

The experiments presented here are low-dimensional. It is encouraging
that informed search permits us to survive $50^4$ grids, but to properly
thwart the curse of dimensionality we can conclude that 
%scottd: changed to list format, rearranged the order a bit, and reordered
%stuff
\begin{enumerate}
\item Informed Global Search (IGS) is often much more tractable 
         than Uninformed Global Search (UGS), even with relatively
         crudely approximated value functions.
\item However, more accurate (yet computationally tractable) value function 
approximators may be needed than the simplex-grid-based approximators
used here.
\item Variable resolution methods (e.g. extensions
to~(Moore and Atkeson 1995))\nocite{mr:thpr2} would probably be needed for 
the Global Search's state-space partitions rather than the uniform grids
used here.
\end{enumerate}
% ayn: changed "will" to "might" since NNs do perfectly well in high dims.
%
%ayn: choice of DP solver is orthogonal to search
%%%awm More reasonable, but again let's be sure the reviewers know this too.
%%%awm A suggested addition of mentioning TD and LQR above will solve this.

% ascottd: added following 
The algorithms tested in this paper calculated the approximate value
functions used by their search procedures independently of any
particular trajectories that were subsequently searched or executed.
However, it might be better to use points along such trajectories
to further update the value function in order to concentrate computational
time and value function approximator accuracy on the most relevant
parts of the state space.  The resulting algorithm would be reminiscent
of Korf's $RTA^{\ast}$~(Korf 1990)\nocite{krf:rltm} and Barto's
$RTDP$~(Barto {\it et al.} 1995)\nocite{brt:rltm}.

% ascottd: added following
The trajectories found by the algorithms described in this paper
use a small discrete set of actions, and do not always switch between
these actions in a completely locally optimal manner.  In domains
where the action space is actually continuous, it would be useful to
use a local trajectory optimization routine such as that used
in ~(Atkeson, 1994)\nocite{atks:usng} in order to fine-tune the
discovered trajectories.

Lastly, algorithms to learn reasonably accurate yet consistently
``optimistic''~(Russell and Norvig 1995)\nocite{rssl:artf} value
functions might be helpful for Informed Global Search.

%\vskip 0.01in{}

{\footnotesize
\bibliography{all-awm,cv-bib}
%\bibliography{nrdp}
%\bibliographystyle{named}}
\bibliographystyle{aaai}
}
\end{document}

-----------------------------------------------


Intro
 - lots of standard algorithms in robotic control
 - search: A*, roadmaps, clever graph thingys, etc
 - this avenue surprisingly unexplored in continuous-state RL literature. 
     we will take these standard algorithms, apply to improve any 
     previous RL method that learns an approximate value (or Q?) function. 
     For example, w/extra computation at run-time.
 - Why run time? Because know exactly where we are, wrt to a non-discretized
     position, unlike offline computations.
**** - Little-discussed question: Given a value function, how to make that 
****     into a policy? Especially for continuous time?
 - Mention: our treatment will primarily be for deterministic domains.

Searches
 - Describe full search. 
 - Motivate it: depth 3-4 might be tractable. O(|A|^d) time in general. Expensive.
 - briefly mention stochastic case: need to to Monte Carlo 
	simulations. (--> runtime *= constant)

Microsearch:
 - maxnumswitches makes it cheap(er). Rationale for why it's a reasonable thing to do.
 - parameters:
	1. how often we consider switching actions
	2. maxnumswitches
	3. depth
     briefly discuss how one might set them. 
     (i.e. for 2&3, expert decides: upperbound on numswitches/step, then set 
	depth to the max tractable depth.)
 - expts with known model.
 - mention: if model murky, might not want to search to deep. 
		(EXPTS: Do expts show this?)
 - Theoretical results:
     non-max microsearch to depth n == using $T^n V$ as the value function
   ?? (note: discrete state->can learn exact value function. If have that, then
		no need for search. Do we mention this in either intro or 
		searches section?)
 - Further justification:
	Also relate to multi-grid methods: learn a coarse approximation 
	to the value function using standard value-iteration or whatever, and 
	then simulating (or approximating) using an *infinitely* fine grid to 
	do 3 more iterations of backups/value iteration. 
 - mention: works not only for shortest-path problems

Macrosearch: 
 - Mention "Latombe's algorithm"
 - much longer search range. 
 - parameters
	- maxdepth
	- microsearch params
 - relate to how fast standard (graph-based) shortest-path algorithms are.
 - A*, and we use the learned value function to guide it.
 - mention? not so much an improvement to old RL V/Q functions, but a new approach
    that marries traditional robot-motion-planning algo to A* + value function
 - metion: hash-table implementation for low-dim embedded in high dim

Learning Models: (where to put this section??)
 - describe learner
? - Describe how to modify kd-tree algorithm to make it much faster for mem-based
	model learning. (EXPT: Can we get timing results?)
	(?? Is this slightly off the paper's theme?)
 - Experiments with having to learn a model
	-> show tradeoff?
 - exploration strategy. (cite antecedent)
 - Discuss the clever things to decide when to recompute model, redo
	forward integration, etc., for when learning a model, so 
	that more tractable than redoing entire dp at every step.
	(?? Is this slightly off the paper's theme?)

Discussion/Extensions: 
 - propose: shift computation of search back to offline, by doing the 
    search, then finding another representation for the policy.
        - backward search from goal. (if have inverse kinematics.) 
 - multiresolution (multires models & multires approach to macrosearch)
	- increase res where needed
	- can increase searchgrid res till we can prove no path to goal, 
		then go improve model. Whatever. 
		relate to robot motion planning algos.
 - ? mention it's worth considering possibility of applying other 
	robot motion planning algos (roadmaps, potential field, other searches, ...)
	to problems with highly non-holonomic constraints or where we do have
	have an exact world model, or pure non-robotic tasks, 
	which were traditionally the domain of RL.


Comparisons:

   local greedy
   micro
   macro
   micro+macro
   macro w/out value function
   micro+macro w/out value function
   

------------ Additional things we might want to include ------------------

--------------- Random things [some snipped from old email]---------------

Possible extensions: 
If we were to use a grid-based discretization with no interpolation, 
it's reasonable (?) to expect that the policy "implied" by the learned value 
function, will not have much finer resolution than the grid used to learn 
the value function? Therefore, it is reasonable to do the possibly-expensive 
searches offline, and store their results a grid of only "somewhat finer" 
granularity then the one used to learn the value function. If this were the 
case, then all the computation can be offline.
Interesting grid-size vs. search-cost computational tradeoff? 

I'm a little murky on this, though, since it's not unlike doing standard 
value iteration on a coarse grid, then (say) 3 iterations on an infinitely 
fine grid, then 1 iteration on a fine-ish grid. Hard to say what the last
step does, but if the coarse grid introduces e1 (absolute) error into 
whatever it's modelling, and the fine-ish grid e2 error, then I *think* 
the error of the final value function can be bounded by something 
like  e2 + gamma^3 * e1/(1-gamma) , possibly modulo some multiplicative 
constants.

Random justification for max-algorithm. [omitted from here.]

Random Thought #4:
If we can convince ourselves the largest absolute error in the value
function is e (say, largest abs bellman error is b, and e = b/(1-gamma),
I *think*), then we can evaluate every single node we ever get to, 
and do some pruning of the search tree in a pretty obvious way.




