CONTENTS
Goals
----- Products
Initial Paper Goals
Focus
Example problems we'll use:
**** Main technologies to be implemented....
Discrete large, non-deterministic-but-sparse, MDPs
----- Value Iteration
----- Policy Iteration
----- Puterman's Modified Policy Iteration
----- Momentum
----- Aitken Acceleration
----- Successive Over-relaxation
----- TD / Q-learning
----- Linear Programming
----- State Aggregation
----- Priority Queues and other fancy backup scheduling methods
section below.
----- Clever Stopping criteria
Finding Policies for Continuous State-space dynamic problems
----- Convert to Discrete MDP by gridding
----- Convert to Discrete MDP by bilinear interpolation
----- Convert to Discrete MDP by triangulation
----- Multigrid
----- Neural Net value function approximation
----- Variable Resolution Partitioning
Finding (locally optimal) trajectories (to goal regions)
----- Latombe-style search
----- Local trajectory optimization
----- Atkebubbles
Bibliography
Calendar
Backup Scheduling
----- Backup Ordering Challenge
----- Re: Backup Ordering Challenge
----- Subject: Re: Backup Ordering Challenge
----- Subject: Re: Backup Ordering Challenge
This is a project involving Scott Davies, Gary Boone, Andrew Moore, Chris Atkeson, Jeff Schneider, Anthony Dilello and hopefully other interested parties over the Summer 96 to implement and improve and apply practical techniques for Dynamic Programming, Optimal Control and Reinforcement Learning problems.
We hope to have a thorough bibliography of the area.
We hope to have a survey of the important literature of the area.
We hope to have a readable, useful, "Numerical Recipes in C style" description of the most useful DP technques, which will become a widely used journal paper in the area.
We will have a good, blazingly fast, C library of DP / RL / Optimal control techniques. Possibly embedded in a Matlab toolbox or other suitable delivery system. Delivered free across the Web and elsewhere as a resource for researchers and industry.
We will have a set of case studies demonstrating the use of the methods and some of the trade-offs.
Ideally, (but there may not be time this summer), we will also attack Model Learning problems (same framework as the recent plethora of Acrobot papers). This includes combining a learned model with DP and Exploration.
(June10)
Chris suggests:
I think it should be two papers: Paper 1: Numerical Recipes for Discrete State Problems Paper 2: Numerical Recipes for Continuous State Problems In addition to truly continuous state methods, Paper 2 also talks about transforming continuous state problems into discrete state problems, but refers to Paper 1 for material on how to solve the transformed problem efficiently.
I guess we're most interested in continuous-state dynamic control problems specified by goals an/or arbitrary reward functions. We expect to focus on 1-5 dimensional state spaces, but with an eye to coping with higher dimensional problems that are "simple" in some respects.
Puddle-world (obstacle avoidance on a 2-d grid)
Double Integrator (2-d space)
Car-on-hill (2-d)
Non-holonomic vehicles (3-d) (Is this too different from the rest?)
Cart-pole stabilization (4-d)
Cart-pole minimal-time maneuvers (4-d)
Two joint arm Ball-tracking/catching/hitting (4-d .. 8-d)
(June10) Anthony now has a nice, fairly realistic simulation of this, and a primitive arm-controlling, ball-hitting algorithm. And a good animation. Those of you who want to see it should do the following:
The executable is named basket and is in the basket directory. There are no special command line arguments.
Acrobot Sutton-task (4-d)
Acrobot swing-up and balance vartically (4-d)
Aircraft Landing (using Navy simulation) (?-d)
Food cooking (?-d)
**** Main technologies to be implemented....
Discrete large, non-deterministic-but-sparse, MDPs
These are the kinds of systems you end up needing to solve when you use an interpolating function approximator in continuous state problem. Possible technologies are...
We all know about this.. I think?
We know the algorithm, but have we ever used it? (When) does it work? Key question: how do we practically do the policy evaluation step? Ideas:
(1) Gauss Seidel Sweeps -- we think this never buys anything you wouldn't have got with value iteration. Maybe if there had been thousands of actions per state it's a win. (June10) no change in our thinking here
(2) Direct Matrix Inversion: Too expensive above say 30 states (June10) no change in our thinking here
(3) Fancy sparse/stochastic matrix inversion. Is SOR the thing? (June10) We looked at \cite{Young88} which is a pretty good description of iterative (as opposed to direct) methods for inverting sparse matrices. Bottom line is that SOR looks fairly hopeless because it requires a more special form of the matrix than we can guarantee. There are obscure references to general SOR technqiues \cite{YoungEidson70,YoungHuang83,Huang83} that Scott is tracking.
(June10) Or a conjugate gradient thing? Yes, looks promising! Scott is conversing with Numerical Recipes in C at this moment. We need to be sure that NRIC implementation doesn't need O(N^2) matrix storage space. We think this should be implemented and tested (Candidate for Auton/GT seal of approval).
(4) Monte Carlo simulation of the value of a policy by, well, running it a lot. Could this sometimes be important? (June10) Still at the level of something to list, but we're not excited by it. Any evidence of literature about this? (June11) Actually, this could be great for not-very-non-deterministic problems, e.g. the result of an interpolated value function. Here, for each grid corner, you do one policy simulation and compute the costs. Total cost of policy evaluation = (Number of Grid Points) * (Typical Simulation Length). Compare to cost of Matrix Inversion Policy evaluation, which could be O(Number of Grid Points cubed), and this might look attractive.
Puterman's Modified Policy Iteration
We need Puterman's book. (June10) We have Puterman's book. It's in the lab. Looks good. We must ask ourselves "What will our survey do that this book doesn't?" . I suspect the answer strongly lies in the computational issues, where we can become the de facto experts.
Scott's evaluation of Puterman's initial MPI paper reveals some strangeness in the results (plus it wasn't implemented on any really excitingly large problems). We should implement and test it.
Also, look into action elimination procedures. (Puterman's book suggests large speedups, though again this seems a bit surprising.)
Andrew will try to find literature that discusses whether a momentum term in value iteration helps. Related to over-relaxation. (June10) Yes: this type of technique looks promising and worth implementing and testing. But let's save the discussion for Aitken, below.
See \cite{ConteDeboor80} for a nice discussion (Page 95-100). Basic idea: Suppose you have a sequence of scalars that you are generating x(1) , x(2) , x(3) .... by some iteration
x(n+1) = g(x(n))
and you're expecting them to converge linearly (i.e. number of decimal places of accuracy is approx linear in N) to a final value z, where z = g(z).
Suppose you have an estimate of g'(z).
Then beginning with the following
g(z) = z
x(n+1) = g(x(n))
g'(z) =approximately= g(x(n)) - g(x(n-1))
-------------------
x(n) - x(n-1)
g(z+h) =approximately= g(z) + h g'(z)
Finally gets you (with elementary algebra) to an estimate of the z you're converging to from x(n-1) , x(n) and x(n+1):
z =approximately= zestimate =
x(n+1) - (x(n+1) - x(n))^2 / (x(n+1) - 2x(n) + x(n-1))
and zestimate is a much better estimate of z than x(n+1) subject to all sorts of weaselly assumptions.
How do you make this accelerate the convergence of a series? One way is Steffensen Iteration:
Start at point xx(0)
For n = 0 , 1 , ....
x0 = xx(n)
x1 = g(x0)
x2 = g(x1)
xx(n+1) = x2 - (x2 - x1)^2 / (x2 - 2x1 + x0)
Until xx() series has converged (testing this is a whole different story).
I propose we try accelerating the sequence of values for each discrete state of a value iteration in this way. This could be done for the guass-seidel policy evaluation step inside policy iteration. Or it could be done within a full value iteration.
(June10) As described above, looks unpromising for whole matrices, but could be done individually for values of individual states during value iteration. This is equivalent to using momentum in neural net training. Easy to investigate, and worth doing so.
We think this is useless for this case of discrete MDP and a known model. I bet Rich (Sutton) would disagree. We should ask him.
Suggested by Mike Trick at GSIA (CMU). (He claims that contrary to conventional wisdom, this is practical). (June10) far too complex for us to try. We may be able to convince Justin, who is turning into a serious LPhead to do this for us.
State Aggregation
(June10) Sounds surprisingly naive <<<Jeff, I'm running out of time. Could
you describe it, and why it seems so naive here. This is in g/doc/dp.doc>>>
Priority Queues and other fancy backup scheduling methods
(June10) Jeff has started a fun line of research here. The email concerning this
section below.
Finding Policies for Continuous State-space dynamic problems
Convert to Discrete MDP by gridding
Convert to Discrete MDP by bilinear interpolation
Convert to Discrete MDP by triangulation
Multigrid
Neural Net value function approximation
Variable Resolution Partitioning
Finding (locally optimal) trajectories (to goal regions)
Latombe-style search
Local trajectory optimization
Atkebubbles
(Folks; start making a bibliography here...or else suggest an already-under-construction bibliography we should use).
Tuesday June 4th, 3.30pm: Scott, Andrew, Jeff will have reviewed some of the least-understood methods above and will report their findings.
(June10) Tues June 11th 3.30pm, Scott Andrew Jeff meet. Start talking about C-coding.
Hi Guys,
This little chunk of MATLAB code computes the optimal:
%% V(0) = 0;
%% V(1) = 1 + 1/3 V(2)
%% V(i) = 1 + 1/3 V(i+1) + 2/3 V(i-1)
%% V(25) = 1 + V(24)
%% Solve the above equations. Ignore V(0).
v1 = ones(1,25);
v2 = -2/3 * ones(1,24);
v3 = -1/3 * ones(1,24);
A = diag(v1,0) + diag(v2,-1) + diag(v3,1);
A(25,24) = -1.0;
b = ones(25,1);
v = inv(A)*b
The values for 1-25 (V(0) = 0.0) are:
3.0000 6.0000 9.0000 12.0000 15.0000 18.0000 21.0000
24.0000 26.9999 29.9999 32.9998 35.9995 38.9990 41.9980
44.9961 47.9922 50.9844 53.9688 56.9375 59.8750 62.7500
65.5000 68.0000 70.0000 71.0000
Jeff.
From: Jeff.Schneider@J.GP.CS.CMU.EDU
To: Andrew.W.Moore@J.GP.CS.CMU.EDU, Jeff.Schneider@J.GP.CS.CMU.EDU,
Scott.Davies@J.GP.CS.CMU.EDU
Note the long interval between this and my previous note (as compared with the interval between the previous 2). That should tell you something about how I did on my predictions :-)
So here's my summary:
Randomly choose backups: 5568 iterations (2)
Loop through the states 1-25 and repeat: 2375 iterations (0)
Loop through the states 25-1 and repeat: 2951 iterations (1)
Alternate forward and backward (as in SSOR): 2775 iterations (4)
Priority sweeps (priority is the max of the
impacts from a change in either neighbor): 3682 iterations (3)
I was disgusted by that last one so I started to wonder about the right way to compute priority. The method I used above was just one possibility. Finally, I decided there was somewhat of an implicit assumption that the right thing to do was to find the one that would have the largest change.
So I implemented that cheat: 4265 iterations!! (3)
So at this point, I'm throwing my hands in the air and asking for your reactions. I can't believe that last number, but if its true it certainly makes a statement about this kind of prioritization for iterative solutions on known models.
I'll offer my code to anyone who wants to see this stuff in action for themselves (or implement wiser backup orderings). It is called wip3. The command line arguments are:
print_interval - (default 100) print the value function values out every
print_interval steps. 0 means never print.
show_interval - (default 0) as in print_interval except its a graphic
illustration of the value and priority functions. The top
plot is the value function as does not erase so you can see
it filling itself in. The bottom plot is the priority values
and they do erase and redraw each time. There is also a
pathetic attempt to keep them at a reasonable scale so they
bounce around a bit.
backup_type - this is the important one. the code describes which types
are which. Those cryptic numbers in parentheses after the
results are the corresponding types
priority_type - 0 means use the priority scheme I described first. 1 means
use the cheat of changing the one that will change the most.
this parameter only matters for backup_type 3, but the plot
shows its correct values even when not using backup_type 3.
seed - random seed for the random chooser
verbosity - at 0 or lower, the plotting goes fast to make an animation.
at higher levels, there is a wait_for_key after each plot.
One final request. The result with priority_type 1 is bad enough, but I would like someone to explain the appearance of the value function plot with it. The fastest way to see it is:
wip3 backup_type 3 show_interval 1 print_interval 0 priority_type 1
Maybe its just some artifact of the pixel spacing, but none of the others look like this. It makes me suspicious of the code, but I've checked it a couple times.
Subject: Re: Backup Ordering Challenge
From: Scott Davies <scottd@lump.sp.cs.cmu.edu>
To: schneide@cs.cmu.edu, awm@cs.cmu.edu
Many thanks to Jeff for providing his code. The animations are fun to watch. :-)
I've been playing with a few off-the-wall heuristics over the last day, and I haven't had too much luck either. The one thing I did find that did marginally better than Gauss-Seidel was to use something close to Jeff's "cheating" prioritized sweeping, but dividing that weight by the square of the state number, thus concentrating backups close to the goal at the beginning. (Result: 2270 backups, for a whopping savings of 105 backups over Gauss-Seidel. Wheee! Not to mention the fact that this is obvious a completely domain-specific hack.) Most of the screwier ideas I had before running these experiments had to modifed or tossed -- not any priority heuristic will do, since there's an excellent chance that the system will get wedged into repeating zero-change backups unless you make absolutely sure that that can't happen.
What code I've kept is in the sdwip directory; the executable is wip3. It uses priority_type 2, which sticks the magnitude of the last backup in P; the actual backup heuristics, 7 through 9, may use various things other than P to figure out their actual priorities. (Putting the magnitudes of the last backups in P makes for some moderately informative animations.) Heuristic 9 is the one that gave the improvement; the others were pretty miserable compared to Gauss-Seidel, though not much worse than Jeff's prioritized sweeping stuff. (7 is a multiply-the-"cheating"-prioritized-sweeping-value-by- a-difference-between-the-magnitudes-of-the-neighbors'-backups hack. 8 is a divide-the-"cheating"-prioritized-sweeping-value-by- the-second-moment-of-the-magnitudes-of-backups-taken-around-the-state- in-question hack. Don't ask me why I thought those were good ideas.)
It is pretty annoying that Gauss-Seidel is so hard to beat, but I suspect that's largely an aspect of this particular domain. I'd bet that in larger, more uneven state spaces, prioritized sweeping-type heuristics would probably fare comparatively better. (But, I wouldn't bet too much.)
-Scott
Subject: Re: Backup Ordering Challenge
From: Jeff.Schneider@J.GP.CS.CMU.EDU
awm@CS.CMU.EDU, schneide@CS.CMU.EDU, scottd@LUMP.SP.CS.CMU.EDU
One extra thing I tried yesterday (which I already mentioned to Andrew).
Given the failure of various "smart" schemes and the relative success of sweeping through the states forward and/or backward, I wondered if the only thing that really mattered was that you hit all the states somewhat frequently. So I implemented a method that randomly orders the 25 states and then repeatedlycycles through that ordering. The result of 10000 trials:
avg: 2664
std dev: 36
range: [2536 2798]
compare to:
sweep through backward from goal: 2375
" " forward to goal: 2951
So that average is almost exactly between the two special cases, the variance is small. That makes my most recent hypothesis that the most important thing is hitting all the states frequently and the order you do so isn't particularly important.
Jeff.