Dynamic Programming Recipes in C

CONTENTS

Goals
----- Products
Initial Paper Goals
Focus
Example problems we'll use:
**** Main technologies to be implemented....
Discrete large, non-deterministic-but-sparse, MDPs
----- Value Iteration
----- Policy Iteration
----- Puterman's Modified Policy Iteration
----- Momentum
----- Aitken Acceleration
----- Successive Over-relaxation
----- TD / Q-learning
----- Linear Programming
----- State Aggregation
----- Priority Queues and other fancy backup scheduling methods
section below.
----- Clever Stopping criteria
Finding Policies for Continuous State-space dynamic problems
----- Convert to Discrete MDP by gridding
----- Convert to Discrete MDP by bilinear interpolation
----- Convert to Discrete MDP by triangulation
----- Multigrid
----- Neural Net value function approximation
----- Variable Resolution Partitioning
Finding (locally optimal) trajectories (to goal regions)
----- Latombe-style search
----- Local trajectory optimization
----- Atkebubbles
Bibliography
Calendar
Backup Scheduling
----- Backup Ordering Challenge
----- Re: Backup Ordering Challenge
----- Subject: Re: Backup Ordering Challenge
----- Subject: Re: Backup Ordering Challenge

Goals

This is a project involving Scott Davies, Gary Boone, Andrew Moore, Chris Atkeson, Jeff Schneider, Anthony Dilello and hopefully other interested parties over the Summer 96 to implement and improve and apply practical techniques for Dynamic Programming, Optimal Control and Reinforcement Learning problems.

Products

We hope to have a thorough bibliography of the area.

We hope to have a survey of the important literature of the area.

We hope to have a readable, useful, "Numerical Recipes in C style" description of the most useful DP technques, which will become a widely used journal paper in the area.

We will have a good, blazingly fast, C library of DP / RL / Optimal control techniques. Possibly embedded in a Matlab toolbox or other suitable delivery system. Delivered free across the Web and elsewhere as a resource for researchers and industry.

We will have a set of case studies demonstrating the use of the methods and some of the trade-offs.

Ideally, (but there may not be time this summer), we will also attack Model Learning problems (same framework as the recent plethora of Acrobot papers). This includes combining a learned model with DP and Exploration.

Initial Paper Goals

(June10)

Chris suggests:

I think it should be two papers: Paper 1: Numerical Recipes for Discrete State Problems Paper 2: Numerical Recipes for Continuous State Problems In addition to truly continuous state methods, Paper 2 also talks about transforming continuous state problems into discrete state problems, but refers to Paper 1 for material on how to solve the transformed problem efficiently.

Focus

I guess we're most interested in continuous-state dynamic control problems specified by goals an/or arbitrary reward functions. We expect to focus on 1-5 dimensional state spaces, but with an eye to coping with higher dimensional problems that are "simple" in some respects.

Example problems we'll use:

Puddle-world (obstacle avoidance on a 2-d grid)

Double Integrator (2-d space)

Car-on-hill (2-d)

Non-holonomic vehicles (3-d) (Is this too different from the rest?)

Cart-pole stabilization (4-d)

Cart-pole minimal-time maneuvers (4-d)

Two joint arm Ball-tracking/catching/hitting (4-d .. 8-d)

(June10) Anthony now has a nice, fairly realistic simulation of this, and a primitive arm-controlling, ball-hitting algorithm. And a good animation. Those of you who want to see it should do the following:

The executable is named basket and is in the basket directory. There are no special command line arguments.

Acrobot Sutton-task (4-d)

Acrobot swing-up and balance vartically (4-d)

Aircraft Landing (using Navy simulation) (?-d)

Food cooking (?-d)

**** Main technologies to be implemented....

Discrete large, non-deterministic-but-sparse, MDPs

These are the kinds of systems you end up needing to solve when you use an interpolating function approximator in continuous state problem. Possible technologies are...

Value Iteration

We all know about this.. I think?

Policy Iteration

We know the algorithm, but have we ever used it? (When) does it work? Key question: how do we practically do the policy evaluation step? Ideas:

(1) Gauss Seidel Sweeps -- we think this never buys anything you wouldn't have got with value iteration. Maybe if there had been thousands of actions per state it's a win. (June10) no change in our thinking here

(2) Direct Matrix Inversion: Too expensive above say 30 states (June10) no change in our thinking here

(3) Fancy sparse/stochastic matrix inversion. Is SOR the thing? (June10) We looked at \cite{Young88} which is a pretty good description of iterative (as opposed to direct) methods for inverting sparse matrices. Bottom line is that SOR looks fairly hopeless because it requires a more special form of the matrix than we can guarantee. There are obscure references to general SOR technqiues \cite{YoungEidson70,YoungHuang83,Huang83} that Scott is tracking.

(June10) Or a conjugate gradient thing? Yes, looks promising! Scott is conversing with Numerical Recipes in C at this moment. We need to be sure that NRIC implementation doesn't need O(N^2) matrix storage space. We think this should be implemented and tested (Candidate for Auton/GT seal of approval).

(4) Monte Carlo simulation of the value of a policy by, well, running it a lot. Could this sometimes be important? (June10) Still at the level of something to list, but we're not excited by it. Any evidence of literature about this? (June11) Actually, this could be great for not-very-non-deterministic problems, e.g. the result of an interpolated value function. Here, for each grid corner, you do one policy simulation and compute the costs. Total cost of policy evaluation = (Number of Grid Points) * (Typical Simulation Length). Compare to cost of Matrix Inversion Policy evaluation, which could be O(Number of Grid Points cubed), and this might look attractive.

Puterman's Modified Policy Iteration

We need Puterman's book. (June10) We have Puterman's book. It's in the lab. Looks good. We must ask ourselves "What will our survey do that this book doesn't?" . I suspect the answer strongly lies in the computational issues, where we can become the de facto experts.

Scott's evaluation of Puterman's initial MPI paper reveals some strangeness in the results (plus it wasn't implemented on any really excitingly large problems). We should implement and test it.

Also, look into action elimination procedures. (Puterman's book suggests large speedups, though again this seems a bit surprising.)

Momentum

Andrew will try to find literature that discusses whether a momentum term in value iteration helps. Related to over-relaxation. (June10) Yes: this type of technique looks promising and worth implementing and testing. But let's save the discussion for Aitken, below.

Aitken Acceleration

See \cite{ConteDeboor80} for a nice discussion (Page 95-100). Basic idea: Suppose you have a sequence of scalars that you are generating x(1) , x(2) , x(3) .... by some iteration

      x(n+1) = g(x(n))

and you're expecting them to converge linearly (i.e. number of decimal places of accuracy is approx linear in N) to a final value z, where z = g(z).

Suppose you have an estimate of g'(z).

Then beginning with the following

      g(z) = z
      x(n+1) = g(x(n))
      g'(z) =approximately= g(x(n)) - g(x(n-1))
                            -------------------
                               x(n) - x(n-1)
      
      g(z+h) =approximately= g(z) + h g'(z)

Finally gets you (with elementary algebra) to an estimate of the z you're converging to from x(n-1) , x(n) and x(n+1):

      z =approximately= zestimate =
                        x(n+1) - (x(n+1) - x(n))^2 / (x(n+1) - 2x(n) + x(n-1))

and zestimate is a much better estimate of z than x(n+1) subject to all sorts of weaselly assumptions.

How do you make this accelerate the convergence of a series? One way is Steffensen Iteration:

      Start at point xx(0)
      
      For n = 0 , 1 , ....
      
          x0 = xx(n)
          x1 = g(x0)
          x2 = g(x1)
      
          xx(n+1) = x2 - (x2 - x1)^2 / (x2 - 2x1 + x0)
      
      Until xx() series has converged (testing this is a whole different story).

I propose we try accelerating the sequence of values for each discrete state of a value iteration in this way. This could be done for the guass-seidel policy evaluation step inside policy iteration. Or it could be done within a full value iteration.

Successive Over-relaxation

(June10) As described above, looks unpromising for whole matrices, but could be done individually for values of individual states during value iteration. This is equivalent to using momentum in neural net training. Easy to investigate, and worth doing so.

TD / Q-learning

We think this is useless for this case of discrete MDP and a known model. I bet Rich (Sutton) would disagree. We should ask him.

Linear Programming

Suggested by Mike Trick at GSIA (CMU). (He claims that contrary to conventional wisdom, this is practical). (June10) far too complex for us to try. We may be able to convince Justin, who is turning into a serious LPhead to do this for us.

State Aggregation

(June10) Sounds surprisingly naive <<<Jeff, I'm running out of time. Could you describe it, and why it seems so naive here. This is in g/doc/dp.doc>>>

Priority Queues and other fancy backup scheduling methods

(June10) Jeff has started a fun line of research here. The email concerning this

section below.

Clever Stopping criteria

Finding Policies for Continuous State-space dynamic problems

Convert to Discrete MDP by gridding

Convert to Discrete MDP by bilinear interpolation

Convert to Discrete MDP by triangulation

Multigrid

Neural Net value function approximation

Variable Resolution Partitioning

Finding (locally optimal) trajectories (to goal regions)

Latombe-style search

Local trajectory optimization

Atkebubbles

Bibliography

(Folks; start making a bibliography here...or else suggest an already-under-construction bibliography we should use).

Calendar

Tuesday June 4th, 3.30pm: Scott, Andrew, Jeff will have reviewed some of the least-understood methods above and will report their findings.

(June10) Tues June 11th 3.30pm, Scott Andrew Jeff meet. Start talking about C-coding.

Backup Scheduling

Backup Ordering Challenge

Hi Guys,

This little chunk of MATLAB code computes the optimal:

      %% V(0) = 0;
      %% V(1) = 1 + 1/3 V(2)
      %% V(i) = 1 + 1/3 V(i+1) + 2/3 V(i-1)
      %% V(25) = 1 + V(24)
      %% Solve the above equations.  Ignore V(0).
      v1 = ones(1,25);
      v2 = -2/3 * ones(1,24);
      v3 = -1/3 * ones(1,24);
      A = diag(v1,0) + diag(v2,-1) + diag(v3,1);
      A(25,24) = -1.0;
      b = ones(25,1);
      v = inv(A)*b
      
      
      The values for 1-25 (V(0) = 0.0) are:
          3.0000    6.0000    9.0000   12.0000   15.0000   18.0000   21.0000
         24.0000   26.9999   29.9999   32.9998   35.9995   38.9990   41.9980
         44.9961   47.9922   50.9844   53.9688   56.9375   59.8750   62.7500
         65.5000   68.0000   70.0000   71.0000

Jeff.

Re: Backup Ordering Challenge

      From: Jeff.Schneider@J.GP.CS.CMU.EDU
      To: Andrew.W.Moore@J.GP.CS.CMU.EDU, Jeff.Schneider@J.GP.CS.CMU.EDU,
          Scott.Davies@J.GP.CS.CMU.EDU

Note the long interval between this and my previous note (as compared with the interval between the previous 2). That should tell you something about how I did on my predictions :-)

So here's my summary:

      Randomly choose backups:                           5568 iterations  (2)
      
      Loop through the states 1-25 and repeat:           2375 iterations  (0)
      
      Loop through the states 25-1 and repeat:           2951 iterations  (1)
      
      Alternate forward and backward (as in SSOR):       2775 iterations  (4)
      
      Priority sweeps (priority is the max of the 
      impacts from a change in either neighbor):         3682 iterations  (3)

I was disgusted by that last one so I started to wonder about the right way to compute priority. The method I used above was just one possibility. Finally, I decided there was somewhat of an implicit assumption that the right thing to do was to find the one that would have the largest change.

      So I implemented that cheat:              4265 iterations!! (3)

So at this point, I'm throwing my hands in the air and asking for your reactions. I can't believe that last number, but if its true it certainly makes a statement about this kind of prioritization for iterative solutions on known models.

I'll offer my code to anyone who wants to see this stuff in action for themselves (or implement wiser backup orderings). It is called wip3. The command line arguments are:

      print_interval - (default 100) print the value function values out every 
                       print_interval steps.  0 means never print.
      show_interval  - (default 0) as in print_interval except its a graphic
                       illustration of the value and priority functions.  The top
                       plot is the value function as does not erase so you can see
                       it filling itself in.  The bottom plot is the priority values
                       and they do erase and redraw each time.  There is also a
                       pathetic attempt to keep them at a reasonable scale so they
                       bounce around a bit.
      backup_type    - this is the important one.  the code describes which types
                       are which.  Those cryptic numbers in parentheses after the
                       results are the corresponding types
      priority_type  - 0 means use the priority scheme I described first.  1 means
                       use the cheat of changing the one that will change the most.
                       this parameter only matters for backup_type 3, but the plot
                       shows its correct values even when not using backup_type 3.
      seed           - random seed for the random chooser
      verbosity      - at 0 or lower, the plotting goes fast to make an animation.
                       at higher levels, there is a wait_for_key after each plot.

One final request. The result with priority_type 1 is bad enough, but I would like someone to explain the appearance of the value function plot with it. The fastest way to see it is:

      wip3 backup_type 3 show_interval 1 print_interval 0 priority_type 1

Maybe its just some artifact of the pixel spacing, but none of the others look like this. It makes me suspicious of the code, but I've checked it a couple times.

Subject: Re: Backup Ordering Challenge

      From: Scott Davies <scottd@lump.sp.cs.cmu.edu>
      To: schneide@cs.cmu.edu, awm@cs.cmu.edu

Many thanks to Jeff for providing his code. The animations are fun to watch. :-)

I've been playing with a few off-the-wall heuristics over the last day, and I haven't had too much luck either. The one thing I did find that did marginally better than Gauss-Seidel was to use something close to Jeff's "cheating" prioritized sweeping, but dividing that weight by the square of the state number, thus concentrating backups close to the goal at the beginning. (Result: 2270 backups, for a whopping savings of 105 backups over Gauss-Seidel. Wheee! Not to mention the fact that this is obvious a completely domain-specific hack.) Most of the screwier ideas I had before running these experiments had to modifed or tossed -- not any priority heuristic will do, since there's an excellent chance that the system will get wedged into repeating zero-change backups unless you make absolutely sure that that can't happen.

What code I've kept is in the sdwip directory; the executable is wip3. It uses priority_type 2, which sticks the magnitude of the last backup in P; the actual backup heuristics, 7 through 9, may use various things other than P to figure out their actual priorities. (Putting the magnitudes of the last backups in P makes for some moderately informative animations.) Heuristic 9 is the one that gave the improvement; the others were pretty miserable compared to Gauss-Seidel, though not much worse than Jeff's prioritized sweeping stuff. (7 is a multiply-the-"cheating"-prioritized-sweeping-value-by- a-difference-between-the-magnitudes-of-the-neighbors'-backups hack. 8 is a divide-the-"cheating"-prioritized-sweeping-value-by- the-second-moment-of-the-magnitudes-of-backups-taken-around-the-state- in-question hack. Don't ask me why I thought those were good ideas.)

It is pretty annoying that Gauss-Seidel is so hard to beat, but I suspect that's largely an aspect of this particular domain. I'd bet that in larger, more uneven state spaces, prioritized sweeping-type heuristics would probably fare comparatively better. (But, I wouldn't bet too much.)

-Scott

Subject: Re: Backup Ordering Challenge

      From: Jeff.Schneider@J.GP.CS.CMU.EDU
      awm@CS.CMU.EDU, schneide@CS.CMU.EDU, scottd@LUMP.SP.CS.CMU.EDU

One extra thing I tried yesterday (which I already mentioned to Andrew).

Given the failure of various "smart" schemes and the relative success of sweeping through the states forward and/or backward, I wondered if the only thing that really mattered was that you hit all the states somewhat frequently. So I implemented a method that randomly orders the 25 states and then repeatedlycycles through that ordering. The result of 10000 trials:

      avg: 2664
      std dev: 36
      range: [2536 2798]
      
      compare to:
      sweep through backward from goal: 2375
        "      "    forward to goal:    2951

So that average is almost exactly between the two special cases, the variance is small. That makes my most recent hypothesis that the most important thing is hitting all the states frequently and the order you do so isn't particularly important.

Jeff.