Dynamic Programming Recipes in C CONTENTS [1]Goals ----- [2]Products [3]Initial Paper Goals [4]Focus [5]Example problems we'll use: [6]**** Main technologies to be implemented.... [7]Discrete large, non-deterministic-but-sparse, MDPs ----- [8]Value Iteration ----- [9]Policy Iteration ----- [10]Puterman's Modified Policy Iteration ----- [11]Momentum ----- [12]Aitken Acceleration ----- [13]Successive Over-relaxation ----- [14]TD / Q-learning ----- [15]Linear Programming ----- [16]State Aggregation ----- [17]Priority Queues and other fancy backup scheduling methods [18]section below. ----- [19]Clever Stopping criteria [20]Finding Policies for Continuous State-space dynamic problems ----- [21]Convert to Discrete MDP by gridding ----- [22]Convert to Discrete MDP by bilinear interpolation ----- [23]Convert to Discrete MDP by triangulation ----- [24]Multigrid ----- [25]Neural Net value function approximation ----- [26]Variable Resolution Partitioning [27]Finding (locally optimal) trajectories (to goal regions) ----- [28]Latombe-style search ----- [29]Local trajectory optimization ----- [30]Atkebubbles [31]Bibliography [32]Calendar [33]Backup Scheduling ----- [34]Backup Ordering Challenge ----- [35]Re: Backup Ordering Challenge ----- [36]Subject: Re: Backup Ordering Challenge ----- [37]Subject: Re: Backup Ordering Challenge Goals This is a project involving Scott Davies, Gary Boone, Andrew Moore, Chris Atkeson, Jeff Schneider, Anthony Dilello and hopefully other interested parties over the Summer 96 to implement and improve and apply practical techniques for Dynamic Programming, Optimal Control and Reinforcement Learning problems. Products We hope to have a thorough bibliography of the area. We hope to have a survey of the important literature of the area. We hope to have a readable, useful, "Numerical Recipes in C style" description of the most useful DP technques, which will become a widely used journal paper in the area. We will have a good, blazingly fast, C library of DP / RL / Optimal control techniques. Possibly embedded in a Matlab toolbox or other suitable delivery system. Delivered free across the Web and elsewhere as a resource for researchers and industry. We will have a set of case studies demonstrating the use of the methods and some of the trade-offs. Ideally, (but there may not be time this summer), we will also attack Model Learning problems (same framework as the recent plethora of Acrobot papers). This includes combining a learned model with DP and Exploration. Initial Paper Goals (June10) Chris suggests: I think it should be two papers: Paper 1: Numerical Recipes for Discrete State Problems Paper 2: Numerical Recipes for Continuous State Problems In addition to truly continuous state methods, Paper 2 also talks about transforming continuous state problems into discrete state problems, but refers to Paper 1 for material on how to solve the transformed problem efficiently. Focus I guess we're most interested in continuous-state dynamic control problems specified by goals an/or arbitrary reward functions. We expect to focus on 1-5 dimensional state spaces, but with an eye to coping with higher dimensional problems that are "simple" in some respects. Example problems we'll use: Puddle-world (obstacle avoidance on a 2-d grid) Double Integrator (2-d space) Car-on-hill (2-d) Non-holonomic vehicles (3-d) (Is this too different from the rest?) Cart-pole stabilization (4-d) Cart-pole minimal-time maneuvers (4-d) Two joint arm Ball-tracking/catching/hitting (4-d .. 8-d) (June10) Anthony now has a nice, fairly realistic simulation of this, and a primitive arm-controlling, ball-hitting algorithm. And a good animation. Those of you who want to see it should do the following: The executable is named basket and is in the basket directory. There are no special command line arguments. Acrobot Sutton-task (4-d) Acrobot swing-up and balance vartically (4-d) Aircraft Landing (using Navy simulation) (?-d) Food cooking (?-d) **** Main technologies to be implemented.... Discrete large, non-deterministic-but-sparse, MDPs These are the kinds of systems you end up needing to solve when you use an interpolating function approximator in continuous state problem. Possible technologies are... Value Iteration We all know about this.. I think? Policy Iteration We know the algorithm, but have we ever used it? (When) does it work? Key question: how do we practically do the policy evaluation step? Ideas: (1) Gauss Seidel Sweeps -- we think this never buys anything you wouldn't have got with value iteration. Maybe if there had been thousands of actions per state it's a win. (June10) no change in our thinking here (2) Direct Matrix Inversion: Too expensive above say 30 states (June10) no change in our thinking here (3) Fancy sparse/stochastic matrix inversion. Is SOR the thing? (June10) We looked at \cite{Young88} which is a pretty good description of iterative (as opposed to direct) methods for inverting sparse matrices. Bottom line is that SOR looks fairly hopeless because it requires a more special form of the matrix than we can guarantee. There are obscure references to general SOR technqiues \cite{YoungEidson70,YoungHuang83,Huang83} that Scott is tracking. (June10) Or a conjugate gradient thing? Yes, looks promising! Scott is conversing with Numerical Recipes in C at this moment. We need to be sure that NRIC implementation doesn't need O(N^2) matrix storage space. We think this should be implemented and tested (Candidate for Auton/GT seal of approval). (4) Monte Carlo simulation of the value of a policy by, well, running it a lot. Could this sometimes be important? (June10) Still at the level of something to list, but we're not excited by it. Any evidence of literature about this? (June11) Actually, this could be great for not-very-non-deterministic problems, e.g. the result of an interpolated value function. Here, for each grid corner, you do one policy simulation and compute the costs. Total cost of policy evaluation = (Number of Grid Points) * (Typical Simulation Length). Compare to cost of Matrix Inversion Policy evaluation, which could be O(Number of Grid Points cubed), and this might look attractive. Puterman's Modified Policy Iteration We need Puterman's book. (June10) We have Puterman's book. It's in the lab. Looks good. We must ask ourselves "What will our survey do that this book doesn't?" . I suspect the answer strongly lies in the computational issues, where we can become the de facto experts. Scott's evaluation of Puterman's initial MPI paper reveals some strangeness in the results (plus it wasn't implemented on any really excitingly large problems). We should implement and test it. Also, look into action elimination procedures. (Puterman's book suggests large speedups, though again this seems a bit surprising.) Momentum Andrew will try to find literature that discusses whether a momentum term in value iteration helps. Related to over-relaxation. (June10) Yes: this type of technique looks promising and worth implementing and testing. But let's save the discussion for Aitken, below. Aitken Acceleration See \cite{ConteDeboor80} for a nice discussion (Page 95-100). Basic idea: Suppose you have a sequence of scalars that you are generating x(1) , x(2) , x(3) .... by some iteration x(n+1) = g(x(n)) and you're expecting them to converge linearly (i.e. number of decimal places of accuracy is approx linear in N) to a final value z, where z = g(z). Suppose you have an estimate of g'(z). Then beginning with the following g(z) = z x(n+1) = g(x(n)) g'(z) =approximately= g(x(n)) - g(x(n-1)) ------------------- x(n) - x(n-1) g(z+h) =approximately= g(z) + h g'(z) Finally gets you (with elementary algebra) to an estimate of the z you're converging to from x(n-1) , x(n) and x(n+1): z =approximately= zestimate = x(n+1) - (x(n+1) - x(n))^2 / (x(n+1) - 2x(n) + x(n-1)) and zestimate is a much better estimate of z than x(n+1) subject to all sorts of weaselly assumptions. How do you make this accelerate the convergence of a series? One way is Steffensen Iteration: Start at point xx(0) For n = 0 , 1 , .... x0 = xx(n) x1 = g(x0) x2 = g(x1) xx(n+1) = x2 - (x2 - x1)^2 / (x2 - 2x1 + x0) Until xx() series has converged (testing this is a whole different story) . I propose we try accelerating the sequence of values for each discrete state of a value iteration in this way. This could be done for the guass-seidel policy evaluation step inside policy iteration. Or it could be done within a full value iteration. Successive Over-relaxation (June10) As described above, looks unpromising for whole matrices, but could be done individually for values of individual states during value iteration. This is equivalent to using momentum in neural net training. Easy to investigate, and worth doing so. TD / Q-learning We think this is useless for this case of discrete MDP and a known model. I bet Rich (Sutton) would disagree. We should ask him. Linear Programming Suggested by Mike Trick at GSIA (CMU). (He claims that contrary to conventional wisdom, this is practical). (June10) far too complex for us to try. We may be able to convince Justin, who is turning into a serious LPhead to do this for us. State Aggregation (June10) Sounds surprisingly naive <<>> Priority Queues and other fancy backup scheduling methods (June10) Jeff has started a fun line of research here. The email concerning this section below. Clever Stopping criteria Finding Policies for Continuous State-space dynamic problems Convert to Discrete MDP by gridding Convert to Discrete MDP by bilinear interpolation Convert to Discrete MDP by triangulation Multigrid Neural Net value function approximation Variable Resolution Partitioning Finding (locally optimal) trajectories (to goal regions) Latombe-style search Local trajectory optimization Atkebubbles Bibliography (Folks; start making a bibliography here...or else suggest an already-under-construction bibliography we should use). Calendar Tuesday June 4th, 3.30pm: Scott, Andrew, Jeff will have reviewed some of the least-understood methods above and will report their findings. (June10) Tues June 11th 3.30pm, Scott Andrew Jeff meet. Start talking about C-coding. Backup Scheduling Backup Ordering Challenge Hi Guys, This little chunk of MATLAB code computes the optimal: %% V(0) = 0; %% V(1) = 1 + 1/3 V(2) %% V(i) = 1 + 1/3 V(i+1) + 2/3 V(i-1) %% V(25) = 1 + V(24) %% Solve the above equations. Ignore V(0). v1 = ones(1,25); v2 = -2/3 * ones(1,24); v3 = -1/3 * ones(1,24); A = diag(v1,0) + diag(v2,-1) + diag(v3,1); A(25,24) = -1.0; b = ones(25,1); v = inv(A)*b The values for 1-25 (V(0) = 0.0) are: 3.0000 6.0000 9.0000 12.0000 15.0000 18.0000 21.0000 24.0000 26.9999 29.9999 32.9998 35.9995 38.9990 41.9980 44.9961 47.9922 50.9844 53.9688 56.9375 59.8750 62.7500 65.5000 68.0000 70.0000 71.0000 Jeff. Re: Backup Ordering Challenge From: Jeff.Schneider@J.GP.CS.CMU.EDU To: Andrew.W.Moore@J.GP.CS.CMU.EDU, Jeff.Schneider@J.GP.CS.CMU.EDU, Scott.Davies@J.GP.CS.CMU.EDU Note the long interval between this and my previous note (as compared with the interval between the previous 2). That should tell you something about how I did on my predictions :-) So here's my summary: Randomly choose backups: 5568 iterations (2) Loop through the states 1-25 and repeat: 2375 iterations (0) Loop through the states 25-1 and repeat: 2951 iterations (1) Alternate forward and backward (as in SSOR): 2775 iterations (4) Priority sweeps (priority is the max of the impacts from a change in either neighbor): 3682 iterations (3) I was disgusted by that last one so I started to wonder about the right way to compute priority. The method I used above was just one possibility. Finally, I decided there was somewhat of an implicit assumption that the right thing to do was to find the one that would have the largest change. So I implemented that cheat: 4265 iterations!! (3) So at this point, I'm throwing my hands in the air and asking for your reactions. I can't believe that last number, but if its true it certainly makes a statement about this kind of prioritization for iterative solutions on known models. I'll offer my code to anyone who wants to see this stuff in action for themselves (or implement wiser backup orderings). It is called wip3. The command line arguments are: print_interval - (default 100) print the value function values out every print_interval steps. 0 means never print. show_interval - (default 0) as in print_interval except its a graphic illustration of the value and priority functions. The t op plot is the value function as does not erase so you can see it filling itself in. The bottom plot is the priority v alues and they do erase and redraw each time. There is also a pathetic attempt to keep them at a reasonable scale so t hey bounce around a bit. backup_type - this is the important one. the code describes which typ es are which. Those cryptic numbers in parentheses after t he results are the corresponding types priority_type - 0 means use the priority scheme I described first. 1 me ans use the cheat of changing the one that will change the m ost. this parameter only matters for backup_type 3, but the p lot shows its correct values even when not using backup_type 3. seed - random seed for the random chooser verbosity - at 0 or lower, the plotting goes fast to make an animati on. at higher levels, there is a wait_for_key after each plo t. One final request. The result with priority_type 1 is bad enough, but I would like someone to explain the appearance of the value function plot with it. The fastest way to see it is: wip3 backup_type 3 show_interval 1 print_interval 0 priority_type 1 Maybe its just some artifact of the pixel spacing, but none of the others look like this. It makes me suspicious of the code, but I've checked it a couple times. Subject: Re: Backup Ordering Challenge From: Scott Davies To: schneide@cs.cmu.edu, awm@cs.cmu.edu Many thanks to Jeff for providing his code. The animations are fun to watch. :-) I've been playing with a few off-the-wall heuristics over the last day, and I haven't had too much luck either. The one thing I did find that did marginally better than Gauss-Seidel was to use something close to Jeff's "cheating" prioritized sweeping, but dividing that weight by the square of the state number, thus concentrating backups close to the goal at the beginning. (Result: 2270 backups, for a whopping savings of 105 backups over Gauss-Seidel. Wheee! Not to mention the fact that this is obvious a completely domain-specific hack.) Most of the screwier ideas I had before running these experiments had to modifed or tossed -- not any priority heuristic will do, since there's an excellent chance that the system will get wedged into repeating zero-change backups unless you make absolutely sure that that can't happen. What code I've kept is in the sdwip directory; the executable is wip3. It uses priority_type 2, which sticks the magnitude of the last backup in P; the actual backup heuristics, 7 through 9, may use various things other than P to figure out their actual priorities. (Putting the magnitudes of the last backups in P makes for some moderately informative animations.) Heuristic 9 is the one that gave the improvement; the others were pretty miserable compared to Gauss-Seidel, though not much worse than Jeff's prioritized sweeping stuff. (7 is a multiply-the-"cheating"-prioritized-sweeping-value-by- a-difference-between-the-magnitudes-of-the-neighbors'-backups hack. 8 is a divide-the-"cheating"-prioritized-sweeping-value-by- the-second-moment-of-the-magnitudes-of-backups-taken-around-the-state- in-question hack. Don't ask me why I thought those were good ideas.) It is pretty annoying that Gauss-Seidel is so hard to beat, but I suspect that's largely an aspect of this particular domain. I'd bet that in larger, more uneven state spaces, prioritized sweeping-type heuristics would probably fare comparatively better. (But, I wouldn't bet too much.) -Scott Subject: Re: Backup Ordering Challenge From: Jeff.Schneider@J.GP.CS.CMU.EDU awm@CS.CMU.EDU, schneide@CS.CMU.EDU, scottd@LUMP.SP.CS.CMU.EDU One extra thing I tried yesterday (which I already mentioned to Andrew). Given the failure of various "smart" schemes and the relative success of sweeping through the states forward and/or backward, I wondered if the only thing that really mattered was that you hit all the states somewhat frequently. So I implemented a method that randomly orders the 25 states and then repeatedlycycles through that ordering. The result of 10000 trials: avg: 2664 std dev: 36 range: [2536 2798] compare to: sweep through backward from goal: 2375 " " forward to goal: 2951 So that average is almost exactly between the two special cases, the variance is small. That makes my most recent hypothesis that the most important thing is hitting all the states frequently and the order you do so isn't particularly important. Jeff. References 1. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#0 2. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#1 3. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#2 4. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#3 5. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#4 6. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#5 7. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#6 8. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#7 9. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#8 10. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#9 11. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#10 12. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#11 13. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#12 14. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#13 15. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#14 16. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#15 17. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#16 18. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#17 19. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#18 20. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#19 21. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#20 22. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#21 23. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#22 24. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#23 25. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#24 26. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#25 27. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#26 28. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#27 29. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#28 30. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#29 31. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#30 32. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#31 33. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#32 34. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#33 35. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#34 36. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#35 37. file://localhost/afs/cs.cmu.edu/project/learn/group/doc/dp.html#36