15-381 Artificial Intelligence
Homework 4: Q-learning for Real-time Control

Prof. Tuomas Sandholm

In this homework, you may use any sources that you want but you must cite the sources that you use. You may not use code written by someone else. Teamwork is not allowed. Due Friday, 12/6/2002 11:59pm (can be put under Prof. Sandholm's door). The maximum number of points is 100.

Pat Riley (pfr@cs.cmu.edu) is the main TA for this assignment.

General

This homework studies Q-learning for real-time control. The object to be controlled is a cart that moves on a track, and there is a stiff tall pole hinged to the cart (pointing straight up to start out with). The objective is to keep moving the cart so that the pole does not fall, i.e. the idea is to keep balancing the pole.

You should program a Q-learning agent that applies a force to the cart at every time step. The magnitude of the force is fixed, but the agent can choose whether to apply it to the left or to the right. The agent must exert force every step.

The code for the dynamics of the cart and the pole is given, as is the code that runs the simulation. You need only program the Q-learning agent. If you do the homework correctly, you should only have about 50 lines of code (possibly less) in your methods. However, think carefully how to initialize the Q-values, and how to choose actions. You should use $\alpha=0.3$ (learning rate) and $\gamma=0.999$ (discount factor).

This homework should take well less than 9 hours to complete. Your program should run rather quickly. Taking random actions, our solution runs through 100,000 failures in 23 seconds. Our best learning code balances the pole in well under a minute, but other reasonable strategies may take a bit longer. If your program runs for more than 20 minutes, there's probably a problem.

When you try your program, you should use a variety of random seeds (see the code section to see how) to insure that performance is consistent over different random sequences. If your learner only successfully balances the pole for a small fraction of the random seeds, it is not an acceptable solution.


The Code

The code can be found in /afs/andrew/course/15/381/hw4. JavaDoc generated documentation is available in the doc subdirectory.

We provide you the following files

Where you run your program, you should invoke it as java SimRunner seed where seed is the random seed to use. If you do not specify a seed, a random one is chosen for you.

Items to turn in

The following items should be in your write-up. Please turn in a hard-copy in class or under Prof. Sandholm's door by the due date. It is generally not acceptable to email us the writeup.

  1. Explanation of how you initialize the Q-values (and why you use the method you use).

  2. Explanation of how you choose actions (and why you use the method you use).

  3. Explanation of how you update the Q-values.

  4. A plot that shows (for some random number seed) how the lengths of the epochs (trials) change as more trials have been seen. You probably don't want to plot every epoch, but average together some number of epochs so that the graph is readable.

  5. A discussion (1 or 2 paragraphs) regarding the convergence properties and speed of learning. This discussion should be based upon what you learned while trying to find a good method for choosing the actions. Make sure that for every case, you run the algorithm on at least 10 different random number seeds. Things you might want to discuss: What did you try that did not work? Why do you think your algorithm works? Is there anything you can think of (but didn't implement) that would probably be better? Don't be too wordy here, we just want to know that you have been thinking about these sorts of issues.

  6. A list of any sources outside of your own head, the course text book, and the course notes which you used to complete this assignment.

  7. Printout of your code. You do not need to turn in the files we gave you: CartPoleState.java, I_QLearner.java, SimRunner.java

Feel free to experiment with the parameters (e.g. learning rate $\alpha$, discount factor $\gamma$, MAX_FAILURES, MAX_STEPS and different initialization and action selection methods if you like.

About this document ...

15-381 Artificial Intelligence
Homework 4: Q-learning for Real-time Control

This document was generated using the LaTeX2HTML translator Version 99.1 release (March 30, 1999)

Copyright © 1993, 1994, 1995, 1996, Nikos Drakos, Computer Based Learning Unit, University of Leeds.
Copyright © 1997, 1998, 1999, Ross Moore, Mathematics Department, Macquarie University, Sydney.

The command line arguments were:
latex2html -split 0 -no_navigation assign.tex

The translation was initiated by Patrick Riley on 2002-11-25


Patrick Riley
2002-11-25