next up previous contents
Next: STAGE Up: ROUT Previous: Task 2: The Game

Task 3: Multi-armed Bandit Problem

Our third test problem is to compute the optimal policy for a finite-horizon k-armed bandit [Berry and Fristedt1985]. While an optimal solution in the infinite-horizon case can be found efficiently using Gittins indices, solving the finite-horizon problem is equivalent to solving a large acyclic, stochastic MDP in belief space [Berry and Fristedt1985]. We show results for k=3 arms and a horizon of n=25 pulls, where the resulting MDP has 736,281 states. Solving this MDP by DAG-SP produces the optimal exploration policy, which has an expected reward of 0.6821 per pull.

We encoded each state as a six-dimensional feature vector of tex2html_wrap_inline2064 tex2html_wrap_inline2066 tex2html_wrap_inline2068 tex2html_wrap_inline2070 tex2html_wrap_inline2072 tex2html_wrap_inline2074 and attempted to learn a neural network approximation to tex2html_wrap_inline1400 with TD(0), TD(1), and ROUT. Again, the parameters for all algorithms were tuned by hand.

The results are shown in Table 1. All methods do spectacularly well, although the TD methods again require more trajectories and more evaluations. Careful inspection of the problem reveals that a globally linear value function, extrapolated from the states close to the end, has low Bellman residual and performs very nearly optimally. Both ROUT and TD successfully exploit this linearity.



Justin A. Boyan
Sat Jun 22 20:49:48 EDT 1996