Gittins gives an ``allocation index'' method for finding the optimal
choice of action at each step in *k*-armed bandit
problems [40]. The technique only applies under the
discounted expected reward criterion. For each action, consider the
number of times it has been chosen, *n*, versus the number of times it
has paid off, *w*. For certain discount factors, there are published
tables of ``index values,'' *I*(*n*,*w*) for each pair of *n* and *w*.
Look up the index value for each action *i*, . It
represents a comparative measure of the combined value of the expected
payoff of action *i* (given its history of payoffs) and the value of
the information that we would get by choosing it. Gittins has shown
that choosing the action with the largest index value guarantees the
optimal balance between exploration and exploitation.

Because of the guarantee of optimal exploration and the simplicity of the technique (given the table of index values), this approach holds a great deal of promise for use in more complex applications. This method proved useful in an application to robotic manipulation with immediate reward [98]. Unfortunately, no one has yet been able to find an analog of index values for delayed reinforcement problems.

Wed May 1 13:19:13 EDT 1996