Gittins gives an ``allocation index'' method for finding the optimal choice of action at each step in k-armed bandit problems . The technique only applies under the discounted expected reward criterion. For each action, consider the number of times it has been chosen, n, versus the number of times it has paid off, w. For certain discount factors, there are published tables of ``index values,'' I(n,w) for each pair of n and w. Look up the index value for each action i, . It represents a comparative measure of the combined value of the expected payoff of action i (given its history of payoffs) and the value of the information that we would get by choosing it. Gittins has shown that choosing the action with the largest index value guarantees the optimal balance between exploration and exploitation.
Because of the guarantee of optimal exploration and the simplicity of the technique (given the table of index values), this approach holds a great deal of promise for use in more complex applications. This method proved useful in an application to robotic manipulation with immediate reward . Unfortunately, no one has yet been able to find an analog of index values for delayed reinforcement problems.