 
  
  
   
If the agent is going to be acting for a total of h steps,
it can use basic Bayesian reasoning to solve for an optimal
strategy [12].  This requires an assumed prior joint
distribution for the parameters   , the most natural of which
is that each
 , the most natural of which
is that each   is independently uniformly distributed between 0
and 1.  We compute a mapping from belief states (summaries of
the agent's experiences during this run) to actions. Here, a belief
state can be represented as a tabulation of action choices and
payoffs:
  is independently uniformly distributed between 0
and 1.  We compute a mapping from belief states (summaries of
the agent's experiences during this run) to actions. Here, a belief
state can be represented as a tabulation of action choices and
payoffs:   denotes a
state of play in which each arm i has been pulled
  denotes a
state of play in which each arm i has been pulled   times with
  times with
  payoffs.  We write
  payoffs.  We write   as the expected
payoff remaining, given that a total of h pulls are available, and
we use the remaining pulls optimally.
  as the expected
payoff remaining, given that a total of h pulls are available, and
we use the remaining pulls optimally.
If   , then there are no remaining pulls, and
 , then there are no remaining pulls, and
  .  This is the basis of a recursive
definition. If we know the
 .  This is the basis of a recursive
definition. If we know the   value for all belief states with t
pulls remaining, we can compute the
  value for all belief states with t
pulls remaining, we can compute the   value of any belief state
with t+1 pulls remaining:
  value of any belief state
with t+1 pulls remaining:
  
 
where   is the posterior subjective probability of action i
paying off given
  is the posterior subjective probability of action i
paying off given   ,
 ,   and our prior probability. For the
uniform priors, which result in a beta distribution,
  and our prior probability. For the
uniform priors, which result in a beta distribution,   .
 .
The expense of filling in the table of   values in this way for
all attainable belief states is linear in the number of belief states
times actions, and thus exponential in the horizon.
  values in this way for
all attainable belief states is linear in the number of belief states
times actions, and thus exponential in the horizon.