Internal Reinforcement through Observation

Next: Results Up: TPOT-RL Applied to a Previous: State Generalization Using a

Internal Reinforcement through Observation

As in any RL approach, the reward function plays a large role in determining what policy is learned. One possible reward function is based entirely upon reaching the ultimate goal. Although goals scored are the true rewards in this domain, such events are very sparse. In order to increase the feedback from actions taken, it is useful to use an internal reinforcement function, which provides feedback based on intermediate states towards the goal. Without exploring the space of possible such functions, we created one reward function R.

R gives rewards for goals scored. However, players also receive rewards if the ball goes out of bounds, or else after a fixed period of time based on the ball's average lateral position on the field. In particular, when a player takes action in state s such that , the player records the time t at which the action was taken as well as the x coordinate of the ball's position at time t, . The reward function R takes as input the observed ball position over time (a subset of ) and outputs a reward r. Since the ball position over time depends also on other agents' actions, the reward is stochastic and non-stationary. Under the following conditions, the player fixes the reward r:

s_enumerate1474

In case 1, the reward r is based on the value as indicated in Figure 1(b): . Thus, the farther in the future the ball goes out of bounds (i.e. the larger ), the smaller the absolute value of r. This scaling by time is akin to the discount factor used in Q-learning. We use and .

In cases 2 and 3, the reward r is based on the average x-position of the ball over the time t to the time or . Over that entire time span, the player samples the x-coordinate of the ball at fixed, periodic intervals and computes the average over the times at which the ball position is known. Then if , where is the x-coordinate of the opponent goal (the right goal in Figure 1(b)). Otherwise, if , where is the x-coordinate of the learner's goal. Thus, the reward is the fraction of the available field by which the ball was advanced, on average, over the time-period in question. Note that a backwards pass can lead to positive reward if the ball then moves forward in the near future.

Figure 1: (a) The black and white dots represent the players attacking the right and left goals respectively. Arrows indicate a single player's action options when in possession of the ball. The player kicks the ball towards a fixed set of markers around the field, including the corner flags and the goals. (b) The component of the reward function R based on the circumstances under which the ball went out of bounds. For kick-ins, the reward varies linearly with the x position of the ball.

The reward r is based on direct environmental feedback. It is a domain-dependent internal reinforcement function based upon heuristic knowledge of progress towards the goal. Notice that it relies solely upon the player's own impression of the environment. If it fails to notice the ball's position for a period of time, the internal reward is affected. However, players can track the ball much more easily than they can deduce the internal states of other players as they would have to do were they to determine future team state transitions.

As teammates learn concurrently, the concept to be learned by each individual agent changes over time. We address this problem by gradually increasing exploitation as opposed to exploration in all teammates and by using a learning rate (see Equation 1). Thus, even though we are averaging several reward values for taking an action in a given state, each new example accounts for 2% of the updated Q-value: rewards gained while teammates were acting more randomly are weighted less heavily.

Next: Results Up: TPOT-RL Applied to a Previous: State Generalization Using a

Peter Stone
Fri Feb 27 18:45:43 EST 1998