next up previous
Next: Results Up: TPOT-RL Applied to a Previous: State Generalization Using a

Internal Reinforcement through Observation

As in any RL approach, the reward function plays a large role in determining what policy is learned. One possible reward function is based entirely upon reaching the ultimate goal. Although goals scored are the true rewards in this domain, such events are very sparse. In order to increase the feedback from actions taken, it is useful to use an internal reinforcement function, which provides feedback based on intermediate states towards the goal. Without exploring the space of possible such functions, we created one reward function R.

R gives rewards for goals scored. However, players also receive rewards if the ball goes out of bounds, or else after a fixed period of time tex2html_wrap_inline2381 based on the ball's average lateral position on the field. In particular, when a player takes action tex2html_wrap_inline2383 in state s such that tex2html_wrap_inline2387 , the player records the time t at which the action was taken as well as the x coordinate of the ball's position at time t, tex2html_wrap_inline2391 . The reward function R takes as input the observed ball position over time tex2html_wrap_inline2395 (a subset of tex2html_wrap_inline2397 ) and outputs a reward r. Since the ball position over time depends also on other agents' actions, the reward is stochastic and non-stationary. Under the following conditions, the player fixes the reward r:


In case 1, the reward r is based on the value tex2html_wrap_inline2415 as indicated in Figure 1(b): tex2html_wrap_inline2417 . Thus, the farther in the future the ball goes out of bounds (i.e. the larger tex2html_wrap_inline2419 ), the smaller the absolute value of r. This scaling by time is akin to the discount factor used in Q-learning. We use tex2html_wrap_inline2423 and tex2html_wrap_inline2425 .

In cases 2 and 3, the reward r is based on the average x-position of the ball over the time t to the time tex2html_wrap_inline2431 or tex2html_wrap_inline2433 . Over that entire time span, the player samples the x-coordinate of the ball at fixed, periodic intervals and computes the average tex2html_wrap_inline2435 over the times at which the ball position is known. Then if tex2html_wrap_inline2437 , tex2html_wrap_inline2439 where tex2html_wrap_inline2441 is the x-coordinate of the opponent goal (the right goal in Figure 1(b)). Otherwise, if tex2html_wrap_inline2443 , tex2html_wrap_inline2445 where tex2html_wrap_inline2447 is the x-coordinate of the learner's goal.gif Thus, the reward is the fraction of the available field by which the ball was advanced, on average, over the time-period in question. Note that a backwards pass can lead to positive reward if the ball then moves forward in the near future.

Figure 1: (a) The black and white dots represent the players attacking the right and left goals respectively. Arrows indicate a single player's action options when in possession of the ball. The player kicks the ball towards a fixed set of markers around the field, including the corner flags and the goals. (b) The component tex2html_wrap_inline2451 of the reward function R based on the circumstances under which the ball went out of bounds. For kick-ins, the reward varies linearly with the x position of the ball.

The reward r is based on direct environmental feedback. It is a domain-dependent internal reinforcement function based upon heuristic knowledge of progress towards the goal. Notice that it relies solely upon the player's own impression of the environment. If it fails to notice the ball's position for a period of time, the internal reward is affected. However, players can track the ball much more easily than they can deduce the internal states of other players as they would have to do were they to determine future team state transitions.

As teammates learn concurrently, the concept to be learned by each individual agent changes over time. We address this problem by gradually increasing exploitation as opposed to exploration in all teammates and by using a learning rate tex2html_wrap_inline2457 (see Equation 1). Thus, even though we are averaging several reward values for taking an action in a given state, each new example accounts for 2% of the updated Q-value: rewards gained while teammates were acting more randomly are weighted less heavily.

next up previous
Next: Results Up: TPOT-RL Applied to a Previous: State Generalization Using a

Peter Stone
Fri Feb 27 18:45:43 EST 1998