next up previous
Next: Internal Reinforcement through Observation Up: TPOT-RL Applied to a Previous: TPOT-RL Applied to a

State Generalization Using a Learned Feature

In the soccer example, we applied TPOT-RL to enable each teammate to simultaneously learn a high-level action policy. The policy is a function that determines what an agent should do when it has possession of the ball.gif The input of the policy is the agent's perception of the current world state; the output is a target destination for the ball in terms of a location on the field, e.g. the opponent's goal. In our experiment, each agent has 8 possible actions as illustrated in Figure 1(a). Since a player may not be able to tell the results of other players' actions, or even when they can act, the domain is opaque-transition.

A team formation is divided into 11 positions (m=11), as also shown in Figure 1(a) [16]. Thus, the partition function tex2html_wrap_inline2359 returns the player's position. Using our layered learning approach, we use the previously trained DT as e. Each possible pass is classified as either a likely success or a likely failure with a confidence factor. Outputs of the DT could be clustered based on the confidence factors. In our experiments, we cluster into only two sets indicating success and failure. Therefore |U| = 2 and tex2html_wrap_inline2365 so tex2html_wrap_inline2367 . Even though each agent only gets about 10 training examples per 10-minute game and the reward function shifts as teammate policies improve, the learning task becomes feasible. This feature space is immensely smaller than the original state space, which has more than tex2html_wrap_inline2369 states.gif Since e indicates the likely success or failure of each possible action, at action-selection time, we only consider the actions that are likely to succeed (|W|=1). Therefore, each player learns 8 Q-values, with a total of 88 learned by the team as a whole. Even with sparse training and shifting concepts, such a learning task is tractable.



next up previous
Next: Internal Reinforcement through Observation Up: TPOT-RL Applied to a Previous: TPOT-RL Applied to a



Peter Stone
Fri Feb 27 18:45:43 EST 1998