Next: Team-PartitionedOpaque-Transition RL Up: Using Learned State Features Previous: Using Learned State Features

Introduction

Reinforcement learning (RL) is an effective paradigm for training an artificial agent to act in its environment in pursuit of a goal. RL techniques rely on the premise that an agent's action policy affects its overall reward over time. As surveyed in [5], several popular RL techniques use dynamic programming to enable a single agent to learn an effective control policy as it traverses a stationary (Markovian) environment.

Dynamic programming requires that agents have or learn at least an approximate model of the state transitions resulting from its actions. Q-values encode future rewards attainable from neighboring states. A single agent can keep track of state transitions as its actions move it from state to state.

This paper focusses on teams of agents learning to collaborate towards a common goal in adversarial environments. While agents can still affect their reward through their actions, they can no longer necessarily track the team's state transitions: teammates and opponents affect and experience state transitions that are completely opaque to the agent. For example, an information agent may broadcast a message without any knowledge of who receives and reacts to it. In such opaque-transition settings, Q-values cannot relate to neighboring states (which are unknown), but must still reflect real-world long-term reward resulting from chosen actions.

While opaque-transition settings eliminate the possibility of using dynamic programming, they permit parallel learning among teammates. Since agents do not track state transitions, they can each explore a separate partition of the state space without any knowledge of state values in other partitions. This team-partitioned characteristic speeds up learning by reducing the learning task of each agent. Nevertheless, the challenge of learning in a non-stationary (non-Markovian) environment remains.

This general setup builds upon the robotic soccer framework, which has been the substrate of our work. In more detail in this setup, agents' actions are chained, i.e., a single agent's set of actions allows the agent to select which other agent will be chained after in the pursuit to achieve a goal. A single agent cannot control directly the full achievement of a goal, but a chain of agents will. In robotic soccer the chaining of actions corresponds to passing a ball between the different agents. There are a variety of other such examples, such as information agents that may communicate through message passing. (These domains are for example in contrast with grid world domains in which a single agent moves from some initial location to some final goal location, domains where agents take actions in parallel though also possibly in coordination - two robots executing tasks in parallel, and game domains where the rules of the game enforce an agent and its opponent to alternate actions.) Because of our chaining of agents and the corresponding lack of control of single agents to fully achieve goals, we call these domains team-partitioned.

In addition, we assume that agents do not know the state that the world will be in after an action is selected, as another agent will continue the path to the goal. Adversarial agents can also intercept the chain and take control of the game. The domain becomes therefore opaque-transition. In short, We identify a way to do RL when the learning cannot even observe what state the team enters next, but the agent can use a reward function that captures a medium- to long-term result of the whole ensemble's learning.

In this paper we present team-partitioned, opaque-transition reinforcement learning (TPOT-RL). TPOT-RL can learn a set of effective policies with very few training examples. It relies on action-dependent dynamic features which coarsely generalize the state space. While feature selection is often a crucial issue in learning systems, our work uses a previously learned action-dependent feature. We empirically demonstrate the effectiveness of TPOT-RL in a multi-agent, adversarial environment, and show that the previously learned action-dependent feature can improve the performance of TPOT-RL.

The remainder of the paper is organized as follows. Section 2 formally presents the TPOT-RL algorithm. Section 3 details an implementation of TPOT-RL in the simulated robotic soccer domain with extensive empirical results presented in Section 4. Section 5 relates TPOT-RL to previous work and Section 6 concludes.

Next: Team-PartitionedOpaque-Transition RL Up: Using Learned State Features Previous: Using Learned State Features

Peter Stone
Fri Feb 27 18:45:43 EST 1998