TPOT-RL brings together several techniques that have been proposed theoretically or tested in simple domains. This section highlights the components of TPOT-RL and relates them to previous work.
Typical RL paradigms update the value of a state-action pair based upon the value of the subsequent state (or state distribution). As presented in , the typical update function in Q-learning is where is the state next reached after executing action a in state s and is the discount factor. While characterized as ``model-free'' in the sense that the agent need not know the transition function , these paradigms assume that the agent can observe the subsequent state that it enters.
However, the vast amount of hidden state coupled with the multi-agent nature of this domain make such a paradigm impossible for the following reasons. Having only local world-views, agents cannot reliably discern when a teammate is able to take an action. Furthermore, even when able to notice that a teammate is within kicking distance of the ball, the agent certainly cannot tell the feature values for the teammate's possible actions. Worse than being model-free, multi-agent RL must deal with the inability to even track the team's state trajectory. Thus we use Equation 1, which doesn't rely on knowing .
Notice that the opaque-transition characteristic also does not fit into the partially observable Markov decision process (POMDP) framework . While POMDPs deal with hidden state, they do assume that the agent at least knows when it has transitioned to a new state and may act again.
The construction of feature space V can have a huge effect on the nature of Q. For example, in , a grid-like discretization is used for V. Since too many states result for a lookup-table, a neural network is used as the value function approximator. This approach is shown not to work very well, and the authors conclude that a more complex function approximator might work better. In contrast, we take the approach of using a smaller feature space and the simplest possible evaluation function: a lookup-table. We can do so by relying on our layered learning approach to learn the state generalization function.
The use of machine learning in multi-agent systems has recently been receiving a good deal of attention. For an extensive survey, see . This section highlights the work most related to TPOT-RL.
The internal reinforcement in the reward function R is similar to Mataric's progress estimators . There, the short-term real-world effects of actions are used as an intermediate reward to help robots reach the ultimate goal location. Mataric's conditions also play a similar role to the features used here, reducing the size of the evaluation function domain. This work was in a completely collaborative, as opposed to our adversarial, setting.
Previous multi-agent reinforcement learning systems have typically dealt with much simpler tasks than the one presented here. Littman uses Markov games to learn stochastic policies in a very abstract version of 1-on-1 robotic soccer . There have also been a number of studies of multi-agent reinforcement learning in the pursuit domain [1, 17]. In this domain, four predators chase a single prey in a small grid-like world.
Also in a predator-like task,  uses a single run to deal with the opponents' shifting policies and ignore the opponents' policies just as we do. The effects of opponent actions are captured in the reward function.
Another team-partitioned, opaque transition domain is network routing as considered in . Each network node is considered as a separate agent which cannot see a packet's route beyond its own action. A major difference between that work and our own is that neighboring nodes send back their own value estimates whereas we assume that agents do not even know their neighboring states. Thus unlike TPOT-RL agents, the nodes are able to use dynamic programming.
In other soccer systems, there have been a number of learning techniques that have been explored. However, most have learned low-level, individual skills as opposed to team-based policies [2, 15]. Interestingly,  uses genetic programming to evolve team behaviors from scratch as opposed to our layered learning approach.