As we have seen, TPOT-RL uses action-dependent features. Therefore, we can assume that the expected long-term reward for taking action depends only on the feature value related to action . That is,
whenever and . In other words, if , depends entirely upon and is independent of for all .
Without this assumption, since there are |A| actions possible for each feature vector, the value function Q has independent values. Under this assumption, however, the Q-table has at most entries: for each action possible from each position, there is only one relevant feature value. Therefore, even with only a small number of training examples available, we can treat the value function Q as a lookup-table without the need for any complex function approximation. To be precise, Q stores one value for every possible combination of action a, , and .
For example, Table 1 shows the entire feature space for one agent's partition of the state space when |U| = 3 and |A| = 2. There are different entries in feature space with 2 Q-values for each entry: one for each possible action. is much smaller than the original state space for any realistic problem, but it can grow large quickly, particularly as |A| increases. However, notice in Table 1 that, under the assumption described above, there are only independent Q-values to learn, reducing the number of free variables in the learning problem by 67% in this case.
Table 1: A sample Q-table for a single agent when |U| = 3 and |A| = 2: , . is the estimated value of taking action when . Since this table is for a single agent, remains constant.
The Q-values learned depend on the agent's past experiences in the domain. In particular, after taking an action a while in state s with , an agent receives reward r and uses it to update as follows:
Since the agent is not able to access its teammates' internal states, future team transitions are completely opaque from the agent's perspective. Thus it cannot use dynamic programming to update its Q-table. Instead, the reward r comes directly from the observable environmental characteristics--those that are captured in S--over a maximum number of time steps after the action is taken. The reward function returns a value at some time no further than in the future. During that time, other teammates or opponents can act in the environment and affect the action's outcome, but the agent may not be able to observe these actions. For practical purposes, it is crucial that the reward function is only a function of the observable world from the acting agent's perspective. In practice, the range of R is where is the reward for immediate goal achievement .
The reward function, including and , is domain-dependent. One possible type of reward function is based entirely upon reaching the ultimate goal. In this case, an agent charts the actual (long-term) results of its policy in the environment. However, it is often the case that goal achievement is very infrequent. In order to increase the feedback from actions taken, it is useful to use an internal reinforcement function, which provides feedback based on intermediate states towards the goal. We use this internal reinforcement approach in our work.