next up previous
Next: Action Selection Up: Team-PartitionedOpaque-Transition RL Previous: State Generalization

Value Function Learning

As we have seen, TPOT-RL uses action-dependent features. Therefore, we can assume that the expected long-term reward for taking action tex2html_wrap_inline1983 depends only on the feature value related to action tex2html_wrap_inline1985 . That is,


whenever tex2html_wrap_inline1987 and tex2html_wrap_inline1989 . In other words, if tex2html_wrap_inline1991 , tex2html_wrap_inline1993 depends entirely upon tex2html_wrap_inline1995 and is independent of tex2html_wrap_inline1997 for all tex2html_wrap_inline1999 .

Without this assumption, since there are |A| actions possible for each feature vector, the value function Q has tex2html_wrap_inline2005 independent values. Under this assumption, however, the Q-table has at most tex2html_wrap_inline2007 entries: for each action possible from each position, there is only one relevant feature value. Therefore, even with only a small number of training examples available, we can treat the value function Q as a lookup-table without the need for any complex function approximation. To be precise, Q stores one value for every possible combination of action a, tex2html_wrap_inline2015 , and tex2html_wrap_inline2017 .

For example, Table 1 shows the entire feature space for one agent's partition of the state space when |U| = 3 and |A| = 2. There are tex2html_wrap_inline2023 different entries in feature space with 2 Q-values for each entry: one for each possible action. tex2html_wrap_inline2025 is much smaller than the original state space for any realistic problem, but it can grow large quickly, particularly as |A| increases. However, notice in Table 1 that, under the assumption described above, there are only tex2html_wrap_inline2029 independent Q-values to learn, reducing the number of free variables in the learning problem by 67% in this case.

Table 1: A sample Q-table for a single agent when |U| = 3 and |A| = 2: tex2html_wrap_inline2141 , tex2html_wrap_inline2143 . tex2html_wrap_inline2145 is the estimated value of taking action tex2html_wrap_inline2147 when tex2html_wrap_inline2149 . Since this table is for a single agent, tex2html_wrap_inline2151 remains constant.

The Q-values learned depend on the agent's past experiences in the domain. In particular, after taking an action a while in state s with tex2html_wrap_inline2157 , an agent receives reward r and uses it to update tex2html_wrap_inline2161 as follows:


Since the agent is not able to access its teammates' internal states, future team transitions are completely opaque from the agent's perspective. Thus it cannot use dynamic programming to update its Q-table. Instead, the reward r comes directly from the observable environmental characteristics--those that are captured in S--over a maximum number of time steps tex2html_wrap_inline2167 after the action is taken. The reward function tex2html_wrap_inline2169 returns a value at some time no further than tex2html_wrap_inline2171 in the future. During that time, other teammates or opponents can act in the environment and affect the action's outcome, but the agent may not be able to observe these actions. For practical purposes, it is crucial that the reward function is only a function of the observable world from the acting agent's perspective. In practice, the range of R is tex2html_wrap_inline2175 where tex2html_wrap_inline2177 is the reward for immediate goal achievement .

The reward function, including tex2html_wrap_inline2179 and tex2html_wrap_inline2181 , is domain-dependent. One possible type of reward function is based entirely upon reaching the ultimate goal. In this case, an agent charts the actual (long-term) results of its policy in the environment. However, it is often the case that goal achievement is very infrequent. In order to increase the feedback from actions taken, it is useful to use an internal reinforcement function, which provides feedback based on intermediate states towards the goal. We use this internal reinforcement approach in our work.

next up previous
Next: Action Selection Up: Team-PartitionedOpaque-Transition RL Previous: State Generalization

Peter Stone
Fri Feb 27 18:45:43 EST 1998