Appendix B: Efficient Action Evaluation

In non-generalizing reinforcement learning the cost of executing a single learning step can be neglected. However, algorithms with generalization in the spaces of sensors and/or actuators are not so simple and the execution time of each iteration can be increased substantially. In an extreme case, this increase can limit the reactivity of the learner and this is very dangerous when working with an autonomous robot.

The most expensive procedure of our algorithm is that of computing the value of all actions (i.e., all valid combinations of elementary actions). The cost of this procedure is especially critical since it is used twice in each step: once to get the guess of each action (in the Action Evaluation procedure detailed in Figure 13) and again to get the goodness of the new achieved situation after the action execution (when computing the value in the Statistics Update procedure detailed in Figure 14). A trivial re-order of the algorithm can avoid the double use of this expensive procedure at each learning step: we can select the action to be executed next at the same time that we evaluate the goodness of the new achieved situation. The drawback of this re-order is that the action is selected without taking into account the information provided by the last reward value (the goodness of the situation is assessed before the value adjustment). However, this is not a problem in tasks that require many learning steps.

Even if we use the action-evaluation procedure only once per learning step, we have to optimize it as much as possible since the brute-force approach described before, which evaluates each action sequentially, is only feasible for simple problems.

The action-evaluation method presented next is based on the observation that many of the actions would have the same value since the highest relevant partial rule at a given moment would provide the value to all actions that are in accordance with the partial command of the rule. The separate computation of the value of two actions that would end up evaluated using the same rule is a waste of time. This can be avoided by performing the action evaluation attending to the set of active rules in the first place and not to the set of possible action, as the brute-force approach does.

Figure 16 shows a general form of the algorithm we propose. In this algorithm, partial rules are considered one at a time, ordered from the most relevant rule to the least relevant one. The partial command of the rule under consideration () is used to process all the actions that are in accordance with that partial command. This already processed sub-set of actions need not to be considered any more in the action-evaluation procedure. While the rules are processed, we update the current situation assessment () and the action to be executed next () attending, respectively, to the value prediction () and the guess () of the rules.

Observe that partial rules can be maintained sorted by relevance by the statistics update procedure, since it is in this procedure where rule relevance is modified. When the relevance of a rule is changed, its position in the list can be also modified accordingly. In this way we do not have to re-sort the list of rules every time we want to apply the procedure just described.

When elementary actions are of the form with a motor and a value in the range of possible values for that motor, the above algorithm can be implemented in an especially efficient way since there is no need to explicitly compute the set of actions . In this case (see Figure 17 and 18), we construct a decision tree using motors as a decision attributes and that groups in the same leaf all those actions evaluated by the same partial rule (all actions removed from the set in each iteration of the algorithm in Figure 16).

Each internal node of the tree classifies the action according to one of the motor commands included in the action. These internal nodes store the following information:

• Partial command: A partial command that is in accordance with all the action classified under the node. This partial command can be constructed by collecting all the motors whose values are fixed in the nodes from the root of the tree to the node under consideration.

• Motor: The motor used in this node to classify actions. When a node is open (i.e., we have still not decided to which motor to attend) the motor value is set to a . A node can be closed by deciding which motor to pay attention to (and adding the corresponding subtrees) or by converting the node into a leaf.

• Subtrees: This is a list of the subtrees that start in that node. Each subtree has an associated value that corresponds to one of the possible actions executable by the motor of the node. All the actions included in a given subtree have an elementary action such as where is the motor of the node and is the value corresponding to this subtree.

The leaves of the tree have information about the value of the actions classified in that leaf. This information is represented with the following set of attributes for each leaf:

• Value: The expected value for all the actions classified in this leaf. The maximum of this value for all leaves is used to assess the goodness, , of a new achieved situation.

• Guess: The value altered with noise for exploratory reasons. The leaf with a maximal guess is set of actions from where to select the action to be executed next.

• Relevance: The relevance of the value predictions (of both the value and the guess).

• Partial command: A partial command that is in accordance with all the actions classified in that leaf. As in the case of internal nodes, this partial command can be constructed by collecting all the motors whose values are fixed from the root of the tree to the leaf under consideration.

At a given moment, the inclusion of a new partial rule in the tree produces the specialization of all open nodes compatible with the rule (see Figure 18). We say that an open node is compatible with a given rule if the partial command of the node and the partial command of the rule does not assign different values to the same motor. The specialization of an open node can result in the extension of the node (i.e., new branches are added to the tree under that node) or in the transformation of this node into a leaf. A node is extended when the partial command of the rule affects some motors not included in the partial command of the node. This means that there are some motor values not taken into account in the tree but that have to be used in the action evaluation according to the rule under consideration. When a node is extended, one of the motors not present in the above layers of the tree is used to generate a layer of open nodes in the current node. After that, the node is considered as closed and the inclusion rule procedure is repeated for this node (with different effects because now the node is closed). When all the motors affected by the partial command of the rule are also affected by the partial command of the node, then the node is transformed into a leaf storing the value, guess, and relevance attributes extracted from the information associated with the rule.

The process is stopped as soon as we detect that all nodes have been closed (i.e. all the external nodes of the tree are leaves). In this case, the rules still to be processed can have no effect in the tree form and, consequently are not useful for action evaluation. If a rule is consistently not used for action evaluation, it can be removed from the controller.

Table 2: Set of rules of the controller. The values and are stored and the and are computed from them. We define all partial views as to indicate that they are active in the current time step.
 Partial View Partial Command 5 0.1 0.83 5.1 7 0.9 0.52 6.5 8 2.0 0.33 6.0 3 3.1 0.24 6.2 2 3.5 0.22 5.3 10 3.6 0.21 4.1 1 4.0 0.20 5.2 6 4.5 0.18 12.7

A toy-size example can illustrate this tree-based action-evaluation algorithm. Suppose that we have a robot with three motors that accept two different values (named and ). This produces a set of 8 different action. Suppose that, at a given moment, the robot controller includes the set of rules shown in Table 2. In the Action Evaluation algorithm (Figure 17), rules are processed from the most to the least relevant one expanding an initially empty tree using algorithm in Figure 18. The inclusion of a rule in the tree results in an extension of the tree (see stages B, D and E in Figure 19) or in closing branches by converting open nodes into leaves (stages C and F). In this particular case the tree becomes completely closed after processing 5 rules out of the 8 active rules in the controller. At the end of the process, we have a tree with five leaves. Three of them include two actions and the other two only represent a single action. Using the tree we can say that the value of the situation in which the tree is constructed, , is 8 (this is given by the leaf circled with a solid line in the figure). Additionally, the next action to be executed is of the form where '' represents any possible action. This optimal action is given by the leaf circled with a dashed line that is the leaf with a larger guess value.

The cost of our algorithm largely depends on the specific set of partial rules to be processed. In the worst case, the cost of our algorithm is:

with the number of rules, the number of motors and, the maximal range of values accepted by the motors. This is because, in the worst case, to insert a given rule, we have to visit all the nodes of a maximally expanded tree (i.e., a tree where each node has subtrees and where all the final nodes of the branches are still opened). The number of nodes of such a tree is

We can transform the cost expression taking into account that is the total number of possible combinations of elementary actions () or, in other words, the total amount of actions. Therefore, the cost of the presented algorithm is

On the other hand, the cost of the brute-force approach is always

So, in the worst case, the cost of the presented algorithm is of the same order as the cost of the brute-force approach. However, since at most rules would be enough to close a maximally expanded tree (one rule for the different values of the motor used in the last still-open layer of the tree), the cost of the tree-based algorithm would be, on average, much smaller than that of the brute-force approach.

Figure 20 exemplifies the different performance of the brute-force action-evaluation procedure and the tree-based one. The figure shows the time taken in the execution of the toy example of Section 6.1. For this experiment, we defined some void motors or motors whose actions have no effect in the environment. As it can be seen, as the number of void motors increases, the cost of the tree-based evaluation is significantly less than that of the brute-force approach.

Josep M Porta 2005-02-17