Reinforcement Learning in Categorizable Environments: the Partial Rule Approach

To implement an algorithm able to exploit the potential categorizability of the environment, we need a representation system able to transfer information between similar situations and also between similar actions.

Clustering techniques or successive subdivisions of the state space as, for instance, that presented by [McCallum, 1995] focus on the perception side of the problem and aim at determining the reward that can be expected in a given state $ s$ considering only some of the feature detectors perceived in that state. This subset of relevant feature detectors is used to compute the expected reward in this state for any possible action $ a$ (the $ Q(s,a)$ function). However, with this way of posing the problem the curse of dimensionality problem is not completely avoided since some of the features can be relevant for one action but not for another and this produces an unnecessary (from the point of view of each action) differentiation between equivalent situations, decreasing the learning speed. This problem can be avoided by finding the specific set of relevant feature detectors for each action. In this case, the $ Q$ function is computed as $ Q(f_s(a),a)$, with a state definition that is function of the action under consideration. This technique is used, for instance, by [Mahadevan and Connell, 1992]. Unfortunately, in the problem we are confronting, this is not enough since, in our case, actions are composed by combinations of elementary actions and we also want to transfer reward information between similar combinations of actions. Therefore, we have to estimate $ Q(f_s(a),a)$ only taking into account some of the elementary actions that compose $ a$. However, in principle, the relevance of elementary actions is function of the situation (or, equivalently, of the state): a given elementary action can be relevant in some situations but not in others. For this reason, the function to approximate becomes $ Q(f_s(a),f_a(s))$ where there is a cross-dependency between the state defined as a function of the action, $ f_s(a)$, and the action defined as a function of the state, $ f_a(s)$. The proposal we detail next solves this cross-dependency by working in the Cartesian product of the spaces of feature detectors and elementary actions combinations.

To formalize our proposal, we introduce some definitions.

We say that the agent perceives (or observes) a partial view of order $ k$, $ v($fd$ _{i_1}, \ldots,$   fd$ _{i_k})$, $ k\leq n_f$ whenever the predicate fd$ _{i_1}\wedge ...\wedge$   fd$ _{i_k}$ holds.3 Obviously, many partial views can be perceived at the same time.

At a given moment, the agent executes an action $ a$ that issues a different command for each one of the agent's motors $ a=\{ea_{1},\ldots,ea_{n_m}\}$, with $ n_m$ the number of motors.

A partial command of order $ k$, noted as $ c(ea_{i_1}, \ldots, ea_{i_k})$, $ k\leq n_m$, is executed whenever the elementary actions $ \{ea_{i_1}, \ldots, ea_{i_k}\}$ are executed simultaneously. We say that a partial command $ c$ and an action $ a$ are in accordance if $ c$ is a subset of $ a$. Note that the execution of a given action $ a$ supposes the execution of all the partial commands in accordance with it.

A partial rule $ w$ is defined as a pair $ w=(v,c)$, where $ v$ is a partial view and $ c$ is a partial command. We say that a partial rule $ w=(v,c)$ is active if $ v$ is observed, and that $ w$ is used whenever the partial view $ v$ is perceived and the partial command $ c$ is executed. A partial rule covers a sub-area of the Cartesian product of feature detectors and elementary actions and, thus, it defines a situation-action rule that can be used to partially determine the actions of the robot in many situations (all those where the partial view of the rule is active). The order of a partial rule is defined as the sum of the order of the partial view and the order of the partial command that compose the rule.

We associate a quantity $ q_w$ to each partial rule. $ q_w$ is an estimation of the value (i.e., the discounted cumulative reward) that can be obtained after executing $ c$ when $ v$ is observed at time $ t$:

$\displaystyle q_w=\sum_{i=0}^{\infty}\gamma^{t+i} \: r_{t+i},$    

with $ r_{t+i}$ the reward received by the learner at time step $ t+i$ after rule $ w$ is used at time $ t$. So, a partial rule can be interpreted as: if partial view $ v$ is observed then the execution of partial command $ c$ results in value $ q_w$.

The objective of the learning process is that of deriving a set of partial rules and adjusting the corresponding $ q_w$ values so that the desired task can be properly achieved.

The apparent drawback of the partial-rule representation is that the number of possible partial rules is much larger than the number of state and action pairs: The number of partial rules that can be defined on a set of $ n_f$ binary feature detectors and $ n_m$ binary motors is $ 3^{n_f+n_m}$, while the number of different states and action pairs is ``only'' $ 2^{n_f+n_m}$. If arbitrary problems have to be confronted (as is the case in synthetic learning situations), the partial-rule approach could not be useful. However, problems confronted by robots are not arbitrary since, as mentioned, environments present regularities or properties (as categorizability) that can be exploited to reduce the complexity of the controller necessary to achieve a given task.

Using the partial-rule framework, the categorizability assumption can be formally defined as:

Definition 1   We say that an environment/task is highly categorizable if there exists a set of low-order partial rules that allows us to predict the reward with the same accuracy as if statistics for each possible state-action combination were considered. The lower the order of the rules in the controller the higher the categorizability of the environment/task.

To the extent the categorizability assumption is fulfilled, the number of partial rules necessary to control the robot becomes much smaller than the number of state-action pairs that can be defined using the same sets of feature detectors and elementary actions in which the partial views and partial commands are based. Additionally, categorizability implies that the rules necessary in the controller are mostly those with lower order and this can be easily exploited to bias the search in the space of partial rules. So, if the environment is categorizable, the use of the partial-rule approach can suppose an important increase in the learning speed and a reduction in the use of memory with respect to traditional non-generalizing reinforcement-learning algorithms.

In the following sections, we describe how it is possible to estimate the effect of an action given a fixed set of partial rules. This evaluation, repeated for all actions, is used to determine the best action to be executed at a given moment. Next, we detail how it is possible to adjust the value predictions of a fixed set of partial rules. Finally, we describe how the categorizability assumption allows us to use an incremental strategy in the generation of new partial rules. This strategy results in faster learning than existing generalizing and non-generalizing reinforcement-learning algorithms. All procedures are described in high-level form to make the explanation more clear. Details of their implementation can be found in Appendix A.

Value Prediction using Partial Rules

In a given situation, many partial views are simultaneously active triggering a subset of the partial rules of the controller $ C$. We call this subset the active partial rules and we denote it as $ C'$. To evaluate a given action $ a$ we only take into account the rules in $ C'$ with a partial command in accordance with $ a$. We denote this subset as $ C'(a)$. Note that, in our approach, when we refer to an action, we mean the corresponding set of elementary actions (one per motor) and not a single element, as it is the general case in reinforcement learning.

Every rule $ w=(v,c)$ in $ C'(a)$ provides a value prediction for $ a$: the $ q_w$ associated with the partial rule. This is an averaged value that provides no information about the accuracy of this prediction. As also pointed by [Wilson, 1995], we should favor the use of the partial rules with a high accuracy in value prediction or, as we say it, rules with a high relevance.

It seems clear that the relevance of a rule ($ \rho_w$) depends on the distribution of values around $ q_w$. Distributions with low dispersion are indicative of coherent value predictions and, so, of a highly relevant rule. To measure this dispersion we maintain an error estimation $ e_w$ on the approximation of $ q_w$. Another factor (not used by [Wilson, 1995]) to be taken into account in the relevance determination is the confidence on the $ q_w$ and $ e_w$ statistics: low confidence (i.e., insufficiently sampled) measures of $ q_w$ and $ e_w$ should reduce the relevance of the rule. The confidence on the value prediction for a given rule ($ c_w$) is a number in the interval $ [0,1]$, initialized as 0, and increasing as the partial rule is used (i.e., the rule is active and its partial command is executed). The confidence would only decrease if the value model for a given partial rule is consistently wrong.

Using the confidence, we approximate the real error in the value prediction for a partial rule $ w$ as

$\displaystyle \epsilon_w = e_w \: c_w + \overline{e} \: (1-c_w),$    

where value $ \overline{e}$ is the average error on the value prediction. Observe that the importance of $ \overline{e}$ is reduced as the confidence increases and, consequently, $ \epsilon_w$ converges to $ e_w$.

With the above definitions, the relevance of a partial rule can be defined as

$\displaystyle \rho_w = \frac{1}{1+\epsilon_w}.$    

Note that the exact formula for the relevance is not that important as far as $ \epsilon_{w1} \leq \epsilon_{w2} \rightarrow \rho_{w1} \geq \rho_{w2}$. The above formula provides a value in the range $ [0,1]$ that could be directly used as a scale factor, if necessary.

The problem is then, how can we derive a single value prediction using the $ q_w$ statistics of all the rules in $ C'(a)$ and its corresponding relevance value, $ \rho_w$? Two possible solutions come to mind: using a weighted sum of the values predicted by all these partial rules using the relevance as a weighting factor, or using a competitive approach, in which only the most relevant partial rule is used to determine the predicted value. The weighted sum assumes a linear relation between the inputs (the value prediction provided by each individual rule) and the output (the value prediction for $ a$). This assumption has proved powerful in many systems but, in general, it is not compatible with the categorizability assumption since, although each one of the partial rules involved in the sum can be of low order, taking all of them into account means using a large set of different feature detectors and elementary actions to predict the effect of a given action. For this reason, our learning system uses a winner-take-all solution where only the value prediction of the most relevant partial rule is taken into account to predict the value of an action. So, for each action we determine the winner rule

$ w=$winner $ (C',a)=$arg $ \underset{\forall w' \in C'(a)}{\max}\{\rho_{w'}\}$,
and we use the range of likely value for this rule, $ I_w = [ q_w - 2 \epsilon_w,q_w + 2 \epsilon_w ]$, to randomly determine the value prediction for action $ a$. The probability distribution inside this interval depends on the distribution we assume for the value.

The procedure just outlined can be used at each time step to obtain a value prediction for each action. The action with the maximal value is the one we want the robot to execute next.

Observe that we obtain a probabilistic value prediction: in the same situation with the same statistics, we can get different value predictions for the same action. In this way, the action that obtains the maximal evaluation is not always the one with maximal $ q_w$ and, consequently, we favor the exploration of promising actions. This probabilistic action selection provides an exploratory mechanism that uses more information than typical reinforcement-learning exploration mechanisms (the error and confidence of value predictions is not available in most reinforcement-learning algorithms) and the result is a more sophisticated exploration schema (see [Wilson, 1996] for a survey of different exploration mechanisms in reinforcement learning).

Partial Rules Value Adjustment

We adjust the value predictions for all the rules in $ C'(a)$ where $ a$ is the last executed action. For each rule to be adjusted, we have to update its $ q_w$, $ e_w$, and $ c_w$ statistics.

The effect of any action $ a$ in accordance with the partial command $ c$ attending to a partial rule $ w=(v,c)$ can be defined (using a Bellman-like equation) as

$\displaystyle q^*_w=\overline{r}_w+\gamma \sum_{\forall C'}{p(w,C')\: v^*(C')},$    

where $ \overline{r}_w$ is the average reward obtained immediately after executing $ c$ when $ v$ is observed, $ \gamma$ is the discount factor used to balance the importance of immediate with respect to delayed reward, $ v^*(C')$ represents the goodness (or value) of the situation where rules $ C'$ are active, and $ p(w,C')$ is the probability of reaching that situation after the execution of $ c$ when $ v$ is observed. The value of a situation is assessed using the best action executable in that situation

$\displaystyle v^*(C')=\underset{\forall a'}{\max}\{q^*_w\vert w=winner(C',a')\},$    

since this gives us information about how well the robot can perform (at most) from that situation.

As in many of the existing reinforcement-learning approaches, the values of $ q_w$ and $ e_w$ for the rules to be adjusted are modified using a temporal difference rule so that they progressively approach $ q^*_w$ and the error on this measure. Rules that have a direct relation with the received reward would provide a value prediction ($ q_w$) coherent with the actually obtained one and, consequently, after the statistics adjustment, their prediction error will be decreased. Contrariwise, rules not related to the observed reward would predict a value different from the obtained one and their error statistics will be increased. In this way, if a rule is really important for the generation of the received reward, its relevance is increased and if not it is decreased. Rules with low relevance have few chances of being used to drive the robot and, in extreme cases, they could be removed from the controller.

The confidence $ c_w$ should also be adjusted. This adjustment depends on how the confidence is measured. If it is only related to the number of samples used in the $ q_w$ and $ e_w$ statistics, then $ c_w$ should be simply slightly incremented every time the statistics of rule $ w$ are updated. However, we also decrease the confidence if the value model for a given partial rule is consistently wrong (i.e., the value observed is systematically out of the interval $ I_w$).

Observe that our learning rule is equivalent to those used in state-based reinforcement-learning methods. For instance, in Q-learning [Watkins and Dayan, 1992], $ Q^*(s,a)$, with $ s$ a state and $ a$ an action, is defined as

$\displaystyle Q^*(s,a)=\overline{r}_w+\gamma \sum_{\forall s'}{p(s,a,s')\: V^*(s')},$    

with $ p(s,a,s')$ the probability of a transition from $ s$ to $ s'$ when $ a$ is executed and

$\displaystyle V^*(s')=\underset{\forall a'}{\max}\{Q^*(s',a')\}$    

In our approach, the set of rules active in a given situation $ C'$ plays the role of a state and, thus, $ v^*(C')$ and $ V^*(s')$ are equivalent. On the other hand, we estimate $ q_w^*$ instead of $ Q^*(s,a)$, but the rule $ w$ includes information about both (partial) state and actions making $ q_w^*$ and $ Q^*(s,a)$ to play a similar role. The reward prediction for a given rule, $ q_w$, corresponds to the average of value predictions for the cells of the Cartesian product of feature detectors and elementary actions covered by that rule. In the case of complete rules (i.e., rules involving all the feature detectors and actions for all motors), the sub-area covered by the rule includes only one cell of the Cartesian product and, therefore, if the controller only includes complete rules, the just described learning rule is exactly the same as that used in Q-learning. In this particular case, $ C'(a)$ is just one rule that, consequently, is the winner rule. The statistics for this rule are the same (and are updated in the same way) as those for the $ Q(s,a)$ entry of the table used in Q-learning. Thus, our learning rule is a generalization of the learning rule normally used in reinforcement learning.

Controller Initialization and Partial Rule Creation/Elimination

Since we assume we are working in a categorizable environment, we can use an incremental strategy to learn an adequate set of partial rules: we initialize the controller with rules of the lowest order and we generate new partial rules only when necessary (i.e., for cases not correctly categorized using the available set of rules). So, the initial controller can contain, for instance, all the rules of order two that include one feature detector and one elementary action ( $ (v($fd$ _i),c(ae_j)), (v(\neg$   fd$ _i),c(ae_j))\:\: \forall i,j$). In any case, it is sensible to include the empty rule (the rule of order 0, $ w_{\emptyset}$) in the initial controller. This rule is always active and it provides the average value and the average error in the value prediction. Additionally, any knowledge the user has about the task to be achieved can be easily introduced in the initial controller in the form of partial rules. If available, an estimation of the value predictions for the user-defined rules can also be included. If the hand-crafted rules (and their value predictions) are correct the learning process will be accelerated. If they are not correct, the learning algorithm would take care of correcting them.

We create a new rule when a large error in the value prediction is detected. The new rule is defined as a combination of two of the rules in $ C'(a)$, that are the rules that forecast the effects of the last executed action, $ a$, in the current situation. When selecting a couple of rules to be combined, we favor the selection of those with a value prediction close to the actually observed one since they are likely to involve features and elementary actions (partially) relevant for the value prediction we try to refine.

The problem is that it is not possible to determine a priori whether an incorrectly predicted value would be correctly predicted after some rule adjustments or if it is really necessary to create a new partial rule to account for the received reward. So, if we create new rules when there is a large error in the value prediction, it is possible to create unnecessary rules. The existence of (almost) redundant rules is not necessarily negative, since they provide robustness to the controller, the so called degeneracy effect introduced by [Edelman, 1989]. What must be avoided is to generate the same rule twice, since this is not useful at all. Two rules can be identical with respect to lexicographic criteria (they contain the same feature detectors and elementary actions) but also with respect to semantic ones (they get active in the same situations and propose equivalent actions). If identical rules are created, then they have to be detected and removed as soon as possible. Preserving only the rules that proved to be useful avoids the number of rules in the controller growing above a reasonable limit.

Since we create new rules while there is a significant error in the value prediction, if necessary, we could end up generating complete rules (provided we do not limit the number of rules in our controller). In this case, and assuming that the more specific the rule the more accurate the value prediction, our system would behave as a normal table-based reinforcement-learning algorithm: Only the most specific rules (i.e., the most relevant ones) would be used to evaluate actions and, as explained before, the statistics for these rules would be exactly the same as those in table-based reinforcement-learning algorithms. Thus, in the limit, our system can deal with the same type of problems as non-generalizing reinforcement-learning algorithms. However, we regard this limit situation as very improbable and we impose limits to the number of rules in our controllers. Observe that this asymptotic convergence to a table-based reinforcement learning is only possible because we use a winner-takes-all strategy in the action evaluation. With a weighted-sum strategy, the value estimation for the non-complete rules possibly present in the controller would be added to that of complete rules leading to an action evaluation different from that of table-based reinforcement-learning algorithms.


... holds.3
A partial view can also include negations of feature detectors since the non-detection of a feature can be as relevant as its detection.
Josep M Porta 2005-02-17