We show the results of applying our learning algorithm to two robotics-like simulated problems: robot landmark-based navigation and legged robot walking. The first problem is simpler (although it includes more delayed reward) and we use it to clearly describe the workings of the algorithm. The second problem approaches a realistic robotic application, our objective in the long term. We use the two examples to compare the performance of our learning system with that of generalizing and non-generalizing reinforcement-learning algorithms. The confronted problems are different enough to show the generality of the proposed learning system.
We confront a simple simulated landmark-based navigation task in the forest-like environment shown in Figure 1. The objective for the learner is to go from the start position (marked with a cross at the bottom of the figure) to the goal position where there is the food (marked with a cross at the top right corner of the environment). The agent can neither walk into the lake nor escape from the depicted terrain.
The agent can make use of some binary landmark (i.e., feature) detectors to identify its position in the environment and to decide which action to execute next. In the example, the landmark detectors of the agent are:
Of these detectors, only the first 5 are relevant for the task. The water detector is always active, and the rest of landmark detectors become active at random. These 10 landmark detectors can differentiate between situations.
We simplify the problem by clustering the possible positions of the learner in the environment in 7 areas (shown in Figure 1): each area includes the positions from which the same set of relevant landmarks can be seen.
As far as actions is concerned, we use three actions for the West-East movement of the robot: move to the West (denoted as ), stay in the same place (), move to the East (). The other three indicate movement along the North-South dimension (move to the North , stay in the same latitude , move to the South ). These two independent groups of three actions can be combined giving rise to 9 different actions (move North-West, North, North-East, etc.). We assume that when the agent executes one of these actions, it does not stop until the nearest area of the terrain in the direction of the movement is reached. When the agent tries to move into the lake or out of the terrain, it remains in the same position it was. Figure 1 shows all the possible transitions between contiguous areas in the environment.
With the just described landmark detectors and elementary actions the maximum possible order of a given rule is 12, and we can define up to ( ) syntactically different partial rules. Only taking into account all the rules with one feature detector and one elementary action (that are the ones initially included in the controller) we have different partial rules.
The agent only receives reward (with value ) when it reaches the goal. Consequently, this is a problem with delayed reward since the agent must transmit the information provided by the reward signal to those actions and situations not directly related with the observation of reward.
The parameters of the partial-rule learning algorithm we used for this task were , , , , , and, (see Appendix A for a detailed description of the parameters). Observe that, with a maximum number of partial rules and an initial controller containing rules, little room is left for the generation of rules with order higher that 2.
The learning is organized in a sequence of trials. Each trial consists in placing the learner in the starting position and letting it move until the goal is reached, allowing the execution of at most 150 actions to reach the goal. When performing optimally, only three actions are required to reach the objective from the starting position.
Figure 2 shows that, after 40 learning trials, the agent approaches the optimal behavior (represented by the flat dashed line at ).
The dashed line in Figure 2 is the performance of a XCS in this problem. To perform this test, we used the implementation of Wilson's XCS developed by [Butz, 1999]. To make XCS work in the same search space as the partial-rule algorithm, we modified the XCS implementation to be able to deal with non-binary actions. No other modification, but parameter adjustment, were introduced in the original code. The results presented here corresponds to the average of 10 runs using the set of parameters that gave a better result. Nominally, these parameters were: learning rate , decay rate , maximum number of classifiers (however, the initial set is empty), the genetic algorithm is applied in average every 5 time steps, the deletion experience is 5, the subsume experience is 15, the fall off rate is 0.1, the minimum error 0.01, a prediction threshold of 0.5, the crossover probability is 0.8, the mutation probability 0.04 and the initial don't care probability . The prediction and the fitness of new classifiers are initialized to 10 and the error to 0. A detailed explanation of the meaning of those parameters is provided by [Wilson, 1995] and also by the comments in the code of [Butz, 1999].
We can see that the XCS reaches the same performance of the partial-rule approach, but using about four times more trials. This difference in performance is partially explained by XCS's lack of generalization in the action space. However this factor is not that relevant in this case since the action space has only two dimensions. The main factor that explains the better performance of the partial-rule approach is the bias introduced by the categorizability assumption that is not present in the XCS system and that, in this case, allows a more efficient learning process. XCS in more powerful than our partial-rule approach in the sense that XCS makes no assumption about the categorizability of the environment, while we assume it as high. The result of the XCS learning process includes the identification of the degree of categorizability of the environment and in our case this is, in some sense, pre-defined. The generality of the XCS, however, produces a slower learning process.
If we initialize the classifiers in a XCS with a high don't care probability and we initialize the rules in the partial-rule algorithm so that no generalization is used in the action space (i.e., if all rules include a command for each motor), then the two systems become closer. In this case, the main (but not the only) difference between the two approaches is the assumption on the relation between the inputs and the value: While XCS assumes a linear relation, we assume the environment to be categorizable, or, what is the same, we assume the value to depend on only few of the inputs. Due to this difference, when confronted to the same problem, the two systems would learn the same policy and the same values for each action, but the values would be computed using different rules with different associated values, and this is so independently of the parameter/rule initialization used in each case. The system with a smaller learning time would be that with an assumption closer to the reality. The results obtained in the particular example presented above show that the categorizability assumption is more valid and our hypothesis is that this would be the case in most robotics-like applications.
Table 1 shows the evaluation of some actions in the different situations the agent encounters on its path from the start to the goal after 50 learning trials. Analyzing this trace, we can extract some insight about how the partial-rule learning algorithm works.
For instance, at time step 1, we see that rule is used to determine the value of action . Since the landmark detector Water is always active, this rule is equivalent to , which is one of the rules used to generate . If we examine the statistics for we find that and . Obviously, the value distributions of and look different (74.70 vs. 79.73 and 15.02 vs. 2.19). This is because has been generated on later stages of the learning and, thus, its statistics have been updated using a subsample of the values used to adjusts the statistics of . In this particular case, has been updated 250 times while has been updated only 27 times. As learning continues, both distributions will become more similar and rule will be eventually eliminated.
In Table 1, we can see that sometimes there are some non-optimal actions that get an evaluation close to the optimal ones. For this reason, the agent executes, some times, non-optimal actions and this increases the number of steps necessary to reach the goal. In general, the adjustment of the statistics of the rules can solve this problem but, in this particular case, we need to create new rules to fix the situation. For instance, at time step 2, when the value for rule is increased towards 90, the value of rules active at this time step and proposing actions in accordance with the action of rule also converge toward 90. So, in the long term, a rule proposing just action can get a value close to 90. In the absence of more specific rules, this rule can be used to estimate the value of an action such as and, due to the probabilistic nature of the action selection procedure, this action can, eventually, be executed delaying the agent from reaching the goal by 1 time step. However, the execution of results in an error in the value prediction and, thus, in the creation of new rules to better characterize the situation. As soon as a specific rule for action is generated, the error is no longer repeated.
At time step 3, we see that rule has a value of 100 with error 0 but the guess for this rule is . This is because the maximum confidence () is lower than ( in this case) and this makes the agent to keep always a certain degree of exploration.
If the agent only receives reward when the task is totally achieved, the function value for each situation can be computed as with the distance (in actions) from situation to the target one and the reward finally obtained. In Table 1, we can see that situations get the correct evaluation: for , for , and for .
Observe that this problem can be solved using only 200 partial rules out of the possible situation-action combinations in this domain. So, we can say that the problem is certainly categorizable. The main conclusion we can extract from this toy example is that, in a particular case in which the confronted problem was categorizable, the presented algorithm has been able to determine the relevant rules and to adjust their values (including the effect of the delayed reward) so that the optimal action can be determined for each situation.
We also applied our algorithm to the task of learning to generate an appropriate gait (i.e., the sequence of steps) for a six-legged robot (Figure 3). To apply the learning algorithm to the real robot would be possible, but dangerous: in the initial phases of the learning the robot would fall down many times damaging the motors. For this reason we used a simulator during the learning and, afterward, we applied the learned policy to the real robot.
The problem of learning to walk with a six legged robot has been chosen by many authors before as a paradigmatic robotic-learning problem. For instance, [Maes and Brooks, 1990] implemented a specific method based on immediate reward to derive the preconditions for each leg to perform the step. [Pendrith and Ryan, 1996] used a simplified version of the six-legged walking problem to test an algorithm able to deal with Non-Markovian spaces of states and [Kirchner, 1998] presented a hierarchical version of Q-learning to learn the low-level movements of each leg, as well as a coordination scheme between the low-level learned behaviors. [Ilg et al., 1997] introduced a learning architecture based on self-organizing neural networks, and [Kodjabachia and Meyer, 1998] proposed an evolutionary strategy to develop a neural network to control the gait of the robot. [Vallejo and Ramos, 2000] used a parallel genetic algorithm architecture and [Parker, 2000] described an evolutionary computation where the robot executes the best controller found up to a given moment while a new optimal controller is computed in an off-line simulation. All these algorithms are usually tested on flat terrain with the aim of generating periodic gaits (i.e., gaits where the sequence of steps is repeated cyclically). However, for general locomotion (turns, irregular terrain, etc) the problem of free gait generation needs to be considered.
Our simulator (see Figure 3) allows the controller to command each leg of the robot in two independent degrees of freedom (horizontal and vertical) and it is able to detect when the robot is in an unstable position (in our robot this happens when two neighboring legs are in the air simultaneously). Using this simulator, we implemented the behaviors described by [Celaya and Porta, 1996] except those in charge of the gait generation. Therefore, the task to be learned consists in deciding at every moment which legs must step (that is, leave the ground and move to an advanced position), and which must descend or stay on the ground to support and propel the body.
We defined a set of 12 feature detectors that, due to our experience on legged robots, we knew could be useful in different situations for the gait-generation task:
Attending to the activation and non-activation of these 12 feature detectors, we can differentiate between 4096 different situations.
On the action side, we work with two different elementary actions per leg: one that issues the step of the leg and another that descends the leg until it touches the ground. Thus, the cardinality of the set of elementary actions is 12 and, at each time step, the robot issues an action containing 6 elementary elements (one per leg). Thus, we can think of each leg as a virtual motor that accepts two possible values, 0 to remain in contact with the ground and 1 to perform the step.
The reward signal includes two aspects:
The most efficient stable gait is the tripod gait in which two sets of three non-adjacent legs step alternately. Using this gait, the robot would obtain a reward of 0 (when one group of three legs are lifted and advanced) followed by a reward of 50 (when the legs in contact with the ground move backward as a reaction to the advance of legs moved in the previous time step). Thus, the optimal average reward is 25.
In the experiments, the robot is set in an initial posture with all the legs in contact with the ground but in a random advance position.
Figure 4 shows results of applying the partial-rule algorithm compared with those obtained using standard Q-learning with 4096 distinct states and 64 different actions.
For the partial-rule algorithm, we used the following set of parameters: , , , , , and, (see Appendix A for a description of these parameters). For Q-learning, the learning rate is set to and we use an action selection rule that performs exploratory actions with probability .
In Figure 4, we can see that the stability subproblem (i.e., not falling down, which corresponds to getting a reward greater than zero) is learned very quickly. This is because, in the stability subproblem, we can take advantage of the generalization provided by using separate elementary actions and, with a single rule, we can avoid executing several dangerous actions. However, the advance subproblem (i.e., getting a reward close to 25) is learned slowly. This is because little generalization is possible and the learning system must generate very specific rules. In other words, this sub-problem is less categorizable than the stability one.
As in the landmark-based navigation example discussed in the previous section, we observe that the controller contains some (slightly) overly general rules that are responsible for the non optimal performance of the robot. However, we don't regard this as a problem since we are more interested in efficiently learning a correct enough policy for the most frequent situations than in finding optimal behaviors for all particular cases.
Figure 5 shows the performance of Q-learning over a longer run using different exploration rates. This shows that Q-learning can eventually converge to an optimal policy but with many more iterations than our approach (about a factor of 10). Observe that a lower exploration rate allows the algorithm to achieve higher performance (around 19 with a learning rate of 0.1 and around 24 with learning rate 0.01) but using a longer period. With a careful adjustment of the exploration rate we can combine an initial faster learning with a better convergence in the long term. Experiments with Q-learning using learning rates other than 0.5 showed insignificant differences compared to the results shown here.
The advantage of our algorithm over non-generalizing ones is increased in problems in which some of the sensors provide information not related to the task. To test this point, we set up an experiment in which 6 feature detectors that become active randomly were added to the 12 initial ones. With these new features, the number of possible combinations of feature activations increases, and so does the number of states considered by Q-learning. Figure 6 shows the comparison between our algorithm and Q-learning for this problem. Q-learning is not able to learn a reasonable gait strategy in the 5000 time steps shown in the figure, while the performance of the partial-rule algorithm is almost the same as before. This means that the partial-rule algorithm is able to detect which sets of features are relevant and use them effectively to determine the robot's behavior. It is remarkable that, in this case, the ratio of memory used by our algorithm with respect to that used by non-generalizing algorithms is below 0.2%. This exemplifies how the performance of the non-generalizing algorithms degrades as the number of features increases, while this is not necessarily the case using the partial-rule approach.
The importance of the generation of partial rules in the improvement of the categorization can be seen comparing the results obtained for the same problem with and without this mechanism (Figure 7). The results show that the task cannot be learned using only partial rules of order 2. The only aspect of the gait-generation problem that can be learned with rules of order 2 is to avoid lifting a leg if one of its neighboring legs is already in the air. For instance, a rule such as
Rules with order higher than 2 (i.e., not provided to the robot in the initial controller) are necessary, for instance, to avoid raising two neighboring legs simultaneously. A rule like
In Figure 8, we can see the performance of the algorithm when we start the learning process from a correct rule set (i.e., a rule set learned in a previous experiment), but with all the statistics initialized to 0. In this experiment, we can compare the complexity of learning only the values for the rules compared with the complexity of learning the rules and their value at the same time. We can see that when only the values for the rules need to be learned the learning process is about two times faster than in the normal application of the algorithm.
In a final experiment, we issue frequent changes in the heading direction of the robot (generated randomly every 10 time steps). In this way, periodic gaits become suboptimal and the controller should produce a free gait, i.e., a gait that includes a sequence of steps without any periodic repetition.
In this case, we focus on the advance subproblem and, thus, we introduced some hand-crafted rules to the initial controller to prevent the robot from falling down. These rules are of the form:
The set of parameters we used in this case was: , , , , , and, .
Figure 9 shows the average results obtained using the partial-rule learning algorithm compared with those obtained with our best hand-coded gait-generation strategy. In the figure, the horizontal dashed line shows the average performance using the best gait-generation strategy we have implemented [Celaya and Porta, 1998]. It can be seen that the learned gait-generation strategy (the increasing continuous line) produces a performance similar to that of our best hand-coded strategy and that, in some cases, it even outperforms it. Figure 10 shows a situation where a learned controller produces a better behavior than our hand coded one. Using the hand-coded strategy, the robot starts to walk raising two legs (3 and 6) and, in few time steps it reaches a state from which the tripod gait is generated. Initially, leg 2 is more advanced than legs 1 and 4 and, in general, it is suboptimal to execute a step with a leg when its neighboring legs are less advances that itself. In this particular case however, this general rule does not hold. The learned strategy detects this exception and generates the tripod gait from the very beginning resulting in a larger advance of the robot in the initial stages of the movement.
Josep M Porta 2005-02-17