Problem Formalization

For simplicity, we assume that the robot perceives its environment through a set of binary feature detectors1 $ FD=\{$fd$ _i \: \vert \: i=1..n_f\}$. A feature detector can be devised as a process that identifies specific combinations of present (and possibly past) sensor readings. The use of feature detectors is very common in robotics. In this field, feature detectors are defined by the programmer attending to the special characteristics of the environment, the robot sensors, and the task to be executed in order to extract potentially useful information (presence of landmarks or obstacles, ...) from raw sensor readings.

In a similar way, instead of working directly with the space of actions provided by the robot motors (that define a too low-level way of controlling the robot), it is a common practice to define a set of elementary actions $ EA=\{ea_i \: \vert \: i=1..n_e\}$. An elementary action is a specific sequence/combination of motor commands defined by the programmer attending to the characteristics of the robot and the task to be achieved. To simplify, we can assume that elementary actions are of the form $ (m_i \leftarrow k)$ ( $ i \in [1..n_m]$) where $ m_i$ is a motor and $ k$ a value in the range of valid inputs for the motor $ m_i$. This framework is quite flexible since a motor $ m_i$ can be either one of the physical motors of the robot or a high-level, abstract motor that combines movements of the actual motors. With this formalization, at a given moment, the robot can execute in parallel as many elementary actions as available motors.

A robot controller can be seen as a procedure that executes (combinations of elementary) actions in response to specific situations (i.e., activation of specific feature detectors) with the objective of achieving a given task. Reinforcement-learning approaches automatically define such a controller using the information provided by the reward signal. In the context of reinforcement learning, the controller is called the policy of the learner.

The objective of the value-function-based reinforcement-learning algorithms (the most common reinforcement-learning algorithms) is to predict the reward that can be directly or indirectly obtained from the execution of each action (i.e., of each combination of elementary actions) in each possible situation, described as a combination of active and inactive feature detectors. If this prediction is available, the action to be executed in each situation is the one from which maximum reward is expected.

To predict the reward, classic reinforcement-learning algorithms rely on the Markov assumption, that requires a state signal to carry enough information to determine the effects of all actions in a given situation.2 Additionally, non-generalizing reinforcement-learning algorithms assume that the states of the system must be learned about independently. So, the information gathered about the effects of an action $ a$ in a given state $ s$, denoted $ Q(s,a)$, cannot be safely transferred to similar states or actions. With this assumption, the cost of a reinforcement-learning algorithm in a general problem is

$\displaystyle \Omega(n_s \: n_a),$    

where $ n_s$ is the number of states and $ n_a$ is the number of actions. This is because each action has to be tried as least once in each state. Since the state is defined as the observed combination of feature detectors, we have that the potential number of states is

$\displaystyle n_s= 2^{n_f},$    

with $ n_f$ the number of feature detectors. Consequently, we have that

$\displaystyle \Omega(n_s \: n_a) = \Omega(2^{n_f} \: n_a),$    

which is exponential in the number of feature detectors. Since the number of feature detectors used in robotic applications tends to be high, non-generalizing reinforcement learning becomes impractical for realistic problems. This is the well known curse of dimensionality introduced by [Bellman, 1957], whose research presaged some of the work in reinforcement learning.

Although the size of the action set ($ n_a$) is as important as the size of the state set ($ n_s$) in the curse of dimensionality, less attention is paid to actions in the reinforcement-learning literature. However, a robot with many degrees of freedom can execute many elementary actions simultaneously and this makes the cost of the learning algorithms also increase exponentially with the number of motors of the robot ($ n_m$).

Suppose we address the same task but with two different sets of feature detectors $ FD_1$ and $ FD_2$ such that $ FD_1 \subset FD_2$. Using a plain reinforcement-learning algorithm, the cost of finding a proper policy would be larger using the larger set of features ($ FD_2$). And this is so even if one of the features in $ FD_2-FD_1$ has a stronger correlation with the reward than any of the features in $ FD_1$. Non-generalizing reinforcement-learning algorithms are not able to take advantage of this situation, and, even having better input information, their performance decreases. A similar argument can be made for actions in addition to feature detectors.

Generalizing reinforcement-learning algorithms such as those using gradient-descent techniques [Widrow and Hoff, 1960], coarse codings [Hinton et al., 1986], radial-basis functions [Poggio and Girosi, 1990], tile coding [Sutton, 1996] or decision trees [Chapman and Kaelbling, 1991,McCallum, 1995] can partially palliate this problem since they can deal with large state spaces. However, as we approach complex realistic problems, the number of dimensions of the state-space grows to the point of making the use of some of these generalization techniques impractical and other function approximation techniques must be used [Sutton and Barto, 1998] page 209.

Adding relevant inputs or actions to a task should make this task easier or at least not more difficult. Only methods whose complexity depends on the relevance of the available inputs and actions and not on their number would scale well to real domain problems. Examples of systems fulfilling this property are, for instance, the Kanerva coding system presented by [Kanerva, 1988] and the random representation method by [Sutton and Whitehead, 1993]. While those systems rely on large collections of fixed prototypes (i.e., combinations of feature detectors) selected at random, our proposal is to search for the appropriate prototypes, but using a strong bias so that the search can be performed in a reasonable time. This strong bias is based on the categorizability assumption that is a plausible assumption for the case of autonomous robots, which allows a large speed up in the learning process. Additionally, existing systems do not address the problem of determining the relevance of actions, since they assume the learning agent has a single actuator (that is, obviously, the only relevant one). This simple set up is not adequate for robotics. In our approach (presented below), combinations of both feature detectors and elementary actions are considered using a unified framework.


... detectors1
Non-binary feature detectors providing a discrete range of values can be readily binarized.
... situation.2
Non-Markovian problems, when confronted, should be converted into Markovian ones. How to do that is out of the scope of this paper, although it is one of the most relevant points to achieve a successful real-world reinforcement-learning application.
Josep M Porta 2005-02-17