Domain Redundancy

Next: Integrative Windowing Up: A Closer Look at Previous: Rule Learning versus Decision

Domain Redundancy

It is clear that windowing requires a redundant training set in order to work effectively. Intuitively, we say a training set is redundant, if it contains more examples that are actually needed for inducing the correct domain theory. If this is not the case, i.e., if every example is needed for inferring a correct theory, windowing is unlikely to work well, because eventually all examples have to be added to the window and there is nothing to be gained.

A computable measure for the redundancy of a given domain would enable us to evaluate the potential effectiveness of windowing in this domain. Unfortunately, we do not know of many approaches that deal with this problem. A notable exception is the work by [45] where the use of the conditional population entropy (CPE) is suggested for measuring the redundancy of a domain. The CPE is defined as

$\begin{displaymath}CPE = - \sum_{i=1}^{n_c} p(c_i) \sum_{a=1}^{n_a}\sum_{v=1}^{n_{v_a}} p(x_{a,v}\vert c_i) \log p(x_{a,v}\vert c_i) \end{displaymath}$

(1)

where n_c is the number of classes, n_a is the number of attributes and n_{v_a} is the number of values for attribute a. c_i stands for the i-th class and x_a,v represents the v-th value of attribute a. One can interpret the formula as computing the sum of the respective average entropies of the class variable in one-level decision trees for predicting each attribute. Møller normalizes the CPE as follows:

$\begin{displaymath}Red = 1 - \frac{CPE}{\sum_{a=1}^{n_a} \log n_{v_a}} \end{displaymath}$

(2)

The normalization factor is the maximum value that the CPE can assume, which is when $p(x_{a,v}\vert c_i) = \frac{1}{n_{v_a}}$ and $p(c_i) = \frac{1}{n_c}$ . Using this normalization, redundancy can be measured with a value between 0 and 1, 1 meaning high redundancy, 0 meaning no redundancy.

However, it is unclear how this measure corresponds to our intuitive notion of redundancy. For example, two example sets of a domain that differ only in the fact that the second contains each example of the first set twice, would have exactly the same redundancy estimate. Intuitively, however, the second example set should be more redundant than the first.

A radically different approach to defining redundancy (in a somewhat different context) was undertaken by [33]: They call a training set n-saturated, iff there is no subset of size n whose removal would cause a different theory to be learned. They propose to use the notion of n-saturation for approximating the concept of saturation, which intuitively means that the training set contains enough examples for inducing the correct target hypothesis [33]. This notion corresponds quite closely to our intuitive concept of redundancy. However, the computational complexity of testing for n-saturation can be quite high.

More importantly, the notion of saturation as defined above does not take into account that different regions of the example space may have different degrees of redundancy. We have already seen in Table 1 that often the rules in a target theory have different degrees of coverage. Consequently, a randomly chosen training set will typically contain more examples that are covered by a high-coverage rule of the target theory than examples that are covered by a low-coverage rule. In other words, the subset of the examples from which the high-coverage rules are learned can already be redundant, while the subset from which the low-coverage rules should be learned does not yet contain enough examples for inducing the correct rules. In Gamberger and Lavrac's notion of saturation, such a training set would be considered non-saturated.

Windowing tries to exploit these different degrees of redundancy in a training set. If some parts of the example space are already covered by correct rules, no more examples of these regions will be added to the window. We have already discussed that this skewing of the example distribution is more appropriate for separate-and-conquer rule learning algorithms than for divide-and-conquer decision tree learning algorithms because the former learn each rule independently. In the next section, we will discuss a new windowing algorithm for rule learning systems that aims at further exploiting this property.

Next: Integrative Windowing Up: A Closer Look at Previous: Rule Learning versus Decision