Domain Redundancy

It is clear that windowing requires a *redundant* training set
in order to work effectively. Intuitively, we say a training set is
redundant, if it contains more examples that are actually needed for
inducing the correct domain theory. If this is not the case, i.e., if
every example is needed for inferring a correct theory, windowing is
unlikely to work well, because eventually all examples have to be
added to the window and there is nothing to be gained.

A computable measure for the redundancy of a given domain would enable
us to evaluate the potential effectiveness of windowing in this
domain. Unfortunately, we do not know of many approaches that deal
with this problem. A notable exception is the work by
[45] where the use of the *conditional
population entropy (CPE)* is suggested for measuring the redundancy
of a domain. The CPE is defined as

where *n*_{c} is the number of classes, *n*_{a} is the number of
attributes and *n*_{va} is the number of values for attribute
*a*. *c*_{i} stands for the *i*-th class and *x*_{a,v} represents the
*v*-th value of attribute *a*. One can interpret the formula as
computing the sum of the respective average entropies of the class
variable in one-level decision trees for predicting each attribute.
Møller normalizes the CPE as follows:

The normalization factor is the maximum value that the CPE can assume, which is when and . Using this normalization, redundancy can be measured with a value between 0 and 1, 1 meaning high redundancy, 0 meaning no redundancy.

However, it is unclear how this measure corresponds to our intuitive notion of redundancy. For example, two example sets of a domain that differ only in the fact that the second contains each example of the first set twice, would have exactly the same redundancy estimate. Intuitively, however, the second example set should be more redundant than the first.

A radically different approach to defining redundancy (in a somewhat
different context) was undertaken by
[33]: They call
a training set *n**-saturated*, iff there is no subset of size
*n* whose removal would cause a different theory to be learned. They
propose to use the notion of *n*-saturation for approximating the
concept of *saturation*, which intuitively means that the
training set contains enough examples for inducing the correct target
hypothesis [33]. This notion
corresponds quite closely to our intuitive concept of
redundancy. However, the computational complexity of testing for
*n*-saturation can be quite high.

More importantly, the notion of saturation as defined above does not take into account that different regions of the example space may have different degrees of redundancy. We have already seen in Table 1 that often the rules in a target theory have different degrees of coverage. Consequently, a randomly chosen training set will typically contain more examples that are covered by a high-coverage rule of the target theory than examples that are covered by a low-coverage rule. In other words, the subset of the examples from which the high-coverage rules are learned can already be redundant, while the subset from which the low-coverage rules should be learned does not yet contain enough examples for inducing the correct rules. In Gamberger and Lavrac's notion of saturation, such a training set would be considered non-saturated.

Windowing tries to exploit these different degrees of redundancy in a training set. If some parts of the example space are already covered by correct rules, no more examples of these regions will be added to the window. We have already discussed that this skewing of the example distribution is more appropriate for separate-and-conquer rule learning algorithms than for divide-and-conquer decision tree learning algorithms because the former learn each rule independently. In the next section, we will discuss a new windowing algorithm for rule learning systems that aims at further exploiting this property.