Of course, the intuitive arguments for the constant-time learning behavior of windowing could equally well apply to the decision tree learning method. However, the experimental evidence seems to show that in that case the learning time is at least linear in the number of examples. Why does windowing work better with rule learning algorithms than with decision tree learning algorithms?

To understand this behavior, let us take a closer look at the characteristics of windowing. Contrary to random sub-sampling, which will retain the distribution of examples of the original training set in the subsample, windowing purposefully tries to skew this example distribution by adding only examples that are misclassified by the current theory. As an example, consider the KRK illegality domain, where the task is to discriminate legal from illegal positions in a king-rook-king chess endgame, given features that compare the coordinates of two pieces and determine whether they are equal or adjacent, or whether one is between the others.

Table 1 shows the body of 7 rules that form a
correct theory for illegal KRK positions in a first-order logic
representation. The head of each rule is the literal `illegal(A,B,C,D,E,F)`. Next to the rules we give the number of
different positions that are covered by the rule, along with their
percentage out of the total of 262,144 different examples. About a
third of the positions are illegal.^{4} One can
see that rules 3, 4, and 5 each cover more than 20,000 examples. They
are usually easily found from small random samples of the dataset. On
the other hand, rules 1 and 2, and in particular 6 and 7 cover a
significantly smaller number of examples. In order to obtain enough
examples for learning these rules, one has to take a much larger
random sample of the data.^{5} This problem is closely related to the
*small disjuncts problem* discussed by
[34].

How does windowing deal with this situation? Recall that it starts
with a small random subsample of the available training data, and
successively adds examples that are misclassified by the previously
learned theory to this window. By doing so it *skews* the
distribution of examples in a way that increases the proportion of
examples covered by rules that are hard to learn, thereby decreasing
the proportion of examples covered by rules that are easy to
learn. Thus for each individual rule in the target theory, windowing
tries to identify a minimum number of examples from which the rule can
be learned.