Windowing versus Random Sub-sampling

Of course, the intuitive arguments for the constant-time learning behavior of windowing could equally well apply to the decision tree learning method. However, the experimental evidence seems to show that in that case the learning time is at least linear in the number of examples. Why does windowing work better with rule learning algorithms than with decision tree learning algorithms?

To understand this behavior, let us take a closer look at the characteristics of windowing. Contrary to random sub-sampling, which will retain the distribution of examples of the original training set in the subsample, windowing purposefully tries to skew this example distribution by adding only examples that are misclassified by the current theory. As an example, consider the KRK illegality domain, where the task is to discriminate legal from illegal positions in a king-rook-king chess endgame, given features that compare the coordinates of two pieces and determine whether they are equal or adjacent, or whether one is between the others.

rule	examples	perc.
`A = C, B = D.`	4,096	1.563%
`C = E, D = F.`	4,032	1.538%
`adjacent(A,E), adjacent(B,F).`	30,072	11.472%
`C = E, A \= C.`	22,932	8.748%
`D = F, B \= D.`	22,932	8.748%
`C = E, not between(B,D,F).`	1,456	0.555%
`D = F, not between(A,C,E).`	1,456	0.555%

Table 1 shows the body of 7 rules that form a correct theory for illegal KRK positions in a first-order logic representation. The head of each rule is the literal illegal(A,B,C,D,E,F). Next to the rules we give the number of different positions that are covered by the rule, along with their percentage out of the total of 262,144 different examples. About a third of the positions are illegal.⁴ One can see that rules 3, 4, and 5 each cover more than 20,000 examples. They are usually easily found from small random samples of the dataset. On the other hand, rules 1 and 2, and in particular 6 and 7 cover a significantly smaller number of examples. In order to obtain enough examples for learning these rules, one has to take a much larger random sample of the data.⁵ This problem is closely related to the small disjuncts problem discussed by [34].

How does windowing deal with this situation? Recall that it starts with a small random subsample of the available training data, and successively adds examples that are misclassified by the previously learned theory to this window. By doing so it skews the distribution of examples in a way that increases the proportion of examples covered by rules that are hard to learn, thereby decreasing the proportion of examples covered by rules that are easy to learn. Thus for each individual rule in the target theory, windowing tries to identify a minimum number of examples from which the rule can be learned.