The goal of the subgroup discovery algorithm SD, outlined in
Figure 1, is to search for rules that maximize
,
where *TP* are true positives, *FP* are false
positives, and *g* is a *generalization parameter*. High quality
rules cover many target class examples and a low number of non-target
examples. The number of tolerated non-target class cases, relative to
the number of covered target class cases, is determined by parameter
*g*. For low *g* (), induced rules will have high
specificity (low false alarm rate) since covering of every single
non-target class example is made relatively very `expensive'. On the
other hand, by selecting a high *g* value (*g* > 10 for small
domains), more general rules will be generated, covering also
non-target class instances.

Varying the value of *g* enables the expert to guide subgroup
discovery in the *TP*/FP space, in which *FP* (plotted on the
*X*-axis) needs to be minimized, and *TP* (plotted on the *Y*-axis)
needs to be maximized. The *TP*/FP space is similar to the ROC
(Receiver Operating Characteristic) space
(Provost & Fawcett, 2001). The comparison of the ROC and *TP*/FP
space and the *g*_{q} heuristic are analyzed in detail in
Sections 2.4 and 4, respectively.

Algorithm SD takes as its input the complete training set *E* and the
feature set *L*, where features
are logical conditions
constructed from attribute values describing the examples in *E*. For
discrete (categorical) attributes, features have the form
*Attribute* =
*value* or
,
for numerical attributes they have
the form
*Attribute* > *value* or
*Attribute* < *value*. To formalize
feature construction, let values *v*_{ix} (*x* = 1..*k*_{ip}) denote the
*k*_{ip} different values of attribute *A*_{i} that appear in the
positive examples and *w*_{iy} (*y* = 1..*k*_{in}) the *k*_{in}
different values of *A*_{i} appearing in the negative examples. A set
of features *L* is constructed as follows:

- For discrete attributes
*A*_{i}, features of the form*A*_{i}=*v*_{ix}and are generated. - For continuous attributes
*A*_{i}, similar to Fayyad & Irani (1992), features of the form are created for all neighboring value pairs (*v*_{ix},*w*_{iy}), and features*A*_{i}> (*v*_{ix}+*w*_{iy})/2 for all neighboring pairs (*w*_{iy},*v*_{ix}). - For integer valued attributes
*A*_{i}, features are generated as if*A*_{i}were both discrete and continuous, resulting in features of four different forms: ,*A*_{i}> (*v*_{ix}+*w*_{iy})/2,*A*_{i}=*v*_{ix}, and .

There is no theoretical upper value for the user-refined *g*
parameter, but in practice the suggested upper limit should not exceed
the number of training examples. For instance, suggested *g* values in
the Data Mining Server are in the range between 0.1 and
100, for analysing data sets of up to 250 examples. The choice of *g*
should be adjusted both to the size of the data set and to the
proportion of positive examples in the set.

Algorithm SD has two additional parameters which are typically not
adjusted by the user. The first is
(default value is
,
where *P* is the number of target class examples in *E*)
which indirectly defines the minimal number of target class examples
which must be covered by every subgroup. The second is
(default value is 20) which defines the number of solutions kept in
each iteration. The output of the algorithm is set *S* of
different rules with highest *q*_{g} values. The rules
have the form of conjunctions of features from *L*.

The algorithm initializes all the rules in *Beam* and
by
empty rule conditions. Their quality values *q*_{g}(*i*) are set to zero
(step 1). Rule initialization is followed by an infinite loop (steps
2-12) that stops when, for all rules in the beam, it is no longer
possible to further improve their quality. Rules can be improved only
by conjunctively adding features from *L*. After the first iteration,
a rule condition consists of a single feature, after the second
iteration up to two features, and so forth. The search is systematic
in the sense that for all rules in the beam (step 3) all features from
*L* (step 4) are tested in each iteration. For every new rule,
constructed by conjunctively adding a feature to rule body (step 5)
quality *q*_{g} is computed (step 6). If the support of the new rule
is greater than
and if its quality *q*_{g} is greater
than the quality of any rule in ,
the worst rule in
is replaced by the new rule. The rules are reordered in
according to their quality *q*_{g}. At the end of each
iteration,
is copied into *Beam* (step 11). When the
algorithm terminates, the first rule in *Beam* is the rule with
maximum *q*_{g}.

A necessary condition (in step 7) for a rule to be included in
is that it must be *relevant*. The new rule is
irrelevant if there exists a rule *R* in
such that true
positives of the new rule are a subset of true positives of *R* and
false positives of the new rule are a superset of false positives of
*R*. A detailed analysis of relevance, presented by
Lavracetall1998, is out of the main scope of this paper. After
the new rule is included in
it may happen that some of the
existing rules in
become irrelevant with respect to this
new rule. Such rules are eliminated from
during its
reordering (in step 8). The testing of relevance ensures that
contains only different and relevant rules.

In Algorithm SD, rule quality measure *q*_{g} serves two purposes:
first, rule evaluation, and second, evaluation of features and their
conjunctions with high potential for the construction of high quality
rules in subsequent iterations. The analysis of this quality measure
in Section 4 shows that for the first purpose, a
measure assigning different costs to false positives and false
negatives could perform equally well, but for the purpose of guiding
the search the *q*_{g} measure is advantageous.