The task of expert-guided subgroup discovery addressed in this work differs slightly from the subgroup discovery task defined in Section 1 and proposed by (Klösgen, 1996; Wrobel, 1997). Instead of defining an optimal measure for automated subgroup search and selection, here the goal is to support the expert in performing flexible and effective search of a broad range of optimal solutions. As a consequence, the decision of which subgroups will be selected to form the final solution is left to the expert. The task of the subgroup discovery algorithm is to enable the detection of rules describing potentially optimal subgroups, which are characterized by the property that they are correct for many target class cases (patients with coronary heart disease, in the example domain used in this work) and incorrect for all, or most of, non-target class cases (healthy subjects). Target class cases included into a subgroup are called true positives while non-target class cases incorrectly included into a subgroup are called false positives.
The particular expert-guided subgroup discovery task addressed in this work assumes the collaboration of the expert and the data analyst in repeatedly running a subgroup discovery algorithm with a goal of finding rules describing population subgroups which:
In each iteration, the task of the subgroup discovery algorithm is to suggest one or more potentially optimal solutions. Section 2.2 describes a heuristic search algorithm SD, which can be used to construct many rules that are optimal with respect to an expert selected generalization parameter. Since many of the induced rules can be very similar, both in terms of their coverage and the selected features, the RSS algorithm described in Section 2.3 can be used to select a small number of distinct rules that are offered to the expert as potentially optimal solutions. Alternatively, subgroup discovery can be implemented within a `weighted' covering algorithm DMS, as is the case in the publicly available Data Mining Server ([Gamberger & Smuc, 2001), which generates up to three best subgroups in every iteration.