Next: Subgroup Visualization Up: The Descriptive Induction Process Previous: Results of Expert-Guided Subgroup

## 3.3 Statistical Characterization of Subgroups

The next step in the proposed descriptive induction process starts from the discovered subgroups. In this step, statistical differences in distributions are computed for two populations, the target and the reference population. The target population consists of true positive case (CHD patients included into the analyzed subgroup), whereas the reference population are all available non-target class examples (all the healthy subjects).

Statistical differences in distributions for all the descriptors (attributes) between these two populations is tested using the test with 95% confidence stage (p =0.05). For this purpose numerical attributes have been partitioned in up to 30 intervals so that in every interval there are at least 5 instances. Among the attributes with significantly different distributions there are always those that form the features describing the subgroups (the principal factors), but usually there are also other attributes with significantly different value distributions. These attributes are called supporting attributes, and the features formed of their values that are characteristic for the discovered subgroups are called supporting factors.

Supporting factors are very important to achieve pattern descriptions that are reasonably complete and acceptable for medical practice, as medical experts dislike short rules and prefer rules including as much supportive evidence as possible (Kononenko, 1993).

In this work, the role of statistical analysis is to detect meaningful supporting factors, whereas the decision whether they will be used to support user's confidence in the subgroup description is left to the expert. In the CHD application the expert has decided whether the proposed factors are indeed interesting, how reliable they are or how easily they can be measured in practice. In Table 3, expert selected supporting factors are listed next to the individual CHD risk groups, each described by a list of principal factors.

 Principal Factors Supporting Factors A1 positive family history psychosocial stress age over 46 year cigarette smoking hypertension overweight A2 body mass index over 25 kgm-2 positive family history age over 63 years hypertension slightly increased LDL cholesterol normal but decreased HDL cholesterol B1 total cholesterol over 6.1 mmolL-1 increased triglycerides value age over 53 years body mass index below 30 kgm-2 B2 total cholesterol over 5.6 mmolL-1 positive family history fibrinogen over 3.7 mmolL-1 body mass index below 30 kgm-2 C1 left ventricular hypertrophy positive family history hypertension diabetes mellitus
Table 3: Induced subgroup descriptions (principal factors) and their statistical characterizations (supporting factors).

Next: Subgroup Visualization Up: The Descriptive Induction Process Previous: Results of Expert-Guided Subgroup