Next: Noise Handling
Previous: Tradeoff Simplicity-Accuracy
A DC obtained on the XD6 domain with WIDC(p). The first
three rules exactly encode the target concept, and the irrelevant variable is absent from the DC.
In the XD6 domain, each example has 10 binary
variables. The tenth is irrelevant in the strongest sense [John et al.1994]. The
target concept is a 3-DNF (a DNF with each monomial containing at most
three literals) over the first nine variables:
. Such a formula is typically hard to encode using a small decision tree. In
our experiments with WIDC(o) and WIDC(p), we have remarked that the target formula
itself is almost always an element of the classifier built, and the irrelevant
attribute is always absent. Figure 1 shows an example of DC which was
obtained on a run of WIDC. Note that the concept returned is a 3-DC.
Figure 2 depicts a part of a tree obtained on this domain with C4.5.
While the tree appears to be
quite large for the domain, note the presence of the irrelevant variable in the tree,
which it contributes to enlarge while making it harder to mine.
On many other domains, we observed persistent rules or subconcepts through
the 10 cross-validation runs. Similarly to XD6, whenever we could mine
the results with a sufficiently accurate knowledge of the domain, these
patterns were most interesting. For example, the DCs obtained on the LEDeven domain
contained most of the time a combination of two rules with one literal each, which represented
a very accurate way to classify 9 out of the 10 possible classes. On the
Vote0 and Vote1 domains, we also observed constant patterns, some
of which are well known [Blake et al.1998] to provide a very accurate classification
for a tiny size. Even for Vote1 where classical studies
often report errors over , and almost never around [Holte1993], we
observed on most of the runs a DC containing an accurate rule with two literals only,
with which WIDC(p) provided on average an error under .
Part of a DT obtained on the XD6 domain with C4.5. Positive literals label the internal nodes.
To classify an observation, the left edge of a node is followed when an observation contains (``Yes'') the positive literal,
and the right edge is followed otherwise (i.e. the literal is negative in the observation). The bold square
is used to display the presence of the irrelevant variable in the tree. A naive conversion of this tree in rules for both classes generates 30 rules, for a total of 179 literals.
WIDC was also compared to C4.5 on a real world domain on which mining issues are as crucial
as classification strength: agriculture. An experiment is being carried out in Martinique by the
DDAF (Departmental Direction of Agriculture and Forest), to achieve better understanding of the
behavior of farmers, in particular regarding their willingness to contract a CTE (Farming Territorial Contract).
Usual farming contracts with either the state (France) or Europe did not contain commitments for the
farmer to satisfy. In a CTE, each farmer commits to adapt and/or change his agricultural techniques
or productions, to ensure sustainable development for local agriculture. In exchange for this,
he receives the guarantee to obtain financial help for this contract, and to be trained to new
agricultural techniques. Such a domain is a good test bed to evaluate a method on the basis of
predictability and interpretability, because of the place of uncertainty in agriculture, and the
fact that obtaining data can be a hard and long task : the DDAF has to be as accurate as
possible in its predictions and interpretations, to manage as best as possible its relationships
with farmers, and in the case of CTEs, to make the best promotion campaign for these new
contracts. Agriculture is also very sensitive to a ``showcase effect'': provided even few
representative farmers will have subscribed to the contracts, comparatively many others are likely to follow.
In this study, from the description of 52 variables for about 60 representative farmers satisfying the criteria to adhere to a CTE, the aim is to develop models for those who are actually willing to adhere, those not willing to adhere, and those currently uncertain. Variables are data on each agricultural exploitation (size, terrain nature, financial data, type of production, etc .), as well as more personal data on the farmers (education, family status, objectives, personal answer to a questionnaire, etc.). This represents a small dataset to mine, but, interestingly, the results obtained were different when processing it with C4.5 or WIDC(p).
We ran both algorithms in a 10-fold stratified cross-validation experiment. WIDC(p) obtained a average error. In 6 out of 10 runs, the same DC was induced. It is presented in Figure 3. Basically, this DC proves that predicting the ``adhere'' class is the easiest task, followed by the prediction of the ``adhere'' class. The ``?'' (uncertain farmers) is predicted only by the default vector. This seems rather natural: whereas the extreme behaviors tend to be clear to determine, the uncertainty is the hardest to predict.
C4.5 (default parameters) induced a DT which was almost the exact transcription of rule 1, a rule which says that farmers with no education (without any agricultural diploma or traineeships) and no ongoing project are not willing to adhere. This rule is mostly interesting because it proves that education is a strong factor determining the ``adhere'' answer. The DTs induced also contained one or two more literals separating the ``adhere'' and ``?'' classes (average error: ), but only few other things could be mined from the trees of C4.5, in the light of the problem addressed.
Rule 2 in Figure 3 did not have the equivalent in the DTs induced. What it says is interesting for the DDAF, because it brings the following conclusion: farmers without ongoing projects, and not selling their products only to a wholesaler, are on the knife edge for their membership (either in ``adhere'', or in ``adhere''). Without going further into local agricultural considerations, this rule, for the DDAF Engineers, represents an accurate view of the farmers actually controlling their exploitation costs, being either for or against CTEs, and that education pushes towards the membership (combination of rules 1 and 2), probably because it allows them to see the future potential benefits of the contract, better than its current constraints.
The DC obtained on the agricultural data (see text for the interpretation of the variables).
Next: Noise Handling
Previous: Tradeoff Simplicity-Accuracy
©2002 AI Access Foundation and Morgan Kaufmann Publishers. All rights reserved.