Next: Our Contribution Up: Inducing Interpretable Voting Classifiers Previous: Inducing Interpretable Voting Classifiers

Introduction

Recent advances in the study of voting classification algorithms have brought empirical and theoretical results clearly showing the discrimination power of ensemble classifiers [Bauer Kohavi1999,Breiman1996,Dietterich2000,Opitz Maclin1999,Schapire Singer1998]. These methods basically rely on voting the decision of individual classifiers inside an ensemble. It is widely accepted, and formally proven in certain cases [Schapire et al.1998,Schapire Singer1998], that their power actually relies on the ability to build potentially very large classifiers. It has even been observed experimentally that such an ensemble can sometimes be as large as (or larger than) the data used to build the ensemble [Margineantu Dietterich1997] ! Then, a simple question arises, namely what is the interest a customer can have in using such a classifier, instead of simple lookups in the data, and using algorithms such as nearest neighbor classifiers [Margineantu Dietterich1997] ?
After some of the most remarkable recent studies in voting classification algorithms, some authors have pointed out the interest to bring this classification power to data mining, and more precisely to make interpretability a clear issue in voting classification algorithms [Bauer Kohavi1999,Ridgeway et al.1998]. Some authors go even further, and argue that the importance of interpretability has been marginalized in the design of these algorithms, and put behind the need to devise classifiers with strong classification power [Ridgeway et al.1998]. But interpretability also governs the quality of a model by providing answers to how it is working, and, most importantly, why. According to Bauer $\&$ Kohavi (1999), striving for comprehensibility in voting models is one of the principal problems requiring future investigations. They also remark that ``voting techniques usually result in incomprehensible classifiers that cannot easily be shown to users''.
Comprehensibility is, on the other hand, a hard mining issue [Buja Lee2001] : it depends on parameters such as the type of classifiers used, the algorithm inducing the classifiers, the user mining the outputs, etc. . Though the quantification of interpretability is still opened in the general case [Buja Lee2001], there are some clues coming from theory and practice of machine learning/data mining indicating some potentially interesting requirements and compromises to devise an efficient learning/mining algorithm.
A first requirement for the algorithm is obviously its generalization abilities: without classification strength, it is pointless to search for interesting models of the data. A second requirement, more related to mining, is the size of the classifiers [Nock Gascuel1995,Nock Jappy1998]. If accurate, a classifier with restricted size can lead to faster and deeper understanding. This is obviously not an absolute rule, rather an approximate proxy for interpretability : pathologic cases exist in which, for example, a large and unbalanced tree can be very simple to understand [Buja Lee2001]. Note that in this example, the authors explain that the tree is simple because all its nodes can be described using few clauses. Therefore, simplicity is also associated to a short description, but using a particular class of concept representation.
A third parameter influencing comprehensibility is the nature of the algorithm's output. Inside the broad scope of symbolic classifiers, some classes of concept representations appear to offer a greater comfort for interpretation. Decision trees belong to this set [Breiman et al.1984], though they also raise some interpretability problems : Kohavi $\&$ Sommerfield (1998) quote that

``the clients [business users] found some interesting patterns in the decision trees, but they did not feel the structure was natural for them. They were looking for those two or three attributes and values (e.g. a combination of geographic and industries) where something ``interesting'' was happening. In addition, they felt it was too limiting that the nodes in a decision tree represent rules that all start with the same attributes.''

Although not limiting from a classification viewpoint, the ordering of nodes prior to classification can therefore make it uncomfortable to mine a decision tree. Notice that this problem might hold for any class of concept representation integrating an ordering prior to classification: decision lists [Rivest1987], alternating decision trees [Freund Mason1999], branching programs [Mansour McAllester2000], etc. . There exists, however, a type of classifiers on which related papers appear to be generally unanimous on their mining abilities : disjunctive normal form formulas (DNFs, and their numerous extensions), that is, disjunctions of conjunctions. Interestingly enough, this is the class which motivated early works (and a great amount of works afterwards) on the well-known PAC theory of learning [Valiant1984,Valiant1985], partly because of the tendency humans seem to have to represent knowledge using similarly shaped rules [Valiant1985]. This class is also the dual of the one implicitly used by Buja $\&$ Lee (2001) to cast their size measure for decision trees (to state whether the concept represented is simple or not).
It is our aim in this paper to propose theoretical results and approximation algorithms related to the induction of very particular voting classifiers, drawing their roots on simple rule sets (like DNF), with the objective to keep a tradeoff between simplicity and accuracy. Our aim is also to prove that, in the numerous induction algorithms already proposed throughout the machine learning and data mining communities, some of them, previously used in decision trees and decision lists induction, can be easily adapted to cope with this objective, thereby leading to easy-to-implement (and compare) algorithms. The next section presents a synthesis of our contribution, which is detailed in the rest of the paper.

Next: Our Contribution Up: Inducing Interpretable Voting Classifiers Previous: Inducing Interpretable Voting Classifiers