This paper is principally concerned with the theoretical and experimental study of a set of voting classifiers which we think is likely to provide an accurate answer to the simplicity-accuracy tradeoff: decision committees (DC) [Nock Gascuel1995]. DC is informally the Boolean multiclass extension of polynomial discriminant functions. A decision committee contains rules, each of these being a pair (monomial, vector). Each monomial is a condition that, when fired, returns its vector. After each monomial has been tested, the sum of the returned vectors is used to take the decision. This additive fashion for combining rules is absent from classical Boolean classifiers such as Decision Trees (DT) or Decision Lists (DL). Furthermore, unlike these two latter classes, the classifier contains absolutely no ordering, neither on variables (unlike DT), nor on monomials (unlike DL). When sufficiently small DCs are built and adequate restrictions are taken, a new dimension in interpreting the classifier is obtained, which does not exist for DT or DL. Namely, any example can satisfy more than one rule, and a DC can therefore be interpreted by means of various rule subsets (in a naive conversion of a DT or a DL into rule sets, any example satisfies exactly one rule). Decision committees resemble or generalize other rule sets [Cohen Singer1999]. In this paper, the authors consider DNF-shaped formulas, in which the output of a monomial is not a class (called ``positive''), but a (non-negative) confidence in the classification as positive. A default class predicts the other class, called ``negative'' (this is a setting with two classes). Computing the class of an observation boils down to summing the confidences of the rules it satisfies, and then deciding the positive class if the sum is greater than zero, and the negative class otherwise. Decision committees are a generalization of these formulas, in which we remove the setting's constraint (two classes) and authorize the membership prediction to arbitrary classes, thereby leading to a true voting classifier. This voting fashion is a feature that decision committees share with decision tables [Kohavi Sommerfield1998]. However, decision tables classifiers are based on majority voting of the examples (and not of rules), over a restricted ``window'' of the description variables. They necessitate the storing of many examples, and the interpretations of the data can only be made through this window, according to this potentially large set of examples. Decision committees rather represent an efficient way to encode a voting method into a small number of rules, and the way a class is given can be brought back to early works in machine learning [Clark Boswell1991]. More formal details are provided in the next section.
Among our theoretical results, that are presented in the following section, we provide formal proofs that the simplicity-accuracy tradeoff is also hard to achieve for DC, as well as for the construction of complex votes involving DT. This last result shows that, while mixing C4.5 with boosting provides one of the most powerful classification algorithms [Friedman et al.2000], pruning boosting is essentially heuristic [Margineantu Dietterich1997].
The algorithm we propose for the induction of DC, WIDC (for Weak Induction of Decision Committees), has the following key features. It uses recent results on partition boosting, ranking loss boosting [Schapire Singer1998] and some about pruning Boolean formulas [Kearns Mansour1998]. WIDC follows a scheme close to C4.5's for decision trees [Quinlan1994], or ICDL's for decision lists [Nock Jappy1998] ; as such, it differs from previous studies in voting classifiers (boosting, bagging [Breiman1996]) by features such as the fact that no modification is made on the example's distribution during induction. It is also one if its differences with the SLIPPER rule induction approach [Cohen Singer1999].
On multiclass and multilabel problems, WIDC proposes a very fast and simple solution to ranking loss boosting, optimal in fairly general cases, and asymptotically optimal in most of the remaining ones. The general problem of ranking loss boosting was previously conjectured -Hard [Schapire Singer1998]. Though our ranking loss boosting algorithm is not always optimal, we also show that the general ranking loss boosting problem related to Schapire Singer (1998) is actually not -Hard, and can be solved in polynomial time, though it seems to require the use of complex and time-expensive algorithms, related to the minimization of (symmetric) submodular functions. This also partially justifies the use of our simple and fast approximation algorithm.
The last section of this paper presents experimental results obtained with WIDC on thirty-one domains, most of which are readily available and can be found on the UCI repository of machine learning database [Blake et al.1998].
In order to keep the paper self-contained and as concise and readable as possible, we have chosen to put an appendix at the end of the paper containing all proofs of our results.