2.3 The Bias plus Variance Decomposition

Next: 3. Results Up: 2. Classifier Ensembles Previous: 2.2 Boosting Classifiers

2.3 The Bias plus Variance Decomposition

Recently, several authors [Breiman1996b,Friedman1996,Kohavi Wolpert1996,Kong Dietterich1995] have proposed theories for the effectiveness of Bagging and Boosting based on Geman et al.'s [1992] bias plus variance decomposition of classification error. In this decomposition we can view the expected error of a learning algorithm on a particular target function and training set size as having three components:

1.: A bias term measuring how close the average classifier produced by the learning algorithm will be to the target function;
2.: A variance term measuring how much each of the learning algorithm's guesses will vary with respect to each other (how often they disagree); and
3.: A term measuring the minimum classification error associated with the Bayes optimal classifier for the target function (this term is sometimes referred to as the intrinsic target noise).

Using this framework it has been suggested [Breiman1996b] that both Bagging and Boosting reduce error by reducing the variance term. Freund and Schapire [1996] argue that Boosting also attempts to reduce the error in the bias term since it focuses on misclassified examples. Such a focus may cause the learner to produce an ensemble function that differs significantly from the single learning algorithm. In fact, Boosting may construct a function that is not even producible by its component learning algorithm (e.g., changing linear predictions into a classifier that contains non-linear predictions). It is this capability that makes Boosting an appropriate algorithm for combining the predictions of ``weak'' learning algorithms (i.e., algorithms that have a simple learning bias). In their recent paper, Bauer and Kohavi [1999] demonstrated that Boosting does indeed seem to reduce bias for certain real world problems. More surprisingly, they also showed that Bagging can also reduce the bias portion of the error, often for the same data sets for which Boosting reduces the bias.

Though the bias-variance decomposition is interesting, there are certain limitations to applying it to real-world data sets. To be able to estimate the bias, variance, and target noise for a particular problem, we need to know the actual function being learned. This is unavailable for most real-world problems. To deal with this problem Kohavi and Wolpert [1996] suggest holding out some of the data, the approach used by Bauer and Kohavi [1999] in their study. The main problem with this technique is that the training set size is greatly reduced in order to get good estimates of the bias and variance terms. We have chosen to strictly focus on generalization accuracy in our study, in part because Bauer and Kohavi's work has answered the question about whether Boosting and Bagging reduce the bias for real world problems (they both do), and because their experiments demonstrate that while this decomposition gives some insight into ensemble methods, it is only a small part of the equation. For different data sets they observe cases where Boosting and Bagging both decrease mostly the variance portion of the error, and other cases where Boosting and Bagging both reduce the bias and variance of the error. Their tests also seem to indicate that Boosting's generalization error increases on the domains where Boosting increases the variance portion of the error; but, it is difficult to determine what aspects of the data sets led to these results.

Next: 3. Results Up: 2. Classifier Ensembles Previous: 2.2 Boosting Classifiers

David Opitz
1999-08-24