Although Bayesian model averaging is theoretically the optimal method for combining learned models, it has seen very little use in machine learning. In this paper we study its application to combining rule sets, and compare it with bagging and partitioning, two popular but more ad-hoc alternatives. Our experiments show that, surprisingly, Bayesian model averaging's error rates are consistently higher than the other methods'. Further investigation shows this to be due to a marked tendency to overfit on the part of Bayesian model averaging, contradicting previous beliefs that it solves (or avoids) the overfitting problem.
This paper presents a unified bias-variance decomposition that is applicable to squared loss, zero-one loss, variable misclassification costs, and other loss functions. The unified decomposition sheds light on a number of significant issues: the relation between some of the previously proposed decompositions for zero-one loss and the original one for squared loss, the relation between bias, variance and Schapire et al.'s (1997) notion of margin, and the nature of the trade-off between bias and variance in classification. While the bias-variance behavior of zero-one loss and variable misclassification costs is quite different from that of squared loss, this difference derives directly from the different definitions of loss. We have applied the proposed decomposition to decision tree learning, instance based learning and boosting on a large suite of benchmark datasets, and made several significant observations.