Boosting [Freund Schapire1996,Schapire1990] encompasses a family of methods.
The focus of these methods is to produce a *series* of classifiers.
The training set used for each member of the series is chosen based on
the performance of the earlier classifier(s) in the series.
In Boosting, examples that are incorrectly predicted by previous classifiers
in the series are chosen more often than examples that were correctly predicted.
Thus Boosting attempts to produce new classifiers that are better able to
predict examples for which the current ensemble's performance is poor.
(Note that in Bagging, the resampling of the training set is not dependent
on the performance of the earlier classifiers.)

In this work we examine two new and powerful forms of Boosting: Arcing
[Breiman1996b] and Ada-Boosting [Freund Schapire1996].
Like Bagging, Arcing chooses a training set of size *N* for classifier
*K*+1 by probabilistically selecting (with replacement)
examples from the original *N* training examples.
Unlike Bagging, however, the probability of selecting an example
is not equal across the training set.
This probability depends on how often that example was misclassified by the
previous *K* classifiers.
Ada-Boosting can use the approach of (a) selecting a set of examples based on
the probabilities of the examples, or (b) simply using all of the examples
and weight the error of each example by the probability for that example
(i.e., examples with higher probabilities have more effect on the error).
This latter approach has the clear advantage that each example is
incorporated (at least in part) in the training set.
Furthermore, Friedman et al. [1998] have demonstrated that
this form of Ada-Boosting can be viewed as a form of additive modeling for
optimizing a logistic loss function.
In this work, however, we have chosen to use the approach of subsampling
the data to ensure a fair empirical comparison
(in part due to the restarting reason discussed below).

Both Arcing and Ada-Boosting initially set the probability of picking each
example to be 1/*N* .
These methods then recalculate these probabilities after each trained classifier
is added to the ensemble. For Ada-Boosting,
let *E*_{k}
be the sum of the probabilities of the misclassified instances
for the currently trained classifier *C*_{k} .
The probabilities for the next trial are generated by
multiplying the probabilities of *C*_{k} 's incorrectly classified instances
by the factor *B*_{k} = (1 -
*E*_{k})/*E*_{k}
and then
renormalizing all probabilities so that their sum equals 1.
Ada-Boosting combines the classifiers
*C*_{1}, ... ,*C*_{k}
using weighted voting
where *C*_{k} has weight log(*B*_{k}).
These weights allow Ada-Boosting to discount the predictions of classifiers
that are not very accurate on the overall problem.
Friedman et al [1998] have also suggested an alternative
mechanism that fits together the predictions of the classifiers as an additive
model using a maximum likelihood criterion.

In this work, we use the revision described by Breiman [1996b]
where we reset all the weights to be equal and restart if either
*E*_{k}
is not less than 0.5 or
*E*_{k}
becomes 0.^{1}
By resetting the weights we do not disadvantage the Ada-Boosting learner
in those cases where it reaches these values of
*E*_{k};
the Ada-Boosting
learner always incorporates the same number of classifiers as other methods
we tested.
To make this feasible, we are forced to use the approach of selecting a
data set probabilistically rather than weighting the examples, otherwise
a deterministic method such as C4.5 would cycle and generate duplicate
members of the ensemble.
That is, resetting the weights to 1/*N* would cause the learner
to repeat the decision tree learned as the first member of the ensemble,
and this would lead to reweighting the data set the same as for the second
member of the ensemble, and so on.
Randomly selecting examples for the data set based on the example
probabilities alleviates this problem.

Arcing-x4 [Breiman1996b] (which we will refer to simply as Arcing)
started out as a simple mechanism for evaluating the effect of Boosting
methods where the resulting classifiers were combined without weighting
the votes.
Arcing uses a simple mechanism for determining the probabilities
of including examples in the training set.
For the *i* th example in the training set, the value *m*_{i} refers
to the number of times that example was misclassified by the previous *K* classifiers.
The probability *p*_{i} for selecting example *i* to be part of
classifier *K*+1 's training set is defined as
Breiman chose the value of the power (4) empirically after trying several
different values [Breiman1996b].
Although this mechanism does not have the weighted voting of Ada-Boosting
it still produces accurate ensembles and is simple to implement;
thus we include this method (along with Ada-Boosting) in our empirical evaluation.

Figure 2 shows a hypothetical run of Boosting. Note that the first training set would be the same as Bagging; however, later training sets accentuate examples that were misclassified by the earlier member of the ensembles. In this figure, example 1 is a ``hard'' example that previous classifiers tend to misclassify. With the second training set, example 1 occurs multiple times, as do examples 4 and 5 since they were left out of the first training set and, in this case, misclassified by the first learner. For the final training set, example 1 becomes the predominant example chosen (whereas no single example is accentuated with Bagging); thus, the overall test-set error for this classifier might become very high. Despite this, however, Boosting will probably obtain a lower error rate when it combines the output of these four classifiers since it focuses on correctly predicting previously misclassified examples and weights the predictions of the different classifiers based on their accuracy for the training set. But Boosting can also overfit in the presence of noise (as we empirically show in Section 3).