Schapire Singer (1998) have investigated classification problems where the aim of the
procedure is not to provide an accurate class for some observation. Rather,
the algorithm outputs a set of values (one for each class) and we expect the
class of the observation to receive the largest value of all, thus being
ranked higher than all others. This approach is particularly useful when
a given example can belong to more than one class (multilabel problems), a case where we expect each of
these classes to receive the greatest values compared to the classes the
examples does not belong to.

The *ranking loss* represents informally the
number of times the hypothesis fails to rank the class of an
example higher than a class to which it does not belong. Before going further, we first generalize our classification setting, and replace the common notation for an example by the more general one . Here,
is a vector giving, for each class, the membership to the class (``0'' is no and ``1'' is yes) of the corresponding observation . It is important to note that this setting is more general than the usual Bayesian setting, in which there can exist examples and (using the non-vector notation) for which but
. Ranking loss generalizes Bayes to the multilabel problems, and postulates that there can be some examples for which we cannot provide a single class at a time, even if *e.g.* any of the classes to which the example belongs are susceptible to appear independently later with the same observation.

Ranking loss Boosting replaces each example by a set of
examples, where denotes the Hamming weight of (*i.e.* the number of classes to which the example belongs). Each of these new examples is denoted , where and span all values in
. The distribution of the new examples is renormalized, so that
whenever
and
, and otherwise.

Take some monomial obtained from the large DC, and all examples satisfying it. We now work with this restricted subset of examples, while calculating the corresponding vector of . Schapire Singer (1998) propose a cost function which we should minimize in order to minimize the ranking loss. This function is

Here, is a tunable parameter which, intuitively, represents the confidence in the choice of , and leverages its quality. The better is at classifying examples, the larger is . In our case however, authorizing is equivalent to authorizing components for in sets for arbitrary . To really constrain the components of in , we have chosen to optimize the criterion

(therefore forcing ). Schapire Singer (1998) conjecture that finding the optimal vector minimizing in eq. (2) (which is similar to an

Then, we present the approximation algorithm we have built and implemented to optimize the computation of in our setting (components of in the set ), along with its properties. While we feel that the ideas used to minimize in the setting of Schapire Singer (1998) can be adapted to our setting to provide an algorithm that is always optimal, our algorithm has the advantage to be simple, fast, and also optimal for numerous cases. In many other cases, we show that it is still asymptotically optimal as increases.

- Optimizing in the Setting of Schapire Singer (1998)
- Optimizing in our Setting
- Explicit Solution in the Two-Classes Case