Calculating Rule Vectors using Ranking Loss Boosting

Next: Optimizing in the Setting Up: Overview of WIDC Previous: Building a Large Decision

Calculating Rule Vectors using Ranking Loss Boosting

Schapire $\&$ Singer (1998) have investigated classification problems where the aim of the procedure is not to provide an accurate class for some observation. Rather, the algorithm outputs a set of values (one for each class) and we expect the class of the observation to receive the largest value of all, thus being ranked higher than all others. This approach is particularly useful when a given example can belong to more than one class (multilabel problems), a case where we expect each of these classes to receive the greatest values compared to the classes the examples does not belong to.
The ranking loss represents informally the number of times the hypothesis fails to rank the class of an example higher than a class to which it does not belong. Before going further, we first generalize our classification setting, and replace the common notation for an example by the more general one $(o,\vec{c}_o)$ . Here, $\vec{c}_o\in\{0,1\}^c$ is a vector giving, for each class, the membership to the class (``0'' is no and ``1'' is yes) of the corresponding observation . It is important to note that this setting is more general than the usual Bayesian setting, in which there can exist examples and $(o',c_{o'})$ (using the non-vector notation) for which but $c_o\neq c_{o'}$ . Ranking loss generalizes Bayes to the multilabel problems, and postulates that there can be some examples for which we cannot provide a single class at a time, even if e.g. any of the classes to which the example belongs are susceptible to appear independently later with the same observation.
Ranking loss Boosting replaces each example $(o,\vec{c}_o)$ by a set of $1_{\vec{c}_o} \times (c-1_{\vec{c}_o})$ examples, where $1_{\vec{c}_o}$ denotes the Hamming weight of $\vec{c}_o$ (i.e. the number of classes to which the example belongs). Each of these new examples is denoted , where and span all values in $\{ 0, 1, ..., c - 1 \}^{2}$ . The distribution of the new examples is renormalized, so that $w((o,k,j))=\frac{w((o,\vec{c}_o))}{1_{\vec{c}_o} \times (c-1_{\vec{c}_o})}$ whenever $\vec{c}_o[j]=1$ and $\vec{c}_o[k]=0$ , and otherwise.
Take some monomial obtained from the large DC, and all examples satisfying it. We now work with this restricted subset of examples, while calculating the corresponding vector $\vec{v}$ of . Schapire $\&$ Singer (1998) propose a cost function which we should minimize in order to minimize the ranking loss. This function is

$\displaystyle Z$

$\textstyle =$

$\displaystyle \sum_{o,j,k} {w((o,k,j)) \times e^{-\frac{1}{2}\alpha \left(\vec{v}[j] - \vec{v}[k]\right)}} \:\:.$

(2)

Here, $\alpha$ is a tunable parameter which, intuitively, represents the confidence in the choice of $\vec{v}$ , and leverages its quality. The better $\vec{v}$ is at classifying examples, the larger is $\vert\alpha\vert$ . In our case however, authorizing $\alpha\neq 1$ is equivalent to authorizing components for $\vec{v}$ in sets $\{-x,0,x\}$ for arbitrary

. To really constrain the components of $\vec{v}$ in $\{-1,0,+1\}$ , we have chosen to optimize the criterion

$\displaystyle Z$

$\textstyle =$

$\displaystyle \sum_{o,j,k} {w((o,k,j)) \times e^{-\frac{1}{2}\left(\vec{v}[j] - \vec{v}[k]\right)}}$

(3)

(therefore forcing $\alpha=1$ ). Schapire $\&$ Singer (1998) conjecture that finding the optimal vector minimizing

in eq. (2) (which is similar to an oblivious hypothesis according to their definitions), or

given a particular value of $\alpha$ , is

-Hard when

is not fixed, and when the components of $\vec{v}$ are in the set $\{-1,+1\}$ . The following section addresses directly the setting of Schapire $\&$ Singer (1998), and presents complexity-theoretic results showing that the minimization of

is actually polynomial, but highly complicated to achieve, all the more for what it is supposed to bring to the minimization of

in our setting. A striking result we also give, not related to the purpose of the paper, is that it is actually the maximization of

which is

-Hard.
Then, we present the approximation algorithm we have built and implemented to optimize the computation of $\vec{v}$ in our setting (components of $\vec{v}$ in the set $\{-1,0,+1\}$ ), along with its properties. While we feel that the ideas used to minimize

in the setting of Schapire $\&$ Singer (1998) can be adapted to our setting to provide an algorithm that is always optimal, our algorithm has the advantage to be simple, fast, and also optimal for numerous cases. In many other cases, we show that it is still asymptotically optimal as

increases.

Subsections

Next: Optimizing in the Setting Up: Overview of WIDC Previous: Building a Large Decision