Calculation of the Dual of the SVM Problem.

Here is a simple page to explain the detailed steps of the calculation of the dual of the SVM Problem.
Prerequisite: SVM, Dual Problem, Lagrangian

Mathematical formulation of the SVM

Primal problem

The SVM problem is as follow: $$ \min_{w,b} ( C \sum_i l(w^T X_i,y_i) ) + {{1}\over{2}} \| w \|^2 ) $$ where $l$ is the hinge loss: $l(w^T X,y)=max(0,1 - y (w^T X + b))$ $$ \min_{w,b} ( C \sum_i max(0,1 - y_i (w^T X_i + b)) + {{1}\over{2}} \| w \|^2 ) $$

Reformulation of the dual

The hinge loss can be rewritten:
$$ \begin{eqnarray} l(w^T X,y) &=& max(0,1 - y (w^T X + b)) \\ &=& \min_{\xi} \xi \text{ s.t. } \left\{ \begin{array}{ll} \xi \geq 0 \\ \xi \geq 1-y ( w^T X+b) \end{array} \right. \end{eqnarray}$$
Thanks to that, we can rewrite the SVM problem as Linearly Constrained Quadratic Programming (LCQP). $$ \min_{w,b,\{\xi_i\}_i} ( C \sum_i \xi_i + {{1}\over{2}} \| w \|^2 ) \text{ s.t. } \left\{ \begin{array}{ll} \forall i, \xi_i \geq 0 \\ \forall i, \xi_i \geq 1-y (w^T X+b) \end{array} \right. $$

Dual Problem

Let's compute the dual of this problem. To begin, let's compute the lagrangian.
Let's associate the variables $\{\alpha_i\}$ to the constraints $\forall i, \xi_i \geq 1-y (w^T X+b)$, which can be rewritten $\forall i, 1-y_i (w^T X_i +b)- \xi_i \leq 0$. So the Lagrangian will have the associated term: $$\sum_i \alpha_i ( 1-y_i (w^T X_i +b)- \xi_i ) $$ Then let's associate the variables $\{\beta_i\}$ to the constraints $\forall i, \xi_i \geq 0$, which can be rewritten $\forall i, -\xi_i \leq 0$. So the associated Lagrangian term will be: $$ \sum_i -\beta_i \xi_i $$ We sum up everything and the Lagrangian is: $$ \begin{eqnarray} \mathcal{L}(w,b,\{\xi_i\},\{\alpha_i\},\{\beta_i\}) = & & C \sum_i \xi_i + {{1}\over{2}} \| w \|^2 \\ & + & \sum_i \alpha_i ( 1-y_i (w^T X_i +b)- \xi_i ) \\ & - & \sum_i \beta_i \xi_i \end{eqnarray} $$ and we have the constraints: $$ \forall{i}, \alpha_i \geq 0 , \beta_i \geq 0 $$ In order to eliminate the primal variables, let's differentiate the lagrangian with respect to the primal variables, and use the fact that at the optimum those derivates should be null. $$ \frac{\partial L }{\partial b} = \sum_i \alpha_i y_i \Rightarrow \sum_i \alpha_i y_i = 0 $$ $$ \frac{\partial L }{\partial \xi_i} = C - \alpha_i - \beta_i \Rightarrow \alpha_i=C-\beta_i \Rightarrow \alpha_i \leq C$$ $$ \frac{\partial L}{\partial w} = w - \sum_i \alpha_i y_i X_i \Rightarrow w = \sum_i \alpha_i y_i X_i$$ So we can reinject that into the lagrangian and get: $$ \begin{eqnarray} \mathcal{L} &=& \sum_i \xi_i + {{1}\over{2}} \| w \|^2 \\ &+& \sum_i \alpha_i ( 1-y_i (w^T X_i +b)- \xi_i ) \\ &-& \sum_i \beta_i \xi_i \\ &=& \sum_i \xi_i + {{1}\over{2}} \| \sum_i \alpha_i y_i X_i \|^2 \text{ (using $\frac{\partial L}{\partial w}=0$.)} \\ &+& \sum_i \alpha_i ( 1-y_i ((\sum_j \alpha_j y_j X_j)^T X_i)- \xi_i ) \text{ (using $\frac{\partial L}{\partial w}=0$.)} \\ &-& (\sum_i \alpha_i y_i) b \text{ (the sum is null, according to $\frac{\partial L}{\partial b}=0$.)} \\ &-& \sum_i ( C - \alpha_i ) \xi_i \text{ (using $\frac{\partial L}{\partial \xi_i}=0$.)} \\ &=& - {{1}\over{2}} \sum_{i,j} \alpha_i \alpha_j y_i y_j X_i^T X_j + \sum_i \alpha_i \end{eqnarray} $$ So the dual problem is:
$$ \begin{eqnarray} \min_{\{\alpha_i\}} {{1}\over{2}} \sum_{i,j} \alpha_i \alpha_j y_i y_j X_i^T X_j - \sum_i \alpha_i\\ s.t. \left\{ \begin{array}{ll} \forall_i, 0 \leq \alpha_i \leq C \\ \sum_i \alpha_i y_i = 0 \end{array} \right. \end{eqnarray} $$

Why working in the dual rather than in the primal?

1) Easy to kernelize, by replacing $X_i^T X_j$ by $K(X_i,X_j)$
2) The boxed constraint ( $ 0 \leq \alpha_i \leq C $ ) is easy to handle (only one variable, independant constraints).
3) The constraint in the primal ( $ \xi \geq 1-y ( w^T X+b) $ ) is typically hard to handle (size(w)+2 variables, constraint with highly dependent variables).
4) Thanks to the box constraint, $\alpha$ is sparse. Only a few $\alpha$'s will be different than both $0$ and $C$. So we one use an active set method.
5) Sometimes, when $w$ is much bigger than $\alpha$ (i.e. the feature dimension is smaller than the number of examplars), the size of the problem is reduced.