Formulate the primal

- Start with a linear method. $h(x) = w^Tx+b$. And $h = sign(w^T+b)$.
- We try to solve following optimization objective:
$min_{\gamma,w,b} \frac{1}{2} ||w||^2$

s.t. $y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,...,m$ -
Intuition: maximize the
*margin (smallest distance between decision boundary and samples)*. If we don't use kernels, then problem is solved. But in order to use kernels, we need to convert it to its dual form. - Feel free to read more on VC Dimension and statistical learning theory.

Lagrangian multipliers

- if we have following primal optimization problem:
$min_w f(w)$

s.t. $g_i(w) \lt 0, i = 1,...,k$

$h_i(w) = 0, i = 1,...,l.$ - construct Lagrangian multipliers:
$\theta_p(w) = \max_{\alpha,\beta:\alpha_i\gt0} (f(w) + \sum_1^k\alpha_ig_i(w) + \sum_1^l\beta_ih_i(w))$$\theta_p(w)$ goes to infinity if constraints are violated. therefore we want:

$\min_w\theta_p(w) = \min_w \max_{\alpha,\beta;\alpha_i\gt0}L(w,\alpha,\beta)$

Dual of Max-margin

- construct Lagrangian for primal we get
$L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_1^m\alpha_i[y^{(i)}(w^Tx^{(i)} + b)-1]$take the derivatives with regard to $w$ and $b$ and plug them back we get:$L(w,b,\alpha) = \sum_i^m\alpha_i - \frac{1}{2}\sum_{i,j=1}^my^{(i)}y^{(j)}\alpha_i\alpha_j(x^{(i)})^Tx^{(j)}$with the constraints we finally get the dual form:$\max_\alpha W(\alpha) = \sum_i^m\alpha_i - \frac{1}{2}\sum_{i,j=1}^my^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}x^{(j)}\rangle$

s.t. $\alpha_i\ge0, i = 1,...,m$

$\sum_i^m \alpha_iy^{(i)} = 0$ - Prediction on new $x$:
- Two problems are equivalent. If we can solve dual we can also solve primal. More on duality you should read https://en.wikipedia.org/wiki/Duality_(optimization)

From a loss perspective:

- It can also be regarded as Hinge loss + L2 regularization: $L(X;Y) = \sum_i l(y^{(i)}, \hat{y}^{(i)};x^{(i)}) + \lambda||w||^2$
- $l$ is defined as hinge loss: $l(y,\hat{y}) = \max(0, 1 - y \cdot \hat{y})$ $y$ is labels and $\hat{y}$ is predictions.
- Hinge loss is convex but not smooth it's a surrogate for 0-1 loss since 0-1 loss isn't convex. We can optimize it by using sub-gradient.
- How is this different from max-margin formulation? Well instead of having constraints, they become part of the optimization objective so more like soft constraints.

Kernels

Reading

- http://www1.inf.tu-dresden.de/~ds24/lehre/ml_ws_2013/ml_11_hinge.pdf