### Support Vector Machine, Kernel, Etc.

Formulate the primal

• Start with a linear method. $h(x) = w^Tx+b$. And $h = sign(w^T+b)$.
• We try to solve following optimization objective:
$min_{\gamma,w,b} \frac{1}{2} ||w||^2$
s.t.   $y^{(i)}(w^Tx^{(i)} + b) \ge 1, i = 1,...,m$
• Intuition: maximize the margin (smallest distance between decision boundary and samples). If we don't use kernels, then problem is solved. But in order to use kernels, we need to convert it to its dual form.
• Feel free to read more on VC Dimension and statistical learning theory.

Lagrangian multipliers

• if we have following primal optimization problem:
$min_w f(w)$
s.t.   $g_i(w) \lt 0, i = 1,...,k$
$h_i(w) = 0, i = 1,...,l.$
• construct Lagrangian multipliers:
$\theta_p(w) = \max_{\alpha,\beta:\alpha_i\gt0} (f(w) + \sum_1^k\alpha_ig_i(w) + \sum_1^l\beta_ih_i(w))$
$\theta_p(w)$ goes to infinity if constraints are violated. therefore we want:
$\min_w\theta_p(w) = \min_w \max_{\alpha,\beta;\alpha_i\gt0}L(w,\alpha,\beta)$

Dual of Max-margin

• construct Lagrangian for primal we get
$L(w,b,\alpha) = \frac{1}{2}||w||^2 - \sum_1^m\alpha_i[y^{(i)}(w^Tx^{(i)} + b)-1]$
take the derivatives with regard to $w$ and $b$ and plug them back we get:
$L(w,b,\alpha) = \sum_i^m\alpha_i - \frac{1}{2}\sum_{i,j=1}^my^{(i)}y^{(j)}\alpha_i\alpha_j(x^{(i)})^Tx^{(j)}$
with the constraints we finally get the dual form:
$\max_\alpha W(\alpha) = \sum_i^m\alpha_i - \frac{1}{2}\sum_{i,j=1}^my^{(i)}y^{(j)}\alpha_i\alpha_j\langle x^{(i)}x^{(j)}\rangle$
s.t.      $\alpha_i\ge0, i = 1,...,m$
$\sum_i^m \alpha_iy^{(i)} = 0$
• Prediction on new $x$:
• Two problems are equivalent. If we can solve dual we can also solve primal. More on duality you should read https://en.wikipedia.org/wiki/Duality_(optimization)

From a loss perspective:

• It can also be regarded as Hinge loss + L2 regularization: $L(X;Y) = \sum_i l(y^{(i)}, \hat{y}^{(i)};x^{(i)}) + \lambda||w||^2$
• $l$ is defined as hinge loss: $l(y,\hat{y}) = \max(0, 1 - y \cdot \hat{y})$ $y$ is labels and $\hat{y}$ is predictions.
• Hinge loss is convex but not smooth it's a surrogate for 0-1 loss since 0-1 loss isn't convex. We can optimize it by using sub-gradient.
• How is this different from max-margin formulation? Well instead of having constraints, they become part of the optimization objective so more like soft constraints.

Kernels