Logistic Regression Demystified

Modeling labels as 1/0:

  1. We model $p(y=1|x) = \frac{1}{1+exp(-\theta^TX)}$ and $p(y=0) = 1 - p(y=1)$. Hence LR is a discriminative classifier since we are directlying modeling $p(y|x)$.
  2. For each single data point we have $p(y=y_i|x;\theta) = {p(y=1|x)}^{y_i} {p(y=0|x)}^{1-y_i}$. This applies to all Bernoulli variables.
  3. For whole dataset we have conditional likelihood: $L(X,Y,\theta) = \prod_i p(y^{(i)}|x^{(i)};\theta) = \prod_i {p(y=1|x)}^{y_i} {p(y=0|x)}^{1-y_i} $
  4. Log-likelihood: $LL(X,Y,\theta) = \sum \{y^{(i)}\log\ {p(y=1|x)} + (1-y^{(i)}){p(y=0|x)}\} = \sum \{y^{(i)}\log\ {p(y=1|x)} + (1-y^{(i)})\log\ (1-p(y=1|x))\}$
  5. To miminize LL: unlike linear regression there's no close-form solution due to nonlinearity of the sigmoid function. Good news is its MLE is convex.
  6. Prediction: $y=1$ when $\frac{1}{1+exp(-\theta^TX)} > 0.5$, i.e. $exp(-\theta^TX)<1$, i.e. $\theta^TX>0$ This indicates that the decision boundary for LR is linear.
  7. Newton's method:

Modeling labels as +1/-1:

  1. What if we model Y=+1/-1 ? Interestingly we can model $p(y=\pm1)=\frac{1}{1+e^{-y\cdot\theta^Tx}}$
  2. Log-likelihood: $LL(X,Y,\theta) = \sum_i \log\frac{1}{1+e^{-y^{(i)}\cdot\theta^Tx^i}} =\sum_i -\log(1+e^{-y^{(i)}\cdot\theta^Tx^i})$. Therefore maximizing LL is equivalent to minimizing $\sum_i \log(1+e^{-y^{(i)}\cdot\theta^Tx^i})$, which is the logistic loss we are familiar with.
  3. Minimize loss. L2-LR: $\min\limits_\theta\ \frac{1}{2} \theta^T\theta+C\sum_i\log(1+e^{-y_i\cdot\theta^Tx_i})$
  4. Minimize loss. L1-LR: $\min\limits_\theta\ \frac{1}{2} ||\theta||+C\sum_i\log(1+e^{-y_i\cdot\theta^Tx_i})$


Multinomial Logistic Regression:

  1. What about multinomial logistic regression? We model $p(y=j|x;\theta)=\frac{e^{\theta_j^Tx}}{\sum_{k=1}^Ke^{\theta_k^Tx}}$ if there's $k$ classes.
  2. Equivalently, we can also model class $K$ as $p(c=K)=\frac{1}{1+\sum_{k=1}^{K-1}e^{\theta_k^Tx}}$ and other classes $p(c=1...K-1)=\frac{e^{\theta_k^Tx}}{1+\sum_{k=1}^{K-1}e^{\theta_k^Tx}}$ thus only $(K-1)n$ parameters needed. $n$ is number of features.

Computation concerns:

Reading