Logistic Regression Demystified
Modeling labels as 1/0:
- 
We model $p(y=1|x) = \frac{1}{1+exp(-\theta^TX)}$ and $p(y=0) = 1 - p(y=1)$. Hence LR is a discriminative classifier since we are directlying modeling $p(y|x)$.
- 
For each single data point we have $p(y=y_i|x;\theta) = {p(y=1|x)}^{y_i} {p(y=0|x)}^{1-y_i}$. This applies to all Bernoulli variables.
- 
For whole dataset we have conditional likelihood: $L(X,Y,\theta) = \prod_i p(y^{(i)}|x^{(i)};\theta) = \prod_i {p(y=1|x)}^{y_i} {p(y=0|x)}^{1-y_i} $
- 
Log-likelihood: $LL(X,Y,\theta) = \sum \{y^{(i)}\log\ {p(y=1|x)} + (1-y^{(i)}){p(y=0|x)}\} = \sum \{y^{(i)}\log\ {p(y=1|x)} + (1-y^{(i)})\log\ (1-p(y=1|x))\}$
-  To miminize LL: unlike linear regression there's no close-form solution due to nonlinearity of the sigmoid function. Good news is its MLE is convex.
   
    -  updating $\theta$ by gradient descent: $\theta_j = \theta_j + \alpha \sum_i (y^{(i)} - p(x^{(i)}))x_j^{(i)}$. This is not hard to understand: update $\theta$ by expected error model made on all examples.
    
-  Or updating $\theta$ by stochastic gradient descent: $\theta_j = \theta_j + \alpha(y^{(i)} - p(x^{(i)}))x_j^{(i)}$. Loop through $i$ sequentially from examples
  
 
-  Prediction: $y=1$ when $\frac{1}{1+exp(-\theta^TX)} > 0.5$, i.e. $exp(-\theta^TX)<1$, i.e. $\theta^TX>0$ This indicates that the decision boundary for LR is linear.
-  Newton's method:
Modeling labels as +1/-1:
-  What if we model Y=+1/-1 ? Interestingly we can model $p(y=\pm1)=\frac{1}{1+e^{-y\cdot\theta^Tx}}$
-  Log-likelihood: $LL(X,Y,\theta) = \sum_i \log\frac{1}{1+e^{-y^{(i)}\cdot\theta^Tx^i}} =\sum_i -\log(1+e^{-y^{(i)}\cdot\theta^Tx^i})$. Therefore maximizing LL is equivalent to minimizing $\sum_i \log(1+e^{-y^{(i)}\cdot\theta^Tx^i})$, which is the logistic loss we are familiar with.
-  Minimize loss. L2-LR: $\min\limits_\theta\  \frac{1}{2} \theta^T\theta+C\sum_i\log(1+e^{-y_i\cdot\theta^Tx_i})$
-  Minimize loss. L1-LR: $\min\limits_\theta\  \frac{1}{2} ||\theta||+C\sum_i\log(1+e^{-y_i\cdot\theta^Tx_i})$
-  Tip: for sigmoid function we have: $g'(z) = g(z)(1-g(z)) g(-z) = 1 - g(z)$ and $g(z)+g(-z)=1$
Multinomial Logistic Regression:
-  What about multinomial logistic regression? We model $p(y=j|x;\theta)=\frac{e^{\theta_j^Tx}}{\sum_{k=1}^Ke^{\theta_k^Tx}}$ if there's $k$ classes. 
-  Equivalently, we can also model class $K$ as $p(c=K)=\frac{1}{1+\sum_{k=1}^{K-1}e^{\theta_k^Tx}}$ and other classes $p(c=1...K-1)=\frac{e^{\theta_k^Tx}}{1+\sum_{k=1}^{K-1}e^{\theta_k^Tx}}$ thus only $(K-1)n$ parameters needed. $n$ is number of features.
Computation concerns:
-  Apparently computing exponentials is very prone to overflow. And you should divide both top/bottom by some large number to prevent this. I use the max number of top/bottom.
-  For the polynomial function $\sum w_ix_i$ some people also include an intercept term making it $w_0+\sum_{i=1}^{n} w_ix_i$. Would this help? I tried a few experiments and didn't see any difference. The learned $w$ would be different of course, but classification result is same. It might make a difference when features are few.
-  Tip: for sigmoid function we have: $g'(z) = g(z)(1-g(z))$, $g(-z) = 1 - g(z)$ so $g(z)+g(-z)=1$
Reading