Machine Learning Notes

Must read: Andrew Ng's notes. http://cs229.stanford.edu/materials.html
Good stats read: http://vassarstats.net/textbook/index.html
Generative model vs. Discriminative model

one models $p(x|y)$; one models $p(y|x)$. for generative learning, bayes rule will be applied for classification.
for generative learning, each class will be modeled separately agnostic of others. for discriminative learning, one model will be learned to distinguish between all classes.
http://robotics.stanford.edu/~ang/papers/nips01-discriminativegenerative.pdf
http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=06.1-NaiveBayes-GenerativeLearningAlgorithms&speed=100
http://melodi.ee.washington.edu/~halloj3/classification.pdf
Naive Bayes reaches its asmyptotic error very quickly with regards to the number of training examples. As the number of training examples grows, logistic regression will outperform naive Bayes and achieve a lower asymptotic error rate.

How to evaluate a model? If one classifier gives you worse result, it is a bad classifier?
Which classifier to choose?

For difference (between A/B tests/two models/etc.), how do you know if it's signifcant? If this week's CTR is 5% better than last week, can we conclude we've done better?

Paired different test is used to assess if two population means are different. for normally distributed differences we can used t-test/z-test. others we can use Wilcoxon test.
student's t-test: if unequal variances, we construct $t=\frac{\bar{X_1}-\bar{X_2}}{\sqrt{\frac{\sigma_1}{n_1}-\frac{\sigma_2}{n_2}}} \sim N(0,1)$. Then we can look up standard gaussian table and get its $p$-value (usually should be < 0.05 or 0.01 to be signifcant).
Wilcoxon test:
Likelihood ratio: $\Lambda(X) = \frac{sup\{L(\theta_0|X)\}}{sup\{L(\theta|X)\}}$. Hypothesis $H_0$ can be rejected if $\Lambda(X)$ < $c$

MAE vs. RMSE: http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/msg/ver_cont_var/uos3/uos3_ko1.htm

Nicely accomodate multiple regression models such as Linear Regression and Logistic Regression (binomial response using sigmoid)
Use a sigmoid function to map $\mathbb{R} \rightarrow [0,1]$ then it regression becomes probabilistic model.

SGD updates parameters on every single training example so if training sample is large, SGD takes much shorter time to update parameters and then starts oscillating since it minimizes error on single example instead of the total error. You can picture the path SGD runs to optimum. Instead of the overall gradient, it makes decision on single examples with expected direction being gradient descent, and in reality could be off the optimum course but eventually it'll get there.
More practically, if training data is really big with batch updating you might not even be able to iterate once.
reading:
- Andrew Ng's video: https://www.youtube.com/watch?v=gdZxqnTKndE#t=102 and his notes http://cs229.stanford.edu/notes/cs229-notes1.pdf
- "This means that we have a trade-off of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent" http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/238.pdf

Loss function can really be anything as long as it measures the correctness of the model. You can directly define a loss function (hinge, logistic, etc), or use negative likelihood for probabilistic models.
Non-probabilistic models dont have likelihood. E.g. SVM has no likelihood defined.
It turns out, for linear regression minimizing least-square loss is equivalent to maximizing likelihood, with the assumption that the error term is zero-mean gaussian distributed.
BTW, if we assume parameters($\theta$) in linear regression is multivariate Gaussian, then by conducting MAP estimate for $\theta$ we'll get exactly the form of L2 regularization on $\theta$.

parameters of the parameters of your model
How important hyper-parameters are? I don't know. Maybe not (dirichlet priors for LDA)? Is K in clustering algorithm a hyper-parameter? Probly not, since K is a parameter of your model (clustering)
Why people always use cross-validation for hyperparameter optimization? Is it a reasonable estimate? http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf

You come up with a model with some parameters $\theta$ as well as a latent variable $z$. e.g. membership in mixture model
E: construct a lower bound. estimate latent variables $p(z|\theta)$ = ...
M: maximize the lower bound. given $p(z|\theta)$ maximize $p(\theta|z)$ = ...
For Gaussian Mixture: Each $X_i$ is from one of the Gaussian components, the problem is you don't which one. Randomly assign membership to each point. Then we can estimate $p(z|\mu, \Sigma)=\frac{p(\mu,\Sigma|z)p(z)}{\sum_zp(\mu,\Sigma,z)}$, followed by $\mu$ = .., $\Sigma$ = ...

Ratio between likelihood from null hypothesis and alternative hypothesis:$\frac{p(X|\theta_0)}{p(X|\theta_1)}$. Null hypothesis with parameter $\theta_0$ will be rejected if ratio $\lt c$. $c$ is determined by certain significane level $\alpha$.

Bayesian model design: fewer parameters better! Design everything to be random variables and only free parameters are hyper ones.

Variance: $Var(X) = E({(X-\mu)}^2) = E(X^2) - E^2(X)$
Covariance: $\sigma(X,Y) = E(X-E(X)(Y-E(Y)) = E(XY)-E(X)E(Y)$
$E(X^2) = E((X-\mu)^2 -\mu^2 + 2\mu X) = \sigma^2 - \mu^2 + 2\mu^2 = \mu^2 + \sigma^2$
Pearson Correlation: $\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y}$ or $r=\frac{\sum_i(X_i-\bar{X})(Y_i-\bar{Y})}{\sqrt{\sum_i(X_i-\bar{X})}\sqrt{\sum_i(Y_i-\bar{Y})}}$

Random: