Machine Learning FAQ:

Must read: Andrew Ng's notes. http://cs229.stanford.edu/materials.html
Good stats read: http://vassarstats.net/textbook/index.html
 Generative model vs. Discriminative model
 one models $p(xy)$; one models $p(yx)$. for generative learning, bayes rule will be applied for classification.
 for generative learning, each class will be modeled separately agnostic of others. for discriminative learning, one model will be learned to distinguish between all classes.
 http://robotics.stanford.edu/~ang/papers/nips01discriminativegenerative.pdf
 http://openclassroom.stanford.edu/MainFolder/VideoPage.php?course=MachineLearning&video=06.1NaiveBayesGenerativeLearningAlgorithms&speed=100
 http://melodi.ee.washington.edu/~halloj3/classification.pdf
 Naive Bayes reaches its asmyptotic error very quickly with regards to the number of training examples. As the number of training examples grows, logistic regression will outperform naive Bayes and achieve a lower asymptotic error rate.
 How to evaluate a model? If one classifier gives you worse result, it is a bad classifier?
 Which classifier to choose?
 For difference (between A/B tests/two models/etc.), how do you know if it's signifcant? If this week's CTR is 5% better than last week, can we conclude we've done better?
 Paired different test is used to assess if two population means are different. for normally distributed differences we can used ttest/ztest. others we can use Wilcoxon test.
 student's ttest: if unequal variances, we construct $t=\frac{\bar{X_1}\bar{X_2}}{\sqrt{\frac{\sigma_1}{n_1}\frac{\sigma_2}{n_2}}} \sim N(0,1)$. Then we can look up standard gaussian table and get its $p$value (usually should be < 0.05 or 0.01 to be signifcant).
 Wilcoxon test:
 Likelihood ratio: $\Lambda(X) = \frac{sup\{L(\theta_0X)\}}{sup\{L(\thetaX)\}}$. Hypothesis $H_0$ can be rejected if $\Lambda(X)$ < $c$
 Regularization. L1 vs L2
 L2: Assuming Gaussian prior on the parameters.
 L1: Assuming Laplacian prior on the parameters.
 http://cs.brown.edu/courses/archive/20062007/cs1955/lectures/lecture13.pdf
 L1 regression: Lasso. L2 regression: Ridge
 Loss
 MAE vs. RMSE: http://www.eumetcal.org/resources/ukmeteocal/verification/www/english/msg/ver_cont_var/uos3/uos3_ko1.htm
 Generalized Linear Model
 Nicely accomodate multiple regression models such as Linear Regression and Logistic Regression (binomial response using sigmoid)
 Use a sigmoid function to map $\mathbb{R} \rightarrow [0,1]$ then it regression becomes probabilistic model.
 Stochastic Gradient Descent vs. Gradient Descent
 SGD updates parameters on every single training example so if training sample is large, SGD takes much shorter time to update parameters and then starts oscillating since it minimizes error on single example instead of the total error. You can picture the path SGD runs to optimum. Instead of the overall gradient, it makes decision on single examples with expected direction being gradient descent, and in reality could be off the optimum course but eventually it'll get there.
 More practically, if training data is really big with batch updating you might not even be able to iterate once.
 reading:
 Andrew Ng's video: https://www.youtube.com/watch?v=gdZxqnTKndE#t=102 and his notes http://cs229.stanford.edu/notes/cs229notes1.pdf
 "This means that we have a tradeoff of fast computation per iteration and slow convergence for SGD versus slow computation per iteration and fast convergence for gradient descent" http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/238.pdf
 Gradient Descent vs. Gradient Ascent
 Maximize a quantity (likelihood) vs. Minimize a quantity (loss)
 Maximize likelihood vs. Minimizing Loss
 Loss function can really be anything as long as it measures the correctness of the model. You can directly define a loss function (hinge, logistic, etc), or use negative likelihood for probabilistic models.
 Nonprobabilistic models dont have likelihood. E.g. SVM has no likelihood defined.
 It turns out, for linear regression minimizing leastsquare loss is equivalent to maximizing likelihood, with the assumption that the error term is zeromean gaussian distributed.
 BTW, if we assume parameters($\theta$) in linear regression is multivariate Gaussian, then by conducting MAP estimate for $\theta$ we'll get exactly the form of L2 regularization on $\theta$.
 Logistic Regression vs SVM. Dual form
http://www.csie.ntu.edu.tw/~cjlin/papers/maxent_dual.pdf
 Hyperparameter
 parameters of the parameters of your model
 How important hyperparameters are? I don't know. Maybe not (dirichlet priors for LDA)? Is K in clustering algorithm a hyperparameter? Probly not, since K is a parameter of your model (clustering)
 Why people always use crossvalidation for hyperparameter optimization? Is it a reasonable estimate? http://jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf
 Markov Chain Monte Carlo
 EM Algorithm & Gaussian Mixture
 You come up with a model with some parameters $\theta$ as well as a latent variable $z$. e.g. membership in mixture model
 E: construct a lower bound. estimate latent variables $p(z\theta)$ = ...
 M: maximize the lower bound. given $p(z\theta)$ maximize $p(\thetaz)$ = ...
 For Gaussian Mixture: Each $X_i$ is from one of the Gaussian components, the problem is you don't which one. Randomly assign membership to each point. Then we can estimate $p(z\mu, \Sigma)=\frac{p(\mu,\Sigmaz)p(z)}{\sum_zp(\mu,\Sigma,z)}$, followed by $\mu$ = .., $\Sigma$ = ...
 Likelihood ratio
 Ratio between likelihood from null hypothesis and alternative hypothesis:$\frac{p(X\theta_0)}{p(X\theta_1)}$. Null hypothesis with parameter $\theta_0$ will be rejected if ratio $\lt c$. $c$ is determined by certain significane level $\alpha$.
 Design concerns(speculation)
 Bayesian model design: fewer parameters better! Design everything to be random variables and only free parameters are hyper ones.
 Cheatsheet
 Variance: $Var(X) = E({(X\mu)}^2) = E(X^2)  E^2(X)$
 Covariance: $\sigma(X,Y) = E(XE(X)(YE(Y)) = E(XY)E(X)E(Y)$
 $E(X^2) = E((X\mu)^2 \mu^2 + 2\mu X) = \sigma^2  \mu^2 + 2\mu^2 = \mu^2 + \sigma^2$
 Pearson Correlation: $\rho_{X,Y} = \frac{cov(X,Y)}{\sigma_X\sigma_Y}$ or $r=\frac{\sum_i(X_i\bar{X})(Y_i\bar{Y})}{\sqrt{\sum_i(X_i\bar{X})}\sqrt{\sum_i(Y_i\bar{Y})}}$
 http://cs229.stanford.edu/section/gaussians.pdf
Random:
sparse coding