Optimization for machine learning 
 Estimator choice: 
First we need to write out the function to be optimized. 
-  Point estimate: Maximum Likelihood. Obtain the single point (model parameters $\theta$) that maximizes the probability data is generated from your model $p(Data; \theta)$.
-  Bayesian point estimate: Maximum A Posteriori. Obtain the single point (model paramters $\theta$) that maximizes the posterior probability $p(\theta | Data)$
-  Fully Bayesian approach: estimate the mean and variance
To minimize/maximize a function $F$, there are a few choices:
 Gradient descent: 
-  only needs first derivative of $F$. simple to implement
-  each iteration is cheap
-  has an extra parameter (learning rate)
-  features better be rescaled
-  for linear regression, if use normal equaiton, no need to resclae
 Subgradient descent: 
-  When objective is convex but not smooth (e.g. hinge loss)
 Conjugate gradient: 
 Newton's method: 
-  needs first and second derivatives of $F$ (Hessian matrix)
-  each iteration is expensive
-  but converges much faster
-  $\theta = \theta - H^{-1}\bigtriangledown_{\theta}l(\theta)$. $\bigtriangledown_{\theta}l(\theta)$ is the partial derivative wrt $\theta$. $H$ is the Hessian matrix.
 Limited-memory BFGS 
Reading
-  stanford videos: https://www.youtube.com/watch?v=McLq1hEq3UY