### Optimization for machine learning

#### Estimator choice:

First we need to write out the function to be optimized.
• Point estimate: Maximum Likelihood. Obtain the single point (model parameters $\theta$) that maximizes the probability data is generated from your model $p(Data; \theta)$.
• Bayesian point estimate: Maximum A Posteriori. Obtain the single point (model paramters $\theta$) that maximizes the posterior probability $p(\theta | Data)$
• Fully Bayesian approach: estimate the mean and variance

To minimize/maximize a function $F$, there are a few choices:

• only needs first derivative of $F$. simple to implement
• each iteration is cheap
• has an extra parameter (learning rate)
• features better be rescaled
• for linear regression, if use normal equaiton, no need to resclae

• When objective is convex but not smooth (e.g. hinge loss)

#### Newton's method:

• needs first and second derivatives of $F$ (Hessian matrix)
• each iteration is expensive
• but converges much faster
• $\theta = \theta - H^{-1}\bigtriangledown_{\theta}l(\theta)$. $\bigtriangledown_{\theta}l(\theta)$ is the partial derivative wrt $\theta$. $H$ is the Hessian matrix.