### Optimization for machine learning

To minimize/maximize a function $F$, there are a few choices:

#### Gradient descent:

- only needs first derivative of $F$. simple to implement
- each iteration is cheap
- has an extra parameter (learning rate)

#### Conjugate gradient:

#### Newton's method:

- needs first and second derivatives of $F$ (Hessian matrix)
- each iteration is expensive
- but converges much faster

#### Limited-memory BFGS

####
Reading

- stanford videos: https://www.youtube.com/watch?v=McLq1hEq3UY