Gabriele Farina - Adaptive preconditioning: AdaGrad and ADAM

MIT 6.7220/15.084 — Nonlinear Optimization Thu, Apr 11th 2024
Lecture 13
Adaptive preconditioning: AdaGrad and ADAM
Instructor: Prof. Gabriele Farina ( gfarina@mit.edu)★
★These notes are class material that has not undergone formal peer review. The TAs and I are grateful for any
reports of typos.
In Lecture 12, we have seen how preconditioning the gradient with the inverse of the Hessian at
the current point can significantly improve convergence, at least in the vicinity of a minimum with
strong curvature. One of the effects of preconditioning with the Hessian was to make the descent
direction invariant to linear transformations of the variables.
■ A second look at preconditioning. The affine-invariance property of Hessian preconditioning is
extremely desirable. For example, consider the case of an optimization problem in which we have
variables with physical meaning: maybe we are trying to design a bridge, and we have variables that
denote lengths, wind strengths, weights, et cetera. In our modeling of the problem, we might have
decided all length variables are in meters. If we now were to change our mind and decide to use
centimeters instead, we would like the optimization algorithm to be invariant to this change, that is,
produce the same sequence of iterates regardless of the units we use. While this would be guaranteed
by the Hessian preconditioner, the same cannot be said for gradient descent.
Indeed, let us consider what would happen if we were to move all of our length variables from
meters to centimeters. Since a centimeter is a tenth of a meter, the objective is now 10 times less
sensitive to a change in length. This means that the gradient of the objective with respect to the
new parameterization of lengths would now be 10 times smaller than before. And yet, all values of
the length variables would be 10 times larger than before, producing a net effect of slowing down the
gradient descent update on each of the variables by a factor of 100 if using the same learning rate!
Adjusting the learning rate might compensate for this, at the expense of now affecting the speed
of change of all other variables, even if they had been left untouched by the “meter → centimeter”
reparameterization.
Example. As a small numerical ex-
ample, consider the objective func-
tion (say, parameterized in me-
ters) 𝑓(𝑥, 𝑦) = 1
2 𝑥2 + 1
2 𝑦2. After a
change of units of 𝑥, consider now
the reparameterized objective
𝑔(𝑥′, 𝑦) = 𝑓( 𝑥′
√2, 𝑦) = 1
4 𝑥′2 +
1
2𝑦2,
plotted on the right.
−2 −1 1 2 𝑥
−2
−1
1
2
𝑦
0
𝑓(𝑥, 𝑦)
−2 −1 1 2 𝑥′
−2
−1
1
2
𝑦
0
𝑔(𝑥′, 𝑦)
The two plots show contour lines for the respective objectives, together with an identical initial
point (up to reparameterization). The red arrows show the gradient descent direction at the
initial point. It is clear that after one step of gradient descent, the two points will no longer be
equivalent.
■ Today’s lecture. In this lecture, we look at preconditioning algorithms that are not based on using
the Hessian. Rather, they are based on the idea of adapting the learning rate for each parameter
based on the historical gradients. Compared to the Hessian preconditioner, the approach in this lec-
ture is significantly more scalable, and it is particularly useful for large-scale optimization problems.
In fact, the algorithms we will discuss today currently include the most-used first-order optimization
algorithm in machine learning, ADAM.
1 The AdaGrad algorithm
The AdaGrad algorithm—introduced by Duchi, J., Hazan, E., & Singer, Y. [DHS11]—is a gradient-
based optimization algorithm that adapts the learning rate for each variable based on the historical
gradients.¹
¹Today we will consider the version of AdaGrad that only uses diagonal preconditioners, which is the version that
is typically preferred in practice for large problems, including in machine learning applications (we will still refer to
the algorithm with the generic name “AdaGrad”).
The main idea behind AdaGrad is to scale the learning rate of each variable based on the sum of
the squared gradients accumulated over time. This allows AdaGrad to give smaller learning rates to
frequently updated variables and larger learning rates to variables with infrequent updates. Going
back to the example of the bridge design, this means that if we were to change the units of the length
variables, AdaGrad would automatically adjust the learning rates to compensate for the change
in scale.
In particular, at each iteration 𝑡, AdaGrad keeps a tally of the sum of the squared gradients up to
time 𝑡 for each variable. This is done by maintaining a vector 𝑠𝑡 of components
[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖 ,
where [∇𝑓(𝑥𝑡)]𝑖 is the 𝑖-th component of the gradient at time 𝑡. The update rule for AdaGrad is then
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
⎝
⎜⎛[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[∇𝑓(𝑥𝜏 )]2
𝑖 : 𝑖 = 1, ..., 𝑛
⎠
⎟⎞
.
(AdaGrad)
We assume that [∇𝑓(𝑥0)]𝑖 ≠ 0 for all 𝑖, so that 𝑀𝑡 is invertible at all times 𝑡 = 0, 1, 2, ....
Remark 1.1. The same algorithm can be used in the stochastic setting, where as usual the gra-
dient is replaced by a stochastic gradient,
𝑥𝑡+1 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ̃ ∇𝑓(𝑥𝑡), where 𝑀𝑡 ≔ diag
⎝
⎜⎛[𝑠𝑡]𝑖 ≔ √∑
𝑡
𝜏=0
[ ̃∇𝑓(𝑥𝜏 )]2
𝑖 : 𝑖 = 1, ..., 𝑛
⎠
⎟⎞
It can also be used in the projected setting, where the update is projected onto a feasible set.
For simplicity, in this lecture, we will focus on the deterministic, unconstrained setting for our
analysis.
2 ADAM: AdaGrad with momentum
In practice, people often use a variant of AdaGrad called ADAM, introduced by Kingma, D. P., &
Ba, J. [KB15]. ADAM combines the adaptive learning rate of AdaGrad with the idea of momentum
we already saw in Lecture 8. In particular, at each iteration 𝑡, ADAM keeps track of the momentum
(discounted sum of past gradients)
𝑔𝑡 = 𝛾𝑔𝑡−1 + (1 − 𝛾)∇𝑓 (𝑥𝑡); 𝑔−1 ≔ 0.
The scaling factors 𝑠𝑡 are also accumulated with a discount rate 𝛽 as
[𝑠𝑡]2
𝑖 = 𝛽[𝑠𝑡−1]2
𝑖 + (1 − 𝛽)[∇𝑓 (𝑥𝑡)]2
𝑖 𝑖 = 1, ..., 𝑛; 𝑠−1 ≔ 0.
Finally, ADAM updates the iterate as follows:
𝑥𝑡+1 = 𝑥𝑡 − 𝜂 𝑀 −1
𝑡 𝑔𝑡, where 𝑀𝑡 ≔ diag(𝑠𝑡) . (ADAM)
The hyperparameters 𝜂, 𝛾, and 𝛽 in ADAM are typically set to 0.001, 0.9, and 0.999 respectively
(this is pytorch’s default behavior).
Remark 2.1. The ADAM algorithm is widely used in practice and is known to work well for a
wide range of optimization problems. It is particularly useful for training deep neural networks.
However, ADAM does not have theoretical guarantees like AdaGrad. It is even known to diverge
in some cases, even with convex objectives [RKK18].
3 Analysis of AdaGrad
In this section, we will analyze the AdaGrad algorithm. For simplicity, we will focus on the non-
stochastic version, though the analysis in the presence of stochastic gradients is analogous.
The main result we will prove is that AdaGrad is competitive with the best possible preconditioner
in hindsight, as we make precise in the next theorem.
Theorem 3.1. Let 𝑓 : ℝ𝑛 → ℝ be a convex and differentiable function. AdaGrad is competitive
with the best preconditioner in hindsight. More precisely, for any choice of coefficients 𝜆𝑖 ≥
0, 𝑖 = 1, ..., 𝑛, the AdaGrad algorithm with stepsize 𝜂 = 𝐷/√2 satisfies
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓 (𝑥⋆) +
√2𝑛𝐷
𝑇 √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛
∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓 (𝑥𝑡),
where 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is the maximum distance from the optimal solution at all times
𝑇 .
In the rest of this section, we prove the above result.
3.1 AdaGrad as an instance of mirror descent
The main idea behind the proof is to show that AdaGrad is a form of mirror descent algorithm
(Lecture 9), with the twist that the distance-generating function is time-dependent. In particular,
we will use as our distance-generating function the (strongly convex) function
𝜑𝑡(𝑥) ≔ 1
2𝑥⊤𝑀𝑡𝑥.
The induced Bregman divergence is
D𝜑𝑡 (𝑥 ‖ 𝑦) = 𝜑𝑡(𝑥) − 𝜑𝑡(𝑦) − ⟨∇𝜑𝑡(𝑦), 𝑥 − 𝑦⟩ = 1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦).
We make the connection formal in the following lemma.
Theorem 3.2. The AdaGrad update rule is equivalent to the mirror descent update rule with
the distance-generating function 𝜑𝑡(𝑥) = 1
2 𝑥⊤𝑀𝑡𝑥, where 𝑀𝑡 ≔ diag(𝑠𝑡).
Proof . Remember that the mirror descent update rule is
𝑥𝑡+1 = arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ + D𝜑𝑡 (𝑥 ‖ 𝑥𝑡)}
= arg min
𝑥∈ℝ𝑛
{𝜂⟨∇𝑓(𝑥𝑡), 𝑥⟩ +
1
2 (𝑥 − 𝑥𝑡)⊤𝑀𝑡(𝑥 − 𝑥𝑡)}.
Setting the gradient of the above objective to zero and solving for 𝑥 yields
𝜂∇𝑓(𝑥𝑡) + 𝑀𝑡(𝑥 − 𝑥𝑡) = 0 ⟹ 𝑥 = 𝑥𝑡 − 𝜂𝑀 −1
𝑡 ∇𝑓(𝑥𝑡),
which is exactly the AdaGrad update rule. □
From the mirror descent lemma we saw in Lecture 9, we can write
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝜂 (D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1) + D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)).
Summing the above inequalities over 𝑡 = 0, 1, ..., 𝑇 − 1 and using the fact that 𝑀0 = 0 yields
∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑇 𝑓(𝑥⋆) + 1
𝜂 (∑
𝑇 −2
𝑡=0
(D𝜑𝑡+1 (𝑥⋆ ‖ 𝑥𝑡+1) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1))
⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟⏟
A
+ ∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1)
⏟⏟⏟⏟⏟⏟⏟⏟⏟
B
).
We will now proceed, in the next two subsections, to bound the two summations A and B separately.
3.2 Bounding the “almost-telescopic” terms A
Theorem 3.3. Let 𝑇 be arbitrary and assume that 𝐷 ≔ max𝑇
𝑡=0 ‖𝑥𝑡 − 𝑥⋆‖∞ is finite. Then, at
all times 𝑇 , the sum A satisfies the inequality
∑
𝑇 −2
𝑡=0
(D𝜑𝑡+1 (𝑥⋆ ‖ 𝑥𝑡+1) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1)) ≤
𝐷2
2 ‖𝑠𝑇 −1‖1.
Proof . It is easy to see that
D𝜑𝑡+1 (𝑥⋆ ‖ 𝑥𝑡+1) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1) =
1
2 (𝑥𝑡+1 − 𝑥⋆)⊤(𝑀𝑡+1 − 𝑀𝑡)(𝑥𝑡+1 − 𝑥⋆).
(The above is a generalization of the result we saw in Lecture 9, which established that the Breg-
man divergence induced by the squared Euclidean norm is the squared Euclidean distance.)
Using now the definition of 𝐷 together with the Cauchy-Schwarz inequality, we can write
1
2 (𝑥𝑡+1 − 𝑥⋆)⊤(𝑀𝑡+1 − 𝑀𝑡)(𝑥𝑡+1 − 𝑥⋆) ≤
1
2‖𝑥𝑡+1 − 𝑥⋆‖2
∞‖𝑠𝑡+1 − 𝑠𝑡‖1
≤ 𝐷2
2 ‖𝑠𝑡+1 − 𝑠𝑡‖1.
On the other hand, since 𝑠𝑡+1 ≥ 𝑠𝑡 ≥ 0 componentwise at all times 𝑡, then ‖𝑠𝑡+1 − 𝑠𝑡‖1 =
‖𝑠𝑡+1‖1 − ‖𝑠𝑡‖1 and we can write
∑
𝑇 −2
𝑡=0
(D𝜑𝑡+1 (𝑥⋆ ‖ 𝑥𝑡+1) − D𝜑𝑡 (𝑥⋆ ‖ 𝑥𝑡+1)) ≤
𝐷2
2 ∑
𝑇 −2
𝑡=0
(‖𝑠𝑡+1‖1 − ‖𝑠𝑡‖1)
= 𝐷2
2 (‖𝑠𝑇 −1‖1 − ‖𝑠0‖1)
≤ 𝐷2
2 ‖𝑠𝑇 −1‖1,
which is the statement. □
3.3 Bounding the summation B
We now shift our attention to the summation in B , where we will establish the following bound.
Theorem 3.4. At all times 𝑇 , the sum B satisfies the inequality
∑
𝑇 −1
𝑡=0
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) ≤ 𝜂2‖𝑠𝑇 −1‖1.
Proof . Expanding the expression for the Bregman divergence D𝜑𝑡 (𝑥 ‖ 𝑦) = 1
2 (𝑥 − 𝑦)⊤𝑀𝑡(𝑥 − 𝑦),
we have
D𝜑𝑡 (𝑥𝑡 ‖ 𝑥𝑡+1) =
1
2 (𝑥𝑡+1 − 𝑥𝑡)⊤𝑀𝑡(𝑥𝑡+1 − 𝑥𝑡) =
𝜂2
2 ∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡).
Using the definition of 𝑀𝑡 ≔ diag(𝑠𝑡) where [𝑠𝑡]𝑖 ≔ √∑𝑡
𝜏=0 [∇𝑓(𝑥𝜏 )]2
𝑖 , we can then write
∇𝑓(𝑥𝑡)⊤𝑀 −1
𝑡 ∇𝑓(𝑥𝑡) = ∑
𝑛
𝑖=1
[∇𝑓(𝑥𝑡)]2
𝑖
[𝑠𝑡]𝑖
= ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖
≤ 2 ∑
𝑛
𝑖=1
[𝑠𝑡]2
𝑖 − [𝑠𝑡−1]2
𝑖
[𝑠𝑡]𝑖 + [𝑠𝑡−1]𝑖
= 2 ∑
𝑛
𝑖=1
([𝑠𝑡]𝑖 − [𝑠𝑡−1]𝑖) = 2(‖𝑠𝑡‖1 − ‖𝑠𝑡−1‖1).
Summing over 𝑡 = 0, 1, ..., 𝑇 − 1 yields the result. □
3.4 Finale: Bounding the norm of the scaling factors
The two bounds above show that
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 1
𝑇 (𝐷2
2𝜂 + 𝜂)‖𝑠𝑇 −1‖1.
Hence, setting 𝜂 = 𝐷/√2 yields
1
𝑇 ∑
𝑇 −1
𝑡=0
𝑓(𝑥𝑡) ≤ 𝑓(𝑥⋆) + 𝐷√2
𝑇 ‖𝑠𝑇 −1‖1. (3)
So, to complete the proof, we only need to provide a bound on the norm of the scaling factors 𝑠𝑇 −1.
This is the content of the following theorem.
Theorem 3.5. The norm of the scaling factors 𝑠𝑇 −1 satisfies the inequality
‖𝑠𝑇 −1‖1 ≤ √𝑛 ⋅ √ min
𝜆∈ℝ𝑛
≥0,‖𝜆‖1=𝑛
∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓 (𝑥𝑡).
Proof . Pick any 𝜆 ∈ ℝ𝑛
≥0, ‖𝜆‖1 = 𝑛; we will use Cauchy-Scharz to bound ‖𝑠𝑇 −1‖2
1 as follows:
‖𝑠𝑇 −1‖2
1 =
⎝
⎜⎛∑
𝑛
𝑖=1
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
⎠
⎟⎞
2
=
⎝
⎜⎛∑
𝑛
𝑖=1 ⎣
⎢
⎡√𝜆𝑗 ⋅
⎝
⎜⎛ 1
√𝜆𝑗
√∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖
⎠
⎟⎞
⎦
⎥
⎤
⎠
⎟⎞
2
≤ (∑
𝑛
𝑖=1
𝜆𝑖) ⋅ (∑
𝑛
𝑖=1
( 1
𝜆𝑖
∑
𝑇 −1
𝑡=0
[∇𝑓(𝑥𝑡)]2
𝑖 )) (from the Cauchy-Schwarz inequality)
= 𝑛 ⋅ (∑
𝑇 −1
𝑡=0
∇𝑓(𝑥𝑡)⊤diag(𝜆)−1∇𝑓 (𝑥𝑡)).
Taking square roots and using the fact that 𝜆 was arbitrary yields the statement. □
Plugging the bound of Theorem 3.5 into (3) proves Theorem 3.1.
Bibliography
[DHS11] J. Duchi, E. Hazan, and Y. Singer, “Adaptive subgradient methods for online learning
and stochastic optimization.,” Journal of machine learning research, vol. 12, no. 7, 2011.
[KB15] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Interna-
tional Conference on Learning Representations (ICLR), 2015. [Online]. Available: http://
arxiv.org/abs/1412.6980
[RKK18] S. J. Reddi, S. Kale, and S. Kumar, “On the convergence of Adam and beyond,” in In-
ternational Conference on Learning Representations (ICLR), 2018.

Content

Typo or question?

Metadata