Gabriele Farina - Optimistic/predictive regret minimization via online optimization

15-888 Computational Game Solving (Fall 2021) Tue, September 28th 2021
Lecture 7
Optimistic/predictive regret minimization via online optimization
Instructor: Gabriele Farina
1 Predictive regret minimization
So far, we have talked extensively about regret minimization algorithms, which we modeled as object with the
following interface:
• NextStrategy() returns a strategy xt ∈ X ⊆ ℝn;
• ObserveLoss(`t) receives a utility vector `t that is meant to evaluate the strategy xt that was last
output.
However, a recent trend in online learning has been concerned with constructing devices that can incorporate
predictions of the next utility vector `t in the decision making [Chiang et al., 2012, Rakhlin and Sridharan,
2013a,b]. We call these devices predictive regret minimizers.
We incorporate predictivity by modifying the interface above. In particular, for a predictive regret minimizer
we modify NextStrategy to now be as in the next definition.
Definition 1.1 (Predictive regret minimizer). Let X be a set. A predictive regret minimizer for X is an
algorithm that interacts with the environment through two operations:
• NextStrategy(mt) returns the next strategy xt ∈ X , given a prediction mt ∈ ℝn of the next
utility `t;
• ObserveLoss(`t) receives a utility vector `t that is meant to evaluate the strategy xt that was
last output. Specifically, the device incurs a utility equal to (`t)>xt. As we mentioned, the utility
vector `t can depend on all past strategies that were output by the regret minimizer (even including
xt, assuming the latter is deterministic).
Just like for regular (i.e., non-predictive) regret minimizers, the quality metric for a predictive regret
minimizer is it cumulative regret, defined as the quantity
RT := max
ˆx∈X
{ T∑
t=1
(`t)> ˆx − (`t)>xt
}
. (1)
The goal for a “good” predictive regret minimizer should guarantee a superior regret bound than a
non-predictive regret minimizer, especially if mt ≈ `t at all times t. Algorithms exist that can guarantee this.
For instance, it is always possible to guarantee that
RT = O
(
1 +
T∑
t=1
‖`t − mt‖2
)
when X is convex and compact, which implies that the regret stays constant when mt is clairvoyant. In fact,
even stronger regret bounds can be attained, as we discuss next.
1.1 Regret bounded by Variation in Utilities (RVU)
A popular definition for a desirable regret goal for a predictive regret minimizer was given by Syrgkanis et al.
[2015]. We restate it here with minor modifications.
Definition 1.2 (RVU regret bound). A predictive regret minimizers satisfies the RVU regret bound with
parameters α, β, γ ∈ ℝ≥0 with respect to norm ‖ · ‖ if at all times T ,
RT ≤ α + β
T∑
t=1
‖`t − mt‖2
∗ − γ
T∑
t=2
‖xt − xt−1‖2. (2)
In Section 1.2 we show that predictive regret minimizers that satisfy the RVU regret bound can be used
in self play to converge to a bilinear saddle point at the accelerated convergence (saddle point gap) rate of
O(1/T ). We will then introduce predictive regret minimizers for generic convex and compact sets in Section 2
and show that they satisfy the RVU regret bound.
1.2 Accelerated convergence to saddle points via predictive regret minimization
We have seen in Lecture 3 that the idea behind using regret minimization to converge to a bilinear saddle point
max
x∈X min
y∈Y x>
Ay (3)
is to use self play. Back then, we instantiated two regret minimization algorithms, RX and RY , for the
domains of the maximization and minimization problem, respectively. At each time t the two regret minimizers
output strategies xt and yt, respectively. Then, they receive as feedback the vectors `t
X , `t
Y defined as
`t
X := Ayt, `t
Y := −A>xt, (4)
where A is Player 1’s payoff matrix (see also Figure 1).
RX
RY
x1
y1 `1
Y
`1
X RX
RY
x2
y2
RX
RY
`t−1
X
`t−1
Y
xt
yt `t
Y
`t
X RX
RY
xt+1
yt+1 · · ·· · ·
Figure 1: The flow of strategies and utilities in regret minimization for games. The symbol denotes
computation/construction of the utility vector.
We then proved that the average strategies
̄xT := 1
T
T∑
t=1
xt, ̄yT := 1
T
T∑
t=1
yt
converge to a saddle point of (3) in the sense that the saddle-point gap of ( ̄xT , ̄yT ) vanishes according to
γ( ̄xT , ̄yT ) ≤ RT
X + RT
Y
T .
As most non-predictive regret minimizers guarantee O(√T ) regret in the worst case, the self-play scheme
defined above guarantees convergence to Nash equilibrium at the rate O(1/√T ).
We now show that by using predictive regret minimizers that satisfy the RVU bound (2), we can accelerate
convergence from O(1/√T ) to O(1/T ). The key is to use as predictions the previous utility vectors, that is,
mt
X := `t−1
X , mt
Y := `t−1
Y .
Then, the gap γ( ̄xT , ̄yT ) satisfies
γ( ̄xT , ̄yT ) ≤ RT
X + RT
Y
T
≤ 1
T
(
α + β
T∑
t=1
∥
∥Ayt − Ayt−1∥
∥2
∗ − γ
T∑
t=2
∥
∥xt − xt−1∥
∥2
)
+ 1
T
(
α + β
T∑
t=1
∥
∥A>xt − A>xt−1∥
∥2
∗ − γ
T∑
t=2
∥
∥yt − yt−1∥
∥2
)
≤ 2α
T + β‖A‖2
op − γ
T
T∑
t=1
‖yt − yt−1‖2 + β‖A‖2
op − γ
T
T∑
t=1
‖xt − xt−1‖2, (5)
where in the second inequality we plugged in the RVU regret bounds for RX and RY (Definition 1.2) and the
third inequality follows by noting that the operator norm ‖ · ‖op of a linear function is equal to the operator
norm of its transpose.
Inequality (5) immediately implies that if ‖A‖2
op ≤ γ/β then the saddle point gap γ( ̄xT , ̄yT ) vanishes at
the rate O(1/T ). Since rescaling A in (3) does not change the set of saddle points, we can always rescale A
before applying the self-play algorithm above to guarantee accelerated O(1/T ) convergence.
2 Predictive FTRL and predictive OMD
Follow-the-regularized-leader (FTRL) [Shalev-Shwartz and Singer, 2007] and online mirror descent (OMD) are
the two best known oracles for the online linear optimization problem. Their predictive variants are relatively
new and can be traced back to the works by Rakhlin and Sridharan [2013a] and Syrgkanis et al. [2015].
Algorithms 1 and 2 give pseudocode for predictive FTRL and predictive OMD. The non-predictive variants
of FTRL and OMD algorithms correspond to predictive FTRL and predictive OMD when the prediction mt
is set to the 0 vector at all t. In both algorithm, η > 0 is an arbitrary step size parameter, X ⊆ ℝn is a convex
and compact set, and φ : X → ℝ≥0 is a differentiable and 1-strongly convex regularizer (with respect to some
norm ‖ · ‖). The symbol Dφ(· ‖ ·) used in OMD denotes the Bregman divergence associated with φ, a standard
surrogate notion of distance in convex optimization, defined as
Dφ(x ‖ c) := φ(x) − φ(c) − ∇φ(c)>(x − c) ∀ x, c ∈ X .
Algorithm 1: (Predictive) FTRL
Data: X ⊆ ℝn convex and compact set
φ : X → ℝ≥0 strongly convex regularizer
η > 0 step-size parameter
1 L0 ← 0 ∈ ℝn
2 function NextStrategy(mt)
[. Set mt = 0 for non-predictive version]
3 return arg max
ˆx∈X
{
(Lt−1 + mt)> ˆx − 1
η φ( ˆx)
}
4 function ObserveUtility(`t)
5 Lt ← Lt−1 + `t
Algorithm 2: (Predictive) OMD
Data: X ⊆ ℝn convex and compact set
φ : X → ℝ≥0 strongly convex regularizer
η > 0 step-size parameter
1 z0 ∈ X such that ∇φ(z0) = 0
2 function NextStrategy(mt)
[. Set mt = 0 for non-predictive version]
3 return arg max
ˆx∈X
{
(mt)> ˆx − 1
η Dφ( ˆx ‖ zt−1)
}
4 function ObserveUtility(`t)
5 zt ← arg max
ˆz∈X
{
(`t)> ˆz − 1
η Dφ( ˆz ‖ zt−1)
}
Remark 2.1. The divergence Dφ(x ‖ c) is a generalization of the typical notion of distance between x
and c. When the regularizer φ is set to φ(·) = 1
2 ‖ · ‖2
2 (which is 1-strongly convex with respect to the
Euclidean norm), then
Dφ(x ‖ c) = φ(x) − φ(c) − ∇φ(c)>(x − c)〉
= 1
2 ‖x‖2
2 − 1
2 ‖c‖2
2 − c>(x − c)
= 1
2 ‖x‖2
2 + 1
2 ‖c‖2
2 − c>x
= 1
2 ‖x − c‖2
2
and so we recover that Dφ(x ‖ c) is half of the squared Euclidean distance between x and c.
Remark 2.2. In light of the above remark, when OMD is set up with φ(·) = ‖ · ‖2
2, the computation of
zt+1 (Line 5 of Algorithm 2) amounts to a step of projected gradient ascent. Indeed,
arg max
ˆz∈X
{
(`t)> ˆz − 1
2η Dφ( ˆz ‖ zt−1)
}
= arg max
ˆz∈X
{
(`t)> ˆz − 1
2η ‖ ˆz − zt−1‖2
2
}
= arg max
ˆz∈X
{
(η`t)> ˆz − 1
2 ‖ ˆz − zt−1‖2
2
}
= arg max
ˆz∈X
{
(zt−1 + η`t)> ˆz − 1
2 ‖ ˆz‖2
2
}
= arg max
ˆz∈X
{ 1
2
∥
∥ ˆz − (zt−1 + η`t)∥
∥2
2
}
= ProjX (zt−1 + η`t).
Remark 2.3. In the non-predictive version of OMD, mt = 0 at all times t. In that case, the proximal
step on Line 3 in Algorithm 2 reduces to
xt = arg max
ˆx∈X
{
− 1
η Dφ( ˆx ‖ zt−1)
}
= arg min
ˆx∈X
Dφ( ˆx ‖ zt−1) = zt−1.
2.1 Regret guarantee
Predictive FTRL and predictive OMD satisfy the following regret bound.
Proposition 2.1. Let Ω denote the range of φ over X , that is, Ω := maxx,x′∈X {φ(x) − φ(x′)}. At all
times T , the regret cumulated by predictive FTRL (Algorithm 1) and predictive OMD (Algorithm 2)
compared to any strategy ˆx ∈ X is bounded as
RT ≤ Ω
η + η
T∑
t=1
‖`t − mt‖2
∗ − 1
cη
T∑
t=2
‖xt − xt−1‖2,
where c = 4 for FTRL and c = 8 for OMD, and where ‖ · ‖∗ denotes the dual of the norm ‖ · ‖ with respect
to which φ is 1-strongly convex. In other words, predictive FTRL satisfies the RVU regret bound with
parameters (Ω/η, η, 1/4η) and predictive OMD satisfies the RVU regret bound with parameters (Ω/η, η, 1/8η).
Remark 2.4. Proposition 2.1 immediately implies that, by appropriately setting the step size parameter
(for example, η = T −1/2), predictive FTRL and predictive OMD guarantee RT = O(T 1/2) at all times T .
Hence, predictive FTRL and predictive OMD are regret minimizers.
3 Distance-generating functions for tree-form decision problems
Predictive FTRL and predictive OMD, as described above, can be applied to any convex and compact set
X , provided a 1-strongly convex regularizer φ for X has been chosen. A safe choice is always φ(·) = 1
2 ‖ · ‖2
2,
which is 1-strongly convex with respect to the Euclidean norm ‖ · ‖2. In that case, we have seen earlier that
φ has practical
implications, because it can make solving the different optimization subproblems involved in the algorithms
5 in Algorithm 2) easy or hard in practice depending on the choice.
Ideally, the choice of regularizer φ makes the computation of the following two quantities as easy as possible.
In particular we define the following.
Definition 3.1 (“Nice” regularizer). Let X ⊆ ℝn be convex and compact. We say that a 1-strongly convex
regularizer φ : X → ℝ is “nice” if the following quantities can both be computed in linear time in the
dimension n:
• the gradient ∇φ(x) of d at any point x ∈ X ; and
• the gradient of the convex conjugate φ∗ of φ at any point g ∈ ℝn:
∇φ∗(g) = arg max
x∈X
{g>x − φ(x)}.
When the regularizer is nice, then the optimization subproblem on line Line 3 of predictive FTRL can be
computed in linear time by noticing that
arg max
ˆx∈X
{(ηLt−1 + ηmt)> ˆx − φ( ˆx)} = ∇φ∗(ηLt−1 + ηmt).
For Line 3 in predictive OMD, we have
arg max
ˆx∈X
{
(mt)> ˆx − 1
η Dφ( ˆx ‖ zt−1)
}
= arg max
ˆx∈X
{(ηmt)> ˆx − Dφ( ˆx ‖ zt−1)}
= arg max
ˆx∈X
{(ηmt)> ˆx − φ( ˆx) + ∇φ(zt−1)> ˆx}
= arg max
ˆx∈X
{(ηmt + ∇φ(zt−1))> ˆx − φ( ˆx)}
= ∇φ∗(ηmt + ∇φ(zt−1)),
which can therefore be evaluated in linear time when φ is nice. Similarly, for Line 5 we have, using the same
derivation,
arg max
ˆz∈X
{
(`t)> ˆz − 1
η Dφ( ˆz ‖ zt−1)
}
= ∇φ∗(η`t + ∇φ(zt−1)).
For many sets of interest in game theory, “nice” regularizer are known. For example, for a simplex domain
a very appealing regularizer is the negative entropy distance generating function
∆n 3 (x1, . . . , xn) 7 →
n∑
i=1
xi log xi.
The construction of a “nice” regularizer for sequence-form strategy polytopes is significantly more involved,
but a modification/generalization of negative entropy with specific weights can be shown to work [Farina et al.,
2021].
Definition 3.2 (Dilatable global entropy). Consider a tree-form sequential decision process, with the usual
notation.a The dilatable global entropy distance generating function ̃φ is the function ̃φ : Q → ℝ≥0
defined as
̃φ : Q 3 x 7 → ∑
j∈J
∑
a∈Aj
wja x[ja] log x[ja],
where each coefficient γj ≥ 1 (j ∈ J ) is defined resursively as
γj := 1 + max
a∈Aj



∑
j′ ∈J :pj′ =ja
γj′


 ∀ j ∈ J , (6)
and each wja ≥ 1 (ja ∈ Σ) is defined recursively as
wja := γj − ∑
j′ ∈J :pj′ =ja
γj′ ∀ ja ∈ Σ.
aIn particular, remember that J denotes the set of decision points, and given any j ∈ J , Aj is the set of actions available
at j. The parent sequence of j, denoted pj , is the last sequence (decision point-action pair) on the path from the root of the
tree-form decision process to j.
Theorem 3.1. The dilatable global entropy function ̃φ : Q → ℝ≥0 is a regularizer for the sequence-form
polytope Q. It is 1-strongly convex with respect to the `2 (Euclidean) norm, and (1/‖Q‖1)-strongly
convex with respect to the `1 norm, where ‖Q‖1 is the `1-diameter of Q, that is, ‖Q‖1 := maxq∈Q ‖q‖1.
References
Chao-Kai Chiang, Tianbao Yang, Chia-Jung Lee, Mehrdad Mahdavi, Chi-Jen Lu, Rong Jin, and Shenghuo
Zhu. Online optimization with gradual variations. In Proceedings of the Conference on Learning Theory
(COLT), 2012.
Alexander Rakhlin and Karthik Sridharan. Online learning with predictable sequences. In Proceedings of the
Conference on Learning Theory (COLT), 2013a.
Sasha Rakhlin and Karthik Sridharan. Optimization, learning, and games with predictable sequences. In
Proceedings of the Annual Conference on Neural Information Processing Systems (NIPS), 2013b.
Vasilis Syrgkanis, Alekh Agarwal, Haipeng Luo, and Robert E Schapire. Fast convergence of regularized
learning in games. In Advances in Neural Information Processing Systems, 2015.
Shai Shalev-Shwartz and Yoram Singer. A primal-dual perspective of online learning algorithms. Machine
Learning, 69(2-3):115–142, 2007.
Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Better regularization for sequential decision spaces:
Fast convergence rates for Nash, correlated, and team equilibria. In Proceedings of the ACM Conference on
Economics and Computation (EC), 2021.

Content

Typo or question?

Metadata