Gabriele Farina - Monte-Carlo CFR and offline optimization techniques

15-888 Computational Game Solving (Fall 2021) Tue, Oct 7th 2021
Lecture 10
Monte-Carlo CFR and offline optimization techniques
Instructor: Gabriele Farina
1 Monte-Carlo CFR
In extremely large games, even performing a single iteration of CFR can be challenging. In that case, a
popular technique to increase scalability is Monte-Carlo CFR (MCCFR) [Lanctot et al., 2009].
Remember that CFR works by orchestrating a tree of local regret minimizers (one per each decision point).
At each time t, each regret minimizer outputs a local, behavioral strategy at each information set. Then, when
a utility `t is received as feedback by CFR, `t is used to construct counterfactual utilities by considering the
expected utility in each of the subtrees.
At its core, MCCFR uses the observation that if the utility vector is very sparse, then the expected utilities
in each subtrees will almost always be 0, and therefore no update of the strategy is necessary for those subtrees,
as no regret is cumulated.
So, the idea of MCCFR is to replace any incoming utility `t by a sparse unbiased estimator ̃`t, that is, a
sparse vector ̃`t whose expectation is `t.
1.1 Setup and notation
Before we continue, let’s recall a bit of notation about extensive-form games that will be useful for this lecture.
In this lecture, it will help to think about the chance player as a real player in the game, with the caveat that
its strategy is fixed and known.
We denote by X and Y the set of all sequence-form strategies for Player 1 and Player 2, respectively. We
denote by c the fixed sequence-form strategy of the chance player.
For any leaf z ∈ Z, the probability that the game ends in z is the product of the probabilities of all
the actions on the path from the root to z. Because of the definition of sequence-form strategies, when
Player 1 and 2 play according to strategies x ∈ X and y ∈ Y, respectively, this probability is equal to
x[σ1(z)] · y[σ2(z)] · c[σc(z)], where σi(z) ∈ Σi denotes the sequence of Player i’s action on the path from the
root to z. So, Player 1’s expected utility is computed via the trilinear map
u1(x, y, c) := ∑
z∈Z
u1(z) · x[σ1(z)] · y[σ2(z)] · c[σc(z)]. (1)
Since the strategy of the chance player is fixed, the above expression is bilinear in x and y and therefore
can be expressed more concisely as u1(x, y) = x>Ay, where A is the usual sequence-form payoff matrix of
Player 1.
1.2 Game-theoretic sparse unbiased utility estimator
We will illustrate how each technique can be used to construct an estimate ̃`t
X for the utility `t
X = Ayt for
Player 1 that is used at all times t when the players play against each other in self-play, as we have seen many
times already (see also Figure 1).
RX
RY
`t−1
X
`t−1
Y
xt
yt `t
Y
`t
X RX
RY
xt+1
yt+1 · · ·· · ·
`t
X := Ayt
`t
Y := −A>xt
Figure 1: The flow of strategies and utilities in regret minimization for games.
The computation of an estimate for Player 2 is analogous.
External Sampling An unbiased estimator of the utility vector `t
X = Ayt can be easily constructed by
independently sampling pure strategies ̃yt for Player 2 and ̃ct for the chance player. Indeed, as long as
피t[ ̃yt] = yt and 피t[ ̃ct] = c, from (1) we have that for all x ∈ X , u1(x, yt, c) = 피t[u1(x, ̃yt, ̃ct)]. Hence, the
vector corresponding to the (random) linear function x 7 → u1(x, ̃yt, ̃ct) is an unbiased utility estimator, called
the external sampling utility estimator.
The external sampling utility estimator, that is, the vector corresponding to the linear function x 7 →
u1(x, ̃yt, ̃ct), can be computed via a simple traversal of the game tree. The algorithm starts at the root of the
game tree and starts visiting the tree. Every time a node that belongs to the chance player or to Player 2
is encountered, an action is sampled according to the strategy c or yt, respectively. Every time a node for
Player 1 is encountered, the algorithm branches on all possible actions and recurses. For every node of Player
2 or chance player, the algorithm branches on only one action. Thus computing an external sampling utility
estimate is significantly cheaper to compute than the exact utility `t
X .
Remark 1.1. Analogous estimators where only the chance player’s strategy c or only Player 1’s strategy yt
are sampled are referred to as chance sampling estimator and opponent sampling estimator, respectively.
Outcome Sampling Let wt ∈ X be an arbitrary strategy for Player 1. Furthermore, let ̃zt ∈ Z be a random
variable such that for all z ∈ Z,
ℙt[ ̃zt = z] = wt[σ1(z)] · yt[σ2(z)] · c[σc(z)],
and let ez ∈ ℝ|Σ1| be defined as the vector such that ez [σ1(z)] = 1 and ez [σ] = 0 for all other σ ∈ Σ1, σ 6 = σ1(z).
It is a simple exercise to prove that the random vector
̃`t
X := u1( ̃zt)
wt[σ1( ̃zt)] e ̃zt
is such that 피t[ ̃`t
X ] = `t
X . This particular definition of ̃`t
X is called the outcome sampling utility estimator.
Computationally, the outcome sampling utility estimator is cheaper than the external sampling utility
estimator. Indeed, since wt ∈ X , one can sample ̃zt by following a random path from the root of the game
tree by sampling (from the appropriate player’s strategy) one action at each node encountered along the way.
The walk terminates as soon as it reaches a leaf, which corresponds to ̃zt.
1.3 Regret analysis
Let’s analyze how much the guarantee on the regret degrades when utility estimators are used instead of exact
utility vectors. To model regret minimization with unbiased utility estimators, consider the following model,
which abstracts our situation.
Let ̃R be a deterministic regret minimizer over a convex and compact set X , and consider a second regret
minimizer R over the same set X that is implemented starting from ̃R as in Figure 2. In particular, at all
times t,
• R queries the next decision xt of ̃R, and outputs it;
• each utility vector `t received by R is used by R to compute a utility estimate ̃`t such that
피t[ ̃`t] := 피[ ̃`t | ̃`1, . . . , ̃`t−1] = `t.
(that is, the estimate in unbiased). The internal regret minimizer ̃R is then shown ̃`t instead of `t.
̃R∼
`t ̃`t xt
R
Figure 2: Abstract regret minimizer implemented by using an unbiased estimator of the utility vector at each
time t.
When the estimate of the utility is very accurate, it is reasonable to expect that the regret RT incurred by
R up to any time T is roughly equal to the regret ̃RT that is incurred by ̃R, plus some degradation term that
depends on the error of the estimates. Indeed, the following can be shown [Farina et al., 2020].
Theorem 1.1. Let M and ̃M be constants such that
|(`t)>(x − x′)| ≤ M, |( ̃`t)>(x − x′)| ≤ ̃M , ∀ time t, and x, x′ ∈ X .
Then, for all δ ∈ (0, 1), with probability at least 1 − δ,
RT ≤ ̃RT + (M + ̃M )
√
2T log 1
δ .
Proof. Fix any u ∈ X and introduce the discrete-time stochastic process
dt := (`t)>(xt − u) − ( ̃`t)>(xt − u), t ∈ ℕ>0.
Since by hypothesis 피t[ ̃`t] = `t and ̃R is a deterministic regret minimizer, 피t[dt] = 0 and so {dt} is a
martingale difference sequence. This martingale difference sequence is well-known, especially in the context
of bandit regret minimization [Abernethy and Rakhlin, 2009, Bartlett et al., 2008].
At all times t, the maximum magnitude of dt is given by
|dt| = |(`t)>(zt − u) − ( ̃`t)>(zt − u)| ≤ |(`t)>(zt − u)| + |( ̃`t)>(zt − u)| ≤ M + ̃M .
Furthermore,
T∑
t=1
dt =
( T∑
t=1
(`t)>(zt − u)
)
−
( T∑
t=1
( ̃`t)>(zt − u)
)
= RT (u) − ̃RT (u).
So, using the Azuma-Hoeffding concentration inequality [Hoeffding, 1963, Azuma, 1967], for all τ ≥ 0
ℙ[RT (u) ≤ ̃RT (u) + τ ] = 1 − ℙ
[ T∑
t=1
dt ≥ τ
]
≥ 1 − exp
{
− 2τ 2
4T (M + ̃M )2
}
.
Substituting τ = (M + ̃M )√2T log(1/p) and taking a maximum yields the statement.
A straightforward consequence of Theorem 1.1 is that if ̃R guarantees sublinear regret, then also the regret in
R will grow sublinearly with high probability.
2 Solving two-player zero-sum games via offline optimization methods
We’ve spent a good amount of time looking into online learning methods for learning strong strategies in
games. In particular, we have been paying particular attention to self-play in two-player zero-sum games,
where it’s known that a Nash equilibrium is the solution to the bilinear saddle point problem
max
x∈X min
y∈Y x>Ay, (2)
where X and Y are the sequence-form polytopes of the players, and A is the payoff matrix of Player 1.
In this second part of the class, we will briefly talk about what other offline methods are available in the
literature for solving problems like (2).
2.1 Accelerated first-order methods for saddle-point optimization
The literature on convex optimization methods for bilinear saddle point problems is very rich. Accelerated
first-order methods for such problems have been developed well before optimistic/predictive regret minimization
methods (see Lecture 7) were discovered. For a long time, before RVU regret bounds and optimistic regret
minimization were established, they were the only first-order methods capable of guaranteeing O(1/T )
convergence to a solution of (2). We briefly present two popular accelerated methods from that literature.
Excessive gap technique (EGT) The excessive gap technique (EGT) is a first-order method introduced
by Nesterov [2005a], and one of the primary applications is to solve BSPPs such as Equation (2). EGT
assumes access to a proximal setup for X and Y, with one-strongly-convex (also known as distance-generating
functions (DGFs)) dx, dy , and constructs smoothed approximations of the optimization problems faced by the
x and y players. Based on this setup, we formally state the EGT of Nesterov [2005b] in Algorithm 1. EGT
alternatingly takes steps focused on decreasing one or the other smoothing parameter. These steps are called
ShrinkX and ShrinkY in Algorithm 1, and make use of some of the concepts related to distance-generating
functions that we have seen in Lecture 7.
Algorithm 1: Excessive Gap Technique (EGT) algorithm.
1 function Initialize()
2 t ← 0
3 μ0
x ← ‖A‖, μ0
y ← ‖A‖
4 ̃x ← arg minˆx∈X dx(ˆx)
5 y0 ← ∇d∗
y (A> ̃x/μ0
y )
6 x0 ← prox ̃x
(Ay0/μ0
x
)
7 function Iterate()
8 t ← t + 1, τ ← 2/(t + 2)
9 if t is even then ShrinkX()
else ShrinkY()
11 function ShrinkX()
12 ̄x ← −∇d∗
x(−Ayt−1/μt−1
x )
13 ˆx ← (1 − τ )xt−1 + τ ̄x
14 ̄y ← ∇d∗
y (A> ˆx/μt−1
y )
15 ̃x ← prox ̄x
( τ
(1−τ )μt−1
x
A ̄y
)
16 xt ← (1 − τ )xt−1 + τ ̃x
17 yt ← (1 − τ )yt−1 + τ ̄y
18 μt
x ← (1 − τ )μt−1
x
19 function ShrinkY()
20 ̄y ← ∇d∗
y (A>xt−1/μt−1
y )
21 ˆy ← (1 − τ )yt−1 + τ ̄y
22 ̄x ← −∇d∗
x(−A ˆy/μt−1
x )
23 ̃y ← prox ̄x
( −τ
(1−τ )μt−1
y
A> ̄x
)
24 yt ← (1 − τ )yt−1 + τ ̃y
25 xt ← (1 − τ )xt−1 + τ ̄x
26 μt
y ← (1 − τ )μt−1
y
Algorithm 1 shows how initial points are selected and the alternating steps and stepsizes are computed.
Nesterov [2005b] proves that the EGT algorithm converges at a rate of O(1/T ).
Theorem 2.1 (Nesterov [2005b], Theorem 6.3). At every iteration t ≥ 1 of the EGT algorithm, the iterate
(xt, yt) produced by EGT (Algorithm 1) satisfies xt ∈ X , yt ∈ Y, and
γ(xt, yt) := max
y∈Y (xt)>Ay − min
x∈X x>Ayt ≤ 4‖A‖√Ωdx ,X Ωdy ,Y
t + 1 .
Mirror prox (MP) Rather than construct smoothed approximations, the Mirror Prox (MP) algorithm Ne-
mirovski [2004] directly uses the DGFs to take first-order steps. Hence, the MP algorithm is best understood
as an algorithm that operates on the product space X × Y directly. As such, in most analyses of the MP
algorithm, a single 1-strongly convex DGF for the product space X × Y is required. To better align with
the setup used for EGT, we will define the DGF for the product space X × Y starting from proximal setups
for both X and Y, with 1-strongly convex DGFs dx, dy with respect to norms ‖ · ‖x and ‖ · ‖y , respectively.
Algorithm 2 shows the sequence of steps taken in every iteration of the MP algorithm. Compared to EGT,
mirror prox has a somewhat simpler structure: it simply takes repeated extrapolated proximal steps. First,
a proximal step in the descent direction is taken for both x and y. Then, the utility at those new points is
used to take a proximal step starting from the previous iterate (this is the extrapolation part: a step is taken
starting from the previous iterate, but with the extrapolated utility). Finally, the average strategy is output.
As we recall in the next theorem, like EGT the mirror prox algorithm converges at rate O(1/T ).
Theorem 2.2 (Ben-Tal and Nemirovski [2001], Theorem 5.5.1). Suppose the stepsize in Algorithm 2 is set
as ηt = 1/‖A‖. Then we have
γ(xt, yt) := max
y∈Y (xt)>Ay − min
x∈X x>Ayt ≤ ‖A‖(Ωdx ,X + Ωdy ,Y )
2t .
2.2 Linear programming formulation
Every bilinear saddle-point problem where X and Y are polytopes can always be converted into a linear
program. The technique is completely general and based on linear programming duality.
Define the function
g(x) := min
y∈Y x>Ay,
which appears as the internal minimization problem in (2), so that
max
x∈X min
y∈Y x>Ay = max
x∈X g(x). (3)
Remember from Lecture 2 that the sequence-form polytope Y of Player 2 that appears in the domain of
the minimization is defined as the intersection of the linear equality and inequality constraints
Y :=


y ∈ ℝ|Σ2|
≥0 : ∑
a∈Aj
y[ja] =
{
1 if pj = ∅
y[pj ] otherwise ∀j ∈ J2


.
We can represent the linear equality and inequality constraints that define Y more compactly in vector form
as Y = {y : F2y = f2, y ≥ 0} for an appropriate matrix F2 and vector f2. (Similarly, for X we have
X = {x : F1x = f1, x ≥ 0}.) With that, we can rewrite g(x) as the linear program
g(x) =



min (A>x)>y
s.t. 1 F2y = f2
2 y ≥ 0.
(P)
From the duality theory of linear programs, we can rewrite the linear program (P) as
g(x) =



max f >
2 v
s.t. 1 F>
2 v ≤ A>x
2 v free.
(D)
By plugging the above expression for g(x) back into (3), and expanding the definition of X = {x : F1x =
f1, x ≥ 0}, we obtain the following characterization of Nash equilibrium strategies in two-player zero-sum
games based on linear programming.
Proposition 2.1. Let X = {x ∈ ℝ|Σ1| : F1x = f1, x ≥ 0} and Y = {y ∈ ℝ|Σ2| : F2y = f2, y ≥ 0} be the
sequence-form polytopes of the two players, and let A be the payoff matrix for Player 1. Then, a Nash
equilibrium strategy for Player 1 can be computed as the solution x to the linear program



max f >
2 v
s.t. 1 A>x − F>
2 v ≥ 0
2 F1x = f1
3 x ≥ 0, v free.
(4)
The linear program for a Nash equilibrium strategy of Player 2 can be obtained by swapping the roles of
the two players, and taking A to be the payoff matrix of Player 2 instead, which is simply the opposite of
the transpose of the payoff matrix of Player 1 in a zero-sum game. Specifically, with the same notation as
Proposition 2.1, a Nash equilibrium strategy for Player 2 is the solution y to the linear program



max f >
1 v
s.t. 1 −Ay − F>
1 v ≥ 0
2 F2y = f2
3 y ≥ 0, v free.
Payoff matrix sparsification Generally speaking, the more nonzeros in the linear program, the longer it will
take solvers to solve the linear program. In the particular case of the Nash equilibrium strategy linear program
given in (4), the number of nonzeros in the constraint matrix of the linear program is mostly driven by the
number of nonzeros in the payoff matrix A of the game.
To further push the limit as to what size games linear programming solvers can be used on, Zhang and
Sandholm [2020] made the following observation. Suppose that the payoff matrix A of the game can be
expressed in the form
A = ˆA + UM−1V>, (5)
for some matrices ˆA, U, M−1 (where M is an invertible matrix), and V. We call expression (5) a sparsification
of A. Then, given the sparsication, we have that the term A>x that appears in constraint 2 of (4) as
A>x = ˆA>x + VM−>U>x.
So, by introducing the vector w := M−>U>x, we can write
A>x = ˆA>x + Vw.
Given that w = M−>U>x ⇐⇒ M>w = U>x, we can therefore rewrite the linear program (4) into its
sparsified form as (6).



max f >
2 v
s.t. 1 A>x − F>
2 v ≥ 0
2 F1x = f1
3 x ≥ 0, v free
sparsified
−−−−−−−−→



max f >
2 v
s.t. 1 ˆA>x − F>
2 v + Vw ≥ 0
2 M>w − U>x = 0
3 F1x = f1
4 x ≥ 0, v free, w free.
(6)
References
Marc Lanctot, Kevin Waugh, Martin Zinkevich, and Michael Bowling. Monte Carlo sampling for regret
minimization in extensive games. In Proceedings of the Annual Conference on Neural Information Processing
Systems (NIPS), 2009.
Gabriele Farina, Christian Kroer, and Tuomas Sandholm. Stochastic regret minimization in extensive-form
games. In Proceedings of the International Conference on Machine Learning (ICML), 2020.
Jacob Abernethy and Alexander Rakhlin. Beating the adaptive bandit with high probability. Technical
Report UCB/EECS-2009-10, EECS Department, University of California, Berkeley, 2009. URL http:
//www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-10.html.
Peter L. Bartlett, Varsha Dani, Thomas Hayes, Sham Kakade, Alexander Rakhlin, and Ambuj Tewari.
High-probability regret bounds for bandit online linear optimization. In Proceedings of the Conference on
Learning Theory (COLT), 2008.
Wassily Hoeffding. Probability inequalities for sums of bounded random variables. Journal of the American
Statistical Association, 58(301):13–30, 1963.
Kazuoki Azuma. Weighted sums of certain dependent random variables. Tohoku Mathematical Journal, 19(3),
1967.
Yurii Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103, 2005a.
Yurii Nesterov. Excessive gap technique in nonsmooth convex minimization. SIAM Journal of Optimization,
16(1), 2005b.
Arkadi Nemirovski. Prox-method with rate of convergence O(1/t) for variational inequalities with Lipschitz
continuous monotone operators and smooth convex-concave saddle point problems. SIAM Journal on
Optimization, 15(1), 2004.
Ahron Ben-Tal and Arkadi Nemirovski. Lectures on modern convex optimization: analysis, algorithms, and
engineering applications, volume 2. Siam, 2001.
Brian Hu Zhang and Tuomas Sandholm. Sparsified linear programming for zero-sum equilibrium finding. In
Proceedings of the International Conference on Machine Learning (ICML), 2020.

Content

Typo or question?

Metadata