Gabriele Farina - Representation of strategies in tree-form decision spaces

15-888 Computational Game Solving (Fall 2021) Thu, September 9th 2021
Lecture 2
Representation of strategies in tree-form decision spaces
Instructor: Gabriele Farina
Consider a one-shot game like playing rock-paper-scissor. A strategy for each player in that game simply
corresponds to a probability distribution over the available actions. So, the set of strategies for a player in
a one-shot decision making problem corresponds to the probability simplex ∆n, where n is the number of
actions available to that player.
In this lecture, we are going to see how strategies are represented in the more complicated case of tree-form
decision making. Abstractly, we are looking for a representation that would guarantee the following properties:
of the representation of strategies is multilinear. The two properties we just listed enable the application of
efficient optimization methods in the context of extensive-form games; for one, they are enough to guarantee
that a Nash equilibrium in a two-player zero-sum game be written as a bilinear saddle point problem.
1 Tree-form decision making
In a tree-form sequential decision process (TFSDP) problem the agent interacts with the environment in two
ways: at decision points, the agent must act by picking an action from a set of legal actions; at observation
points, the agent observes a signal drawn from a set of possible signals. Different decision points can have
different sets of legal actions, and different observation points can have different sets of possible signals.
Decision and observation points are structured as a tree: under the standard assumption that the agent is not
forgetful, so, it is not possible for the agent to cycle back to a previously encountered decision or observation
point by following the structure of the decision problem. TFSDPs provide a general formalism which captures
the decision problem faced by a player of an extensive-form game with perfect recall (for example, poker or
bridge), as well as Markov decision processes and partially-observable Markov decision processes for which the
agent conditions its policy on the entire history of observations and actions.
As an example, consider the simplified game of Kuhn poker [Kuhn, 1950], depicted in Figure 1. Kuhn
poker is a standard benchmark in the EFG-solving community. In Kuhn poker, each player puts an ante
worth 1 into the pot. Each player is then privately dealt one card from a deck that contains 3 unique cards
1
decides to either check or bet 1. Then,
• If Player 1 checks Player 2 can check or raise 1.
– If Player 2 checks a showdown occurs; if Player 2 raises Player 1 can fold or call.
∗ If Player 1 folds Player 2 takes the pot; if Player 1 calls a showdown occurs.
• If Player 1 raises Player 2 can fold or call.
– If Player 2 folds Player 1 takes the pot; if Player 2 calls a showdown occurs.
When a showdown occurs, the player with the higher card wins the pot and the game immediately ends.
k1
j1 j2 j3
k2 k3 k4
j4 j5 j6
fold call fold call fold call
check raise check raise check raise
jack queen king
check raise check raise check raise
Figure 1: Tree-form sequential decision making process of the first acting player in the game of Kuhn poker.
As soon as the game starts, the agent observes a private card that has been dealt to them; this is observation
point k1, whose set of possible signals is Sk1 := {jack, queen, king}. Should the agent observe the ‘jack’ signal,
the decision problem transitions to the decision point j1, where the agent must pick one action from the set
Aj1 := {check, raise}. If the agent picks ‘raise’, the decision process terminates; otherwise, if ‘check’ is chosen,
the process transitions to observation point k2, where the agent will observe whether the opponent checks (at
which point the interaction terminates) or raises. In the latter case, the process transitions to decision point
j4, where the agent picks one action from the set Aj4 := {fold, call}. In either case, after the action has been
selected, the interaction terminates.
When the interaction terminates, the agent will receive a form of feedback, typically a payoff. However, in
this first part of the course, we will only focus on the set of strategies in the decision process, and not its
payoffs. We will reconnect the geometry of strategies with the details of the feedback in the second part of the
document.
The example above already reveals some of the notation we will use throughout the document. We now
introduce additional symbols, which are also summarized in Table 1.
Symbol Description
J Set of decision points
Aj Set of legal actions at decision point j ∈ J
K Set of observation points
Sk Set of possible signals at observation point k ∈ K
Σ Set of sequences, defined as Σ := {(j, a) : j ∈ J , a ∈ Aj }
pj Parent sequence of decision point j ∈ J , defined as the last sequence (decision point-action
pair) on the path from the root of the TFSDP to decision point j; if the agent does not act
before j, pj = ∅
Table 1: Summary of notation in TFSDPs.
• We denote the set of decision points in the TFSDP as J , and the set of observation points as K. At
each decision point j ∈ J , the agent selects an action from the set Aj of available actions. At each
observation point k ∈ K, the agent observes a signal sk from the environment out of a set of possible
signals Sk.
• A pair (j, a) where j ∈ J and a ∈ Aj is called a sequence. The set of all sequences is denoted as
Σ := {(j, a) : j ∈ J , a ∈ Aj }. For notational convenience, we will often denote an element (j, a) in Σ as
ja without using parentheses, especially when used as a subscript.
• Given a decision point j ∈ J , we denote by pj its parent sequence, defined as the last sequence (that
is, decision point-action pair) encountered on the path from the root of the decision process to j. If
the agent does not act before j (that is, j is the root of the process or only observation points are
encountered on the path from the root to j), we let pj = ∅.
From extensive-form games to TFSDPs The primary difference between a TFSDP and an extensive-form
game is that the former encodes the decision problem faced by a single agent, while the latter directly encodes
the dynamics of the interaction for all agents involved. In extensive-form games, each decision node of the
underlying game tree belongs to a player in the game. Observation nodes correspond to stochastic actions, and
are traditionally though of as being decision points for a fictitious player called the nature (or chance) player.
There is a one-to-one correspondence between decision points in the TFSDP formalism and information sets
in the extensive-form game formalism. Similarly, there exists a one-to-one correspondence between sequences
in the TFSDP formalism and (information set, action)-pairs in the extensive-form game formalism. In other
words, given an EFG extracting the TFSDP faced by any of the players is a mechanical task. Figure 2 provides
a small example.
1 2
3 4 5 6 7 8 9 7 8 9
a
b c d
−→
1 2
3 4 5 6
7 8 9
k
a
b c
d
Figure 2: (Left) Example of a small extensive-form game. Black nodes belong to Player 1, White nodes belong
to Player 2. The gray partitions represent the information sets of the game. (Right) Sequential decision
making problem faced by Player 1 in the small extensive-form game on the left.
2 Strategies in tree-form decision making
Conceptually, a strategy for an agent in an TFSDP specifies a distribution over the set of actions Aj at each
decision point j ∈ J .
2.1 Behavioral strategies
Perhaps, the most intuitive representation of a strategy, called a behavioral strategy, is as a vector x ∈ ℝ|Σ|
≥0
indexed over sequences that assigns at each action a at decision point j the probability of picking that action
at that decision point. The set of all possible behavioral strategies is clearly convex, as it is the Cartesian
product of probability simplexes—one per each decision point. That representation has a major drawback:
the probability of reaching a particular terminal state in the decision process is the product of all actions on
the path from the root to the terminal state. This makes many expressions of interest that depend on the
probability of reaching terminal states (including expected utilities in extensive-form games) non-convex.
2.2 Sequence-form strategies
To avoid the issue of non-convexity, throughout this document we will almost exclusively use a different
representation, called the sequence-form representation [Romanovskii, 1962, Koller et al., 1996, von Stengel,
1996]. In the sequence-form representation, a strategy is a vector x ∈ ℝ|Σ|
≥0 whose entries are indexed by Σ.
However, the entry x[ja] contains the product of the probabilities of all actions at all decision points on the
path from the root of the process to action a at decision point j. In order to be a valid sequence-form strategy,
the entries in x must satisfy the probability-mass-conservation constraints:
∑
a∈Aj
x[ja] = x[pj ]
at every “non-root” decision point j (i.e., pj 6 = ∅), and the constraint ∑
a∈Aj x[ja] = 1 at every “root” decision
point j.
So, we have the following definition.
Definition 2.1. The polytope of sequence-form strategies of a TFSDP is the convex polytope
Q :=


x ∈ ℝ|Σ|
≥0 : ∑
a∈Aj
x[ja] =
{
1 if pj = ∅
x[pj ] otherwise ∀j ∈ J


.
Because the constraints in Definition 2.1 are linear, the set of all valid sequence-form strategies is a convex
polytope.
Remark 2.1. To avoid breaking into cases depending on whether pj = ∅ or not, a popular alternative
definition of sequence-form strategies includes the special element ∅ as an index of the strategy vector,
with the fixed value 1. In that case, the set of sequence-form strategies could be rewritten as


x ∈ ℝ|Σ∪{∅}|
≥0 : x[∅] = 1, ∑
a∈Aj
x[ja] = x[pj ] ∀j ∈ J


.
We do not follow that approach in this lecture, as it makes the inductive characterization of Q (Section 2.4)
less clean.
2.3 Deterministic and randomized sequence-form strategies
Out of the polytope of all possible strategies in the decision process, deterministic strategies are extremely
important. Deterministic strategies are strategies that select exactly one action at each decision point, without
ever randomizing the choice. In other words, deterministic strategies assign probability either 0 or 1 to each
action at each decision point. Hence, when using the sequence-form representation, the set of all deterministic
strategies—denoted Π—corresponds to the subset of Q whose components are 0 or 1.
Definition 2.2. The set of deterministic sequence-form strategies is the set
Π := Q ∩ {0, 1}|Σ|.
The set of deterministic sequence-form strategies corresponds one-to-one to the concept of reduced normal-
form strategies in game theory. Because of this connection, a direct consequence of the classical Kuhn’s
theorem [Kuhn, 1953] is the following.
Theorem 2.1. The set of deterministic sequence-form strategies generates the polytope of sequence-form
strategies, that is, Q = co Π.
2.4 Bottom-up decomposition of the polytope of sequence-form strategies
The polytope of sequence-form strategies (Definition 2.1) has a strong combinatorial structure that enables
speeding up several common optimization procedures. In this subsection we will explore that combinatorial
structure by giving a bottom-up decomposition based on two standard convexity-preserving operations: convex
hulls and Cartesian products.
We illustrate this intuitively by means of a small example.
Example 2.1. Consider the small sequential decision making problem of Figure 3. As our construction is
bottom-up, we will denote with the symbol Qv , with v ∈ J ∪ K, the partial sequence-form strategy space
corresponding to the subtree rooted at v.
1 2
3 4 5 6
7 8 9
k
a
b c
d
Figure 3: Sequential decision making problem used in the example.
We can characterize the set of sequence-form strategies as follows:
• At the terminal decision points j = b, c, d, Qj is a probability simplex. Specifically, Qb = Qc = ∆2
and Qd = ∆3.
• At the (only) observation point k, a strategy for the subtree rooted at k must provide independent
strategies for the subtrees rooted at b and c. So,
Qk = Qb × Qc
is the Cartesian product of the strategy spaces for the subtrees rooted b and c.
• At the decision point a, we need to first pick a probability distribution of play for the two actions
(sequences 1 and 2). Let’s call the probabilities assigned to those actions as λ1 and λ2, respectively.
Clearly, (λ1, λ2) ∈ ∆2. Once the probabilities of sequences 1 and 2 are chosen, strategies for the
subtrees rooted at the observation point k and d. Since sequence-form strategies associate to each
sequence σ the product of the probabilities of all actions on the path form the root of the TFSDP
to σ, the set of all valid sequence-form strategies for the subtree rooted in a (that is, the whole
TFSDP) is
Qa =
{
(λ1, λ2, λ1xk, λ2xd) : (λ1, λ2) ∈ ∆2, xk ∈ Qk, xd ∈ Qd
}
= co








1
0
Qk
0




,





0
1
0
Qd








.
The above approach can be used in any TFSDP. In particular, we always have that the sequence-form
strategy space of a subtree rooted at an observation point is the Cartesian product of the sequence-form
strategy spaces of the children subtrees. Furthermore, the sequence-form strategy space of a subtree rooted at
a decision point is the convex hull of the sequence-form strategy spaces of the children subtrees, augmented
with the indicators of the actions. These inductive rules are summarized in Table 2.
Decision point Observation point
QnQ1
· · · co











e1
Q1
0
...
0








,








e2
0
Q2
...
0








, . . . ,








en
0
0
...
Qn










 QnQ1
· · · Q1 × Q2 × · · · × Qn
Table 2: Bottom-up construction rules for sequence-form strategy spaces. ei denotes the i-th indicator vector,
that is, the vector whose entries are all 0 except for the entry in position i, which is set to 1.
References
H. W. Kuhn. A simplified two-person poker. In H. W. Kuhn and A. W. Tucker, editors, Contributions to the
Theory of Games, volume 1 of Annals of Mathematics Studies, 24, pages 97–103. Princeton University Press,
Princeton, New Jersey, 1950.
I. Romanovskii. Reduction of a game with complete memory to a matrix game. Soviet Mathematics, 3, 1962.
Daphne Koller, Nimrod Megiddo, and Bernhard von Stengel. Efficient computation of equilibria for extensive
two-person games. Games and Economic Behavior, 14(2), 1996.
Bernhard von Stengel. Efficient computation of behavior strategies. Games and Economic Behavior, 14(2):
220–246, 1996.
H. W. Kuhn. Extensive games and the problem of information. In H. W. Kuhn and A. W. Tucker, editors,
Contributions to the Theory of Games, volume 2 of Annals of Mathematics Studies, 28, pages 193–216.
Princeton University Press, Princeton, NJ, 1953.

Content

Typo or question?

Metadata