Approachability in Two-Player Games

1. Introduction

In the following, we will be considering a repeated two-player game, called Blackwell’s Approachability Game. The game is formally represented by a 4-tuple $(\mathcal{X}, \mathcal{Y}, f, S)$, where:
  • $\mathcal{X}$ denotes a convex compact action space of the first player, identified as $\mathcal{X}$ with an abuse of notation;
  • $\mathcal{Y}$ similarly denotes a convex compact action space of the second player, identified as $\mathcal{Y}$ with an abuse of notation;
  • $f : \mathcal{X}\times\mathcal{Y} \to \mathbb{R}^d$ biaffine payoff function of the game. We assume $f$ is bounded above in norm by $\gamma > 0$, that is
    \[\left\|f(x,y)\right\| < \gamma\qquad\forall (x, y) \in \mathcal{X}\times\mathcal{Y}.\]
    Notice that we can always rescale $f$ so that we can safely assume $\gamma = 1$, so that $f(x,y)\in B(0,1)$, the unit ball centered in the origin;
  • $S \,\subseteq\, B(0, 1)$ is a closed convex target set.
A strategy for player $\mathcal{X}$ is a family of functions
\[\sigma_{\mathcal{X}} \doteq \{\sigma_{\mathcal{X}}^t : (\mathcal{X}\times\mathcal{Y})^{t-1} \to \mathcal{X}\}_{t = 1}^{\infty}\]
mapping at each instant of time $t$ the game history $\mathcal{H}_{t-1} \doteq ((x_i, y_i))_{i = 1}^{t - 1}$ into the next action $x_t \in \mathcal{X}$. Of course, a similar yet simmetric definition holds for player $\mathcal{Y}$.
The objective of player $\mathcal{X}$ is to keep the average payoff
\[\phi_t \doteq \frac{1}{t} \sum_{i = 1}^t f(x_i, y_i)\]
as close as possible to $S$, while $\mathcal{Y}$ tries to deny this.

2. Approachable and forceable sets

A set $H$ is said to be forceable if there exists an action $\bar{x}$ for player $\mathcal{X}$ such that whatever the $y$ used by the opponent, $f(\bar{x}, y) \in H$. The set $S$ is said to be approachable if there exists a strategy $\sigma_\mathcal{X}$ for player $\mathcal{X}$ such that whatever the strategy $\sigma_\mathcal{Y}$ used by the opponent, the distance from $\phi_t$ and $S$ converges uniformely to 0. In symbols: \[ \exists\, \sigma_\mathcal{X}:\ \forall\, \epsilon > 0\ \exists\, T \ \forall\, \sigma_\mathcal{Y}\ \forall t \ge T\ \rho(\phi_t, S) < \epsilon \] where $\rho(\phi_t, S) = \min_{s\in S} \left\|\phi_t - s\right\|$ as usual (the projection $\rho(\phi_t, S)$ exists and is unique since $S$ is closed and convex and we are working in the $l^2$ norm).

2.1. The scalar case

We start by recalling the classic theorem by Sion, an extension of Von Neumann’s minimax theorem.
Theorem 1 (Sion’s theorem).
Given compact convex sets $\mathcal{X}, \mathcal{Y}$, and a biaffine function $f: \mathcal{X}\times\mathcal{Y}\to\mathbb{R}$, we have \[ \nu = \max_{x\in\mathcal{X}}\min_{y\in\mathcal{Y}} f(x,y) = \min_{y\in\mathcal{Y}}\max_{x\in\mathcal{X}} f(x,y). \]
An immediate consequence of Sion’s theorem is that there exists an action $\bar{x}$ of player $\mathcal{X}$ such that
\[f(\bar{x}, y) \ge \nu\qquad \forall y\in\mathcal{Y}.\]
Indeed, assume the opposite: for all $x$ there exists a $\hat{y}(x)\in\mathcal{Y}$ such that $f(x, \hat{y}(x)) < \nu$. This would mean that for all $x\in\mathcal{X}$, $\min_{y\in\mathcal{Y}} f(x, y) < \nu$, so that $\nu = \max_x \min_y f(x,y) < \nu$, a contradiction!
Now, suppose player $\mathcal{X}$ plays $\bar{x}$ repeatedly. This proves that the interval $[\lambda, +\infty)$ is forceable and approachable for every $\lambda \le \nu$. Symmetrically, there exists an action $\bar{y}$ for player $\mathcal{Y}$ such that
\[f(x, \bar{y}) \le \nu\qquad \forall x\in\mathcal{X}.\]
Again, by contradicition: if the condition was false, then for all $y\in\mathcal{Y}$ there exists an action $\hat{x}(y)\in\mathcal{X}$ such that $f(\hat{x}(y), y) > \nu$. This in turn implies that $\max_x f(x, y) > \nu$ for all actions $y\in\mathcal{Y}$, and $\nu = \min_y \max_x f(x, y) > \nu$, contradiction.
As a consequence, $[\lambda, +\infty)$ is not approachable nor foarceable for any value of $\lambda > \nu$. We can then state the following lemma.
Lemma 1 (Scalar approachability condition).
In a scalar game, the following conditions are equivalent:
  • $[\lambda, +\infty)$ is approachable;
  • $[\lambda, +\infty)$ is forceable;
  • $\lambda \le \nu$.

2.2. Approachable halfspaces

Before turning our attention to general closed convex sets (Section 3), we focus on an important class of target sets: halfspaces. The following lemma is central in our discussion.
Lemma 2.
A halfspace $H(\lambda, q) = \{h \in\, \mathbb{R}^d\, :\, \lambda^\top h \ge q\}$ with $\left\|\lambda\right\| = 1$ is approachable if and only if it is forceable.
Proof. Consider a new two-player game $\hat{\Gamma}$ whose payoff function $\hat{f}: \mathcal{X}\times\mathcal{Y} \to \mathbb{R}$ is defined as \[ \hat{f}(x, y) \doteq \lambda^\top f(x,y). \] This is a scalar game ($\hat{f}$ maps to $\mathbb{R}$) with a biaffine payoff function. The key observation is that $H(\lambda, q)$ is approachable if and only if $\hat{H}(q) = [q, +\infty)$ is approachable. Indeed, notice that for any $u\in\mathbb{R}^d$:
\[\rho(u, H(\lambda, q)) = \rho(\lambda^\top u, \hat{H}(q)).\] (1)
Applying Equation 1 to $\phi_t$, we obtain
\[\begin{array}{rl} \rho\left(\phi_t, H(\lambda, q)\right) &= \rho\left(\lambda^\top\left(\frac{1}{t}\sum_{i=1}^t f(x_i, y_i)\right), \hat{H}(q)\right)\\\\ &= \rho\left(\frac{1}{t}\sum_{i=1}^t \hat{f}(x_i, y_i), \hat{H}(q)\right)\\\\ &= \rho(\hat{\phi_t}, \hat{H}(q)). \end{array}\]
Therefore, approachability with respect to $H(\lambda, q)$ is equivalent to approachability with respect to $\hat{H}(q)$. At the same time we have that
\[\exists \bar{x}\in\mathcal{X} : \forall y\ f(\bar{x}, y) \in H(\lambda, q) \Longleftrightarrow \exists \bar{x}\in\mathcal{X} : \forall y\ \hat{f}(\bar{x}, y) \ge q,\]
so that we conclude that forceability with respect to $H(\lambda, q)$ is equivalent to forceability with respect to $\hat{H}(q)$. Invoking Lemma 1 completes the proof. $\square$

3. Blackwell’s Theorem

Theorem 2.
A closed convex set $S$ is approachable if and only if every halfspace $H \supseteq S$ is forceable.
Proof. ($\Longrightarrow$) This direction is trivial: if $S$ is approachable than any set $Q \supseteq S$ is approachable, including any halfspace $H \supseteq S$. On the other hand, we know from Lemma 2 that approchability and forceability are equivalent when dealing with halfspaces. ($\Longleftarrow$) Suppose we played a strategy $\sigma_\mathcal{X}$ up until time $t$, and it is now our turn to play. We have collected an average reward of $\phi_{t}$, and would like “move closer” to $S$. See Figure 1.
Figure 1.
Let $\psi_t$ be the projection of $\phi_t$ onto $S$. Remember that such projection exists and is unique since $S$ is a closed convex set. If $\psi_t\in S$, play any random move. Otherwise, consider the halfspace $H(\hat{\lambda}, \hat{q})$ perpendicular to $\psi_t - \phi_t$, containing $S$ and such that $\hat{\lambda}^\top \psi_t = \hat{q}$. We clearly have
\[\hat{\lambda} = \frac{\psi_t - \phi_t}{\left\|\psi_t - \phi_t\right\|},\qquad \hat{q} = \hat{\lambda}^\top \psi_t.\]
Since $H(\hat{\lambda}, \hat{q})$ is forceable, there exists a choice of $\bar{x} \in \mathcal{X}$ such that, for all $y\in\mathcal{Y}$,
\[\hat{\lambda}^\top f(\bar{x}, y) \ge \hat{q}.\]
Let’s see how the average payoff $\phi_t$ changes when we play $\bar{x}$:
\[\phi_{t + 1} = \frac{1}{t + 1}\sum_{i = 1}^{t + 1} f(x_i, y_i) = \frac{t}{t + 1} \phi_t + \frac{1}{t + 1} f(\bar{x}, y_{t + 1}).\]
The distance from $\phi_{t + 1}$ to $S$ is bounded as
\[\begin{array}{rl} \rho(\phi_{t + 1}, S)^2 &= \min_{s\in S} \left\|\phi_{t + 1} - s\right\|^2 \\ \\ &\le \left\|\phi_{t + 1} - \psi_t\right\|^2 \\ \\ &= \left\|\frac{t}{t + 1} \phi_t + \frac{1}{t + 1} f(\bar{x}, y_{t + 1}) - \psi_t\right\|^2 \\ \\ &= \frac{1}{(t + 1)^2}\left\|t \left(\phi_t - \psi_t\right) + \left(f(\bar{x}, y_{t + 1}) - \psi_t\right)\right\|^2 \end{array}\]
By expanding the norm, we find
\[\rho(\phi_{t + 1}, S)^2 \le \frac{1}{(t + 1)^2} \left( t^2 \left\| \phi_t - \psi_t\right\|^2 + \left\|f_{t + 1} - \psi_t\right\|^2 + 2t (\phi_t - \psi_t)^\top (f_{t+1} - \psi_t ) \right),\] (2)
where we let $f_{t+1}\doteq f(\bar{x}, y_{t+1})$ be the payoff received by player $\mathcal{X}$ at time $t+1$.
Notice that since $\psi_t$ is also the projection of $\phi_t$ onto $H$,
\[(\phi_t - \psi_t)^\top (f(\bar{x}, y_{t + 1}) - \psi_t) \le 0,\]
so that we can write
\[\rho(\phi_{t + 1}, S)^2 \le \frac{1}{(t+1)^2} (t^2 \|\phi_t - \psi_t\|^2 + \|f_{t+1}-\psi_t\|^2).\]
Since $\|\phi_t - \psi_t\| = \rho(\phi_t, S)$, we can expand recursively, finding
\[\rho(\phi_{t + 1}, S)^2 \le \frac{1}{(t+1)^2}\sum_{i=1}^t \|f_{i+1}-\psi_i\|^2.\] (3)
Now, since $f_{i+1}$ is a convex combination of payoffs, and $S \subseteq B(0,1)$, we have $\|f_{i + 1} - \psi_i\| \le 2$ for all $i$, so that
\[\rho(\phi_{t + 1}, S)^2 \le \frac{4t}{(t + 1)^2} \le \frac{4}{t + 1}\] (4)
Taking the square root of Equation 4 shows that $\rho(\phi_t, S)$ converges uniformely to zero as $t \to \infty$, at a rate equal to $O(1/\sqrt{t})$. Therefore $S$ is approachable and we conclude the proof. $\square$

4. Approximate Nash equilibria

Here we make the further assumption that $\mathcal{X}$ and $\mathcal{Y}$ are (finite-dimensional) polytopes, and that the game is zero-sum. We let $u : \mathcal{X}\times\mathcal{Y} \to \mathbb{R}$ represent the bilinear payoff function of the game. We start by recalling two classical definitions in game theory.
Definition 1 (Regret).
The (external) regret of player $\mathcal{X}$ against action $\hat{x}\in\mathcal{X}$ after history $\mathcal{H}_{t} = ((x_i, y_i))_{i=1}^{t}$ is defined as
\[R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) \doteq \sum_{i = 1}^{t} u(\hat{x}, y_i) - u(x_i, y_i),\]
where $u$ is the (bilinear) utility function of the game.
Definition 2 (Regret-minimizing strategy).
A strategy $\sigma_\mathcal{X}$ for player $\mathcal{X}$ is regret-minimizing, or Hannan consistent, if
\[\limsup_{t\to+\infty} \frac{1}{t}R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) \le 0\] (5)
for all histories $\mathcal{H}_{t}$ and all $\hat{x}\in\mathcal{X}$.
There exists a well-known relationship between regret and approximate Nash equilibria (Definition 3), as summarized in Theorem 3.
Definition 3 (Approximate Nash equilibrium).
A strategy pair $(x, y) \in \mathcal{X}\times\mathcal{Y}$ is a $\epsilon$-Nash equilibrium if
\[u(x,y) + \epsilon \ge u(\hat{x}, y)\]
for all $\hat{x}\in\mathcal{X}$, and
\[-u(x, y) + \epsilon \ge -u(x, \hat{y})\]
for all $\hat{y}\in\mathcal{Y}$.
Theorem 3.
In a zero-sum game, if at time $t$ the players’ average regrets are such that
\[\frac{1}{t}R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) \le \epsilon_\mathcal{X}, \qquad \frac{1}{t}R_\mathcal{Y}(\mathcal{H}_{t}, \hat{y}) \le \epsilon_\mathcal{Y}\]
for all histories $\mathcal{H}_{t} = ((x_i, y_i))_{i=1}^t$ and actions $\hat{x}\in\mathcal{X}, \hat{y}\in\mathcal{Y}$, then the strategy pair
\[(\bar{x}_t, \bar{y}_t) \doteq \left(\frac{1}{t}\sum_{i=1}^t x_i, \frac{1}{t}\sum_{i=1}^t y_i\right)\in\mathcal{X}\times\mathcal{Y}\]
is a $(\epsilon_\mathcal{X} + \epsilon_\mathcal{Y})$-Nash equilibrium.
Proof. We have by hypothesis that for all $\hat{x},\hat{y}$,
\[\begin{array}{c} \frac{1}{t} \sum_{i=1}^t u(\hat{x}, y_i) - u(x_i, y_i) \le \epsilon_\mathcal{X}, \\ \\ \frac{1}{t} \sum_{i=1}^t -u(x_i, \hat{y}) + u(x_i, y_i) \le \epsilon_\mathcal{Y}. \end{array}\]
Summing the inequalities, we find
\[\frac{1}{t} \sum_{i=1}^t u(\hat{x}, y_i) - u(x_i, \hat{y}) = u(\hat{x},\bar{y}_t) - u(\bar{x}_t, \hat{y}) \le \epsilon_\mathcal{X} + \epsilon_\mathcal{Y}.\]
Choosing $\hat{y} = \bar{y}_t$ yields
\[u(\bar{x}_t, \bar{y}_t) + (\epsilon_\mathcal{X} + \epsilon_\mathcal{Y}) \ge u(\hat{x}, \bar{y}_t),\quad \forall\ \hat{x}\in\mathcal{X}.\]
Analogously, choosing $\hat{x} = \bar{x}_t$ yields
\[-u(\bar{x}_t,\bar{y}_t) + (\epsilon_\mathcal{X} + \epsilon_\mathcal{Y}) \ge -u(\bar{x}_t,\hat{y}),\quad \forall\ \hat{y}\in\mathcal{Y}.\]
Therefore, we conclude that $(\bar{x}_t,\bar{y}_t)$ is a $(\epsilon_\mathcal{X}, \epsilon_\mathcal{Y})$-Nash equilibrium. $\square$

4.1. From regret-minimization to approachability

To our purposes, it is convenient to introduce the concept of instantaneous regret, $r_{\hat{x}}(x_t,y_t)$, representing the regret accumulated by player $\mathcal{X}$ against action $\hat{x}$ when the two players play $x_t\in\mathcal{X}$ and $y_t\in\mathcal{Y}$, respectively:
\[r_{\hat{x}}(x, y) \doteq u(\hat{x}, y) - u(x, y).\]
It is then evident that
\[\frac{1}{t} R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) = \frac{1}{t} \sum_{i=1}^{t} r_{\hat{x}}(x_i, y_i)\]
is the average of the instantaneous regrets of player $\mathcal{X}$. Note that under this characterization, $\bar{R}_\mathcal{X}(\mathcal{H}_{t}, \hat{x})$ looks similar to the definition of average payoff $\phi_t$.
It is tempting to introduce the biaffine multi-dimensional payoff function
\[r: (x, y) \mapsto (r_{\hat{x}}(x, y))_{\hat{x}\in\mathcal{X}},\] (6)
so that any regret-minimizing strategy is equivalent to an approach strategy for the non-positive orthant. However, this approach is flawed, in that the resulting payoff space would have an infinite number of dimensions.
Luckily, we can prove that the condition of Equation 5, which has to hold for all $\hat{x}\in\mathcal{X}$, is equivalent to the same condition over a restricted, finite number of actions when $\mathcal{X}$ is a finitely generated polytope. Indeed, we claim the following:
Lemma 3.
A strategy $\sigma_\mathcal{X}$ for player $\mathcal{X}$ is regret-minimizing if
\[\limsup_{t\to+\infty} \frac{1}{t} R_\mathcal{X}(\mathcal{H}_{t}, \hat{x})\le 0\]
for all histories $\mathcal{H}_{t}$ and all $\hat{x} \in \{b_1, \ldots, b_n\}$, where $\{b_1,\ldots,b_n\}$ is a convex basis for $\mathcal{X}$.
Proof. ($\Longrightarrow$) This directions is trivial, as clearly $b_1, \ldots, b_n \in \mathcal{X}$. ($\Longleftarrow$) By definition of basis, given any $x\in\mathcal{X}$ there exist $n$ reals $\lambda_1(x), \ldots, \lambda_n(x) \ge 0$ with $\lambda_1(x) + \cdots + \lambda_n(x) = 1$, such that $x = \lambda_1(x) b_1 + \cdots + \lambda_n(x) b_n$. Therefore, using the bilinearity of $u$:
\[\begin{array}{rl} \frac{1}{t}R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) &= \frac{1}{t}\sum_{i=1}^{t-1} u(\hat{x}, y_i) - u(x_i, y_i) \\ \\ &= \frac{1}{t} \sum_{i=1}^{t-1} \left(\sum_{j=1}^n \lambda_j(\hat{x}) u(b_j, y_i)\right) - \left(\sum_{j=1}^n \lambda_j(\hat{x}) u(x_i, y_i)\right)\\ \\ &= \sum_{j=1}^n \lambda_j(\hat{x})\left(\frac{1}{t}\sum_{i=1}^{t-1} u(b_j, y_i) - u(x_i,y_i)\right) \\ \\ &= \sum_{j=1}^n \lambda_j(\hat{x}) R_\mathcal{X}(\mathcal{H}_{t}, b_j). \end{array}\]
We know by hypothesis that $\limsup R_\mathcal{X}(\mathcal{H}_{t}, b_j) \le 0$ for all $j = 1,\ldots, n$. Using the subadditivity of $\limsup$, we find
\[\limsup_{t\to+\infty} \frac{1}{t}R_\mathcal{X}(\mathcal{H}_{t}, \hat{x}) \le \sum_{j=1}^n \lambda_j(\hat{x}) \limsup_{t\to+\infty} R_\mathcal{X}(\mathcal{H}_{t}, b_j) \le 0.\]
This completes the proof. $\square$
We conclude this section noting that when considering a standard normal-form game, the set of pure strategies forms a basis for the strategy simplex $\mathcal{X}$. However, notice that our presentation is more general, and accounts for non-standard action spaces.

4.2. Building an explicit strategy

Lemma 3 fixes the problem we had we Equation 6. Specifically, we can now consider an approachability game over $\mathcal{X}$ and $\mathcal{Y}$, with a biaffine payoff function
\[f : (x, y) \mapsto (r_{b_1}(x, y), \ldots, r_{b_n}(x,y)),\]
and having the non-positive orthant as target set.
From the construction in the proof of Theorem 2 we know that in order to find the action $x_t\in\mathcal{X}$ given $\mathcal{H}_{t-1}$, we need to first project $\phi_t$ onto $S$, finding $\psi_t$, and then force the halfspace containing $S$ perpendicular to $\phi_t-\psi_t$ at $\psi_t$.
  • We start by projecting $\phi_t$ onto $S$. The particular structure of $S$ (non-positive orthant) makes it very easy:
    \[\psi_t = [\phi_t]^-,\]
    where $[x]^-$ denotes the vector whose $i$-th component is equal to $\min\{x_i, 0\}$.
  • All halfplanes tangent to $S$ and containing $S$ are of the form
    \[H(\lambda) = \{x : \lambda^\top x \ge 0\}, \quad \|\lambda\| \neq 0,\ \ \lambda_j \le 0 \ \ \forall j.\]
    By definition, $H(\lambda)$ is forceable if there exists an action $\bar{x}\in\mathcal{X}$ such that $\lambda^\top f(\bar{x}, y) \ge 0$ for all $y\in\mathcal{Y}$. Expanding the definition of $f$,
    \[\lambda^\top f(\bar{x}, y) = \sum_{j=1}^n \lambda_j (u(b_j, y) - u(\bar{x}, y)),\]
    so that the forceability condition becomes:
    \[u(\bar{x},y) \sum_{j=1}^n \lambda_j \le \sum_{j=1}^n \lambda_j u(b_j,y).\]
    Exploiting the bilinearity of $u$, we finally find that a forcing action for $\bar{x}$ has to satisfy
    \[u\left(\left(\sum_{j=1}^n \lambda_j\right) \bar{x}, y\right) \le u\left(\sum_{j=1}^n \lambda_j b_j, y\right).\]
    Luckily, we can simply choose $\bar{x}$ so that the first argument on the left hand side matches exactly the first argument on the right hand side:
    \[\bar{x} = \sum_{j=1}^n \frac{\lambda_j}{\Lambda} b_j,\] (7)
    where $\Lambda\doteq\sum_{j=1}^n \lambda_j < 0$ conveniently acts as a normalization constant, so that Equation 7 not only gives us a candidate $\bar{x}$, it also guarantees that such $\bar{x}$ belongs to the polytope $\mathcal{X}$ by expressing it as a convex combination of the basis vectors $b_1,\ldots,b_n$.
We now have all the pieces needed to write down the complete algorithm:
  1. We start by playing a random $x_1\in\mathcal{X}$, and observe the opponent’s move $y_1$.
  2. At time $t$, if $\phi_t \in S$, we play at random. Otherwise, we play the action $x_t$ defined by Equation 7. In both cases, move and wait for the opponent’s move $y_{t}$. Notice that since $\lambda$ is parallel to $\psi_t - \phi_t = -[\phi_t]^+$, we can rewrite $x_t$ as
    \[x_t = \sum_{j=1}^n \frac{[\phi_t]_j^+}{\Lambda} b_j, \qquad \Lambda = \sum_{j=1}^n\,[\phi_t]_j^+.\]
Blackwell’s theorem guarantees that if both $\mathcal{X}$ and $\mathcal{Y}$ play according to the strategy above, then for all $\hat{x}\in\mathcal{X},\hat{y}\in\mathcal{Y}$,
\[\frac{1}{t} R_\mathcal{X}(\mathcal{H}_t, \hat{x}) \le \frac{2}{\sqrt{t}},\qquad \frac{1}{t} R_\mathcal{Y}(\mathcal{H}_t, \hat{y}) \le \frac{2}{\sqrt{t}}.\]
Using Theorem 3 we conclude that the strategy above guarantees that at time $t$ a $O(\sqrt{\max\{|\mathcal{X}|,|\mathcal{Y}|\}}/\sqrt{t})$-Nash equilibrium is found.

5. References

This post is based on the work of
  • Matus Telgarsky, Blackwell Approachability and Minimax Theory (pdf)
  • Sergiu Hart and Mas-Colell, A Simple Adaptive Procedure Leading to Correlated Equilibrium (pdf)