1 Introduction

Games with infinitely many atomless players have since long ago been used both in engineering and economics to model strategic interaction between large number of players, when the influence of an individual on the outcome of the game becomes negligible. Since the pioneering papers of Schmeidler [31] and Wardrop [38], they have become an important tool in modelling competitive markets, stock exchange and exploitation of common resources on one side, and network congestion or power control on the other.

Dynamic games of this type have been introduced in a paper by Jovanovic and Rosenthal [19]. In their framework each of the players controls an individual discrete-time Markov chain, while the global state of the game, defined as a probability distribution of individual states of all the players, becomes deterministic. The reward of an individual is then computed as the expectation of discounted sum of utilities obtained by him in infinitely many stages of the game. Some generalizations of their model were provided in [1, 57]. The extension of their model to cover other utility criteria such as expected average utility and expected total utility was provided in [39].

Another important class of dynamic games with continuum of players has been introduced independently by Lasry and Lions [2326] and by Huang et al. [1618]. In their model the time is continuous, and so the evolution of both the individual and the global state of the game are described by ordinary differential equations. One can view their model as a generalization of differential games to games with continuum of players, while that of [19] as an extension of Markov or stochastic games to games with infinitely many players. The papers of Lasry and Lions have made an important impact on the entire game-theoretic community, additionally providing the name which is now commonly used to describe games of both types—“mean-field games”. An overview of the state of the art in mean-field game theory can be found in [1, 11] includes a review of applications of mean-field games in economics, while [35] takes a look at those in engineering.

In this paper we concentrate on an intermediate concept, linking some features of mean-field games à Lasry and Lions and anonymous games of Jovanovic and Rosenthal. In our model the moments when the decisions are made are discrete, but follow separate controlled continuous time Markov chains, each controlled by a different player. As a result, these moments are different for each of the players—the process of individual states for each is a continuous time Markov chain, but the global state is, as in other mean-field game models, deterministic—following an ordinary differential equation. Model of this type has first appeared in the literature in a seminal paper of Gomes et al. [10] where characterization in terms of differential equations and main properties of this model were provided, together with a result on the convergence of n-person counterparts of this game to mean field limit. Further results of this type were provided in [9]. Some particular cases or applications of this type of games were also studied in [13, 14, 20, 21, 40]. In the paper we introduce a novel model of games of this type where players, instead of maximizing some payoff accumulated over the entire game, maximize the reward obtained during their lifetime, which may be different for different players. We assume that a dead player can be replaced after some time by a newborn one, and thus after some time we can obtain stationary behavior of the system which is then used to define a mean-field-type equilibrium. In the first part of the paper we give some sufficient conditions for these games to possess equilibria. These are of strategic complementarity type and are inspired by a paper on Markov-type discounted mean-field games [1]. Further, we show that the payoffs of the players using any given stationary strategy of a certain class in a semi-Markov mean-field game are close to those obtained in its n-person counterparts for n large enough. This implies that equilibrium strategies in the anonymous model can well approximate equilibria in related games with large finite number of players.

The organization of the paper is as follows: In Sect. 2 we present the general framework we are going to work with and define what kind of solutions we will be looking for. In Sect. 3 we present our main results about the existence of equilibrium in games with strategic complementarities and convergence to equilibrium of a simple learning procedure, followed by some examples of applications of our model. Section 4 contains results linking mean-field game model presented earlier with games with large finite number of players. It is followed by conclusions in Sect. 5.

2 The Model

In this section, we formally describe the game model and the solution we will analyze in the remainder of the paper.

The semi-Markov mean-field game with total reward is described by the following objects:

  • The game is played by an infinite number (continuum) of players. Each player has his own private state \(s\in S\), changing over time. We assume that S is a finite set. We assume that there exists an elementFootnote 1 \(s_0\) standing for “death” of a player. Any player in state \(s_0\) has no choice of action to play and receives no rewards. Moreover, his reward is computed over his “lifetime”, that is, from one visit in state \(s_0\) to his next visit there.

  • The global state of the system at time t, \(X_t\) is a probability measure over S. It describes the mass of the population, which is at time t in each of the individual states. The set of global states of the game is thusFootnote 2 \(\Delta (S)\). We assume that any player \(\alpha \) has an ability to observe the global state of the game, so from his point of view the state of the game at time t is \((s^\alpha _t,X_t)\in S\times \Delta (S)\).

  • We assume that the time is continuous, but the individual state of player \(\alpha \) can only change at specific times \(T_0^\alpha ,T_1^\alpha ,\ldots \), where \(T_0^\alpha =0\). The time between successive transitions \(\tau _k^\alpha =T_{k+1}^\alpha -T_{k}^\alpha \) is random exponentially distributed with intensity \(\lambda (s_{T_{k-1}^\alpha },X_{T_{k}^\alpha })\). \(\tau _k^\alpha \) are for different k and \(\alpha \) independent random variables. \(\lambda \) is a positive, Lipschitz-continuous function of the global state of the game.

  • The set of actions available to a player in state (sX) is a nonempty set A(sX), with \(A:=\bigcup _{(s,X)\in S\times \Delta (S)}A(s,X)\)—a finite set. We assume that the mapping A is an upper semicontinuous function. We also assume that any player in state \(s_0\) plays some default action \(a_0\), not available in any other individual state.

    Let D denote the set of feasible state-action vectors, that is

    $$\begin{aligned} D:=\{ (s,X,a)\in S\times \Delta (S)\times A: a\in A(s,X)\}. \end{aligned}$$
  • The transition for player \(\alpha \) at time \(T_{k-1}^\alpha \) is according to the transition function \(q:D\rightarrow \Delta (S)\) which is a Lipschitz-continuous function of the global state. \(q(\cdot |s_{T_{k-1}^\alpha },X_{T_{k}^\alpha },a_{T_{k}^\alpha })\) denotes the distribution of the individual state of player \(\alpha \) after jump he makes at time \(T_k^\alpha \), given his previous state \(s_{T_{k-1}^\alpha }\), his action \(a_{T_{k}^\alpha }\) and the state-action distribution of all the players at time \(T_k^\alpha \). In particular, a player in state \(s_0\) can join the game (be reborn) at time T in state s with probability \(q(s|s_{0},X_{T},a_{0})\).

  • We assume that all the players use stationary strategies, that is, they choose their actions depending only on their current individual state and current global state. Thus any strategy f is a Borel-measurable function from \(S\times \Delta (S)\) to A such that for any \(s\in S\) and \(X\in \Delta (S)\), \(f(s,X)\in A(s,X)\). The set of all stationary strategies will be denoted by F.

  • The changes in individual states are aggregated according to the Kurtz (see Theorem 5.3 in [32]) dynamics:

    $$\begin{aligned} \overset{.}{X^s_t}=\sum _{s'\in S}\sum _{a\in A}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,a)\overline{f}_a(s',X_t)-X^s_t\lambda (s,X_t),\quad s\in S \end{aligned}$$
    (1)

    with \({X}_0\equiv x_0\), the initial global state, where \(\overline{f}\) denotes the average stationary policy used by the players. This average can be defined if the function \(f_\alpha (s,X)\) is jointly measurable in \((\alpha ,X)\) by the following equality

    $$\begin{aligned} \overline{f}_a(s,X):=\int _0^1 \mathbbm {1}\{f^\alpha (s,X)=a\}\, \mathrm{d}\alpha , \end{aligned}$$

    where \(f^\alpha \) is the stationary strategy of player \(\alpha \). As we will see, in all our considerations, this will be a.e. a constant function of \(\alpha \), so the joint measurability will be immediately implied by measurability w.r.t. X. In the sequel, we will write \(X_t(\overline{f})\) for the global state satisfying (1) when average stationary strategy is \(\overline{f}\).

  • Given the evolution of the global state, which depends on the strategies of the players in a deterministic manner, we can define the individual history of player \(\alpha \) as the sequence of his consecutive individual states, actions and sojourn times \(h=\left( s^\alpha _{T^\alpha _0},\tau ^\alpha _0,a^\alpha _{T^\alpha _1},s^\alpha _{T^\alpha _1},\ldots \right) \). By the Ionescu-Tulcea theorem (see Chap. 7 in [4]), for any stationary strategy f of player \(\alpha \) and any initial individual state distribution \(\mu _0\), there exists a unique probability measure \(P_{f,\mu _0}\) on the set of all infinite histories of the game \(H=(S\times \mathbb {R}^+\times A)^\infty \) endowed with Borel \(\sigma \)-algebra consistent with f, q and \(\mu _0\). Then the individual \(\alpha \)’s expected total reward is defined as the integral of his immediate (per unit time) reward function \(r:D\rightarrow \mathbb {R}\) over his lifetime, plus the sum of rewards received upon the change of state awarded according to the function \(\widetilde{r}:D\rightarrow \mathbb {R}\), which can be written as

    $$\begin{aligned} J(f,\overline{g},\mu _0)={{\mathbb E}}^{P_{f,\mu _0}}\left[ {\sum _{i=0}^{i_e-1}\left( \widetilde{r}(s^\alpha _{T_{i}^\alpha },X_{T_{i}^\alpha }(\overline{g}),a^\alpha _{T_{i}^\alpha })+ \int _{T_{i}^\alpha }^{T_{i+1}^\alpha }r(s^\alpha _{T_{i}^\alpha },X_t(\overline{g}),a^\alpha _{T_{i}^\alpha })\, \hbox {d}t\right) }\right] , \end{aligned}$$
    (2)

    where \(T_{i_e}\) is the moment of his first return to \(s_0\) and \(\mu _0\) is the initial distribution of all the new-born players. We assume both r and \(\widetilde{r}\) are continuous in the global state of the game.

Since the game is symmetric, the equilibrium can be defined in the following manner. A stationary strategy f and a measure \(\mu \in \Delta (S)\) are in equilibrium in the semi-Markov mean-field game with total reward if \(X_0=\mu \) implies \(X^t(\overline{f})\equiv \mu \) for every \(t\ge 0\) and for every other stationary strategy \(g\in F\),

$$\begin{aligned} J(f,\overline{f},\rho )\ge {J}(g,\overline{f},\rho ), \end{aligned}$$

where \(\rho =q(\cdot |s_{0},\mu ,a_{0})\) is the distribution of individual states of new-born players when global state is \(\mu \).

3 Game with Strategic Complementarities

In this section, we present the results about the existence of and convergence to equilibrium in our game for the model under some lattice-theoretic assumptions. Since the reader may be unfamiliar with lattice theory, below we present a brief introduction to it with all the notions used in the remainder of the paper. Those interested in deepening their knowledge about this subject are referred to [36], where concepts of lattices and supermodularity together with their applications to decision and game theory are discussed in detail.

3.1 Lattice-Theoretic Preliminaries

Let B be a partially ordered set with order \(\preceq \). An element \(b\in B\) is called an upper bound of \(C\subset B\) if \(b\succeq c\) for every \(c\in C\). Similarly, b is a lower bound of C if \(b\preceq c\) for all \(c\in C\). We say that b is a supremum or a least upper bound of C in B if it is an upper bound of C and \(b\succeq b'\) for any other upper bound of C, \(b'\). Similarly a least lower bound or an infimum is defined. We say that B is a lattice if for every \(b,b'\in B\), \(\sup \{ b,b'\},\inf \{ b,b'\}\) exist in B. We say that it is a complete lattice if for every nonempty \(C\subset B\), \(\sup \{ C\},\inf \{ C\}\) exist in B.

Many commonly used partially ordered sets are lattices. For example \(\mathbb {R}\) is a lattice with usual ordering as well as any \(\mathbb {R}^n\) with componentwise ordering.Footnote 3 None of them is a complete lattice though. Compact intervals of \(\mathbb {R}^n\) are simple examples of complete lattices. A lattice which will be of particular interest to us is that of Borel probability measures on \(\mathbb {R}\), \(\Delta (\mathbb {R})\), with (first order) stochastic dominance ordering \(\preceq _\mathrm{SD}\) defined as follows:

$$\begin{aligned} P\preceq _\mathrm{SD}Q\quad \Longleftrightarrow \quad \int _{\mathbb {R}}g(x)P(\hbox {d}x)\le \int _{\mathbb {R}}g(x)Q(\hbox {d}x) \end{aligned}$$

for any nondecreasing bounded measurable function \(g:\mathbb {R}\rightarrow \mathbb {R}\).Footnote 4 It is well known that \(P\preceq _\mathrm{SD}Q\) is equivalent to \(F_P(x)\ge F_Q(x)\) for any \(x\in \mathbb {R}\), where \(F_P\) and \(F_Q\) are cumulative distribution functions corresponding to P and Q respectively. Again, \(\Delta (\mathbb {R})\) is not a complete lattice, but for any compact subset B of \(\mathbb {R}\), \(\Delta (B)\) is complete. It has been shown in [29] that the same is not true already for \(\mathbb {R}^2\). There, even the set of probability measures defined on the set \(\{ (0,0),(0,1),(1,0),(1,1)\}\) with stochastic dominance ordering is not a lattice, so any results basing on the lattice structure of \(\Delta (\mathbb {R})\) cannot be directly repeated for \(\Delta (\mathbb {R}^n)\), \(n\ge 2\).

Let B be a lattice. A function \(f:B\rightarrow \mathbb {R}\) is nondecreasing if \(b\preceq b'\) implies \(f(b)\le f(b')\). f is supermodular if \(f(\sup \{ b,b'\})+f(\inf \{ b,b'\})\ge f(b)+f(b')\). If C is also a lattice, we say that a function \(f:B\times C\rightarrow \mathbb {R}\) has increasing differences in b and c if \(b\succeq b'\), \(c\succeq c'\) implies \(f(b,c)-f(b',c)\ge f(b',c)-f(b',c')\). Finally, a correspondence \(T:B\rightarrow C\) is nondecreasing if for any \(b\preceq b'\) and \(c\in T(b)\), \(c'\in T(b')\), \(\inf \{ c,c'\}\in T(b)\) and \(\sup \{ c,c'\}\in T(b')\). If, instead of real-valued functions f we consider a function whose values are probability measures on \(\mathbb {R}\) (a parametrized measure) with stochastic dominance ordering, we use terms stochastically nondecreasing and stochastically supermodular for the counterparts of the above properties. We say that a parametrized measure \(f(\cdot |b,c)\) has stochastically increasing differences if \(\int _{\mathbb {R}}g(a)f(da|b,c)\) has increasing differences for any nondecreasing bounded measurable function g.

3.2 Assumptions

Below we present the set of assumptions for the model considered in our paper. These assumptions (except the first one) are not necessary for the model to make sense, but will be used either to prove the existence of equilibria there or in some further results.

  1. (A1)

    There exists a \(p_0>0\) such that for any fixed global state \(\mu \) and under any stationary policy f the probability of getting from any state \(s\in S\setminus \{ s_0\}\) to \(s_0\) in \(|S|-1\) steps is not smaller than \(p_0\).

  2. (A2)

    S and A are sublatticesFootnote 5 of \(\mathbb {R}\) with \(s_0=\min \{ S\}\) and \(a_0=\min \{ A\}\) and for any \(s\in S\) and \(X\in \Delta (S)\), A(sX) is a sublattice of A. Moreover, A(sX) is nondecreasing in (sX).

  3. (A3)

    r(sXa) and \(\widetilde{r}(s,X,a)\) are nonnegative nondecreasing in s and supermodular in (sa). Moreover, they have increasing differences in (sa) and X.

  4. (A4)

    \(q(\cdot |s,X,a)\) is stochastically supermodular in (sa) and stochastically nondecreasing in s, a and X. Moreover, it has stochastically increasing differences in (sa) and X.

  5. (A5)

    \(\lambda (s,X)\) does not depend on s and is nonincreasing in X.

  6. (A6)

    The value of A(sX) does not depend on X.Footnote 6

Remark 1

Note that some of the above assumptions can be slightly relaxed if, instead of considering each of the functions defining the game separately, some combinations of them were characterized. In particular, assumptions (A3) and (A5) could be relaxed, if we did not assume the positivity, monotonicity and supermodularity of each of r, \(\widetilde{r}\) and \(\lambda ^{-1}\), but rather assumed that \(\widetilde{r}(s,X,a)+\frac{r(s,X,a)}{\lambda (s,X)}\) is nonnegative nondecreasing in s, supermodular in (sa) and having increasing differences in (sa) and X.

Remark 2

The assumption (A3) can be slightly generalized by considering the reward functions r and \(\widetilde{r}\) depending not only on the individual state s and the action a of a given player and the global state X but also on the global distribution of actions that we can denote as Z. Then we could assume the following:

  1. (A3’)

    r(sXaY) and \(\widetilde{r}(s,X,a,Y)\) are nonnegative nondecreasing in s, supermodular in (sa). Moreover, they have increasing differences in (sa) and (XY).

The proofs of Theorems 1 and 2 can be repeated when (A3) is replaced with (A3’) in the assumptions, although they become more complex notationally.

Remark 3

Our supermodularity/increasing differences assumptions are closely related to the monotonicity assumptions used by Lasry and Lions [26] to establish the uniqueness of equilibrium solution in a mean-field game. The assumptions of this type have been extensively used in the mean-field game literature, also for games with finite state space [10]. The formulations of these assumptions may slightly differ depending on other assumptions that are made, but they all can be viewed as very close to requiring strictly increasing differences in individual and global states of some function related to the Hamiltonian corresponding to the immediate reward (cost) function (or the immediate reward itself, see e.g. [12]) as well as of the terminal reward (cost). In our assumptions we require weak supermodularity and weakly increasing (nondecreasing) differences of the functions defining our model. It is easy to see that in a degenerated case when each of the functions r, \(\widetilde{r}\) and q is constant on \(S\times A\times \Delta (S)\), our assumptions will not be violated, while any of the monotonicity assumptions used in the literature will.Footnote 7 It is natural, as we do not expect uniqueness of equilibrium in our model, but rather a special structure of equilibrium strategy set. Similarly, monotonic mean-field game models typically require some convexity assumptions to hold. In our case no convexity in any variable is assumed. On the other hand, apart from assuming increasing differences in individual and global state we make additional (weak) monotonicity assumptions about functions defining our model.

Remark 4

There is no discounting in our model, as (A1) guarantees that the expected rewards for the players are bounded. Note however that adding discounting does not change our results, so if some real-life application requires adding it to the model (which is often the case in economics), one is free to do so.

The assumptions of the strategic complementarity type have been used in the game-theoretic literature for a long time. A review of results for one-step games can be found in [36]. Some results about dynamic games with strategic complementarities can be found in [2, 3, 8, 15, 30, 33, 37]. A model of discounted dynamic games with continuum of players satisfying similar assumptions can be found in [1]. A general intuition about this type of conditions is the following: Strategic complementarity between some two quantities describes a situation when they mutually reinforce one another, that is an increase in one of them implies that it is profitable to increase the other one and vice versa. In dynamic games with complementarities we usually assume that strategic complementarity takes place between individual states of players, so an increase in one’s state makes increase in others’ state profitable. In addition, we usually assume (as we do here) that there is a complementarity between player’s actions and his states, so that an increase in the state makes higher actions more profitable. Finally, we also need to make some monotonicity assumptions about the immediate rewards and the transition law, which are crucial for the aggregate reward of a player to preserve the strategic complementarity of immediate reward functions. It turns out that many games possess this kind of properties, as seen in the example below. It should also be noted that many real-life applications can be modelled as total reward semi-Markov mean-field games with complementarities. Some of them are presented in Sect. 3.5.

Example 1

While some of the assumptions (A1–A6) are rather clear, it may be difficult for those not familiar with theory of supermodular functions to see what kind of functions satisfy assumptions (A3) and (A4). Below we present some examples. Functions r and \(\widetilde{r}\) satisfying (A3) can be of any of the following forms:

$$\begin{aligned}&\alpha (s)\beta (a){{\mathbb E}}\left[ {\gamma (X)}\right] , \end{aligned}$$
(3)
$$\begin{aligned}&\min \{\alpha (s),\beta (a),{{\mathbb E}}\left[ {\gamma (X)}\right] \}, \end{aligned}$$
(4)

where \(\alpha :S\rightarrow \mathbb {R}\), \(\beta :A\rightarrow \mathbb {R}\), \(\gamma :S\rightarrow \mathbb {R}\) are any nonnegative nondecreasing functions. They can also be of the form

$$\begin{aligned} c_1\alpha (s)+c_2\beta (a)+c_3\gamma (X) \end{aligned}$$
(5)

where \(\alpha \) is a nonnegative nondecreasing function, \(\beta \) and \(\gamma \) are any nonnegative functions of respective variables, while the constants \(c_1, c_2, c_3\ge 0\). Finally, they can be any conical combination of functions of forms (35), as well as of a quadratic function of the formFootnote 8

$$\begin{aligned} -{{\mathbb E}}\left[ {(\beta (a)-\gamma (X))^2}\right] , \end{aligned}$$

where \(\beta \) and \(\gamma \) are nondecreasing, provided it is nonnegative.

An example of the transition law satisfying (A4) was given by Nowak [30]:

$$\begin{aligned} q(\cdot |s,X,a)=f(s,X,a)q_1(\cdot |s,X,a)+(1-f(s,X,a))q_2(\cdot |s,X,a), \end{aligned}$$

where \(q_1\succeq _\mathrm{SD}q_2\), while \(f:S\times \Delta (S)\times A\rightarrow [0,1]\) is supermodular in (sa) and nondecreasing in s, a and X. Moreover, it has increasing differences in (sa) and X. Such a function can be constructed as a conic combination of functions (35) under additional condition that all the functions \(\alpha \), \(\beta \), \(\gamma \) are nondecreasing.

3.3 Existence of Equilibrium

Now we can formulate the main result of this section.

Theorem 1

A semi-Markov mean-field game with total reward satisfying assumptions (A1–A5) has an equilibrium \((f^*,\mu ^*)\) such that \(f^*\) is nondecreasing in individual state and \(\mu \).

Many of the arguments used in the proof are taken from [1] where discrete-time discounted mean-field games with strategic complementarities were considered. Whenever some results appearing there can be used here in an unchanged form, we refer the reader to some specific results in that paper. To start with, we need to introduce for any fixed global state X an auxiliary dynamic optimization model \(\mathcal {M}(X)\). Suppose an individual controls a discrete-time Markov decision process with total cost, with

  1. (a)

    the state space S and the action space A;

  2. (b)

    the initial distribution of states \(\mu _0\);

  3. (c)

    the transition probabilitiesFootnote 9

    $$\begin{aligned} Q_X(\cdot |s_t,a_t)=\left\{ \begin{array}{ll}q(\cdot |s_t,X,a_t)&{} \text{ for } \text{ any } s_t\ne s_0\\ \delta [s_0],&{} \text{ for } s_t=s_0\end{array}\right. , \end{aligned}$$

    so \(s_0\) becomes now absorbing;

  4. (d)

    the reward per stage given by the equality

    $$\begin{aligned}R_X(s_t,a_t)= \widetilde{r}(s_t,X,a_t)+ \frac{r(s_t,X,a_t)}{\lambda (s_t,X)} .\end{aligned}$$

Note that for any stationary strategy f, the reward received by the controller using f in this model equals the total reward (2) in case the global state induced by \(\overline{g}\) is fixed and equal to X. Note also that this is a classic Markov decision process with total reward, as considered in the literature, and so standard dynamic programming arguments imply that:

  1. (a)

    Since assumptions (A1) and (A2) hold, the optimal value in this model is finite.

  2. (b)

    The optimal value in this model \(V_X^*\) has to satisfy for any \(s\in S\) the following Bellman equation:

    $$\begin{aligned} V_X^*(s)=\max _{a\in A(s,X)}\left[ R_X(s,a)+\sum _{s'\in S}V_X^*(s')Q_X(s'|s,a)\right] . \end{aligned}$$
    (6)
  3. (c)

    A is finite, and thus compact, which implies that ‘sup’ in (6) can be replaced by ‘max’, moreover, optimal stationary strategies in \(\mathcal {M}(X)\) exist and can be identified as any strategies maximizing the RHS of (6).

In the first lemma we will show what are the main properties of \(V_X^*\).

Lemma 1

\(V_X^*(s)\) is nondecreasing in s and has increasing differences in s and X.

Proof

The proof is for most part the repeat of the arguments used in [1]. It will be broken into three claims. Before we formulate the first one, we need to note two facts: First, that \(R(s,X,a)=\frac{r(s,X,a)}{\lambda (s,X)}\) is by assumptions (A3) and (A5) a product of two functions that are nonnegative nondecreasing in s and supermodular in (sa). As such, R preserves all these properties. Next, since \((\lambda (s,X))^{-1}\) in nonnegative, constant in s and nondecreasing in X while r has increasing differences in (sa) and X,

$$\begin{aligned} R(s,X,a)-R(s',X,a')= & {} \frac{r(s,X,a)}{\lambda (s,X)}-\frac{r(s',X,a')}{\lambda (s',X)}\\= & {} \frac{1}{\lambda (s,X)}(r(s,X,a)-r(s',X,a')), \end{aligned}$$

so for \((s,a)\succeq (s',a')\) it is a product of two nonnegative nondecreasing functions of X, thus a nondecreasing function itself. This means that R has increasing differences in (sa) and X. Monotonicity, supermodularity, and increasing differences are preserved upon summation, so (by (A3)) \(R_X(s,a)=\widetilde{r}(s,X,a)+\frac{r(s,X,a)}{\lambda (s,X)}\) also has all these properties.

Second, note that \(Q_X(\cdot |s,a)=\left\{ \begin{array}{ll}q(\cdot |s,X,a)&{} \text{ for } \text{ any } s\ne s_0\\ \delta [s_0],&{} \text{ for } s=s_0\end{array}\right. \) preserves all the properties of q, as:

  1. (a)

    \(\delta [s_0]\) is stochastically smaller than any other probability distribution over S, and so \(Q_X\) trivially stays stochastically nondecreasing in (sa).

  2. (b)

    Stochastically increasing differences in (sa) and X are preserved because for \((s,a)\succ (s_0,a_0)\), \(Q_X(\cdot |s,a)=q(\cdot |s,X,a)\) is stochastically nondecreasing in X, while \(Q_X(\cdot |s_0,a_0)\) is constant.

  3. (c)

    Supermodularity in (sa) in D is trivial, as \((s_0,a_0)\prec (s,a)\) for any \((s,a)\ne (s_0,a_0)\), and so always \(\sup \{(s_0,a_0),(s,a)\}=(s,a)\) and \(\inf \{(s_0,a_0),(s,a)\}=(s_0,a_0)\).

Now we can pass to the main part of the proof.

Claim 1

Let v be a bounded function of s and X, nondecreasing in s and having increasing differences in s and X. Then

$$\begin{aligned} w(s,X,a)=\sum _{s'\in S}v(s',X)Q_X(s'|s,a) \end{aligned}$$

is nondecreasing in s and a, supermodular in (sa), and has increasing differences in (sa) and X.

This claim has been shown in [1] as Lemma 3.

Claim 2

Let v be a bounded function of s and X, nondecreasing in s and having increasing differences in s and X. Then

$$\begin{aligned} T(s,X)(v)=\max _{a\in A(s,X)}\left[ R_X(s,a)+\sum _{s'\in S}v(s',X)Q_X(s'|s,a)\right] \end{aligned}$$

is nondecreasing in s and has increasing differences in s and X.

This claim has been shown in [1] as Lemma 4.

Claim 3

\(V_X^*(s)\) is nondecreasing in s and has increasing differences in s and X.

By assumption (A1) we can write that for any two bounded functions of (sX): v, w

$$\begin{aligned}&\max _{s\in S,X\in \Delta (S)}\left| T^{|S|}(s,X)(v)-T^{|S|}(s,X)(w)\right| \\&\quad \le (1-\min _{s\in S,a\in A,X\in \Delta (S)}q^{|S|}(s_0|s,X,a))\max _{s\in S,X\in \Delta (S)}|v(s,X)-w(s,X)|\\&\quad \le (1-p_0)\max _{s\in S}|v(s)-w(s)| \end{aligned}$$

and so \(T^{|S|}\) is a contraction. Since the set of bounded functions of (sX) which are nondecreasing in s and have increasing differences in s and X is a closed subset of a complete metric space of bounded functions from \(S\times \Delta (S)\) to \(\mathbb {R}\), it is also a complete metric space, and consequently \(T^{|S|}\) has a unique fixed point in this set.

Now take \(V_X^0\equiv 0\) and define for \(k>0\)

$$\begin{aligned} V_X^k(s)=\max _{a\in A(s,X)}\left[ R_X(s,a)+\sum _{s'\in S}V_X^{k-1}(s')Q_X(s'|s,a)\right] . \end{aligned}$$

It is clear that \(V_X^k(s)=T^k(s,X)(V_X^0)\). Consequently, \(V_X^*(s)=\lim _{k\rightarrow \infty }V_X^k(s)=\lim _{k\rightarrow \infty }T^{k|S|}(s,X)(V_X^0)\) which equals the fixed point of \(T^{|S|}\). This proves that \(V_X^*(s)\) has all the desired properties.\(\square \)

Next, let us define a correspondence that can be viewed as a best response operator:

$$\begin{aligned} \mathcal {B}(X)(s)=\arg \max _{a\in A(s,X)}\left[ R_X(s,a)+\sum _{s'\in S}V_X^*(s')Q_X(s'|s,a)\right] . \end{aligned}$$

Next, let \(\underline{B}(X)\) and \(\overline{B}(X)\) denote the smallest and the biggest best responses, that is

$$\begin{aligned} \underline{B}(X)(s)=\min \mathcal {B}(X)(s),\quad \overline{B}(X)(s)=\max \mathcal {B}(X)(s). \end{aligned}$$

The fact that they are both well defined, as well as their crucial properties, are shown in the following lemma.

Lemma 2

\(\mathcal {B}(X)\) is nondecreasing in (sX). Moreover, \(\underline{B}(X)(s)\) and \(\overline{B}(X)(s)\) are well defined, nondecreasing in X and, for a fixed X, also nondecreasing in s.

Proof

The proof is based on two results by Topkis. First, define

$$\begin{aligned}f(a,s,X)=R_X(s,a)+\sum _{s'\in S}V_X^*(s')Q_X(s'|s,a).\end{aligned}$$

By Lemma 1 \(V_X^*(s)\) is nondecreasing in s and has increasing differences in s and X. Next, we can use Claim 1 of this lemma to show that this implies that \(\sum _{s'\in S}V_X^*(s')Q_X(s'|s,a)\) is nondecreasing in s, supermodular in (sa), and has increasing differences in (sa) and X. Since \(R_X(s,a)\) also has these properties (which was shown at the beginning of the proof of Lemma 1) and as they are preserved under summation, f(asX) is also nondecreasing in s, supermodular in (sa), and has increasing differences in (sa) and X. Note also that by assumption (A2) A(sX) is nondecreasing in (sX). Now we can apply Theorem 2.8.1 in [36] to obtain the first part of the lemma. The second statement follows from Theorem 2.8.3 (a) in [36]. \(\square \)

In the next lemma we come back to the original game model and analyze the properties of stationary individual state distributions when a player applies a given stationary strategy.

Lemma 3

Suppose that the global state of the game is constant and equal to X. Then the smallest stationary state distribution corresponding to a stationary strategy

$$\begin{aligned} f\in F_0:=\{ g\in F: g(s,X) \text{ is } \text{ nondecreasing } \text{ in } X \text{ and } \text{ for } \text{ any } \text{ fixed } X \text{ in } s\}, \end{aligned}$$

\(\underline{X}(f,X)\) and the greatest stationary state distribution corresponding to f, \(\overline{X}(f,X)\), are nondecreasing functions of f and X on \(F_0\times \Delta (S)\).

Proof

First, note that a stationary global state Y corresponding to the stationary strategy f used by all the players and the fixed global state of the game X must satisfy for every \(s\in S\) the following equation:

$$\begin{aligned} \sum _{s'\in S}\sum _{a\in A}Y^{s'}\lambda (s',X)q(s|s',X,a)f_a(s',X)-Y^s\lambda (s,X)=0. \end{aligned}$$

Note however that by (A5) \(\lambda (s,X)\) does not depend on s. As it is always nonzero, we can cancel out all the \(\lambda \) terms from the above equation, obtaining

$$\begin{aligned} Y^s=\sum _{s'\in S}\sum _{a\in A}Y^{s'}q(s|s',X,a)f_a(s',X). \end{aligned}$$
(7)

Clearly, by (A4) and the fact that f is nondecreasing, \(q(\cdot |s',X,f(s',X))\) is stochastically nondecreasing in \(s'\) and X, as well as in f, as long as strategies from \(F_0\) are applied.

Now define \(\phi :\Delta (S)\times \Delta (S)\times F_0\rightarrow \Delta (S)\) with equality

$$\begin{aligned} \phi ^s(Y,X,f)=\sum _{s'\in S}Y^{s'}q(s|s',X,f(s',X)). \end{aligned}$$

We will show that this is a nondecreasing function. Let \(Y\preceq _\mathrm{SD}\widetilde{Y}\), \(f\preceq \widetilde{f}\) and \(X\preceq _\mathrm{SD}\widetilde{X}\). As \(q(\cdot |s'.X,f(s',X))\) is stochastically nondecreasing in X and f, clearly

$$\begin{aligned} \sum _{s\in S}w(s)q(s|s',X,f(s',X))\le \sum _{s\in S}w(s)q(s|s',\widetilde{X},\widetilde{f}(s',\widetilde{X})) \end{aligned}$$
(8)

for any \(s'\in S\) and any bounded nondecreasing function \(w:S\rightarrow S\). This implies that

$$\begin{aligned}&\sum _{s\in S}w(s)\phi ^s(\widetilde{Y},\widetilde{X},\widetilde{f})=\sum _{s\in S}w(s)\sum _{s'\in S}\widetilde{Y}^{s'}q(s|s',\widetilde{X},\widetilde{f}(s',\widetilde{X}))\\= & {} \sum _{s'\in S}\widetilde{Y}^{s'}\left[ \sum _{s\in S}w(s)q(s|s',\widetilde{X},\widetilde{f}(s',\widetilde{X}))\right] \ge \sum _{s'\in S}\widetilde{Y}^{s'}\left[ \sum _{s\in S}w(s)q(s|s',X,f(s',X))\right] . \end{aligned}$$

Now note that since \(q(\cdot |s',X,f(s',X))\) is stochastically nondecreasing in \(s'\), the expression in brackets is a nondecreasing function of \(s'\), and so since \(\widetilde{Y}\preceq _\mathrm{SD} Y\), the RHS of the last inequality is not smaller than

$$\begin{aligned}\sum _{s'\in S}Y^{s'}\left[ \sum _{s\in S}w(s)q(s|s',X,f(s',X))\right] =\sum _{s\in S}\phi ^s(Y,X,f),\end{aligned}$$

proving that \(\phi \) is nondecreasing. Now we can apply Theorem 3 in [28] to show that for any \(X\in \Delta (S)\), \(f\in F_0\), there exists an \(Y\in \Delta (S)\) such that

$$\begin{aligned} Y=\phi (Y,X,f). \end{aligned}$$
(9)

Moreover, the greatest and the smallest Y satisfying (9), that is the greatest and the smallest stationary distributions corresponding to X and f, \(\overline{X}(X,f)\) and \(\underline{X}(X,f)\), are nondecreasing in f. \(\square \)

Proof of Theorem 1

Define

$$\begin{aligned} \overline{\Psi }(X)=\overline{X}(\overline{B},X)\quad \text{ and }\quad \underline{\Psi }(X)=\underline{X}(\underline{B},X). \end{aligned}$$

Both functions are nondecreasing in X (as superpositions of functions that are nondecreasing by Lemmas 2 and 3 respectively) and defined on a nonempty complete lattice \(\Delta (S)\). Thus by Tarski’s theorem [34] each of them has a fixed point which clearly defines an equilibrium in the game. Note also, that by Lemma 2 equilibrium stationary strategies (\(\overline{B}\) and \(\underline{B}\) respectively) are nondecreasing in s and X.

3.4 Distributed Learning

In the next part of this section we present a distributed iterative algorithm allowing players to learn to play the game. This kind of algorithms are known to exist for some types of games, and games with strategic complementarities are known to be one of them. The very simple and intuitive algorithm presented below is an adaptation for our game of an algorithm presented in [1].

Algorithm 1 (Lower Myopic Learning) For each time moment \(t\ge 0\) repeat the following steps:

  1. 1.

    Every player making his move at time t observes current population state \(X_t\).

  2. 2.

    A player in the individual state s chooses action \(a_t=\underline{B}(X_t)(s)\).

The following theorem summarizes main properties of the Lower Myopic Learning Algorithm.

Theorem 2

Suppose assumptions (A1–A6) are satisfied. Additionally assume that the initial state of the game \(X_0\) satisfies the inequality

$$\begin{aligned} X_0\preceq _{SD} \phi (X_0,X_0,\underline{B}(X_0)). \end{aligned}$$
(10)

and that all the players adjust their strategies according to the Lower Myopic Learning Algorithm. Then:

  1. (a)

    For every \(\alpha \) \(a^\alpha _{T^\alpha _{i+1}}\ge a^\alpha _{T^\alpha _{i}}\), \(i=0,1,\ldots ,i_e^\alpha -1\).

  2. (b)

    \(X_t\) is an increasing function of t converging to some \(\mathcal {X}\) as \(t\rightarrow \infty \), such that \((\underline{B},\mathcal {X})\) are an equilibrium in the game.

One lemma will be used in the proof of the above theorem.

Lemma 4

Suppose that assumptions (A1–A6) are satisfied and that \(\widehat{X},X_t\in \Delta (S)\), \(t\in \mathbb {R}^+\) such that \(X_t\nearrow \widehat{X}\). If f is a stationary strategy such that

$$\begin{aligned} f(X_t,s)\rightarrow _{t\rightarrow \infty }f(\widehat{X},s) \text{ for } \text{ any } s\in S, \end{aligned}$$
(11)

then for any \(s\in S\) the reward from using policy f in model \(\mathcal {M}(X_t)\), \(J_{X_t}(f,s)\), converges to the reward from using f in \(\mathcal {M}(\widehat{X})\), \(J_{\widehat{X}}(f,s)\), as t goes to infinity.

Proof

For any bounded function \(v:s\times X\rightarrow \mathbb {R}\), nondecreasing in s, such that

$$\begin{aligned} v(s,X_t)\rightarrow v(s,\widehat{X}) \text{ for } \text{ any } s\in S \end{aligned}$$
(12)

let us define the operator

$$\begin{aligned} K_f(s,X)(v)=R_X(s,f(X,s))+\sum _{s'\in S}v(s',X)Q_X(s'|s,f(X,s)). \end{aligned}$$

It is clear that for any \(s\in S\), (11) together with the continuity of r, \(\widetilde{r}\) and \(\lambda \) implies the following:

$$\begin{aligned}&\lim _{t\rightarrow \infty }R_{X_t}(s,f(X_t,s))=\lim _{t\rightarrow \infty }\left[ \widetilde{r}(s,X_t,f(X_t,s))+\frac{r(s,X_t,f(X_t,s))}{\lambda (s,X_t)}\right] \\&\quad =\widetilde{r}(s,\widehat{X},f(\widehat{X},s))+\frac{r(s,\widehat{X},f(\widehat{X},s))}{\lambda (s,\widehat{X})}=R_{\widehat{X}}(s,f(\widehat{X},s)). \end{aligned}$$

Then, also

$$\begin{aligned} \lim _{t\rightarrow \infty }\sum _{s'\in S}v(s',X_t)Q_{X_t}(s'|s,f(X_t,s))= \sum _{s'\in S}v(s',\widehat{X})Q_{\widehat{X}}(s'|s,f(\widehat{X},s)), \end{aligned}$$

by (11), (12) and the continuity of Q. This obviously implies that

$$\begin{aligned} \lim _{t\rightarrow \infty }K_f(s,X_t)(v)=K_f(s,\widehat{X}). \end{aligned}$$

Consequently, by induction the same is true for \(K^k_f(s,X)(v)\) with \(k\in \mathbb {N}\).

Next note that r, \(\widetilde{r}\) and \(\lambda \) are continuous on a compact domain, hence bounded. Let L be such that \(|r(s,X,a)|\le L\), \(|\widetilde{r}(s,X,a)|\le L\) and \(\lambda (s,X)\le L\) for any \((s,X,a)\in D\). In addition \(\lambda \) is by assumption positive, so there also exists a \(\underline{\lambda }>0\) such that \(\lambda (s,X)\ge \underline{\lambda }\) for any \((s,X,a)\in D\). Consequently, \(|R_X(s,a)|<L+\frac{L}{\underline{\lambda }}\) for any \((s,X,a)\in D\). Further note that by (A1) for any X, s, f and v,

$$\begin{aligned}&\left| \lim _{m\rightarrow \infty }K^m_f(s,X)(v)-K^k_f(s,X)(v)\right| \\&\quad \le \left( L+\frac{L}{\underline{\lambda }}\right) (1-p_0)^{\left\lfloor \frac{k}{|S|-1}\right\rfloor }\sum _{i=0}^\infty (1-p_0)^i=\frac{L(\underline{\lambda }+1)}{\underline{\lambda }p_0}(1-p_0)^{\left\lfloor \frac{k}{|S|-1}\right\rfloor }. \end{aligned}$$

Thus, for any \(\varepsilon >0\), \(\left| \lim _{m\rightarrow \infty }K^m_f(s,X)(v)-K^k_f(s,X)(v)\right| <\frac{\varepsilon }{2}\) for k big enough, say \(k\ge k_0\). Consequently,

$$\begin{aligned}&\left| \lim _{t\rightarrow \infty }\lim _{m\rightarrow \infty }K^m_f(s,X_t)(v)-\lim _{m\rightarrow \infty }K^m_f(s,\widehat{X})(v)\right| \\&\quad <\left| \lim _{t\rightarrow \infty } K^{k_0}_f(s,X_t)(v)-K^{k_0}_f(s,\widehat{X})(v)\right| +\varepsilon =\varepsilon . \end{aligned}$$

Note however that this, in view of the arbitrarity of \(\varepsilon \) and because both limits on the LHS of the above set of inequalities exist, implies

$$\begin{aligned} \lim _{t\rightarrow \infty }\lim _{m\rightarrow \infty }K^m_f(s,X_t)(v)=\lim _{m\rightarrow \infty }K^m_f(s,\widehat{X})(v). \end{aligned}$$

Note however that by standard dynamic programming arguments, \(\lim _{m\rightarrow \infty }K^m_f(s,X)(v)\) equals \(J_X(f,s)\). Thus we have proved that for every \(s\in S\) \(J_{X_t}(f,s)\rightarrow _{t\rightarrow \infty }J_{\widehat{X}}(f,s)\). \(\square \)

Proof of Theorem 2

First, note that

$$\begin{aligned} X_t\preceq _\mathrm{SD} \phi (X_t,X_t,\underline{B}(X_t)). \end{aligned}$$
(13)

is by definition equivalent to

$$\begin{aligned} \sum _{s\in S}\left[ (\phi (X_t,X_t,\underline{B}))^s-X_t^s\right] h(s)\ge 0 \end{aligned}$$

for any nondecreasing function \(h:S\rightarrow \mathbb {R}\), and further by the definition of \(\phi \) and (A5) to

$$\begin{aligned} \sum _{s\in S}\left[ \sum _{s'\in S}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,\underline{B}(X_t)(s'))-X^s_t\lambda (s,X_t)\right] h(s)\ge 0. \end{aligned}$$
(14)

Next, defineFootnote 10

$$\begin{aligned} H(h)(s):=\sum _{s\in S}X_t^s(\underline{B})h(s), \end{aligned}$$

where (as before) h is an arbitrary function from S to \(\mathbb {R}\). Then by (1) and (14)

$$\begin{aligned} \frac{\mathrm{d}H(h)}{\mathrm{d}t}= & {} \sum _{s\in S}\overset{.}{X^s_t}h(s)\\= & {} \sum _{s\in S}\left[ \sum _{s'\in S}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,\underline{B}(X_t)(s'))-X^s_t\lambda (s,X_t)\right] h(s)\\\ge & {} 0 \end{aligned}$$

This however means that as long as (13) holds, the global state of the game is increasing as time increases. Of course, it also implies that \(a^\alpha _{T^\alpha _{i+1}}\ge a^\alpha _{T^\alpha _{i}}\), \(i=0,1,\ldots ,i_e^\alpha -1\) for any player \(\alpha \), as \(\underline{B}\) is nondecreasing.

Next assume that at some time t (13) is violated. Then, since at the beginning of the game it was by assumption true, and because of the continuity of the trajectory of \(X_t\), there must exist a function \(h_0\), such that

$$\begin{aligned} \sum _{s\in S}\left[ (\phi (X_t,X_t,\underline{B}))^s-X_t^s\right] h_0(s)=0. \end{aligned}$$

Then, it is easy to see that

$$\begin{aligned} \frac{\mathrm{d}H(h_0)}{\mathrm{d}t}= & {} \sum _{s\in S}\overset{.}{X^s_t}h_0(s)\\= & {} \sum _{s\in S}\left[ \sum _{s'\in S}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,\underline{B}(X_t)(s'))-X^s_t\lambda (s,X_t)\right] h_0(s)\\= & {} 0, \end{aligned}$$

which implies that when the boundary of the set where (13) is satisfied is reached, the trajectory cannot leave the set.

Next note that since at any time t \(X_t\preceq _\mathrm{SD} \delta [\max \{ S\}]\), the fact that \(X_t\) is increasing implies that it converges to some \(\mathcal {X}\) (recall that the stochastic domination ordering is equivalent to ordering of CDFs, which, as S is finite, is in turn equivalent to componentwise ordering in \(\mathbb {R}^{|S|}\)).

Next, define

$$\begin{aligned} \widehat{\underline{B}}(X)(s):=\left\{ \begin{array}{ll}\underline{B}(X)(s)&{} \text{ for } X\ne \mathcal {X}\\ \lim _{t\rightarrow \infty }\underline{B}(X_t)(s)&{} \text{ for } X=\mathcal {X}\end{array}\right. \end{aligned}$$

We will now show that \(\mathcal {X}\) is a stationary distribution corresponding to \(\widehat{\underline{B}}\). From the definition of \(\widehat{\underline{B}}\) and the continuity of \(\lambda \) and q we can infer that

$$\begin{aligned}&\lim _{t\rightarrow \infty }\left[ \sum _{s'\in S}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,\widehat{\underline{B}}(X_t)(s'))-X^s_t\lambda (s,X_t)\right] \nonumber \\&\quad =\sum _{s'\in S}\mathcal {X}^{s'}\lambda (s',\mathcal {X})q(s|s',\mathcal {X},\widehat{\underline{B}}(\mathcal {X})(s'))-\mathcal {X}^s\lambda (s,\mathcal {X}). \end{aligned}$$
(15)

On the other hand, since \(X_t\rightarrow _{t\rightarrow \infty }\mathcal {X}\) monotonically, for any s \(\overset{.}{X^s_t}\rightarrow 0\), which is equivalent to

$$\begin{aligned} \lim _{t\rightarrow \infty }\left[ \sum _{s'\in S}X_t^{s'}\lambda (s',X_t)q(s|s',X_t,\widehat{\underline{B}}(X_t)(s'))-X^s_t\lambda (s,X_t)\right] =0. \end{aligned}$$

Combining this with (15) we obtain

$$\begin{aligned} \sum _{s'\in S}\mathcal {X}^{s'}\lambda (s',\mathcal {X})q(s|s',\mathcal {X},\widehat{\underline{B}}(\mathcal {X})(s'))-\mathcal {X}^s\lambda (s,\mathcal {X})=0. \end{aligned}$$

But this, by the definition of \(\phi \) and (A5), means that \(\mathcal {X}\) is a fixed point of \(\phi (\cdot ,\mathcal {X},\widehat{\underline{B}})\), and consequently \(\mathcal {X}\) is a stationary distribution corresponding to \(\widehat{\underline{B}}\).

Next, note that under (A6) any vector of actions \(\overline{a}=(a_s)_{s\in S}\) from sets \(A(s,\mathcal {X})\) can be obtained as a value of a global state-independent policy defined by

$$\begin{aligned} f_{\overline{a}}(X,s)=a_s,\quad s\in S. \end{aligned}$$

Clearly, each of the policies \(f_{\overline{a}}\) satisfies (11). So does \(\widehat{\underline{B}}\) by its construction. Thus we can use Lemma 4 to show the following

$$\begin{aligned} J_{\mathcal {X}}(\widehat{\underline{B}},s)=\lim _{t\rightarrow \infty }J_{X_t}(\widehat{\underline{B}},s)\ge \lim _{t\rightarrow \infty }J_{X_t}(f_{\overline{a}},s)=J_{\mathcal {X}}(f_{\overline{a}},s), \end{aligned}$$

where the inequality follows from the fact that \(\widehat{\underline{B}}(X)=\underline{B}(X)\) for \(X\ne \mathcal {X}\) and the fact that for each t, \(\underline{B}(X_t)\) is a best response to \(X_t\). But this proves that \(\widehat{\underline{B}}(\mathcal {X})\) is a best response to \(\mathcal {X}\), as strategies \(f_{\overline{a}}\) cover all the possible actions that a player can use at the global state \(\mathcal {X}\). To end the proof, note that by the monotonicity of \(\underline{B}\),

$$\begin{aligned} \widehat{\underline{B}}(\mathcal {X})=\lim _{t\rightarrow \infty }\underline{B}(X_t)\le \underline{B}(\mathcal {X}).\end{aligned}$$

This however implies that \(\widehat{\underline{B}}(\mathcal {X})=\underline{B}(\mathcal {X})\), as the latter is by definition the smallest best response to \(\mathcal {X}\). Since the two strategies could only differ at \(\mathcal {X}\), this means that they are equal and that \((\underline{B},\mathcal {X})\) is an equilibrium in the game. \(\square \)

Remark 5

The assumption (10) is both difficult to check and rather restrictive. In [1] to avoid this kind of problem the authors start the algorithm by setting the initial state of each player to \(\min \{ S\}\). This kind of solution seems doubtful. Note that the notion of state of the game or that of a player captures the properties of his environment, and as such depends only partially (and in a nondeterministic way) on his decisions. It cannot thus be set by a player at the beginning of the game. Note however that in our setting this kind of assumption could make more sense than in [1]. In our framework \(\min \{ S\}=s_0\), so assuming that the algorithm is initialized by setting individual states of all the players to \(s_0\) would mean that the game is started before any players join it, which could make sense in many practical applications. On the other hand, for \(X_0=\delta [s_0]\) (10) is trivially satisfied, as it reduces to \(\delta [s_0]\preceq _\mathrm{SD} q(\cdot |s_0,\delta [s_0],a_0)\)), which is true for any transition probability defined on S.

Remark 6

In our setting the players join the game at different times. This naturally implies that those joining at later stages of the game hardly need any adjustment to their initial strategies, because the global state of the game is already very close to \(\mathcal {X}\) when they appear. Consequently, the expected rewards they receive over their lifetime are very close to equilibrium payoffs corresponding to the smallest equilibrium in the game.

3.5 Examples of Application of the Model

In the remainder of this section we will briefly present some natural applications of our framework. Some further ones could be possible, if the sets of states and actions were multi-dimensional or the rewards could be negative. Generalizing to these situations is left however for further research.

Research and development race In this game the players are firms choosing their technological profile. Let s be the level of technological development of firm’s products and a, its investment in research. The transition times for a player are technological breakthroughs for his firm. It is obvious that these moments do not come at the same time for each of the players, so this corresponds well to our framework. Next, a ‘death’ of a player is naturally interpreted as his firm’s bankruptcy. Finally, let r describe his profit minus investment. We assume that there is no \(\widetilde{r}\). It is natural to assume strategic complementarities between rewards for different firms—a higher level of technological development of the entire industry results in a higher demand for high-tech products. Also a higher investment in research is required if industry is at a higher level of development. Finally, one can argue that a firm with a higher technological profile is less likely to get bankrupt.

Corruption game This is a variant of the game presented in [20]. The players here are civil servants who can be in three states: corrupt, honest, excluded from the society. The last state can be naturally seen as a (civil) death of a player—in this state he is not able to receive any rewards. A player’s transitions happen when he has to decide on some project. These moments are naturally different for different players. His actions describe his willingness to change his state. Obviously, a player who wants to be bribed is much more likely to become corrupt. Also the possibility of becoming corrupt increases as the society becomes more depraved. In corrupt state a player’s rewards are the highest and naturally increase as the society becomes more corrupt. Finally, the possibility of death for a player decreases as the society becomes more corrupt, because the control is less stringent. Thus, we can argue this is a game with strategic complementarities as well.

Interdependent security A similar model has appeared in [22]. Let us consider a large number of computers in a cluster. Each of them is trying to avoid system failure due to viruses. Let s describe individual computer’s security level, while a its investment in security. The transition times for an individual are moments of malicious attacks against him. A ‘death’ of a player is the time of system failure. We can assume that \(r\equiv 1\) if the system is OK and zero otherwise (a number of different ‘health’ levels with different rewards is also possible). Further, let \(\widetilde{r}\) be an individual’s investment in security. As one can immediately see, this model fails to satisfy our assumptions, because \(\widetilde{r}\) is negative. We can however argue that a weaker version of our assumptions presented in Remark 1 can be satisfied without making the model unrealistic. Note, that this game is a natural example of games with strategic complementarities, as higher level of security for other computers results in a lower probability of infecting any of them. It is also natural to assume that attacks on different machines are not coordinated, so the moves of different players are asynchronous, like in our framework.

Charging control for plug-in electric vehicles This model is inspired by the one presented in [27]. Let us consider a large population of plug-in electric vehicles. Each of them needs to load its battery regularly, but tries to do it as cheap as possible. The problem is that the cost of energy may depend on the hour of the day—from the electricity producers’ point of view it is best if all the vehicles charge their batteries at the same time during the night when the overall energy consumption is relatively low, so they can incur some additional cost on the car owners for doing differently. On the other hand, the vehicle whose battery is empty needs to be recharged immediately, and otherwise it will decrease its owner’s profits from using it. In our model each player tries to maximize his profits from use of the car minus the charging costs over the lifetime of the vehicle. There are two possible actions: \(a=1\) (not to charge) and \(a=2\) (to charge) and a number of states denoting the battery charge levels (plus artificial state \(s_0<0\) and action \(a_0=0\) denoting the breakdown of the car). The transition times can be viewed as moments when the battery of a given vehicle can be charged, so we can assume \(\lambda \) is constant. The battery state at each of transition times decreases by one with some positive probability, decreases to \(s_0\) with some smaller positive probability and remains constant with the remaining one unless the user decides to charge the battery—then it increases to the maximum battery level \(s_\mathrm{max}\). The immediate reward is of the form \(r(s,X,a,Z)=R\mathbbm {1}\{ s>0\}\), where R is the reward from the exploitation of the vehicle, while \(\widetilde{r}\) is defined as

$$\begin{aligned} \widetilde{r}(s,X,a,Z)=\mathbbm {1}\{ a=2\}[p(s-s_\mathrm{max})-c{{\mathbb E}}\left[ {(a-Z)^2}\right] ], \end{aligned}$$

where p is the nominal energy price and c is the additional cost for deviating from the average policy of the population. Again, this model fails to satisfy assumptions (A1–A5) (\(\widetilde{r}\) is nonpositive and it depends on Z), but it can be directly checked that for R big enough it satisfies all the assumptions of the model combining its two generalizations described in Remarks 1 and 2.

Remark 7

It is worth noting that the last model is one of many models considered in the engineering literature where the so-called crowd-seeking behavior is beneficiary for the players. Strategic complementarity between states or actions of the players seems a perfect mathematical description of this kind of situation. It turns out however that engineering applications of our model are limited for several reasons. The first one is that typically engineering models consider costs, not rewards, so the positivity assumption appearing in (A3) (very important, since we consider a total reward model) fails. The second one is that we also assume that r and \(\widetilde{r}\) are nondecreasing in s, which often is not satisfied. One should however note that this monotonicity assumption is crucial in proving that the aggregate utility of each player preserves the strategic complementarity structure, so we cannot easily get rid of it. Finally, the problems can be caused by the fact that we assume that the state space is a sublattice of \(\mathbb {R}\) (which is important because for \(S\subset \mathbb {R}^n\), \(n\ge 2\), the set \(\Delta (S)\) does not preserve the lattice structure).

4 Relation to Games with Finitely Many Players

In this section we provide a result which links the model with a continuum of players studied above with related models with finite numbers of players. In turn, this result provides an explanation to the use of the Kurtz dynamics (1) for the global state of the game. To begin with, we need to introduce the finite models we will discuss below. Let \(\Gamma \) denote the game with continuum of players defined in Sect. 2. Then \(\Gamma _n\) will denote its counterpart with n players played in exactly the same way as game \(\Gamma \) and such that:

  1. (a)

    The global state of the game at time t is denoted by \(X_t[n]\) and defined by the formula

    $$\begin{aligned} X_t^s[n]=\# \{ \alpha \in \{ 1,\ldots ,n\}:s_t^\alpha =s\}. \end{aligned}$$

    Next, the normalized global state of the game at time t is denoted by \(\overline{X}_t[n]\) and defined as

    $$\begin{aligned} \overline{X}_t^s[n]=\frac{1}{n}X_t^s[n]. \end{aligned}$$
  2. (b)

    All the functions defining the model are defined with respect to the normalized state, and so:

    $$\begin{aligned} r[n](s_t,X_t[n],a_t):=r(s_t,\overline{X}_t[n],a_t),&\quad \widetilde{r}[n](s_t,X_t[n],a_t):=\widetilde{r}(s_t,\overline{X}_t[n],a_t),\\ q[n](\cdot |s_t,X_t[n],a_t):=q(\cdot |s_t,\overline{X}_t[n],a_t),&\quad \lambda [n](s_t,X_t[n]):=\lambda (s_t,\overline{X}_t[n]). \end{aligned}$$

Next define the subset of strategies we shall concentrate on in this section.

$$\begin{aligned} F_{c}=\{ f\in F: f(s,X) \text{ does } \text{ not } \text{ depend } \text{ on } X\}. \end{aligned}$$

The following result will link the game \(\Gamma \) with ‘sufficiently close’ games \(\Gamma _n\).

Theorem 3

Suppose assumption (A1) holds and take some \(\Theta ,\varepsilon >0\). Then there exists an \(N\in \mathbb {N}\) such that for any \(n\ge N\) the expected reward of player \(\alpha \) from playing policy \(g\in F_{c}\) against \(f\in F_{c}\) played by all the other players in the game \(\Gamma _n\) differs from his expected reward when he plays g against f in game \(\Gamma \) by at most \(\varepsilon \).

Proof

First recall that r, \(\widetilde{r}\) and \(\lambda \) are continuous on a compact domain, hence bounded. Let L be such that \(|r(s,X,a)|\le L\), \(|\widetilde{r}(s,X,a)|\le L\) and \(\lambda (s,X)\le L\) for any \((s,X,a)\in D\). In addition, note that \(\lambda \) is by assumption positive, so there also exists a \(\underline{\lambda }>0\) such that \(\lambda (s,X)\ge \underline{\lambda }\) for any \((s,X,a)\in D\).

Next, note that under assumption (A1) absolute value of the sum of rewards received by (any given) player \(\alpha \) from his kth change of state on

$$\begin{aligned} \left| {{\mathbb E}}\left[ {\sum _{i=k}^{i_e-1}\left( \widetilde{r}(s^\alpha _{T_{i}^\alpha },X_{T_{i}^\alpha },a^\alpha _{T_{i}^\alpha })+ \int _{T_{i}^\alpha }^{T_{i+1}^\alpha }r(s^\alpha _{T_{i}^\alpha },X_t,a^\alpha _{T_{i}^\alpha })\, \mathrm{d}t\right) }\right] \right| \end{aligned}$$

can be bounded by

$$\begin{aligned}&\left( L+\frac{L}{\underline{\lambda }}\right) \mathbb {P}[i_e>l]\sum _{i=0}^\infty (|S|-1)\mathbb {P}[i_e> i(|S|-1)+k|i_e>k]\nonumber \\&\quad \le \left( L+\frac{L}{\underline{\lambda }}\right) (1-p_0)^{\left\lfloor \frac{k}{|S|-1}\right\rfloor }\sum _{i=0}^\infty (1-p_0)^i=\frac{L(\underline{\lambda }+1)}{\underline{\lambda }p_0}(1-p_0)^{\left\lfloor \frac{k}{|S|-1}\right\rfloor }. \end{aligned}$$
(16)

It is then immediate that there exists a \(k_{\varepsilon }\) such that

$$\begin{aligned} \left| {{\mathbb E}}\left[ {\sum _{i=k_{\varepsilon }}^{i_e-1}\left( \widetilde{r}(s^\alpha _{T_{i}^\alpha },X_{T_{i}^\alpha },a^\alpha _{T_{i}^\alpha })+ \int _{T_{i}^\alpha }^{T_{i+1}^\alpha }r(s^\alpha _{T_{i}^\alpha },X_t,a^\alpha _{T_{i}^\alpha })\, dt\right) }\right] \right| <\frac{\varepsilon }{6}. \end{aligned}$$

The same bound will apply to every \(\Gamma _n\).

Then, since \(\tau _i^\alpha \) is for any \(\alpha \) stochastically dominated by an exponentially distributed random variable with intensity \(\underline{\lambda }\), \(\underline{\tau }_i^\alpha \), for any \(T>0\) we can conclude as follows:

$$\begin{aligned} \mathbb {P}\left[ \sum _{i=0}^{k_\varepsilon }\tau _i^\alpha>T\right] \le \mathbb {P}\left[ \sum _{i=0}^{k_\varepsilon }\underline{\tau }_i^\alpha >T\right] . \end{aligned}$$

Since \(\tau _i^\alpha \) are for different i independent, we can assume the same about \(\underline{\tau }_i^\alpha \). Then \(\sum _{i=0}^{k_\varepsilon }\underline{\tau }_i^\alpha \) is Gamma-distributed with fixed parameters \(k_\varepsilon \) and \(\underline{\lambda }\), thus the probability it is greater than T converges to 0 as T goes to infinity. Thus, there exists a \(T_\varepsilon >0\) such that

$$\begin{aligned} \mathbb {P}\left[ \sum _{i=0}^{k_\varepsilon }\tau _i^\alpha >T_\varepsilon \right] <\frac{\underline{\lambda }p_0}{6L(\underline{\lambda }+1)}\varepsilon . \end{aligned}$$
(17)

Consequently, by (16) the expected reward received by any player either in model \(\Gamma \) or any of models \(\Gamma _n\) from time \(T_\varepsilon \) on can be bigger than that received until the \(k_\varepsilon \)th jump of his individual state by no more than

$$\begin{aligned} \frac{L(\underline{\lambda }+1)}{\underline{\lambda }p_0}\frac{\underline{\lambda }p_0}{6L(\underline{\lambda }+1)}\varepsilon =\frac{\varepsilon }{6}, \end{aligned}$$

which implies that the expected reward received by player \(\alpha \) until time \(\min \{ T_\varepsilon ,T^{\alpha }_{k_{\varepsilon }}\}\) in any of these models differs from the expected reward over his lifetime by at most \(\frac{\varepsilon }{3}\).

Now note that since \(f\in F_{c}\) and by Lipschitz continuity and boundedness of q and \(\lambda \), all of the intensities \(\sum _{s'\in S}X^{s'}\lambda (s',X)q(s|s',X,f(s',X))\), \(-X^s\lambda (s,X)\) are Lipschitz-continuous and bounded functions of X, and so by the Kurtz theorem (see Theorem 5.3 in [32]) if all the players except \(\alpha \) are using policy f,

$$\begin{aligned} {\mathbb P}[\sup _{0\le t\le T_\varepsilon }|\overline{X}_t[n]-X_t|\ge \delta ]\le De^{-nF(\delta )} \end{aligned}$$

for some positive constant D and a function F satisfying \(\lim _{\eta \searrow 0}\frac{F(\eta )}{\eta ^2}\in (0,\infty )\). By this last property, the probability bounded above converges to zero as n goes to infinity at rate of \(e^{-n }\), so for n large enough, say \(n>N_\delta \),

$$\begin{aligned} {\mathbb P}[\sup _{0\le t\le T_\varepsilon }|\overline{X}_t[n]-X_t|\ge \delta ]\le \frac{\underline{\lambda }p_0}{3L(\underline{\lambda }+1)}\varepsilon \end{aligned}$$
(18)

for any given \(\delta >0\).

Further, notice that r(sXg(sX)) and \(\widetilde{r}(s,X,g(s,X))\) are continuous on a compact domain, which by Heine’s theorem implies that they are uniformly continuous, so we can find a \(\delta >0\) such that for any \(X,X'\in \Delta (S)\),

$$\begin{aligned} |X-X'|<\delta \Longrightarrow \sup _{s\in S}|r(s,X,g(s,X))-r(s,X',g(s,X'))|<\frac{\varepsilon }{4T_\varepsilon } \end{aligned}$$
(19)

and

$$\begin{aligned} |X-X'|<\delta \Longrightarrow \sup _{s\in S}|\widetilde{r}(s,X,g(s,X))-\widetilde{r}(s,X',g(s,X'))|<\frac{\varepsilon }{4}. \end{aligned}$$
(20)

Then, let us fix a trajectory of \(\overline{X}_t[n]\) and define

where B is any Borel set on \(\mathbb {R}^+\), and analogously R[n](st) and \(Q[n](s',B|s,t)\) by replacing \(X_t\) in the above formulas with \(\overline{X}_t[n]\) whenever \(\sup _{t\in [0,T_\varepsilon ]}|\overline{X}_t[n]-X_t|<\delta \) (and doing nothing otherwise). Note that measurability of the functions integrated in the above formulas is guaranteed by their continuity. Also in the cases of R[n](st) and \(Q[n](s',B|s,t)\), even though the functions integrated there are not continuous, their domain \(S\times \mathbb {R}^+\) can be divided into a countable number of subsets of form \(S\times [ \underline{t},\overline{t})\) where they are continuous. These sets are obviously Borel, which guarantees the measurability of the functions.

If we combine (1920) with the definitions of R[n] and Q[n], we obtain that for n large enough

$$\begin{aligned} |R(s,t)-R[n](s,t)|<2\left( \frac{\varepsilon }{4}+\frac{\varepsilon }{4T_\varepsilon }T_\varepsilon \right) =\varepsilon , \end{aligned}$$
(21)

which means that R[n] converges uniformly to R as n goes to infinity. Moreover, uniformly both in state (st) and the trajectory \(\overline{X}_t[n]\). Similarly, we can show that the density appearing in the definition of Q[n] converges uniformly to that appearing in the definition of Q. Note however that uniform convergence of densities together with uniform convergence and boundedness of rewards implies that

$$\begin{aligned}&R[n](s_0,0)+\sum _{k=1}^{k_\varepsilon }\sum _{s\in S}\int _{\mathbb {R}^+}R[n](s,u)Q[n]^k(s,\mathrm{d}u|s_0,0)\nonumber \\&\quad \rightrightarrows R(s_0,0)+\sum _{k=1}^{k_\varepsilon }\sum _{s\in S}\int _{\mathbb {R}^+}R(s,u)Q^k(s,\mathrm{d}u|s_0,0), \end{aligned}$$
(22)

where the convergence is uniform with respect to both \(s_0\) and \(\overline{X}_t[n]\). Clearly, R and Q were constructed in such a way that the RHS of the above equation equals the expected reward received by player \(\alpha \) until time \(\min \{T_\varepsilon ,T^\alpha _{k_\varepsilon }\}\) in model \(\Gamma \). On the other hand, if we take the expected value of the LHS over all trajectories of \(\overline{X}_t[n]\), (16) and (18) imply that for \(n>N_0\) it will differ by at most

$$\begin{aligned} \frac{L(\underline{\lambda }+1)}{\underline{\lambda }p_0}\frac{\underline{\lambda }p_0}{3L(\underline{\lambda }+1)}\varepsilon =\frac{\varepsilon }{3} \end{aligned}$$
(23)

from the expected reward received by player \(\alpha \) until time \(\min \{ T_\varepsilon ,T^\alpha _{k_\varepsilon }\}\) in model \(\Gamma _n\).

Now, if we take n big enough, say bigger than \(N_1\), the supremum over all \(s_0\) and all \(\overline{X}_t[n]\) of the two sides of (22) will differ by at most \(\frac{\varepsilon }{3}\). This however, together with (23) and the fact that the expected reward until time \(\min \{T_\varepsilon ,T^\alpha _{k_\varepsilon }\}\) differs from that over lifetime of a player by at most \(\frac{\varepsilon }{3}\), will imply that the reward received by player \(\alpha \) in model \(\Gamma \) differs from that received in models \(\Gamma _n\) forFootnote 11 \(n>\max \{ N_\delta ,N_1\}\) by at most

$$\begin{aligned} \frac{\varepsilon }{3}+\frac{\varepsilon }{3}+\frac{\varepsilon }{3}=\varepsilon , \end{aligned}$$

which ends the proof.\(\square \)

Remark 8

The restriction of strategies to the set \(F_c\) may seem quite strong but, since at any fixed global state X a stationary strategy reduces to a mapping from S to A, it is enough to show the existence of approximate equilibria defined in a way similar to that equilibria are defined for mean-field game, which is obviously much weaker than how Nash equilibria are defined. Just this is done in a corollary below. On the other hand, note that the result presented in Theorem 3 can be easily generalized (in the sense that the proof will follow along the same lines as here) to Lipschitz-continuous randomized stationary strategies. However, as we limited our considerations to pure strategies in most of the paper, we have decided to present this result in this weaker form.

To formulate the next result, which will link equilibria of mean-field game \(\Gamma \) with approximate equilibria of games \(\Gamma _n\), we need to introduce the following concept.

Definition 1

A stationary strategy f and a measure \(\mu \in \Delta (S)\) are in \(\varepsilon \) -weak equilibrium in the semi-Markov n-person counterpart of mean-field game with total reward \(\Gamma _n\), if \(\mu \) is a stationary global state corresponding to f and for every other stationary strategy \(g\in F\),

$$\begin{aligned} J(f,\overline{f},\rho )\ge \overline{J}(g,\overline{f},\rho )-\varepsilon , \end{aligned}$$

where \(\rho =q(\cdot |s_{0},\mu ,a_{0})\) is the distribution of individual states of new-born players when the global state is \(\mu \).

The following result is an immediate consequence of Theorems 1 and 3.

Corollary 1

Suppose that the total reward mean-field game \(\Gamma \) satisfies assumptions (A1–A6). Then for n big enough \((\underline{B}(\underline{X}),\underline{X})\) and \((\overline{B}(\overline{X}),\overline{X})\) are \(\varepsilon \)-weak equilibria in n-person counterparts of \(\Gamma \), \(\Gamma _n\).

5 Conclusions

In our paper we presented a model of mean-field game where each of infinitely many players controls his own continuous time Markov chain of private states, but the global state follows an ordinary differential equation. We have made two main contributions here: the first one is the generalization of this type of games to a novel model where players, instead of maximizing some payoff accumulated over the entire game, maximize the reward obtained during their lifetime, which may be different for different players. Since any dead player can be replaced after some time by a newborn one, after some time stationary behavior of the system is obtained, which is then used to define a mean-field-type equilibrium. We have provided an approximation result linking this new model with its n-player counterparts for n approaching infinity under some very mild assumptions.

The second main contribution of the paper is an equilibrium-existence result for mean-field game model discussed in this paper under some strategic complementarity conditions. These assumptions differ significantly from those discussed in the mean-field game literature, as no conditions based on convexity or strict monotonicity of the functions defining our model are required. Instead, properties implying that an increase in states of most of the players makes increase in any individual’s state profitable and that an increase in one’s state makes higher actions more profitable are assumed. This allows us to obtain the existence of equilibria in strategies with some monotonicity properties as well as the convergence of a myopic learning procedure. What is important, it turns out that many real-life applications of mean-field models satisfy our strategic complementarity assumptions. However, the applications of our contributions are limited especially due to two of them: positivity of reward functions and one-dimensional state space. It will be very interesting to see a generalization of our results getting rid of these two assumptions.