1 Introduction

We consider a finite number of N statistically equal agents, each moving on a finite set of states according to a continuous-time Markov Decision Process. Transition intensities of the agents and generated rewards can be controlled and depend not only on the state and action of the agent itself, but also on the states of the other agents. Interactions like this are typical for a wide range of models in e.g. biology, epidemics, finance, social science and queueing systems among others. The aim is to maximize the expected discounted reward of the system, i.e. the agents have to cooperate as a team. This can be implemented by a central controller who is able to observe the whole system and assigns actions to the agents. Though this system itself can be formulated as a continuous-time Markov Decision Process, the established solution procedures are not really practical since the state space of the system is complicated and of high cardinality. Thus, we consider the limit \(N\rightarrow \infty \) when the number of agents tends to infinity and analyze the connection between the limiting optimization problem, which is a deterministic control problem, and the N agents problem.

Investigations like this are well-known under the name Mean-field approximation, because the mean dynamics of the agents can be approximated by differential equations for a measure-valued state process. This is inspired by statistical mechanics and can be done for different classes of stochastic processes for the agents. In our paper we restrict our investigation to continuous-time Markov chains (CTMC). Earlier, more practical studies in this spirit with CTMC, but without control are e.g. [1, 2] which consider illustrating examples to discuss how the mean-field method is used in different application areas. The convergence proof there is based on the law of large numbers for centred Poisson processes, see also [3]. The authors of [4] look at so-called reaction networks which are chemical systems involving multiple reactions and chemical species. They take approximations of multiscale nature into account and show that ’slow’ components can be approximated by a deterministic equation. Reference [5] formulates some simple conditions under which a CTMC may be approximated by the solution to a differential equation, with quantifiable error probabilities. They give different applications. Reference [6] explores models proposed for the analysis of BitTorrent P2P systems and provide the arguments to justify the passage from the stochastic process, under adequate scaling, to a fluid approximation driven by a differential equation. A more recent application is given in [7] where a multi-type analogue of Kingman’s coalescent as a death chain is considered. The aim is to characterize the behaviour of the replicator coalescent as it is started from an initial population that is arbitrarily large. This leads to a differential equation called the replicator equation. A similar control model as ours is considered in [8]. However, there the author uses a finite time horizon and solves the problem with HJB equations. This requires a considerable technical overhead like viscosity solutions and more assumptions on the model data like Lipschitz properties which we do not need here.

A related topic are fluid models. Fluid models have been introduced in queueing network theory since there is a close connection between the stability of the stochastic network and the corresponding fluid model, [9]. They appear under ’fluid scaling’ where time in the CTMC for the stochastic queueing network is accelerated by a factor N and the state is compressed by factor 1/N. Fluid models have also been used to approximate the optimal control in these networks, see e.g. [10,11,12,13,14]. In [15] different scales of time are treated for the approximation and some components may be replaced by differential equations. But there is no mean-field interaction in all of these fluid models.

There are also investigations about controlled mean-field Markov decision processes and their limits in discrete time. An early paper is [16] where the mean-field limit for increasing number of agents is considered in a model where only the central controller is allowed to choose one action. However, in order to get a continuous limit the authors have to interpolate and rescale the original discrete-time processes. This implies the necessity for assumptions on the transition probabilities. The authors show the convergence of the scaled value functions and derive asymptotically optimal strategies. The recent papers [17,18,19,20] discuss the convergence of value functions and asymptotically optimal policies in discrete time. In contrast to our paper they allow a common noise. The limit problem is then a controlled stochastic process in discrete-time.

Another strand of literature considers continuous-time mean-field games on a finite number of states [21,22,23,24,25]. These papers among others considers the construction of asymptotically optimal Nash-equilibria from a limiting equation. The exception is [24] where it is shown that any solution of the limiting game can be approximated by \(\epsilon _N\)-Nash equilibria in the N player game. However, all these papers deal with the convergence of the HJB equations which appear in the N player game to a limiting equation, called the Master equation ( [26]) which is a deterministic PDE for the value function. This approach needs sufficient regularity of the value functions and many assumptions. Reference [25] considers the problem with common noise and reduces the mean field equilibrium to a system of forward-backward systems of (random) ordinary differential equations.

The contribution of our paper is first to establish and investigate the limit of the controlled continuous-time Markov decision processes. In contrast to previous literature which works with the HJB equation this point of view requires less assumptions e.g. we do not need Lipschitz conditions on the model data. Second, we are also able to construct an asymptotically optimal strategy for the N agents model. Our model is general, has only a few, easy to check assumptions and allows for various applications. The advantage of our limiting optimization problem is that we can apply Pontryagin’s maximum principle easily which is often more practical than deterministic dynamic programming. Further, we show that an optimal feedback policy in the deterministic problem does not necessarily imply an asymptotically optimal policy for the N agents problems. Third, we obtain a convergence rate in a straightforward way. Fourth, we can consider finite and infinite time horizon at the same time. There is essentially no difference. We restrict the presentation mainly to the infinite time horizon.

Our paper is organized as follows: In the next section we introduce our N agents continuous-time Markov decision process. The aim is to maximize the expected discounted reward of the system. In Sect. 3 we introduce a measure-valued simplification which is due to the symmetry properties of the problem and which reduces the cardinality of the state space. The convergence theorem if the number of agents tends to infinity can be found in Sect. 4. It is essentially based on martingale convergence arguments. In Sect. 5 we construct a sequence of asymptotically optimal strategies from the limiting model for the N agents model. We also show that different implementations may be possible and that the rate of convergence is at most \(1/\sqrt{N}\). Finally in Sect. 6 we discuss three applications. The first one is a machine replacement problem when we have many machines, see e.g. [27]. The second one is the spreading of malware which is based on the classical SIR model for spreading infections, [16, 28]. The last example shows that one has to be careful with feedback policies.

2 The N Agents Continuous-Time Markov Decision Process

We consider a finite number of N statistically equal agents, each moving on a finite set of states S according to a continuous-time Markov Decision Process. The vector \({\textbf{x}}_t = (x_t^1,...,x_t^N)\in S^N\) describes the state of the system at time \(t\in [0,\infty )\), where \(x_t^k\) is the state of agent \(k=1,\dots ,N\). The action space of one agent is a compact Borel set A. The action space of the system is accordingly \(A^N\). We denote an action of the system by \({\textbf{a}} = (a^1,...,a^N)\in A^N\) where \(a^k\) is the action chosen by agent \(k=1,\ldots ,N\).

Let \(D(i)\subset A\) be the set of actions available for an agent in state \(i\in S\) which we again assume to be compact. Then the set of admissible actions for the system in state \({\textbf{x}} \in S^N\) is given by \({\textbf{D}}({\textbf{x}}):= D(x^1)\times \dots \times D(x^N)\subset A^N\). The set of admissible state-action combinations for one agent is denoted by \(D:= \{(i, a) \in S\times A \mid a \in D(i), \ \forall \ i\in S\}\).

For the construction of the system state process we follow the notation of [29]. The state process of the system is defined on the measurable space \((\Omega ,\mathcal {F}):= \big ((S^N\times \mathbb {R}_+)^\infty ,\mathcal {B}((S^N\times \mathbb {R}_+)^\infty )\big ).\) We denote an element of \(\Omega \) by \(\omega =({\textbf {x}}_0,t_1,{\textbf {x}}_1,t_2,...)\). Now define

$$\begin{aligned} \tilde{{\textbf{X}}}_n&:\Omega \rightarrow S^N,\quad \tilde{{\textbf{X}}}_n(\omega ) = {\textbf {x}}_n,\quad n\in \mathbb {N}_0,\\ \tau _n&:\Omega \rightarrow \mathbb {R}_+,\quad \ \tau _n(\omega ) = t_n,\quad \ n\in \mathbb {N},\\ T_n&:= \sum _{k=1}^n \tau _k, \quad T_0 := 0. \end{aligned}$$

The controlled state process of the system is then given by

$$\begin{aligned} {\textbf {X}}_t:= \sum _{n\in \mathbb {N}_0} \mathbbm {1}_{\{T_n\le t< T_{n+1}\}} \tilde{{\textbf{X}}}_n,\qquad t\in [0,\infty ). \end{aligned}$$

The construction of the process can be interpreted as follows: The random variables \(\tau _n\) describe the sojourn times of the system in states \(\tilde{{\textbf{X}}}_{n-1}\). Based on the sojourn times, \(T_n\) describes the time of the n-th jump of the process and \(\tilde{{\textbf{X}}}_n\) the state of the process on the interval \([T_n,T_{n+1})\). By construction the continuous-time state process \(({\textbf{X}}_t)\) has piecewise constant càdlàg-paths and the embedded discrete-time process is \((\tilde{{\textbf{X}}}_n)\).

The system is controlled by policies. W.l.o.g. we restrict here to Markovian stationary policies. Further, we allow for randomized decisions, i.e. each agent can choose a probability distribution on A as its action. Hence a policy for the system is given by a collection of N stochastic kernels \(\pi (d{\textbf{a}}\mid {\textbf{x}}) = (\pi ^k(da\mid {\textbf{x}}))_{k=1,...,N}\), where

$$\begin{aligned} \pi ^k:S^N\times \mathcal {B}(A) \rightarrow [0,1], \quad ({\textbf{x}},\mathcal {A}) \mapsto \pi ^k(\mathcal {A}\mid {\textbf{x}})\qquad \text {(kernel for agent { k})}. \end{aligned}$$

\(\pi ^k(\mathcal {A}\mid {\textbf{x}})\) is the stochastic kernel (it is here considered as a relaxed control) with which agent k chooses an action, given the state \({\textbf{x}}\) of the system. Naturally, it should hold that the kernel is concentrated on admissible actions, i.e. \(\pi ^k(D(x^k) \mid {\textbf{x}})= 1\) for all agents \(k=1,...,N\).

The action process is thus defined by

$$\begin{aligned} \pi _t:= \sum _{n\in \mathbb {N}_0} \mathbbm {1}_{\{T_n< t\le T_{n+1}\}} \pi (\cdot \mid \tilde{{\textbf{X}}}_n),\qquad t\in [0,\infty ). \end{aligned}$$

In contrast to the state process, the action process has piecewise constant càglàd-paths. This means that a new decision can only be taken after a change of state has already occurred. The general theory on continuous-time Markov decision processes states that the optimal policy can be found among the piecewise constant, deterministic, stationary policies. In particular, varying the action continuously on the interval \([T_n,T_{n+1})\) does not increase the value of the problem. Also randomization does not increase the value, but in view of the sections to come, we already allowed for randomization (relaxation) here.

To prepare the description of the transition mechanism in our model, we define the empirical distribution of the agents over the states, i.e.

$$\begin{aligned} \mu [{\textbf{x}}]:= \frac{1}{N} \sum _{k=1}^N \delta _{x^k}. \end{aligned}$$

where \(\delta _{x^k}\) is the Dirac measure in point \(x^k\). The transition intensities for one agent are given by a signed kernel

$$\begin{aligned} q:S\times A\times \mathbb {P}(S)\times \mathcal {P}(S) \rightarrow \mathbb {R},\quad (i,a,\mu ,\Gamma )\mapsto q(\Gamma \mid i,a,\mu ) = \sum _{j\in \Gamma } q(\{j\}\mid i,a,\mu ). \end{aligned}$$

Here \( \mathbb {P}(S) \) is the set of all probability distributions on S and \( \mathcal {P}(S)\) is the power set of S. Note that the transition of an agent depends not only on its own state and action, but also on the empirical distribution of all agents over the states.

We make the following assumptions on q:

  1. (Q1)

    \(q(\{j\}|i,a,\mu )\ge 0\) for all \(i,j\in S,\ j\ne i, \ a \in D(i), \mu \in \mathbb {P}(S).\)

  2. (Q2)

    \(\sum _{j\in S} q(\{j\}|i,a,\mu )=0\) for all \((i,a)\in D, \ \mu \in \mathbb {P}(S).\)

  3. (Q3)

    \(\sup _{i,a,j,\mu } |q(\{j\}|i,a,\mu )|=: q_{max}<\infty .\)

  4. (Q4)

    \(\mu \mapsto q(\{j\}|i,a,\mu )\) is continuous w.r.t. weak convergence for all \(i,j\in S,\ a\in D(i).\)

  5. (Q5)

    \(a \mapsto q(\{j\}|i,a,\mu )\) is continuous for all \(i,j\in S,\ \mu \in \mathbb {P}(S).\)

Note that (Q3) follows from (Q4) and (Q5), but since it is important we list it here. Based on the transition intensities for one agent, the transition intensities of the system are given by

$$\begin{aligned} q(\{(x^1,\ldots ,x^{k-1},j,x^{k+1},\ldots x^N)\} |{\textbf{x}}, {\textbf{a}} ):= q(\{j\}|x^k,a^k,\mu [{\textbf{x}}]) \end{aligned}$$
(2.1)

for all \(({\textbf{x}}, {\textbf{a}}) \in {\textbf{D}}({\textbf{x}}), j\in S, j\ne x^k\) and

$$\begin{aligned} q(\{{{\textbf{x}}}\}| {\textbf{x}}, {\textbf{a}}):= \sum _{k=1}^N q(\{ x^k\}|x^k,a^k,\mu [{\textbf{x}}]). \end{aligned}$$

All other intensities are zero. The intensity in Eq. (2.1) describes the transition of agent k from state \(x^k\in S\) to state \(j\in S\), while all other agents stay in their current state. Since only one agent can change its state at a time, this definition is sufficient to describe the transition mechanism of the system.

Further we set (in a relaxed sense) for a decision rule \(\pi ^k(da|{\textbf{x}})\)

$$\begin{aligned} q(\{(x^1,\ldots ,x^{k-1},j,x^{k+1},\ldots , x^N)\} |{\textbf{x}}, \pi ) = \int _A q(\{j\}|x^k,a,\mu [{\textbf{x}}])\pi ^k(da|{\textbf{x}}). \end{aligned}$$

Note that in a certain sense there is abuse of notation here since we use the letter q both for the agent transition intensity and for the system transition intensity. It should always be clear from the context which one is meant.

The probability measure of the N agent process is now given by the following transition kernels

$$\begin{aligned} \mathbb {P}^\pi (\tau _n\le t, \tilde{{\textbf{X}}}_n \in B| \tilde{{\textbf{X}}}_{n-1}) = \int _0^t q(B| \tilde{{\textbf{X}}}_{n-1},\pi ) e^{s \cdot q (\{\tilde{{\textbf{X}}}_{n-1}\}| \tilde{{\textbf{X}}}_{n-1},\pi )}ds \end{aligned}$$

for all \(t\ge 0\) and \(B\in \mathcal {P}(S^N).\) In particular, the sojourn times \(\tau _n\) are exponentially distributed with parameter \(-q (\{\tilde{{\textbf{X}}}_{n-1}\}| \tilde{{\textbf{X}}}_{n-1},\pi )\) respectively. Note that by using this construction, the probability measure depends on the chosen policy. This construction is more convenient when the transition intensities are given. In case the system is described by transition functions and external noise it is easier to use a common probability space which does not depend on the policy. Of course these two points of view are equivalent.

Returning to the model’s control mechanism, keep in mind that the policy of an agent \(\pi ^k(da\mid {\textbf{x}})\) is allowed to depend on the state of the whole system, i.e. we assume that each agent has information about the position of all other agents. Therefore, we can interpret our model as a centralized control problem, where all information is collected and shared by a central controller.

The goal of the central controller is to maximize the social reward of the system. In order to implement this, we introduce the (stationary) reward function for one agent as

$$\begin{aligned} r:D\times \mathbb {P}(S)\rightarrow \mathbb {R},\qquad (i,a,\mu ) \mapsto r(i,a,\mu ), \end{aligned}$$

which does not only depend on the state and action of the agent, but also on the empirical distribution of the system. We make the following assumptions on the reward function:

  1. (R1)

    For all \((i,a)\in D\) the function \(\mu \mapsto \mu (i) r(i,a,\mu )\) is continuous w.r.t. weak convergence.

  2. (R2)

    For all \(i\in S\) and \(\mu \in \mathbb {P}(S)\) the function \(a \mapsto r(i,a,\mu )\) is continuous.

Since the set of admissible actions D(i) is compact, (R1) and (R2) imply that the following expression is bounded:

$$\begin{aligned} \sup _{(i,a)\in D, \ \mu \in \mathbb {P}(S)} |\mu (i) r(i,a,\mu )|<\infty . \end{aligned}$$
(2.2)

The (social) reward of the system is the average of the agents’ rewards

$$\begin{aligned} r({\textbf{x}},{\textbf{a}}):= \frac{1}{N} \sum _{k=1}^N r(x^k,a^k, \mu [{\textbf{x}}]), \end{aligned}$$
(2.3)

or, in a relaxed sense for a decision rule \(\pi ^k(da\mid {\textbf{x}})\)

$$\begin{aligned} r({\textbf{x}},\pi ):= \frac{1}{N} \sum _{k=1}^N \int _A r(x^k,a, \mu [{\textbf{x}}]) \pi ^k(da\mid {\textbf{x}}). \end{aligned}$$

The aim is now to find the social optimum, i.e. to maximize the joint expected discounted reward of the system over an infinite time horizon. For a policy \(\pi \), a discount rate \(\beta >0\) and an initial configuration \({\textbf{x}}\in S^N\) define the value function

$$\begin{aligned} V_\pi ({\textbf{x}})&= \mathbb {E}_{\textbf{x}}^\pi \Big [ \int _0^\infty e^{-\beta t} r({\textbf{X}}_t,\pi _t)dt\Big ] \nonumber \\ V({\textbf{x}})&= \sup _\pi V_\pi ({\textbf{x}}). \end{aligned}$$
(2.4)

We are not discussing solution procedures for this optimization problem here since we simplify it in the next section and present asymptotically optimal solution methods in Sect. 5.

3 The Measure-Valued Continuous-Time Markov Decision Process

As N is getting larger, so does the state space \(S^N\), which could make the model increasingly complex and impractical to solve. Therefore, we seek for some simplifications. An obvious approach which is common for these kind of models, is to exploit the symmetry of the system by capturing not the state of every single agent, but the relative or empirical distribution of the agents across the \(\vert S\vert \) states.

Thus, let \(\mu _t^N:= \mu [{\textbf{X}}_t]\) and define as new state space the set of all distributions which are empirical measures of N atoms

$$\begin{aligned} \mathbb {P}_N(S):= \{\mu \in \mathbb {P}(S)\mid \mu = \mu [{\textbf{x}}], \text { for } {\textbf{x}} \in S^N\}. \end{aligned}$$

It holds that the new state process \(\mu _t^N \) is the same as

$$\begin{aligned} \mu _t^N = \sum _{n\in \mathbb {N}_0} \mathbbm {1}_{\{T_n \le t < T_{n+1}\}}\mu [ \tilde{{\textbf{X}}}_n],\qquad t\in [0,\infty ). \end{aligned}$$

As action space take the \(\vert S\vert \)-fold Cartesian product \(\mathbb {P}(A)^{\vert S\vert }\) of \(\mathbb {P}(A)\). Hence, an action is given by \(\vert S\vert \) probability measures \(\alpha (d{\textbf{a}}) = (\alpha ^i(da))_{i\in S}\) with \(\alpha ^i(D(i)) = 1\). Hereby the i-th component indicates the distribution of the agents’ actions in state \(i\in S\). The set of admissible state-action combinations of the new model is given by \({\hat{D}}:= \mathbb {P}_N(S) \times \mathbb {P}(A)^{\vert S \vert }\).

For the policies we restrict again to Markovian, stationary policies given by a collection of \(\vert S \vert \) stochastic kernels \({\hat{\pi }}(d {\textbf{a}}|\mu )= ({\hat{\pi }}^i(da|\mu ))_{i\in S}\), where

$$\begin{aligned} {\hat{\pi }}^i:\mathbb {P}_N(S)\times \mathcal {B}(A) \rightarrow [0,1], \quad (\mu ,\mathcal {A}) \mapsto {\hat{\pi }}^i(\mathcal {A}\mid \mu )\qquad (\text {kernel for state}\,\, i). \end{aligned}$$

where \({\hat{\pi }}^i(D(i)\mid \mu )=1.\) In what follows we denote \({\tilde{\mu }}_n^N:= \mu [\tilde{{\textbf{X}}}_n]\). Then we can express the action process by setting

$$\begin{aligned} {\hat{\pi _t}}:= \sum _{n\in \mathbb {N}_0} \mathbbm {1}_{\{T_n < t \le T_{n+1}\}} {\hat{\pi }}(\cdot | {\tilde{\mu }}_n^N),\qquad t\in [0,\infty ). \end{aligned}$$
(3.1)

The transition intensities of the process \((\mu _t^N)_{t\ge 0}\) are given by

$$\begin{aligned} q(\{\mu ^{i\rightarrow j}\}| \mu ,\alpha )= N\mu (i) \int _A q(\{j\}| i,a,\mu ) \alpha ^i(da), \quad \mu \in \mathbb {P}_N(S), \alpha \in \mathbb {P}(A)^{\vert S \vert },\nonumber \\ \end{aligned}$$
(3.2)

with \(\mu ^{i\rightarrow j}:= \mu -\frac{1}{N} \delta _{i}+\frac{1}{N} \delta _{j}\) for all \(i,j\in S, i\ne j\) if \(\mu (i)>0.\) This intensity describes the transition of one arbitrary agent in state \(i\in S\) to state \(j\in S\), while all other agents stay in their current state. Note that the intensity follows from the usual calculations for continuous-time Markov chains, in particular from the fact that if XY are independent random variables with \(X\sim Exp(\lambda ), Y\sim Exp(\nu ),\) then \(X\wedge Y \sim Exp(\lambda +\nu ).\) In the situation in Eq. (3.2) we have \(N\mu (i)\) agents in state i. Further we set for all \(\mu \in \mathbb {P}_N(S)\) and \(\alpha \in \mathbb {P}(A)^{\vert S\vert }\)

$$\begin{aligned} q(\{\mu \}| \mu ,\alpha ):= -\sum _{i, \mu (i)>0}\sum _{j\ne i}q(\{\mu ^{i\rightarrow j}\}| \mu ,\alpha ). \end{aligned}$$

All other intensities are zero, since again only one agent can change its state at a time.

The probability distribution of the measure-valued process under a fixed policy \({\hat{\pi }}\) is now given by the following transition kernels

$$\begin{aligned} \mathbb {P}^{{\hat{\pi }}}(\tau _n\le t, {\tilde{\mu }}_n^N \in B| {\tilde{\mu }}_{n-1}^N) = \int _0^t q(B| {\tilde{\mu }}_{n-1}^N,{\hat{\pi }}) e^{s\cdot q (\{{\tilde{\mu }}_{n-1}^N\}| {\tilde{\mu }}_{n-1}^N,{\hat{\pi }})}ds \end{aligned}$$

for all \(t\ge 0\) and \(B\subset \mathbb {P}_N(S)\) measurable, where the random variables \((\tau _n)\) are the same as before.

The reward function of the system is derived from the reward for one agent:

$$\begin{aligned} r(\mu ,\alpha ):= \sum _{i\in S} \int r(i,a,\mu ) \alpha ^i(da) \mu (i). \end{aligned}$$

In view of Eq. (2.2) \(r(\mu ,\alpha )\) is bounded. The aim in this model is again to maximize the joint expected discounted reward of the system over an infinite time horizon. For a policy \(\hat{\pi }\), a discount rate \(\beta >0\) and an initial configuration \(\mu \in \mathbb {P}_N(S)\) define the value function

$$\begin{aligned} V_{{\hat{\pi }}}^N(\mu )= & {} \mathbb {E}_{\mu }^{{\hat{\pi }}}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t^N,{\hat{\pi _t}})dt\Big ] \nonumber \\ V^N(\mu )= & {} \sup _{{\hat{\pi }}} V_{{\hat{\pi }}}^N(\mu ). \end{aligned}$$
(3.3)

We can now show that both formulations in Eqs. (2.4) and (3.3) are equivalent in the sense that the optimal values are the same. Of course, an optimal policy in the measure-valued setting can directly be implemented in the original problem. The advantage of the measure-valued formulation is the reduction of the cardinality of the state space. Suppose for example that \(S=\{0,1\}\), i.e. all agents are either in state 0 or state 1. Then \(|S^N|=2^N\) in the original formulation whereas \(|\mathbb {P}_N(S)|=N+1\) in the second formulation. A proof of the next theorem can be found in the appendix.

Theorem 3.1

It holds that \(V({\textbf{x}})=V^N(\mu )\) for \(\mu =\mu [{\textbf{x}}]\) for all \({\textbf{x}}\in S^N.\)

Remark 3.2

It is possible to extend the previous result to a situation where reward and transition intensity both also depend on the empirical distribution of actions, see e.g. [18]. However, due to the definition of the Young topology which we use later it is not possible to transfer the convergence results to this setting.

The problem we have introduced is a classical continuous-time Markov Decision Process and can be solved with the established theory accordingly. Thus, we obtain:

Theorem 3.3

There exists a continuous function \(v:\mathbb {P}_N(S)\rightarrow \mathbb {R}\) satisfying

$$\begin{aligned} \beta v(\mu ) = \sup _{\alpha \in \mathbb {P}(A)^{|S|}} \left\{ r(\mu ,\alpha ) + \int v(\nu ) q(d\nu |\mu ,\alpha ) \right\} \end{aligned}$$

for all \(\mu \in \mathbb {P}_N(S)\) and there exists a maximizer \({\hat{\pi }}(\cdot |\mu )\) of the r.h.s. such that \(v=V^N\) and \({\hat{\pi }}\) determines the optimal policy by Eq. (3.1).

The theorem follows from Theorem 4.6, Lemma 4.4 in [30] or Theorem 3.1.2 in [29].

Theorem 3.3 implies a solution method for problem (3.3). It can e.g. be solved by value or policy iteration. However, as already discussed, even in this simplified setting, the computation may be inefficient if N is large, since this leads to a large state space.

4 Convergence of the State Process

In this section we discuss the behaviour of the system when the number of agents tends to infinity. In this case we obtain a deterministic limit control model which serves as an asymptotic upper bound for our optimization problem with N agents. Moreover, an optimal control of the limit model can be used to establish a sequence of asymptotically optimal policies for the N agents model.

In what follows we consider \((\mu _t^N)\) as a stochastic element of \(D_{\mathbb {P}_N(S)}[0,\infty )\), the space of càdlàg paths with values in \(\mathbb {P}_N(S)\) equipped with the Skorokhod \(J_1\)-topology and metric \(d_{J_1}.\) On \(\mathbb {P}_N(S)\) we choose the total variation metric \(\Vert \cdot \Vert _{TV}\).

Further, we consider \({\hat{\pi }}^i\) as a stochastic element in \(\mathcal {R}:= \{\rho :\mathbb {R}_+\rightarrow \mathbb {P}(A)\ |\ \rho \text{ measurable }\}\) endowed with the Young topology (cf. [31]). It is possible to show that \(\mathcal {R}\) is compact and metrizable. Measurability and convergence in \(\mathcal {R}\) can be characterized as in Lemma 4.1. These statements follow directly from the fact that the Young topology is the coarsest topology such that the mappings

$$\begin{aligned} \rho \mapsto \int _0^\infty \int _A \psi (t,a) \rho _t(da)dt \end{aligned}$$

are continuous for all real functions \(\psi \) on \(\mathbb {R}_+\times A\) where \(\psi \) is a Carathéodory function, i.e. \(\psi \) is continuous in a and measurable in t where \(\psi \) is integrable in the sense that \(\int _0^\infty \sup _a |\psi (t,a)|dt <\infty .\)

Lemma 4.1

  1. (a)

    \(\rho :\mathbb {R}_+\rightarrow \mathbb {P}(A)\) is measurable if and only if \(\rho \) is a transition probability from \(\mathbb {R}_+\) into A.

  2. (b)

    Let \(\rho ^n,\rho \in \mathcal {R}.\) \(\rho ^n\rightarrow \rho \) for \(n\rightarrow \infty \) if and only if

    $$\begin{aligned} \int _0^\infty \int _A \psi (t,a) \rho _t^n(da)dt \rightarrow \int _0^\infty \int _A \psi (t,a) \rho _t(da)dt \end{aligned}$$

    for all measurable functions \(\psi :\mathbb {R}_+\times A\rightarrow \mathbb {R}\) such that \(a\mapsto \psi (t,a)\) is continuous for all \(t\ge 0\) and \(\int _0^\infty \sup _a |\psi (t,a)|dt <\infty .\)

In a first step we define for \(N\in \mathbb {N}\), a fixed policy \({\hat{\pi }}^{N}\) and arbitrary \(j\in S\), the one-dimensional process

$$\begin{aligned} M_t^N(j)&:=\mu _t^N(j)-\mu _0^N(j)-\int _0^t \sum _{\nu \in \mathbb {P}_N(S)} (\nu (j)-\mu _s^N(j)) q (\{\nu \}|\mu _s^N,{\hat{\pi _s}})ds. \end{aligned}$$

Then \((M_t^N(j))\) are martingales w.r.t. the filtration \(\mathcal {F}_t^N = \sigma (\mu _s^N,s\le t).\) This follows from the Dynkin formula, see e.g. [31], Proposition 14.13. Next we can express the process \((M_t^N(j))\) a bit more explicitly. Note that the difference \(\nu (j)-\mu _s^N(j)\) can either be \(-1/N\) if an agent changes from state j to a state \(k\ne j\) or it could be 1/N if an agent changes from state \(i\ne j\) to state j. Since by (Q2)

$$\begin{aligned} \sum _{k\ne j} \int q (\{k\}|j,a,\mu _s^N) {\hat{\pi }}^{N,j}_s(da) = - \int q (\{j\}|j,a,\mu _s^N) {\hat{\pi }}_s^{N,j}(da) \end{aligned}$$
(4.1)

we obtain by inserting the intensity in Eq. (3.2) and by using Eq. (4.1)

$$\begin{aligned} M_t^N(j)=&\mu _t^N(j)-\mu _0^N(j)-\int _0^t \sum _{k\ne j} -\frac{1}{N} N \mu _s^N(j) \int q (\{k\}| j,a,\mu _s^N){\hat{\pi }}^{N,j}_s(da)ds\nonumber \\&-\int _0^t \sum _{i\ne j} \frac{1}{N} N \mu _s^N(i)\int q (\{j\}|i,a,\mu _s^N) {\hat{\pi }}_s^{N,i}(da)ds\nonumber \\ =&\mu _t^N(j)-\mu _0^N(j)-\int _0^t \sum _{i\in S} \mu _s^N(i)\int q (\{j\}|i,a,\mu _s^N) {\hat{\pi }}_s^{N,i}(da)ds. \end{aligned}$$
(4.2)

With this representation we can prove that the sequence of stochastic processes \((M^N(j))\) converges weakly (denoted by \(\Rightarrow \)) in the Skorokhod \(J_1\)-topology to the zero process. The proof of this lemma together with the proof of the next theorem can be found in the appendix.

Lemma 4.2

We have for all \(j\in S \) that

$$\begin{aligned} (M_t^N(j))_{t\ge 0} \Rightarrow 0, \quad N\rightarrow \infty . \end{aligned}$$

Next we show that an arbitrary state-action process sequence is relatively compact which implies the existence of converging subsequences.

Theorem 4.3

A sequence of arbitrary state-action processes \((\mu ^N, {\hat{\pi }}^{N})_N\) is relatively compact. Thus, there exists a subsequence \((N_k)\) which converges weakly

$$\begin{aligned} (\mu ^{N_k},{\hat{\pi }}^{N_k}) \Rightarrow (\mu ,{\hat{\pi }}), \text{ for } k\rightarrow \infty . \end{aligned}$$

Moreover, the limit \((\mu ,{\hat{\pi }})\) satisfies

  1. (a)

    \((\mu _t) \) has a.s. continuous paths,

  2. (b)

    and for each component j we have

    $$\begin{aligned} \mu _t(j) = \mu _0(j) + \int _0^t \sum _{i\in S} \mu _s(i) \int q(\{j\}|i,a,\mu _s) {\hat{\pi }}^{i}_s(da)ds. \end{aligned}$$

5 The Deterministic Limit Model

Consider the following deterministic optimization problem:

$$\begin{aligned} (F){} & {} \quad \quad \sup _{{\hat{\pi }}} \int _0^\infty e^{-\beta t} r(\mu _t,{\hat{\pi _t}}) dt,\\{} & {} \quad \quad s.t.\ \mu _0\in \mathbb {P}(S),\; {\hat{\pi }}^{i}_{t} \in \mathbb {P}(A),\; {\hat{\pi }}^{i}_{t} (D(i))=1, \\{} & {} \quad \mu _t(j) = \mu _0(j) + \int _0^t \sum _{i\in S} \mu _s(i) \\{} & {} \qquad \int q(\{j\}|i,a,\mu _s) {\hat{\pi }}_s^{i}(da)ds, \quad \forall t\ge 0, j=1,\ldots ,\vert S\vert . \end{aligned}$$

Note that the theory of continuous-time Markov processes implies that \(\mu _t\) is automatically a distribution. Hence one of the \(\vert S\vert \) differential equations in (F) may be skipped. Also note that when the transition intensity and the reward are linear in the action, relaxation of the control is unnecessary. We denote the maximal value of this problem by \(V^F(\mu _0).\) We show next, that this value provides an asymptotic upper bound to the value of problem  (3.3).

Theorem 5.1

For all \((\mu ^N_0) \subset \mathbb {P}_N(S), \mu _0\in \mathbb {P}(S)\) with \(\mu _0^N \Rightarrow \mu _0\) and for all sequences of policies \((\hat{\pi }^N_t)\) we have

$$\begin{aligned} \limsup _{N\rightarrow \infty } V^N_{{\hat{\pi }^N_t}} (\mu _0^N) \le V^F(\mu _0). \end{aligned}$$

Proof

According to Theorem 4.3 we can choose a subsequence \((N_k)\) of corresponding state and action processes such that

$$\begin{aligned} (\mu ^{N_k},{\hat{\pi }}^{N_k}) \Rightarrow (\mu ,{\hat{\pi }}), \text{ for } k\rightarrow \infty . \end{aligned}$$

For convenience we still denote this sequence by (N). We show that

$$\begin{aligned} \lim _{N\rightarrow \infty } V^N_{{{\hat{\pi }}^{N}}}(\mu _0^N)&= \lim _{N\rightarrow \infty } \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t^N, {\hat{\pi }}^{N}_{t})dt\Big ]\\&= \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t,{\hat{\pi _t}})dt\Big ] \le V^F(\mu _0). \end{aligned}$$

The last inequality is true due to the fact that by Theorem 4.3 the limit process \( (\mu ,{\hat{\pi }})\) satisfies the constraints of problem (F).

Let us show the second equality. We obtain by bounded convergence (r is bounded)

$$\begin{aligned}&\lim _{N\rightarrow \infty } \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t^N, {\hat{\pi }}^{N}_{t})dt\Big ]= \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} \lim _{N\rightarrow \infty }r(\mu _t^N,{\hat{\pi }}^{N}_{t})dt\Big ]. \end{aligned}$$

Further we have

$$\begin{aligned}&\left| \int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t^N) {\hat{\pi }}_t^{N,i}(da) \mu _t^N(i)dt\right. \\&\quad \left. - \int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t) {\hat{\pi }}_t^{i}(da) \mu _t(i)dt\right| \\ \le&\left| \int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t^N) {\hat{\pi }}_t^{N,i}(da) \mu _t^N(i)dt \right. \\&\quad \left. - \int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t^{}) {\hat{\pi }}_t^{N,i}(da) \mu _t(i)dt\right| \\ +&\left| \int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t^{}) {\hat{\pi }}_t^{N,i}(da) \mu _t(i)dt \right. \\&\quad \left. -\int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A r(i,a,\mu _t) {\hat{\pi }}_t^{i}(da) \mu _t(i)dt\right| . \end{aligned}$$

The second expression tends to zero for \(N\rightarrow \infty \) due to the definition of the Young topology and the fact that \(a\mapsto r(i,a,\mu )\) is continuous by (R2). The first expression can be bounded from above by

$$\begin{aligned}&\int _0^\infty e^{-\beta t} \sum _{i\in S}\int _A \left| r(i,a,\mu _t^N) \mu _t^N(i) - r(i,a,\mu _t^{})\mu _t(i) \right| {\hat{\pi }}_t^{N,i}(da) dt \\&\quad \le \int _0^\infty e^{-\beta t} \sum _{i\in S} \sup _{a\in D(i)} \left| r(i,a,\mu _t^N) \mu _t^N(i) - r(i,a,\mu _t)\mu _t(i) \right| dt \end{aligned}$$

which also tends to zero for \(N\rightarrow \infty \) due to (R1), (R2), Lemma 7.1 and dominated convergence. Thus, the statement follows. \(\square \)

On the other hand we are now able to construct a strategy which is asymptotically optimal in the sense that the upper bound in the previous theorem is attained in the limit. Suppose that \((\mu ^*,{\hat{\pi }}^*)\) is an optimal state-action trajectory for problem (F). Then we can consider for the N agents problem the strategy

$$\begin{aligned} {\hat{\pi }}_t^{N,i}:= {\hat{\pi }}^{*,i}_t \end{aligned}$$

which applies at time t the kernel \({\hat{\pi }}^{*,i}_t\) irrespective of the state \(\mu _t^N\) the process is in. More precisely, the considered strategy is deterministic and not a feedback policy.

Theorem 5.2

Suppose \({\hat{\pi }}^*\) is an optimal strategy for (F) where the corresponding differential equation in (F) has a unique solution and let \((\mu ^N_0) \subset \mathbb {P}_N(S)\) be such that \(\mu _0^N \Rightarrow \mu _0\in \mathbb {P}(S)\). Then if we use strategy \({\hat{\pi }}^*\) for problem (3.3) for any N we obtain

$$\begin{aligned} {\lim _{N\rightarrow \infty }} V_{{\hat{\pi }}^*}^N(\mu _0^N) =V^F(\mu _0). \end{aligned}$$

Thus, we call \({\hat{\pi }}^*\) asymptotically optimal.

Proof

First note that \({\hat{\pi }}^*\) is an admissible policy for any N. Further let \((\mu _t^N)\) be the corresponding state process when N agents are present. Since the corresponding differential equation in (F) has a unique solution, every subsequence \((N_k)\) is such that

$$\begin{aligned} \mu ^{N_k} \Rightarrow \mu ^*, \text{ for } k\rightarrow \infty \end{aligned}$$

holds (Theorem 4.3). Using the same arguments as in the last proof we obtain

$$\begin{aligned}&{\lim _{N\rightarrow \infty } \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t^{N},{\hat{\pi }}_t^*)dt\Big ]}= \mathbb {E}\Big [ \int _0^\infty e^{-\beta t} r(\mu _t^*,{\hat{\pi }}_t^*)dt\Big ] = V^F(\mu _0). \end{aligned}$$

Together with the previous theorem, the statement is shown. \(\square \)

Remark 5.3

  1. (a)

    In order to guarantee the unique solvability, it is sufficient to assume Lipschitz continuity for \(\mu \mapsto q(\{j\}|i,a,\mu )\). More precisely, instead of (Q4) we have to assume (Q4’) which is given below. The proof follows from the Theorem of Picard-Lindelöf. Example 5.4 shows what may happen if the differential equation for \((\mu _t)\) in (F) has multiple solutions.

  2. (b)

    Note that the construction of asymptotically optimal policies which we present here, works in the same way when we consider control problems with finite time horizon. I.e. instead of Eq. (3.3) we consider

    $$\begin{aligned}&\sup _{{\hat{\pi }}} \mathbb {E}_{\mu }^{{\hat{\pi }}}\Big [ \int _0^T e^{-\beta t} r(\mu ^{N}_t,{\hat{\pi _t}})dt+g(\mu ^{N}_T)\Big ] \end{aligned}$$
    (5.1)

    with possibly a terminal reward \(g(\cdot )\) for the final state. In this case (F) is given with a finite time horizon

    $$\begin{aligned}&\quad \quad \sup _{{\hat{\pi }}} \int _0^T e^{-\beta t} r(\mu _t,{\hat{\pi _t}}) dt + g(\mu _T)\nonumber \\&\quad \quad s.t.\ \mu _0\in \mathbb {P}(S),\; {\hat{\pi }}^{i}_{t} \in \mathbb {P}(A),\; {\hat{\pi }}^{i}_{t}(D(i))=1, \nonumber \\&\quad \quad \hspace{0.7cm} \mu _t(j) = \mu _0(j) + \int _0^t \sum _{i\in S} \mu _s(i) \int q(\{j\}|i,a,\mu _s) {\hat{\pi }}_s^{i}(da)ds, \nonumber \\&\quad \forall t\in [0,T], j=1,\ldots ,|S|. \end{aligned}$$
    (5.2)

    Theorem 5.2 holds accordingly.

  3. (c)

    General statements about the existence of optimal controls in (F) can only be made under additional assumptions. A classical result is the Theorem of Filipov-Cesari (see [32] Theorem 8 in Chapter II.8 for the finite time horizon problem and Theorem 15 in Chapter III.7 for the infinite horizon problem). It states the existence of an optimal control (for the finite horizon problem) under the following assumptions:

    1. (i)

      There exist admissible pairs \(({\hat{\pi }},\mu ),\) (for example by assuming Lipschitz continuity like in (a))

    2. (ii)

      A is closed and bounded (which we assume here)

    3. (iii)

      \(\mu \) is bounded for all controls (which we have here)

    4. (iv)

      For fixed \(\mu \) the set \(\{(r(\mu ,\alpha )+\gamma , f_1(\mu ,\alpha )), \gamma \le 0, \alpha \in A\}\) is convex where \(f_1\) is the r.h.s. of the differential equation in (F).

  4. (d)

    Suppose we obtain for problem (F) an optimal feedback rule \({\hat{\pi _t}} (\cdot )= {\hat{\pi }}(\cdot |\mu _t).\) If \(\mu \mapsto {\hat{\pi }}(\cdot |\mu )\) is continuous, this feedback rule is also asymptotically optimal for problem Eq. (3.3). The proof can be done in the same way as before. If the mapping is not continuous, the convergence may not hold (see application 6.3).

  5. (e)

    Natural extensions of our model that we have not included in the presentation are resource constraints. For example the total sum of fractions of a certain action may be limited, i.e. we restrict the set \(\mathbb {P}(A)^{|S|}\) by requiring that \(\sum _{i\in S} {\hat{\pi _t^i}}(\{a^0\}|\mu )\le c < |S|\) for a certain action \(a^0\in A.\) As long as the constraint yields a compact subset of \(\mathbb {P}(A)^{|S|}\) our analysis also covers this case.

Example 5.4

In this example we discuss what may happen if the differential equation for \((\mu _t)\) in (F) has multiple solutions. Suppose the state space is \(S=\{1,2\}\) and the system is uncontrolled. State 1 is absorbing, i.e. \(q(\{1\}|1,\mu )=q(\{2\}|1,\mu )=0\) (since the system is uncontrolled we skip the action from the notation). So agents can only change from state 2 to 1. The intensity of such a change is

$$\begin{aligned} q(\{1\}|2,\mu )= \left\{ \begin{array}{cc} \frac{(\mu _t(1))^\frac{1}{3}}{1-\mu _t(1)}, &{} \text{ if } \mu _t(1) \le 0.99 \\ \frac{0.99^\frac{1}{3}}{0.01} &{} \text{ if } \mu _t(1) \ge 0.99. \end{array}\right. \end{aligned}$$

Intensities are bounded and continuous. Since the two probabilities satisfy \(\mu _t(1)+\mu _t(2)=1\) we can concentrate on \(\mu _t(1).\) The differential equation for \(\mu _t(1)\) in (F) is

$$\begin{aligned} \mu '_t(1)= \mu _t(1)q(\{1\}|1,\mu _t)+(1-\mu _t(1)) q(\{1\}|2,\mu ) = (\mu _t(1))^\frac{1}{3} \end{aligned}$$

as long as \(\mu _t(1) \le 0.99.\) If \(\mu _0(1)=0\), there are two solutions of this initial value problem: \(\mu _t(1)\equiv 0\) and \(\mu _t(1)=(\frac{2}{3} t)^\frac{3}{2}\) for \(\mu _t(1) \le 0.99.\) Now consider the following sequence \((\mu _0^N):\) For N even we set \(\mu _0^N=(0,1)\) (all N agents start in state 2), for N odd we set \(\mu _0^N=(1/N,N-1/N)\) (exactly one agent starts in state 1). Obviously \((\mu _0^N)\Rightarrow (0,1).\) However, when we consider the even subsequence we obtain \(\mu _t^N(1)\equiv 0\) since the intensity to change from 2 to 1 remains 0. The uneven subsequence converges against the second solution \(\mu _t(1)=(\frac{2}{3} t)^\frac{3}{2}\) as long as \(\mu _t(1)\) is below 0.99. Thus, when we skip the assumption of a unique solution in Theorem 5.2 we only obtain \(\limsup _{N\rightarrow \infty } V_{{\hat{\pi }}^*}^N(\mu _0^N) \le V^F(\mu _0)\), see Theorem 5.1.

Fig. 1
figure 1

Colourful lines: State trajectories \(\mu _t^N(1)\) for \(N = 100\) (red) and \(N=10000\) (green) agents in Example 5.4 when one agent starts in state 1.Black line: Deterministic limit process \(\mu _t(1) = (\frac{2}{3}t)^\frac{3}{2}\) (Color figure online)

Under stricter assumptions it is possible to prove that the rate of convergence in the finite horizon problem Eq. (5.1) is \(1/\sqrt{N}\). In order to obtain this rate we need Lipschitz conditions on the reward function and the intensity functions. More precisely assume

  1. (R1’)

    For all \((i,a)\in D\) there exists a uniform constant \(L_1>0\) s.t.

    $$\begin{aligned} |r(i,a,\mu ) -r(i,a,\nu )| \le L_1 \Vert \mu -\nu \Vert _{TV}, \quad |g(\mu )-g(\nu )| \le L_1 \Vert \mu -\nu \Vert _{TV} \end{aligned}$$

    for all \(\mu ,\nu \in \mathbb {P}(S).\)

  2. (Q4’)

    For all \((i,a)\in D, j\in S\) there exists a uniform constant \(L_2>0\) s.t.

    $$\begin{aligned} |q(\{j\} |i,a,\mu ) -q(\{j\}| i,a,\nu )| \le L_2 \Vert \mu -\nu \Vert _{TV} \end{aligned}$$

    for all \(\mu ,\nu \in \mathbb {P}(S).\)

Denote by \({\hat{\pi }}^*\) the optimal control of the limiting problem (5.2), \(V^{F,T}(\mu _0)\) the corresponding value and let

$$\begin{aligned} V_{{\hat{\pi }}^*}^{N,T}(\mu _0^N):= \mathbb {E}_{{\mu _0^N}}^{{\hat{\pi }}^*}\Big [ \int _0^T e^{-\beta t} r(\mu ^{N}_t,{\hat{\pi }}_t^*)dt+g(\mu ^{N}_T)\Big ]. \end{aligned}$$

Then we can state the following convergence rate

Theorem 5.5

In the finite horizon setting under assumption (Q1)–(Q5) with (Q4) replaced by (Q4’) and (R1’), (R2), suppose that \(\mathbb {E}\left[ \Vert \mu ^{N}_0-\mu _0\Vert _{TV}\right] \le \frac{L_0}{\sqrt{N} }\) for a constant \(L_0>0\). Then

$$\begin{aligned} \Big | V_{{\hat{\pi }}^*}^{N,T}(\mu _0^N) - V^{F,T}(\mu _0)\Big |\le \frac{{\tilde{L}}}{\sqrt{N} } \end{aligned}$$

for a constant \({\tilde{L}}>0\) which is independent of N, but depends on T.

The statement about the convergence rate can be extended to the infinite horizon problem when the discount factor is large enough. Also note that \(\mathbb {E}\left[ \Vert \mu ^{N}_0-\mu _0\Vert _{TV}\right] \le \frac{L_0}{\sqrt{N} }\) is satisfied if e.g. the states of the N agents are sampled i.i.d. from \(\mu _0.\)

A direct implementation of policy \(\hat{\pi }^*\) in the problem Eq. (3.3) might make it necessary to update the policy continuously. This can be avoided by using the following policy instead. We assume here that \(t\mapsto {\hat{\pi }}^*_t\) is piecewise continuous. Thus, let \((t_n)_{n\in \mathbb {N}}\) be the discontinuity points in time of \(\hat{\pi }^*\) and define the set

$$\begin{aligned} \{ T_n^N, n\in \mathbb {N}\} \cup \{ t_n, n\in \mathbb {N}\} =: \{{\tilde{T}}_1^N< {\tilde{T}}_2^N <\ldots \} \end{aligned}$$

where \(T_n^N\) describes the time of the n-th jump of the N agents process. Then \(({\tilde{T}}_n^N)\) is the ordered sequence of the time points in this set. Define

$$\begin{aligned} \pi _t^{N,*}:= \sum _{n=0}^\infty {\hat{\pi }}_{\tilde{T}_n}^*\mathbbm {1}_{[{\tilde{T}}_n^N, {\tilde{T}}_{n+1}^N)}(t). \end{aligned}$$
(5.3)

The idea of the action process \(\pi _t^{N,*}\) is to adapt it to \(\hat{\pi }^*\) only when an agent changes its state or when \(\hat{\pi }^*\) has a jump, and to keep it constant otherwise. It can be shown that this sequence of policies is also asymptotically optimal.

Theorem 5.6

Suppose \({\hat{\pi }}^*\) is a piecewise continuous optimal strategy for (F) where the corresponding differential equation in (F) has a unique solution and let \((\mu ^N_0) \subset \mathbb {P}_N(S)\) be such that \(\mu _0^N \Rightarrow \mu _0\in \mathbb {P}(S)\). Then if we use the strategy \(( \pi _t^{N,*})\) of Eq. (5.3) for problem (3.3) for any N we obtain

$$\begin{aligned} {\lim _{N\rightarrow \infty }} V_{{\hat{\pi }}^{N,*}}^N(\mu _0^N) =V^F(\mu _0). \end{aligned}$$

Proof

In light of the proof of Theorem 5.2 it is enough to show that \(\pi ^{N,*} \Rightarrow \pi ^*.\) Indeed, the convergence can be shown \(\mathbb {P}\)-a.s. Now \((\pi ^{N,*})\) converges in \(J_1\)-topology against \(\pi ^*\) on \([0,\infty )\) if and only if \((\pi ^{N,*})|_{[0,T]}\), the restriction to [0, T], converges in the finite \(J_1\)-topology to the restriction \(\pi ^*_{[0,T]}\) for all T which are continuity points of the limit function (see [33] Sect. 16, Lemma 1). Since \({\hat{\pi }}^*\) is piecewise continuous we can consider the convergence on each compact interval of the partition separately. Indeed we have if \(t\in [{\tilde{T}}_n^N, {\tilde{T}}_{n+1}^N] \)

$$\begin{aligned} \vert \vert \pi _t^{N,*} - {\hat{\pi }}_t^{*}\vert \vert _{TV} \le \sup _{s\in [{\tilde{T}}_n^N, {\tilde{T}}_{n+1}^N]}\vert \vert \hat{\pi }_s^{*} - {\hat{\pi }}_t^{*}\vert \vert _{TV}. \end{aligned}$$

Since \(t\mapsto {\hat{\pi }}^*_t\) is continuous on this interval and since all \(|{\tilde{T}}_{n+1}^N-{\tilde{T}}_{n}^N|\) converge to zero for \(N\rightarrow \infty \) uniformly (the jump intensity increases with N) we have that the right hand side converges to zero for \(N\rightarrow \infty \) uniformly in t which implies the statement. \(\square \)

Remark 5.7

Let us briefly discuss the main differences to [8] where a similar model is considered. In [8] the author considers a finite horizon problem where model data is not necessarily stationary, i.e. reward and transition intensities may depend on time. Moreover, he solves the corresponding optimization problems (N-agents and limit problem) via HJB equations. This requires the notion of viscosity solutions and more regularity assumptions in terms of Lipschitz continuity of reward and transition intensities. Using the MDP perspective, we can state our solution theorem for the N agents problem (in form of a Bellman equation) and the convergence result under weaker continuity conditions. For the convergence to hold we use randomized policies whereas in [8] the author sticks to deterministic policies throughout. The obtained convergence rates under Lipschitz assumptions are the same whereas our proof is simpler and more direct. In [8] the problem is further discussed under stronger assumption. In contrast we present some applications next in order to show how to use the results of the previous sections.

6 Applications

In this section we discuss two applications of the previously derived theorems and one example which shows that state processes under feedback policies do not necessarily have to converge. More precisely we construct in two applications asymptotically optimal strategies for stochastic N agents systems from the deterministic limit problem (F). The advantage of our problem (F) in contrast to the master equation is that it can be solved with the help of Pontryagin’s maximum principle which gives necessary conditions for an optimal control and is in many cases easier to apply than dynamic programming. For examples see [10, 11, 13, 34] and for the theory see e.g. [32, 35].

6.1 Machine Replacement

The following application is a simplified version of the deterministic control problem in [27]. A mean-field application can be found in [36]. Suppose a company has N statistically equal machines. Each machine can either be in state 0=’working’ or in state 1=’broken’, thus \(S=\{0,1\}.\) Two actions are available: 0=’do nothing’ or 1=’repair’, thus \(A=\{0,1\}\). A working machine does not need repair, so \(D(0)=\{0\}.\) The transition rates are as follows: A working machine breaks down with fixed rate \(\lambda _{wb}>0\). A broken machine which gets repaired changes to the state ’working’ with rate \(\lambda _{bw}>0\). Thus, we can summarize the transition rates of one machine by

$$\begin{aligned} q(\{1\} | 0, 0, \mu _t^N) = \lambda _{wb},&q(\{0\} | 1, a_t, \mu _t^N) = \lambda _{bw} \delta _{\{a_t=1\}}. \end{aligned}$$

The diagonal elements of the intensity matrix are given by

$$\begin{aligned} q(\{0\} | 0, 0, \mu _t^N) = -\lambda _{wb},&q(\{1\} | 1, a_t, \mu _t^N) = -\lambda _{bw} \delta _{\{a_t=1\}}, \end{aligned}$$

and all other intensities are zero. Obviously (Q1)–(Q5) are satisfied. The initial state of the system is \(\mu _0^N=(1,0)\), i.e. all machines are working in the beginning. Each working machine produces a reward rate \(g>0\) whereas we have to pay a fixed cost of \(C>0\) when we have to call the service for repair, i.e.

$$\begin{aligned} r(i,a,\mu _t^N)= g \delta _{\{i=0\}}-C \delta _{\{a=1\}}\delta _{\{i=1\}} \frac{1}{1-\mu _t^N(0)}. \end{aligned}$$

Hence we obtain an interaction of the agents in the reward. Note that (R1), (R2) are satisfied. This yields the reward rate for the system

$$\begin{aligned} r(\mu _t^N,{\hat{\pi _t}}) = g \mu _t^N(0)-C(1-{\hat{\pi }}^1_t(\{0\}|\mu _t^N)). \end{aligned}$$

Thus, problem (F) in this setting is given by (we denote the limit by \((\mu _t(0),\mu _t(1))=:(\mu _t^0,1-\mu _t^0)\) and let \(\alpha _t^0:= {\hat{\pi }}_t^1(\{0\}|\mu _t)\)):

$$\begin{aligned} (F)&\quad \quad \sup _{(\alpha _t)} \int _0^T g\cdot \mu _t^0-C\cdot (1-\alpha _t^0)dt,\\&\quad \quad s.t.\ \text{ for } \text{ all } t\in [0,T]\\&\quad \quad \hspace{0.7cm} \mu _t^0 = 1 + \int _0^t \lambda _{bw}(1-\mu ^0_s)(1-\alpha _s^0) -\lambda _{wb}\mu _s^0ds. \end{aligned}$$

We briefly explain how to solve this problem using Pontryagin’s maximum principle. The Hamiltonian function to (F) is given by

$$\begin{aligned} H(\mu _t^0,\alpha _t^0,p_t,t)= & {} g\mu _t^0 -C(1-\alpha _t^0)+p_t (\lambda _{bw}(1-\mu _t^0)(1-\alpha _t^0) -\lambda _{wb}\mu _t^0)\\= & {} (1-\alpha _t^0)(\lambda _{bw} p_t(1-\mu _t^0)-C) +g\mu _t^0-\lambda _{wb} p_t\mu _t^0 \end{aligned}$$

where \((p_t)\) is the adjoint function. Pontryagin’s maximum principle yields the following sufficient conditions for optimality ( [32, 35]):

Lemma 6.1

The control \((\alpha _t^{0,*})\) with the associated trajectory \((\mu _t^{0,*})\) is optimal for (F) if there exists a continuous and piecewise continuously differentiable function \((p_t)\) such that for all \(t>0\):

  1. (i)

    \(\alpha _t^{0,*}\) maximizes \(\alpha \mapsto H(\mu _t^0,\alpha ,p_t,t)\) for \(\alpha \in [0,1],\)

  2. (ii)

    \(\dot{p}_t = -g+p_t(\lambda _{wb}+\lambda _{bw}(1-\alpha _t^0))\) at those points where \(p_t\) is differentiable,

  3. (iii)

    \(p(T)=0.\)

Inspecting the Hamiltonian it is immediately clear from (i) that the optimal control is essentially ’bang-bang’. For a numerical illustration we solved (F) for the parameters \(C=1, g=2, \lambda _{wb}=1, \lambda _{bw}=2\) and \(T=4.\) Here it is optimal to do nothing until time point \(t^*=\ln {2}.\) Then it is optimal to repair the fraction \(\alpha ^{0,*}=1/2\) of the broken machines which keeps the number of working machines at 1/2. Finally, \(\ln {2}\) time units before the end, we do again nothing and wait until the end of the time horizon. A numerical illustration of the optimal trajectory \(\mu _t^{0,*}\) of the deterministic problem together with simulated paths under this policy for different number of N can be found in Fig. 2, left. A number of different simulations for \(N=1000\) are shown in Fig. 2, right. The simulated paths are quite close to the deterministic trajectory.

Fig. 2
figure 2

Left: State trajectories for different numbers N of machines executing the optimal control for (F). Right: Ten state trajectories for \(N = 1000\) machines executing the asymptotically optimal control (Color figure online)

The optimal value in the deterministic model is \(V^F(1,0) = \frac{9}{2}-\frac{3}{2} \ln (2) \approx 3.4603\). If we simulate ten times the trajectory of the state process for \(N = 1000\) machines while following the asymptotically optimal policy and take the average of the respective values, we obtain a mean of 3.43612 which is slightly less than the value for (F), cp. Theorem 5.1.

6.2 Spreading Malware

This example is based on the deterministic control model considered in [28], see resp. [16], and treats the propagation of a virus in a mobile wireless network. It is based on the classical SIR model by Kermack-McKendrick, [37]. Suppose there are N devices in the network. A device can be in one of the following states: Susceptible (S), Infective (I), Dead (D) or Recovered (R). A device is in the susceptible state if it is not contaminated yet, but prone to infection. A device is infective if it is contaminated by the virus. It is dead if the virus has destroyed the software and recovered if the device has already a security patch which makes it immune to the virus. The states D and R are absorbing. The joint process \(\mu _t^N=(S_t^N,I_t^N,D_t^N,R_t^N)\) is a controlled continuous-time Markov chain where \(X_t^N\) represents the fraction of devices in state \(X\in \{S,I,D,R\}\). The control is a strategy of the virus which chooses the rate \(a(t)\in [0,{\bar{a}}]\), at which infected devices are destroyed. In this model we have \(S_t^N+I_t^N+D_t^N+R_t^N=1\) and \(S_t^N,I_t^N,D_t^N,R_t^N\ge 0\). The transition rates of one device are as follows: A susceptible device gets infected with rate \(\lambda _{SI} I_t\) with \(\lambda _{SI} >0.\) The rate is proportional to the number of infected devices and we thus have an interaction of one agent with the empirical distribution of the others. And it gets recovered with rate \(\lambda _{SR}>0\) which is the rate the security patch is distributed. An infected device gets killed by the virus with rate \(a(t)\in [0,{\bar{a}}]\) chosen by the attacker and gets recovered at rate \(\lambda _{IR}>0.\) The rates are shown in Fig. 3.

Fig. 3
figure 3

Transition intensities of one device between the possible states

The intensities of one device at time t are summarized by

$$\begin{aligned} q(\{I\} | S, \cdot , \mu _t^N) = \lambda _{SI} I_t^N,&q(\{R\} | S, \cdot , \mu _t^N) = \lambda _{SR},\\ q(\{D\} | I, a_t, \mu _t^N) = a_t,&q(\{R\} | I, \cdot , \mu _t^N) = \lambda _{IR}. \end{aligned}$$

Thus, the diagonal elements of the intensity matrix are given by

$$\begin{aligned} q(\{S\} | S, \cdot , \mu _t^N) ={} & {} -\lambda _{SI} I_t^N-\lambda _{SR},\qquad q(\{I\} | I, a_t, \mu _t^N) = -a_t-\lambda _{IR}, \\{} & {} q(\{D\} | D, \cdot , \mu _t^N) = q(\{R\} | R, \cdot , \mu _t^N) = 0 \end{aligned}$$

and all other intensities are zero. Note that (Q1)–(Q5) are satisfied and that since the intensities are linear in a, there is no need for a relaxed control. The initial state of the network is \(\mu _0^N=(S_0^N,I_0^N,D_0^N,R_0^N)=(1-I_0,I_0,0,0)\) with \(0<I_0<1.\) The aim of the virus is to produce as much damage as possible over the time interval [0, T], evaluated by

$$\begin{aligned} \mathbb {E}\left[ D_T^N + \frac{1}{T} \int _0^T (I^N_t)^2 dt\right] \end{aligned}$$

which is given when we choose \(r(i,a,\mu )=\frac{1}{T}(\mu (2))^2\) (the second component of \(\mu \) squared) and an appropriate terminal reward. (R1) and (R2) are satisfied. Thus, problem (F) in this setting is given by (we denote the limit by \(\mu _t=(S_t,I_t,D_t,R_t)\))

$$\begin{aligned} (F)&\quad \quad \sup _{(a_t)} D_T + \frac{1}{T} \int _0^T I^2_t dt,\\&\quad \quad s.t.\; a_t \in [0,{\bar{a}}], \text{ and } \text{ for } \text{ all } t\in [0,T]\\&\quad \quad \hspace{0.7cm} S_t = 1-I_0 + \int _0^t -\lambda _{SI} I_sS_s-\lambda _{SR} S_sds, \\&\quad \quad \hspace{0.7cm} I_t = I_0 + \int _0^t \lambda _{SI} I_sS_s-\lambda _{IR} I_s -a_t I_sds, \\&\quad \quad \hspace{0.7cm} D_t = \int _0^t a_t I_sds. \end{aligned}$$

A solution of this deterministic control problem can be found in [28]. It is shown there that a critical time point \(t_1\in [0,T]\) exists such that \(a_t=0\) on \(t\in [0,t_1]\) and \(a_t={\bar{a}}\) on \(t\in (t_1,T].\) Thus, the attacker is not destroying devices from the beginning because this lowers the number of devices which can get infected. Instead, she first waits to get more infected devices before setting the kill rate to a maximum.

Fig. 4
figure 4

State trajectories for \(N=1000\) devices under optimal control for \(\lambda _{SI}=0.6, \lambda _{SR}=\lambda _{IR}=0.2, \bar{a}=1, T=10.\) (Color figure online)

A numerical illustration can be found in Fig. 4. There we can see the trajectories of the optimal state distribution in (F) and simulated paths for \(N=1000\) devices for \(\lambda _{SI}=0.6, \lambda _{SR}=\lambda _{IR}=0.2, \bar{a}=1, T=10.\) The optimal time point for setting \(a_t\) to the maximum is here 4.9. The simulated paths are almost indistinguishable from the deterministic trajectories.

6.3 Resource Competition

This example shows that feedback policies in the deterministic problem are not necessarily asymptotically optimal when implemented in the N agents problem. The infinite horizon problem (F) could also be solved using an HJB equation which would provide (under sufficient regularity) a feedback control \({\hat{\pi }}(\cdot |\mu )\). I.e. we obtain the optimal control by \({\hat{\pi _t}} = {\hat{\pi }}(\cdot |\mu _t)\). This feedback function could also be used in the N agents model. However, in this case convergence of the N agents model to the deterministic model like in Theorem 5.2 is not guaranteed. Convergence may fail when discontinuities in the feedback function are present. The example is an adaption of the queuing network considered in [38, 39] to our setting. Suppose the state space is given by \(S=\{1,2,3,4,5,6,7,8\}.\) Agents starting in state 1 change to state 2, then 3 and are finally absorbed in state 4. Agents starting in state 5 change to state 6, then 7 and are finally absorbed in state 8. The aim is to get the agents in the absorbing states as quickly as possible by activating the intensities in states 2,3,6 and 7. The intensity for leaving states 1 and 5 is \(\lambda _1=\lambda _5=1\), the full intensity for leaving states 2 and 6 is \(\lambda _2=\lambda _6=6\) and finally the full intensity for leaving states 3 and 7 is \(\lambda _3=\lambda _7=1.5.\) The action space is \(A=\{0,1\}\) where actions have to be taken in states 2, 3, 6 and 7 and determine the activation of the transition intensity. Action \(a=0\) means that the intensity is deactivated and \(a=1\) that it is fully activated. There is a resource constraint such that the sum of the activation probabilities in states 2 and 7 as well as the sum of the activation probabilities in states 3 and 6 are constraint by 1 (see remark on p.13). When we denote the randomized control by \({\hat{\pi }}_t^2= a_t, {\hat{\pi }}_t^7= 1-a_t, {\hat{\pi }}_t^6= b_t, {\hat{\pi }}_t^3= 1-b_t\), \(a_t,b_t\in [0,1]\) then the intensities are given by

$$\begin{aligned} q(\{3\} | 2,a_t, \mu _t^N) = a_t \lambda _2,{} & {} q(\{4\} | 3, 1-b_t, \mu _t^N) = (1-b_t) \lambda _3,\\ q(\{7\} | 6,b_t, \mu _t^N) = b_t\lambda _6,{} & {} q(\{8\} | 7, 1-a_t, \mu _t^N) = (1-a_t)\lambda _7. \end{aligned}$$

An illustration of this model can be seen in Fig. 5.

Fig. 5
figure 5

Transition intensities of one agent for the resource constraint problem

The initial state distribution is given by \(\mu _0=(\frac{5}{14},\frac{1}{14},\frac{1}{14},0, \frac{5}{14},\frac{1}{14},\frac{1}{14},0)\) where we assume for the simulation that we have \(N=1400\) agents. Now suppose further that agents in the absorbing states 4 and 8 produce no cost whereas agents in state 3 and 7 are the most expensive as soon as there are at least \(0.01\%\) of the population present. This optimization criterion leads to a priority rule where agents in state 3 receive priority (and thus full capacity) over those in state 6 (as long as there are at least \(0.01\%\) present) and agents in state 7 receive priority (and thus full capacity) over those in state 2 (as long as there are at least \(0.01\%\) present). In the deterministic problem the priority rule can be implemented such that once the number of agents in state 3 and 7 fall to the threshold of \(0.01\%\) of the population it is possible to keep this level. This is not possible in the N agents problem. The priority switch leads to blocking the agents in the other line, see Fig. 6. The blue line shows the state trajectories in the deterministic model. The red line is a realization of the system for \(N=1400\) agents where we use the deterministic open-loop control of Theorem 5.2. We see that the state processes converge. Finally the green line is a realization of the \(N=1400\) agents model under the priority rule. We can see that here state processes do not converge.

Fig. 6
figure 6

State trajectories for \(N=1400\) agents. Deterministic trajectory (blue), realization under deterministic open loop (red), realization under feedback priority rule (green) (Color figure online)