CONTINUOUS-TIME MEAN FIELD MARKOV DECISION MODELS

. We consider a ﬁnite number of N statistically equal individuals, each moving on a ﬁnite set of states according to a continuous-time Markov Decision Process. Transition intensities of the individuals and generated rewards depend not only on the state and action of the individual itself, but also on the states of the other individuals as well as the chosen action. Interactions like this are typical for a wide range of models in e.g. biology, epidemics, ﬁnance, social science and queueing systems among others. The aim is to maximize the expected discounted reward of the system, i.e. the individuals have to cooperate as a team. Computationally this is a diﬃcult task when N is large. Thus, we consider the limit for N → ∞ . In contrast to other papers we do not consider the so-called Master equation. Instead we deﬁne a ’limiting’ (deterministic) optimization problem from the limiting diﬀerential equation for the path trajectories. This has the advantage that we need less assumptions and can apply Pontryagin’s maximum principle in order to construct asymptotically optimal strategies. We show how to apply our results using two examples: a machine replacement problem and a problem from epidemics. We also show that optimal feedback policies are not necessarily asymptotically optimal.


Introduction
We consider a finite number of N statistically equal agents, each moving on a finite set of states according to a continuous-time Markov Decision Process.Transition intensities of the agents and generated rewards can be controlled and depend not only on the state and action of the agent itself, but also on the states of the other agents.Interactions like this are typical for a wide range of models in e.g.biology, epidemics, finance, social science and queueing systems among others.The aim is to maximize the expected discounted reward of the system, i.e. the agents have to cooperate as a team.This can be implemented by a central controller who is able to observe the whole system and assigns actions to the agents.Though this system itself can be formulated as a continuous-time Markov Decision Process, the established solution procedures are not really practical since the state space of the system is complicated and of high cardinality.Thus, we consider the limit N → ∞ when the number of agents tends to infinity and analyze the connection between the limiting optimization problem which is a deterministic control problem and the N agents problem.
Investigations like this are well-known under the name Mean-field approximation, because the mean dynamics of the agents can be approximated by differential equations for a measure-valued state process.This is inspired by statistical mechanics and can be done for different classes of stochastic processes for the agents.In our paper we restrict our investigation to continuous-time Markov chains (CTMC).Earlier, more practical studies in this spirit with CTMC, but without control are e.g.Bortolussi et al. (2013); Kolesnichenko et al. (2014) which consider illustrating examples to discuss how the mean-field method is used in different application areas.The convergence proof there is based on the law of large numbers for centred Poisson processes, see also Kurtz (1970).Ball et al. (2006) look at so-called reaction networks which are chemical systems involving multiple reactions and chemical species.They take approximations of multiscale nature into account and show that 'slow' components can be approximated by a deterministic equation.Darling and Norris (2008) formulate some simple conditions under which a CTMC may be approximated by the solution to a differential equation, with quantifiable error probabilities.They give different applications.Aspirot et al. (2011) explore models proposed for the analysis of BitTorrent P2P systems and provide the arguments to justify the passage from the stochastic process, under adequate scaling, to a fluid approximation driven by a differential equation.A more recent application is given in Kyprianou et al. (2022) where a multi-type analogue of Kingman's coalescent as a death chain is considered.The aim is to characterize the behaviour of the replicator coalescent as it is started from an initial population that is arbitrarily large.This leads to a differential equation called the replicator equation.A similar control model as ours is considered in Cecchin (2021).However, there the author uses a finite time horizon and solves the problem with HJB equations.This requires a considerable technical overhead like viscosity solutions and more assumptions on the model data like Lipschitz properties which we do not need here.
A related topic are fluid models.Fluid models have been introduced in queueing network theory since there is a close connection between the stability of the stochastic network and the corresponding fluid model, Meyn (1997).They appear under 'fluid scaling' where time in the CTMC for the stochastic queueing network is accelerated by a factor N and the state is compressed by factor 1/N. Fluid models have also been used to approximate the optimal control in these networks, see e.g.Avram et al. (1995); Weiss (1996); Bäuerle (2000Bäuerle ( , 2002)); Čudina and Ramanan (2011).In Yin and Zhang (2012) different scales of time are treated for the approximation and some components may be replaced by differential equations.But there is no mean-field interaction in all of these fluid models.
There are also investigations about controlled mean-field Markov decision processes and their limits in discrete time.An early paper is Gast et al. (2012) where the mean-field limit for increasing number of agents is considered in a model where only the central controller is allowed to choose one action.However, in order to get a continuous limit the authors have to interpolate and rescale the original discrete-time processes.This implies the necessity for assumptions on the transition probabilities.The authors show the convergence of the scaled value functions and derive asymptotically optimal strategies.The recent papers Carmona et al. (2019); Motte andPham (2022, 2023); Bäuerle (2023) discuss the convergence of value functions and asymptotically optimal policies in discrete time.In contrast to our paper they allow a common noise.The limit problem is then a controlled stochastic process in discrete-time.
Another strand of literature considers continuous-time mean-field games on a finite number of states Gomes et al. (2013); Basna et al. (2014); Bayraktar and Cohen (2018); Cecchin and Fischer (2020); Belak et al. (2021).These papers among others consider the construction of asymptotically optimal Nash-equilibria from a limiting equation.The exception is Cecchin and Fischer (2020) where it is shown that any solution of the limiting game can be approximated by ϵ N -Nash equilibria in the N player game.However, all these papers deal with the convergence of the HJB equations which appear in the N player game to a limiting equation, called the Master equation (Cardaliaguet et al. (2019)) which is a deterministic PDE for the value function.This approach needs sufficient regularity of the value functions and many assumptions.Belak et al. (2021) consider the problem with common noise and reduce the mean field equilibrium to a system of forward-backward systems of (random) ordinary differential equations.
The contribution of our paper is first to establish and investigate the limit of the controlled continuous-time Markov decision processes.In contrast to previous literature which works with the HJB equation this point of view requires less assumptions e.g.we do not need Lipschitz conditions on the model data.Second, we are also able to construct an asymptotically optimal strategy for the N agents model.Our model is general, has only a few, easy to check assumptions and allows for various applications.The advantage of our limiting optimization problem is that we can apply Pontryagin's maximum principle easily which is often more practical than deterministic dynamic programming.Further, we show that an optimal feedback policy in the deterministic problem does not necessarily imply an asymptotically optimal policy for the N agents problems.Third, we obtain a convergence rate in a straightforward way.Fourth, we can consider finite and infinite time horizon at the same time.There is essentially no difference.We restrict the presentation mainly to the infinite time horizon.
Our paper is organized as follows: In the next section we introduce our N agents continuoustime Markov decision process.The aim is to maximize the expected discounted reward of the system.In Section 3 we introduce a measure-valued simplification which is due to the symmetry properties of the problem and which reduces the cardinality of the state space.The convergence theorem if the number of agents tends to infinity can be found in Section 4. It is essentially based on martingale convergence arguments.In Section 5 we construct a sequence of asymptotically optimal strategies from the limiting model for the N agents model.We also show that different implementations may be possible and that the rate of convergence is at most 1/ √ N .Finally in Section 6 we discuss three applications.The first one is a machine replacement problem when we have many machines, see e.g.Thompson (1968).The second one is the spreading of malware which is based on the classical SIR model for spreading infections, Khouzani et al. (2012); Gast et al. (2012).The last example shows that one has to be careful with feedback policies.

The N agents continuous-time Markov Decision Process
We consider a finite number of N statistically equal agents, each moving on a finite set of states S according to a continuous-time Markov Decision Process.The vector x t = (x 1 t , ..., x N t ) ∈ S N describes the state of the system at time t ∈ [0, ∞), where x k t is the state of agent k = 1, . . ., N .The action space of one agent is a compact Borel set A. The action space of the system is accordingly A N .We denote an action of the system by a = (a 1 , ..., a N ) ∈ A N where a k is the action chosen by agent k = 1, . . ., N .Let D(i) ⊂ A be the set of actions available for an agent in state i ∈ S which we again assume to be compact.Then the set of admissible actions for the system in state x ∈ S N is given by D The set of admissible state-action combinations for one agent is denoted by For the construction of the system state process we follow the notation of Piunovskiy and Zhang (2020).The state process of the system is defined on the measurable space (Ω, We denote an element of Ω by ω = (x 0 , t 1 , x 1 , t 2 , ...).Now define The controlled state process of the system is then given by The construction of the process can be interpreted as follows: The random variables τ n describe the sojourn times of the system in states Xn−1 .Based on the sojourn times, T n describes the time of the n-th jump of the process and Xn the state of the process on the interval [T n , T n+1 ).By construction the continuous-time state process (X t ) has piecewise constant càdlàg-paths and the embedded discrete-time process is ( Xn ).
The system is controlled by policies.W.l.o.g.we restrict here to Markovian stationary policies.Further, we allow for randomized decisions, i.e. each agent can choose a probability distribution on A as its action.Hence a policy for the system is given by a collection of N stochastic kernels π(da | x) = (π k (da | x)) k=1,...,N , where is the stochastic kernel (it is here considered as a relaxed control) with which agent k chooses an action, given the state x of the system.Naturally, it should hold that the kernel is concentrated on admissible actions, i.e. π k (D(x k ) | x) = 1 for all agents k = 1, ..., N .The action process is thus defined by In contrast to the state process, the action process has piecewise constant càglàd-paths.This means that a new decision can only be taken after a change of state has already occurred.The general theory on continuous-time Markov decision processes states that the optimal policy can be found among the piecewise constant, deterministic, stationary policies.In particular, varying the action continuously on the interval [T n , T n+1 ) does not increase the value of the problem.Also randomization does not increase the value, but in view of the sections to come, we already allowed for randomization (relaxation) here.
To prepare the description of the transition mechanism in our model, we define the empirical distribution of the agents over the states, i.e.
where δ x k is the Dirac measure in point x k .The transition intensities for one agent are given by a signed kernel Here P(S) is the set of all probability distributions on S and P(S) is the power set of S. Note that the transition of an agent depends not only on its own state and action, but also on the empirical distribution of all agents over the states.
Note that (Q3) follows from (Q4) and (Q5), but since it is important we list it here.Based on the transition intensities for one agent, the transition intensities of the system are given by for all (x, a) ∈ D(x), j ∈ S, j ̸ = x k and q({x}|x, a) : All other intensities are = 0.The intensity in (2.1) describes the transition of agent k from state x k ∈ S to state j ∈ S, while all other agents stay in their current state.Since only one agent can change its state at a time, this definition is sufficient to describe the transition mechanism of the system.Further we set (in a relaxed sense) for a decision rule π k (da|x) Note that in a certain sense there is abuse of notation here since we use the letter q both for the agent transition intensity and for the system transition intensity.It should always be clear from the context which one is meant.
The probability measure of the N agent process is now given by the following transition kernels π) ds for all t ≥ 0 and B ∈ P(S N ).In particular, the sojourn times τ n are exponentially distributed with parameter −q({ Xn−1 }| Xn−1 , π) respectively.Note that by using this construction, the probability measure depends on the chosen policy.This construction is more convenient when the transition intensities are given.In case the system is described by transition functions and external noise it is easier to use a common probability space which does not depend on the policy.Of course these two points of view are equivalent.
Returning to the model's control mechanism, keep in mind that the policy of an agent π k (da | x) is allowed to depend on the state of the whole system, i.e. we assume that each agent has information about the position of all other agents.Therefore, we can interpret our model as a centralized control problem, where all information is collected and shared by a central controller.
The goal of the central controller is to maximize the social reward of the system.In order to implement this, we introduce the (stationary) reward function for one agent as which does not only depend on the state and action of the agent, but also on the empirical distribution of the system.We make the following assumptions on the reward function: ) For all i ∈ S and µ ∈ P(S) the function a → r(i, a, µ) is continuous.
Since the set of admissible actions D(i) is compact, (R1) and (R2) imply that the following expression is bounded: sup The (social) reward of the system is the average of the agents' rewards or, in a relaxed sense for a decision rule The aim is now to find the social optimum, i.e. to maximize the joint expected discounted reward of the system over an infinite time horizon.For a policy π, a discount rate β > 0 and an initial configuration x ∈ S N define the value function (2.4) We are not discussing solution procedures for this optimization problem here since we simplify it in the next section and present asymptotically optimal solution methods in Section 5.

The measure-valued continuous-time Markov Decision Process
As N is getting larger, so does the state space S N , which could make the model increasingly complex and impractical to solve.Therefore, we seek for some simplifications.An obvious approach which is common for these kind of models, is to exploit the symmetry of the system by capturing not the state of every single agent, but the relative or empirical distribution of the agents across the |S| states.Thus, let µ N t := µ[X t ] and define as new state space the set of all distributions which are empirical measures of N atoms As action space take the |S|-fold Cartesian product P(A) |S| of P(A).Hence, an action is given by |S| probability measures α(da) = (α i (da)) i∈S with α i (D(i)) = 1.Hereby the i-th component indicates the distribution of the agents' actions in state i ∈ S. The set of admissible state-action combinations of the new model is given by D := P N (S) × P(A) |S| .
For the policies we restrict again to Markovian, stationary policies given by a collection of |S| stochastic kernels π(da|µ) = (π i (da|µ)) i∈S , where πi : The transition intensities of the process (µ N t ) t≥0 are given by This intensity describes the transition of one arbitrary agent in state i ∈ S to state j ∈ S, while all other agents stay in their current state.Note that the intensity follows from the usual calculations for continuous-time Markov chains, in particular from the fact that if X, Y are independent random variables with In the situation in (3.2) we have N µ(i) agents in state i.Further we set for all µ ∈ P N (S) and α ∈ P All other intensities are zero, since again only one agent can change its state at a time.
The probability distribution of the measure-valued process under a fixed policy π is now given by the following transition kernels for all t ≥ 0 and B ⊂ P N (S) measurable, where the random variables (τ n ) are the same as before.
The reward function of the system is derived from the reward for one agent: In view of (2.2) r(µ, α) is bounded.The aim in this model is again to maximize the joint expected discounted reward of the system over an infinite time horizon.For a policy π, a discount rate β > 0 and an initial configuration µ ∈ P N (S) define the value function We can now show that both formulations (2.4) and (3.3) are equivalent in the sense that the optimal values are the same.Of course, an optimal policy in the measure-valued setting can directly be implemented in the original problem.The advantage of the measure-valued formulation is the reduction of the cardinality of the state space.Suppose for example that S = {0, 1}, i.e. all agents are either in state 0 or state 1.Then |S N | = 2 N in the original formulation whereas |P N (S)| = N + 1 in the second formulation.A proof of the next theorem can be found in the appendix.
Remark 3.2.It is possible to extend the previous result to a situation where reward and transition intensity both also depend on the empirical distribution of actions, e.g.Motte and Pham (2022).However, due to the definition of the Young topology which we use later it is not possible to transfer the convergence results to this setting.
The problem we have introduced is a classical continuous-time Markov Decision Process and can be solved with the established theory accordingly.Thus, we obtain: Theorem 3.3.There exists a continuous function v : for all µ ∈ P N (S) and there exists a maximizer π(•|µ) of the r.h.s.such that v = V N and π determines the optimal policy by (3.1).
Follows from Theorem 4.6, Lemma 4.4 in Guo and Hernández-Lerma (2009) or Theorem 3.1.2in Piunovskiy and Zhang (2020).Theorem 3.3 implies a solution method for problem (3.3).It can e.g.be solved by value or policy iteration.However, as already discussed, even in this simplified setting, the computation may be inefficient if N is large, since this leads to a large state space.

Convergence of the state process
In this section we discuss the behaviour of the system when the number of agents tends to infinity.In this case we obtain a deterministic limit control model which serves as an asymptotic upper bound for our optimization problem with N agents.Moreover, an optimal control of the limit model can be used to establish a sequence of asymptotically optimal policies for the N agents model.
In what follows we consider (µ N t ) as a stochastic element of D P N (S) [0, ∞), the space of càdlàg paths with values in P N (S) equipped with the Skorokhod J 1 -topology and metric d J 1 .On P N (S) we choose the total variation metric.
Further, we consider πi as a stochastic element in R := {ρ : R + → P(A) | ρ measurable} endowed with the Young topology (cf.Davis (2018)).It is possible to show that R is compact and metrizable.Measurability and convergence in R can be characterized as in Lemma 4.1.These statements follow directly from the fact that the Young topology is the coarsest topology such that the mappings In a first step we define for N ∈ N, a fixed policy πN and arbitrary j ∈ S, the one-dimensional process (ν(j) − µ N s (j))q({ν}|µ N s , πs )ds.
Then (M N t (j)) are martingales w.r.t. the filtration F N t = σ(µ N s , s ≤ t).This follows from the Dynkin formula, see e.g.Davis (2018), Proposition 14.13.Next we can express the process (M N t (j)) a bit more explicitly.Note that the difference ν(j) − µ N s (j) can either be −1/N if an agent changes from state j to a state k ̸ = j or it could be 1/N if an agent changes from state i ̸ = j to state j.Since by (Q2) we obtain by inserting the intensity (3.2) and by using (4.1) With this representation we can prove that the sequence of stochastic processes (M N (j)) converges weakly (denoted by ⇒) in the Skorokhod J 1 -topology to the zero process.The proof of this lemma together with the proof of the next theorem can be found in the appendix.
Lemma 4.2.We have for all j ∈ S that Next we show that an arbitrary state-action process sequence is relatively compact which implies the existence of converging subsequences.
Theorem 4.3.A sequence of arbitrary state-action processes (µ N , πN ) N is relatively compact.Thus, there exists a subsequence (N k ) which converges weakly Moreover, the limit (µ, π) satisfies a) (µ t ) has a.s.continuous paths, b) and for each component j we have , a, µ s )π i s (da)ds.

The deterministic limit model
Consider the following deterministic optimization problem: Note that the theory of continuous-time Markov processes implies that µ t is automatically a distribution.Hence one of the |S| differential equations in (F ) may be skipped.Also note that when the transition intensity and the reward are linear in the action, relaxation of the control is unnecessary.We denote the maximal value of this problem by V F (µ 0 ).We show next, that this value provides an asymptotic upper bound to the value of problem (3.3).
Theorem 5.1.For all (µ N 0 ) ⊂ P N (S), µ 0 ∈ P(S) with µ N 0 ⇒ µ 0 and for all sequences of policies (π N t ) we have lim sup Proof.According to Theorem 4.3 we can choose a subsequence (N k ) of corresponding state and action processes such that For convenience we still denote this sequence by (N ).We show that lim The last inequality is true due to the fact that by Theorem 4.3 the limit process (µ, π) satisfies the constraints of problem (F ).
Let us show the second equality.We obtain by bounded convergence (r is bounded) Further we have The second expression tends to zero for N → ∞ due to the definition of the Young topology and the fact that a → r(i, a, µ) is continuous by (R2).The first expression can be bounded from above by which also tends to zero for N → ∞ due to (R1), (R2), Lemma 7.1 and dominated convergence.Thus, the statement follows.□ On the other hand we are now able to construct a strategy which is asymptotically optimal in the sense that the upper bound in the previous theorem is attained in the limit.Suppose that (µ * , π * ) is an optimal state-action trajectory for problem (F ).Then we can consider for the N agents problem the strategy πN,i t := π * ,i t which applies at time t the kernel π * ,i t irrespective of the state µ N t the process is in.More precisely, the considered strategy is deterministic and not a feedback policy.
Theorem 5.2.Suppose π * is an optimal strategy for (F ) where the corresponding differential equation in (F ) has a unique solution and let (µ N 0 ) ⊂ P N (S) be such that µ N 0 ⇒ µ 0 ∈ P(S).Then if we use strategy π * for problem (3.3) for any N we obtain Thus, we call π * asymptotically optimal.
Proof.First note that π * is an admissible policy for any N .Further let (µ N t ) be the corresponding state process when N agents are present.Since the corresponding differential equation in (F ) has a unique solution, every subsequence (N k ) is such that holds (Theorem 4.3).Using the same arguments as in the last proof we obtain lim Together with the previous theorem, the statement is shown.□ Remark 5.3.a) In order to guarantee the unique solvability, it is sufficient to assume Lipschitz continuity for µ → q({j}|i, a, µ).More precisely, instead of (Q4) we have to assume (Q4') which is given below.The proof follows from the Theorem of Picard-Lindelöf.Example 5.4 shows what may happen if the differential equation for (µ t ) in (F ) has multiple solutions.b) Note that the construction of asymptotically optimal policies which we present here, works in the same way when we consider control problems with finite time horizon.I.e.instead of (3.3) we consider with possibly a terminal reward g(•) for the final state.In this case (F ) is given with a finite time horizon sup π T 0 e −βt r(µ t , πt )dt + g(µ T ) Theorem 5.2 holds accordingly.c) General statements about the existence of optimal controls in (F ) can only be made under additional assumptions.A classical result is the Theorem of Filipov-Cesari (see Seierstad (1987) Theorem 8 in Chapter II.8 for the finite time horizon problem and Theorem 15 in Chapter III.7 for the infinite horizon problem).It states the existence of an optimal control (for the finite horizon problem) under the following assumptions: i) There exist admissible pairs (π, µ), (for example by assuming Lipschitz continuity like in a)) ii) A is closed and bounded (which we assume here) iii) µ is bounded for all controls (which we have here) iv) For fixed µ the set {(r(µ, α) + γ, f 1 (µ, α)), γ ≤ 0, α ∈ A} is convex where f 1 is the r.h.s. of the differential equation in (F ).d) Suppose we obtain for problem (F ) an optimal feedback rule πt ( is continuous, this feedback rule is also asymptotically optimal for problem (3.3).The proof can be done in the same way as before.If the mapping is not continuous, the convergence may not hold (see application 6.3).e) Natural extensions of our model that we have not included in the presentation are resource constraints.For example the total sum of fractions of a certain action may be limited, i.e. we restrict the set P(A) |S| by requiring that i∈S πi t ({a 0 }|µ) ≤ c < |S| for a certain action a 0 ∈ A. As long as the constraint yields a compact subset of P(A) |S| our analysis also covers this case.
Example 5.4.In this example we discuss what may happen if the differential equation for (µ t ) in (F ) has multiple solutions.Suppose the state space is S = {1, 2} and the system is uncontrolled.State 1 is absorbing, i.e. q({1}|1, µ) = q({2}|1, µ) = 0 (since the system is uncontrolled we skip the action from the notation).So agents can only change from state 2 to 1.The intensity of such a change is Intensities are bounded and continuous.Since the two probabilities µ t (1) + µ t (2) = 1 we can concentrate on µ t (1).The differential equation for µ t (1) in (F ) is as long as µ t (1) ≤ 0.99.If µ 0 (1) = 0, there are two solutions of this initial value problem: µ t (1) ≡ 0 and µ t (1) = ( 2 3 t) 3 2 for µ t (1) ≤ 0.99.Now consider the following sequence (µ N 0 ) : For N even we set µ N 0 = (0, 1) (all N agents start in state 2), for N odd we set µ N 0 = (1/N, N − 1/N ) (exactly one agent starts in state 1).Obviously (µ N 0 ) ⇒ (0, 1).However, when we consider the even subsequence we obtain µ N t (1) ≡ 0 since the intensity to change from 2 to 1 remains 0. The uneven subsequence converges against the second solution µ t (1) = ( 2 3 t) 3 2 as long as µ t (1) is below 0.99.Thus, when we skip the assumption of a unique solution in Theorem 5.2 we only obtain lim sup N →∞ V N π * (µ N 0 ) ≤ V F (µ 0 ), see Theorem 5.1.Under stricter assumptions it is possible to prove that the rate of convergence in the finite horizon problem (5.1) is 1/ √ N .In order to obtain this rate we need Lipschitz conditions on the reward function and the intensity functions.More precisely assume (R1') For all (i, a) ∈ D there exists a uniform constant L 1 > 0 s.t.
for all µ, ν ∈ P(S).Denote by π * the optimal control of the limiting problem (5.2), V F,T (µ 0 ) the corresponding value and let ) .Then we can state the following convergence rate Theorem 5.5.In the finite horizon setting under assumption (Q1)-( Q5) with (Q4) replaced by (Q4') and (R1'), (R2), suppose that for a constant L > 0 which is independent of N , but depends on T.
The statement about the convergence rate can be extended to the infinite horizon problem when the discount factor is large enough.Also note that E ∥µ N 0 − µ 0 ∥ T V ≤ L 0 √ N is satisfied if e.g. the states of the N agents are sampled i.i.d.from µ 0 .
A direct implementation of policy π * in the problem (3.3) might make it necessary to update the policy continuously.This can be avoided by using the following policy instead.We assume here that t → π * t is piecewise continuous.Thus, let (t n ) n∈N be the discontinuity points in time of π * and define the set where T N n describes the time of the n-th jump of the N agents process.Then ( T N n ) is the ordered sequence of the time points in this set.Define (5. 3) The idea of the action process π N, * t is to adapt it to π * only when an agent changes its state or when π * has a jump, and to keep it constant otherwise.It can be shown that this sequence of policies is also asymptotically optimal.
Theorem 5.6.Suppose π * is a piecewise continuous optimal strategy for (F ) where the corresponding differential equation in (F ) has a unique solution and let (µ N 0 ) ⊂ P N (S) be such that µ N 0 ⇒ µ 0 ∈ P(S).Then if we use the strategy (π N, * t ) of (5.3) for problem (3.3) for any N we obtain lim Proof.In light of the proof of Theorem 5.2 it is enough to show that π N, * ⇒ π * .Indeed, the convergence can be shown P-a.s.Now (π N, * ) converges in J 1 -topology against π * on [0, ∞) if and only if (π N, * )| [0,T ] the restriction to [0, T ] converges in the finite J 1 -topology to the restriction π * [0,T ] for all T which are continuity points of the limit function (see Billingsley (2013) sec.16,Lem.1).Since π * is piecewise continuous we can consider the convergence on each compact interval of the partition separately.Indeed we have if Since t → π * t is continuous on this interval and since all | T N n+1 − T N n | converge to zero for N → ∞ uniformly (the jump intensity increases with N ) we have that the right hand side converges to zero for N → ∞ uniformly in t which implies the statement.□ Remark 5.7.Let us briefly discuss the main differences to Cecchin (2021) where a similar model is considered.In Cecchin (2021) the author considers a finite horizon problem where model data is not necessarily stationary, i.e. reward and transition intensities may depend on time.Moreover, he solves the corresponding optimization problems (N -agents and limit problem) via HJB equations.This requires the notion of viscosity solutions and more regularity assumptions in terms of Lipschitz continuity of reward and transition intensities.Using the MDP perspective, we can state our solution theorem for the N agents problem (in form of a Bellman equation) and the convergence result under weaker continuity conditions.For the convergence to hold we use randomized policies whereas in Cecchin (2021) the author sticks to deterministic policies throughout.The obtained convergence rates under Lipschitz assumptions are the same whereas our proof is simpler and more direct.In Cecchin (2021) the problem is further discussed under stronger assumption.In contrast we present some applications next in order to show how to use the results of the previous sections.

Applications
In this section we discuss two applications of the previously derived theorems and one example which shows that state processes under feedback policies do not necessarily have to converge.More precisely we construct in two applications asymptotically optimal strategies for stochastic N agents systems from the deterministic limit problem (F ).The advantage of our problem (F ) in contrast to the master equation is that it can be solved with the help of Pontryagin's maximum principle which gives necessary conditions for an optimal control and is in many cases easier to apply than dynamic programming.For examples see Avram et al. (1995); Weiss (1996); Bäuerle and Rieder (2000); Bäuerle (2002) and for the theory see e.g.Seierstad (1987); Zabczyk (2020).
6.1.Machine replacement.The following application is a simplified version of the deterministic control problem in Thompson (1968).A mean-field application can be found in Huang and Ma (2016).Suppose a company has N statistically equal machines.Each machine can either be in state 0='working' or in state 1='broken', thus S = {0, 1}.Two actions are available: 0='do nothing' or 1='repair', thus A = {0, 1}.A working machine does not need repair, so D(0) = {0}.The transition rates are as follows: A working machine breaks down with fixed rate λ wb > 0. A broken machine which gets repaired changes to the state 'working' with rate λ bw > 0. Thus, we can summarize the transition rates of one machine by q({1}|0, 0, µ N t ) = λ wb , q({0}|1, a t , µ N t ) = λ bw δ {at=1} .The diagonal elements of the intensity matrix are given by q({0}|0, 0, µ N t ) = −λ wb , q({1}|1, a t , µ N t ) = −λ bw δ {at=1} , and all other intensities are zero.Obviously (Q1)-(Q5) are satisfied.The initial state of the system is µ N 0 = (1, 0), i.e. all machines are working in the beginning.Each working machine produces a reward rate g > 0 whereas we have to pay a fixed cost of C > 0 when we have to call the service for repair, i.e. .
Hence we obtain an interaction of the agents in the reward.Note that (R1), (R2) are satisfied.This yields the reward rate for the system ).Thus, problem (F ) in this setting is given by (we denote the limit by (µ t (0), µ t (1)) =: (µ 0 t , 1−µ 0 t ) and let α 0 t := π1 t ({0}|µ t )): (F ) sup We briefly explain how to solve this problem using Pontryagin's maximum principle.The Hamiltonian function to (F ) is given by where (p t ) is the adjoint function.Pontryagin's maximum principle yields the following sufficient conditions for optimality (Seierstad (1987); Zabczyk (2020)): Lemma 6.1.The control (α 0, * t ) with the associated trajectory (µ 0, * t ) is optimal for (F ) if there exists a continuous and piecewise continuously differentiable function (p t ) such that for all t > 0: Inspecting the Hamiltonian it is immediately clear from (i) that the optimal control is essentially 'bang-bang'.For a numerical illustration we solved (F ) for the parameters C = 1, g = 2, λ wb = 1, λ bw = 2 and T = 4.Here it is optimal to do nothing until time point t * = ln 2. Then it is optimal to repair the fraction α 0, * = 1/2 of the broken machines which keeps the number of working machines at 1/2.Finally, ln 2 time units before the end, we do again nothing and wait until the end of the time horizon.A numerical illustration of the optimal trajectory µ 0, * t of the deterministic problem together with simulated paths under this policy for different number of N can be found in Figure 2  The optimal value in the deterministic model is V F (1, 0) = 9 2 − 3 2 ln(2) ≈ 3.4603.If we simulate ten times the trajectory of the state process for N = 1000 machines while following the asymptotically optimal policy and take the average of the respective values, we obtain a mean of 3.43612 which is slightly less than the value for (F ), cp.Theorem 5.1.6.2.Spreading malware.This example is based on the deterministic control model considered in Khouzani et al. (2012), see also Gast et al. (2012) and treats the propagation of a virus in a mobile wireless network.It is based on the classical SIR model by Kermack-McKendrick, Daley and Gani (2001).Suppose there are N devices in the network.A device can be in one of the following states: Susceptible (S), Infective (I), Dead (D) or Recovered (R).A device is in the susceptible state if it is not contaminated yet, but prone to infection.A device is infective if it is contaminated by the virus.It is dead if the virus has destroyed the software and recovered if the device has already a security patch which makes it immune to the virus.The states D and R are absorbing.The joint process µ N t = (S N t , I N t , D N t , R N t ) is a controlled continuous-time Markov chain where X N t represents the fraction of devices in state X ∈ {S, I, D, R}.The control is a strategy of the virus which chooses the rate a(t) ∈ [0, ā], at which infected devices are destroyed.In this model we have The transition rates of one device are as follows: A susceptible device gets infected with rate λ SI I t with λ SI > 0. The rate is proportional to the number of infected devices and we thus have an interaction of one agent with the empirical distribution of the others.And it gets recovered with rate λ SR > 0 which is the rate the security patch is distributed.An infected device gets killed by the virus with rate a(t) ∈ [0, ā] chosen by the attacker and gets recovered at rate λ IR > 0. The rates are shown in the following figure: The intensities of one device at time t are summarized by Thus, the diagonal elements of the intensity matrix are given by and all other intensities are zero.Note that (Q1)-( Q5) are satisfied and that since the intensities are linear in a, there is no need for a relaxed control.The initial state of the network is µ N 0 = (S N 0 , I N 0 , D N 0 , R N 0 ) = (1 − I 0 , I 0 , 0, 0) with 0 I 0 < 1.The aim of the virus is to produce as much damage as possible over the time interval [0, T ], evaluated by which is given when we choose r(i, a, µ) = 1 T (µ(2)) 2 (the second component of µ squared) and an appropriate terminal reward.(R1) and (R2) are satisfied.Thus, problem (F ) in this setting is given by (we denote the limit by µ t = (S t , I t , D t , R t )) (F ) sup and for all t ∈ [0, T ] A solution of this deterministic control problem can be found in Khouzani et al. (2012).It is shown there that a critical time point t 1 ∈ [0, T ] exists such that a t = 0 on t ∈ [0, t 1 ] and a t = ā on t ∈ (t 1 , T ].Thus, the attacker is not destroying devices from the beginning because this lowers the number of devices which can get infected.Instead, she first waits to get more infected devices before setting the kill rate to a maximum.A numerical illustration can be found in Figure 4.There we can see the trajectories of the optimal state distribution in (F ) and simulated paths for N = 1000 devices for λ SI = 0.6, λ SR = λ IR = 0.2, ā = 1, T = 10.The optimal time point for setting a t to the maximum is here 4.9.The simulated paths are almost indistinguishable from the deterministic trajectories.6.3.Resource competition.This example shows that feedback policies in the deterministic problem are not necessarily asymptotically optimal when implemented in the N agents problem.The infinite horizon problem (F ) could also be solved using an HJB equation which would provide (under sufficient regularity) a feedback control π(•|µ).I.e.we obtain the optimal control by πt = π(•|µ t ).This feedback function could also be used in the N agents model.However, in this case convergence of the N agents model to the deterministic model like in Theorem 5.2 is not guaranteed.Convergence may fail when discontinuities in the feedback function are present.The example is an adaption of the queuing network considered in Kumar and Seidman (1989); Rybko and Stolyar (1992) to our setting.Suppose the state space is given by S = {1, 2, 3, 4, 5, 6, 7, 8}.Agents starting in state 1 change to state 2, then 3 and are finally absorbed in state 4. Agents starting in state 5 change to state 6, then 7 and are finally absorbed in state 8.The aim is to get the agents in the absorbing states as quickly as possible by activating the intensities in states 2,3,6 and 7.The intensity for leaving states 1 and 5 is λ 1 = λ 5 = 1, the full intensity for leaving states 2 and 6 is λ 2 = λ 6 = 6 and finally the full intensity for leaving states 3 and 7 is λ 3 = λ 7 = 1.5.The action space is A = {0, 1} where actions have to be taken in states 2, 3, 6 and 7 and determine the activation of the transition intensity.Action a = 0 means that the intensity is deactivated and a = 1 that it is fully activated.There is a resource constraint such that the sum of the activation probabilities in states 2 and 7 as well as the sum of the activation probabilities in states 3 and 6 are constraint by 1 (see remark on p.13).When we denote the randomized control by π2 An illustration of this model can be seen in Figure 5.The initial state distribution is given by µ 0 = ( 5 14 , 1 14 , 1 14 , 0, 5 14 , 1 14 , 1 14 , 0) where we assume for the simulation that we have N = 1400 agents.Now suppose further that agents in the absorbing states 4 and 8 produce no cost whereas agents in state 3 and 7 are the most expensive as soon as there are at least 0.01% of the population present.This optimization criterion leads to a priority rule where agents in state 3 receive priority (and thus full capacity) over those in state 6 (as long as there are at least 0.01% present) and agents in state 7 receive priority (and thus full capacity) over those in state 2 (as long as there are at least 0.01% present).In the deterministic problem the priority rule can be implemented such that once the number of agents in state 3 and 7 fall to the threshold of 0.01% of the population it is possible to keep this level.This is not possible in the N agents problem.The priority switch leads to blocking the agents in the other line, see Figure 6.The blue line shows the state trajectories in the deterministic model.The red line is a realization of the system for N = 1400 agents where we use the deterministic open-loop control of Theorem 5.2.We see that the state processes converge.Finally the green line is a realization of the N = 1400 agents model under the priority rule.We can see that here state processes do not converge.Lemma 7.1.Let X be a separable metric space, Y be compact metric and f : For a proof see e.g.Lemma B.12, Lange (2017).7.2.Proof of Theorem 3.1.First of all observe that the reward function r in (2.3) in the N agents problem is symmetric, i.e. r(x, a) = r(s(x), s(a)) for any permutation s(•) of the vectors.Moreover, the agent transition intensities q(•|i, a, µ[X t ]) depend only on the own state of the agent and on µ[X t ].Thus, the optimal policy in the N agents problem at time t only depends on µ[X t ].Now for a decision rule π for the N agents problem define for all states i ∈ S : where µ = µ[x].On the right-hand side we consider all agents in state i and take a convex combination of their action distributions as the action distribution in state i.If π depends only on µ[x], then this is also true for π.
Thus, the reward in both formulations is the same.Finally the transition intensity in the N agents model that one agent changes its state from i to j is given by (again µ = µ[x]) N k=1 1 {x k =i} q({j}|i, a, µ)π k (da|x) =N µ(i) A q({j}|i, a, µ) A q({j}|i, a, µ)π i (da|µ) = q({µ i→j }|µ, π).Thus, the empirical measure process of the N agents problem is statistically equal to the measurevalued MDP process and they produce the same expected reward under measure-dependent policies which implies the result.A formal proof has to be done by induction like in Bäuerle (2023) Thm. 3.3. 7.3.Proof of Lemma 4.2.First we show that M N t (j) is bounded for fixed t: µ N s (i) q({j}|i, a, µ N s )π N,i s (da)ds µ N s (i) |q({j}|i, a, µ N s )|π N,i s (da)ds We exploit the fact that there can be at most jumps of height 1 N in the state processes with N agents.Theorem 3.10.2a) in Ethier and Kurtz (1986) then implies the a.s.continuity of the limit state process (µ * t ) t≥0 .
In particular, due to the Skorokhod representation theorem we find a probability space such that convergence of µ N ⇒ µ * holds almost surely in J 1 and is uniformly on compact sets such as [0, t] since µ * is a.s.continuous (see p. 383 in Whitt (2002)).Thus, component-wise for almost all ω in the probability space above we obtain: Finally we have to take the limit N → ∞ in (4.2).By the previous Lemma 4.2 we know that the martingale on the left-hand side converges to zero and that µ N t (ω) → µ * t .Now consider the integral on the right-hand side: t 0 i∈S µ N s (i) q({j}|i, a, µ N s )π µ * s (i) q({j}|i, a, µ * s )π * ,i s (da)ds .
The second expression tends to 0 for N → ∞ due to the definition of the Young topology and the fact that a → q({j}|i, a, µ * s ) is continuous by assumption.The first expression can be bounded by t 0 i∈S µ N s (i)q({j}|i, a, µ N s ) − µ * s (i)q({j}|i, a, µ * s ) πN,i s (da)ds µ N s (i)q({j}|i, a, µ N s ) − µ * s (i)q({j}|i, a, µ * s ) ds which also tends to zero due to dominated convergence, (Q4),(Q5) and Lemma 7.1.Now putting things together, equation (4.2) implies that the limit satisfies the stated differential equation.

Now we obtain
V N,T π * (µ N 0 ) − V F,T (µ 0 ) ≤ E For the last term we have already shown in Lemma 4.2 that Thus, from the two previous inequalities we get with L 4 := |S| √ q max T /2 that Finally, Gronwall's inequality implies that for all t ∈ [0, T ] which in turn implies the statement.
, left.A number of different simulations for N = 1000 are shown in Figure 2, right.The simulated paths are quite close to the deterministic trajectory.

Figure 2 .
Figure 2. Left: State trajectories for different numbers N of machines executing the optimal control for (F ).Right: Ten state trajectories for N = 1000 machines executing the asymptotically optimal control.

Figure 3 .
Figure 3. Transition intensities of one device between the possible states.

Figure 5 .
Figure 5. Transition intensities of one agent for the resource constraint problem.
a)ρ t (da)dt are continuous for all real functions ψ on R + × A where ψ is a Carathéodory function, i.e. ψ is continuous in a and measurable in t where ψ is integrable in the sense that a) ρ : R + → P(A) is measurable if and only if ρ is a transition probability