Mean Field Markov Decision Processes

We consider mean-field control problems in discrete time with discounted reward, infinite time horizon and compact state and action space. The existence of optimal policies is shown and the limiting mean-field problem is derived when the number of individuals tends to infinity. Moreover, we consider the average reward problem and show that the optimal policy in this mean-field limit is ε\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\varepsilon $$\end{document}-optimal for the discounted problem if the number of individuals is large and the discount factor close to one. This result is very helpful, because it turns out that in the special case when the reward does only depend on the distribution of the individuals, we obtain a very interesting subclass of problems where an average reward optimal policy can be obtained by first computing an optimal measure from a static optimization problem and then achieving it with Markov Chain Monte Carlo methods. We give two applications: Avoiding congestion an a graph and optimal positioning on a market place which we solve explicitly.


Introduction
Mean-field control problems have been developed from McKean-Vlasov processes (see [26]) where the dynamics depend on the distribution of the current state itself.In the corresponding control problem the relevant data like reward and transition function not only depend on the current state and action but also on the distribution of the state.Whereas the original motivation comes from physics these kind of problems are able to model the interaction of a large population.Thus, other popular applications include finance, queueing, energy and security problems among others.In this paper we consider mean-field control problems in discrete time in contrast to the majority of literature which concentrates on continuous time models.Moreover, our optimization criterion is to maximize the social benefit of the system i.e. the overall expected reward.In particular in our paper individuals cooperate in contrast to the game situation where one usually tries to find the Nash equilibrium of the system.Here we rather aim at obtaining the Pareto optimal solution.A comprehensive overview over continuous-time mean-field games can be found in [7].These games have been introduced in economics and later studied in mathematics since at least 15 years (see e.g.[24] for one of the first mathematical papers on this topic).
We review briefly the latest results on discrete-time mean-field problems.First note that there have been some early studies of interactive games in [23] under the name anonymous sequential games and in [35] of so-called oblivious games which are in nature very similar to mean-field games.For a recent paper on discrete-time mean-field games and a literature survey, see for example [32].In this paper Markov Nash equilibria are considered in a model without common noise.For an early game paper with finite state space see [16].Since our paper is not a game and more in the spirit of Markov Decision Processes (MDPs) we concentrate our literature survey on control papers.One of the first papers in this area have been [13,14].In both papers the authors' goal is to investigate the convergence of a large interacting population process to the simpler mean-field model.More precisely, the authors show convergence of value functions and convergence of optimal policies which implies the construction of asymptotically optimal policies.In both papers the state space is finite and the action space compact.Whereas in [13] the convergence rate is studied, in [14] the authors also scale the time steps to obtain a continuous-time deterministic limit.Finite as well as infinite-horizon discounted reward problems are considered.In [20] the authors also investigate convergence in a discounted reward problem, however consider the situation that the random disturbance density in unknown.A consumption-investment example is discussed there.In [21] the same authors treat the unknown disturbance as a game against nature.The paper [29] already starts from a discrete-time meanfield control problem.The authors derive the value iteration and solve an LQ McKean-Vlasov control problem.In contrast to our paper there is no common noise, the authors restrict to finite time horizon and do not use MDP theory to solve their problem.However, their model data like cost and transition function may also depend on the distribution of actions.LQ-problems are popular as applications of mean-field control since it is often possible to obtain optimal policies in these cases.E.g. [11] is entirely devoted to these kind of problems.
The two papers which are closest to ours, at least as far as the model is concerned, are [8,27].In both papers, the model data may also depend on the distribution of actions, but there is no restriction on admissible actions.Both consider a discounted problem with infinite time horizon.In [8] the authors work with lower semicontinuous value functions, whereas we show continuity under the same assumptions.The main issues in [8] are an extensive discussion of different types of policies and the development of Q-learning algorithms.We however start already with Markovian deterministic policies since in MDP theory it is well-known that historydependent policies or randomized policies do not increase the value.Moreover, we consider the convergence of the N -individuals problem as well as average reward optimization.In [27] the authors deal with so-called open-loop controls and restrict to individualized or decentralized information.They investigate the rate of convergence from the N -population model to the mean-field problem.They also derive a fixed point characterization of the value function and discuss the role of randomized controls.Since in [27] decisions may only depend on the history of the single agent an additional source of randomness is required such that individuals with same history may take different actions.
Other recent papers discuss reinforcement learning for mean-field control problems, see e.g.[9,8,17,18].In the second part of the paper we consider average reward mean-field control problems which is a new aspect.There are papers on average reward games, like [5] where the transition probability does not depend on the empirical distribution of individuals and [36] where under some strong ergodicity assumptions the existence of a stationary mean-field equilibrium is shown.Both papers do not consider the vanishing discount approach which we do here.The recent paper [6] considers the vanishing discount approach, but in a continuous-time setting and for a game.
The main contributions of our paper are as follows: We first want to stress the point that mean-field control problems fit naturally into the established MDP theory.We start with a problem where N interacting individuals try to maximize their expected discounted reward over an infinite time horizon.Reward and transition functions may depend on the empirical measure of the individuals.Moreover, the transition functions of individuals depend on an idiosyncratic noise and a common noise.Due to symmetry reasons instead of taking the state of each individual as a common state of the system it is enough to know the empirical measure over the states.This equivalence implies an MDP formulation where the underlying state process consists of empirical measures.A similar observation can be found in [27], however there the authors take the mean-field limit first.Letting the number N of individuals tend to infinity, implies a mean-field limit by applying the Glivenko-Cantelli theorem.The idiosyncratic noise vanishes in the limit.In our setting state and action spaces are compact Borel spaces.We also discuss the existence of optimal policies which is rarely done in other papers.E.g. we give explicit conditions under which an optimal deterministic policy does exist for the limit problem as well as for the initial N -individuals problem.Moreover, we investigate average optimality in mean-field control problems, an aspect which is neglected in the literature.Applying results from MDP theory leads to an average reward optimality inequality.In some cases we obtain optimal policies in this setting rather easily.Since we use the vanishing discount approach, we can show that these policies are ε-optimal for the initial problem when the number of individuals is large and the discount factor close to one.Thus, we get some kind of double approximation which is helpful in some applications.Indeed, it turns out that the case when the reward does not depend on the action yields an interesting special case.The average reward problem can then be solved by first finding an optimal measure for a static optimization problem and then by using Markov Chain Monte Carlo to find an optimal randomized decision rule which achieves the optimal measure in the limit.We show how this works in a network example where the aim is to avoid congestion.Another interesting feature of the solution is that it is a decentralized control, i.e. individuals can decide optimally based on their own state without knowing the distribution of all individuals, i.e. individuals do not have to communicate.A second example is the optimal placement on a market square.
The paper is organized as follows: In the first section we introduce the model with a finite number of N individuals.We give conditions under which the optimality equation holds and optimal policies exist.In Section 3 we show how to formulate an equivalent MDP whose state space consists of the empirical measures of individuals.Based on this formulation we let the number N of individuals tend to infinity in the next section.We prove the convergence of value functions and show how an asymptotically optimal policy can be constructed.In Section 5 we consider the average reward problem via the vanishing discount approach.Under some ergodicity assumptions we prove the existence of average reward optimal policies and verify that the value function satisfies an average reward optimality inequality.Next we show how to use this optimal policy to construct ε-optimal policies for the original problem.
We discuss how to solve average reward problems when the reward depends only on the distribution of individuals and not on the action.Finally in Section 6 we consider two applications (network congestion and positioning on market place) which we solve explicitly.The appendix contains additional material which consists of a useful convergence result and the definition of the Wasserstein distance and Wasserstein ergodicity.Moreover, longer proofs are also deferred to the appendix.

The Mean-Field Model
We consider the following Markov Decision Process with a finite number of individuals: Suppose we have a compact Borel set S of states and N statistically equal individuals.Each individual is at the beginning in one of the states, i.e. the state of the system is described by a vector x = (x 1 , . . ., x N ) ∈ S N which represents the states of the individuals.In case we need the time index n, we write x i n , i = 1, . . ., N .Each individual can choose actions from the same Borel set A. Let D(x) ⊂ A be the actions available for one individual who is in state x ∈ S, i.e. a = (a 1 , . . ., a N ) ∈ D(x) := D(x 1 ) × . . .× D(x N ) is the vector of admissible actions for all individuals.We denote D := {(x, a) ∈ S × A : a ∈ D(x) for all x ∈ S} and assume that it contains the graph of a measurable mapping f : After choosing an action each individual faces a random transition.In order to define this, suppose that (Z i n ) n∈N , i = 1, . . ., N and (Z 0 n ) n∈N are sequences of i.i.d.random variables with values in a Borel set Z. The sequence (Z 0 n ) n∈N will play the role of a common noise.In what follows we need the empirical measure of x, i.e. we denote where δ y is the Dirac measure in point y.µ[x] can be interpreted as a distribution on S. We denote by P(S) the set of all distributions on S and by the set of all distributions which are empirical measures of N points.On these sets we consider the topology of weak convergence.The transition function of the system is now a combination of the individual transition functions which are given by a measurable mapping T : S × A × P(S) × Z 2 → S such that ) for i = 1, . . ., N .Note that the individual transition may also depend on the empirical distribution µ[x n ] of all individuals.In total the transition function for the entire system is a measurable mapping T : D×P N (S)×Z N +1 → S N of the state x, the chosen actions a ∈ D(x), the empirical measure µ[x] and the disturbances Z n+1 := (Z 1 n+1 , . . ., Z N n+1 ), Z 0 n+1 such that .
Last but not least each individual generates a bounded one-stage reward r : S × A × P(S) → R which is given by r(x i , a i , µ[x]), i.e. it may also depend on the empirical distribution of all individuals.The total one-stage reward of the system is the average of all individuals.The first aim will be to maximize the joint expected discounted reward of the system over an infinite time horizon, i.e. we consider here the social optimum of the system or Pareto optimality.In particular the agents have to work together in order to optimize the system.This is in contrast to mean-field games where each individual tries to maximize her own expected discounted reward and where the aim is to find Nash equilibria.We make the following assumptions: (A0) D is compact.(A1) x → D(x) is upper semicontinuous, i.e. for all x ∈ S: If x n → x for n → ∞ and a n ∈ D(x n ), then (a n ) has an accumulation point in D(x).(A2) (x, a, µ) → r(x, a, µ) is upper semicontinuous.(A3) (x, a, µ) → T (x, a, µ, z, z 0 ) is continuous for all z, z 0 ∈ Z.A policy in this model is given by π = (f 0 , f 1 , . ..) with f n ∈ F being a decision rule where is the set of all decision rules.In case we do not need the time index n we write f (x) := (f 1 (x), . . ., f N (x)).It is not necessary to introduce randomized or history-dependent policies here, since we obtain a classical MDP below and it is well-known that an optimal policy will be among deterministic Markov ones.We assume that each individual has information about the position of all other individuals.This point of view can be interpreted as a centralized control problem where all information is collected and shared by a central controller.
Together with the distributions of (Z i n ), (Z 0 n ) and the transition function T, a policy π induces a probability measure P π x on the measurable space where B(S N ) is the Borel σ-algebra on S N .The corresponding state process is denoted by (X n ) where X n (ω 1 , ω 2 , . ..) = ω n ∈ S N and the action process is denoted by (A n ) where Our aim is to maximize the expected discounted reward of the system over an infinite time horizon.Hence we define for a policy π = (f 0 , f 1 , . ..) where β ∈ (0, 1) is a discount factor.E π x is the expectation w.r.t.P π x .V N (x) is the maximal expected discounted reward over an infinite time horizon, initially given the configuration x of individual's states.
Remark 2.1.It is not difficult to see that V N is symmetric, i.e.V N (x) = V N (σ(x)) for any permutation σ(x) of x because the reward r(x, a) = r(σ(x), σ(a)) and the transition function T(x, a, µ[x], Z, Z 0 ) = T(σ(x), σ(a), µ[σ(x)], Z, Z 0 ) are symmetric.This is a simple observation but in the end leads to the conclusion that it is only necessary to know how many individuals are in the different states.
In what follows we introduce some notations.
From classical MDP theory we obtain:

Assume (A0)-(A3). Then:
a) The value function V N is the unique fixed point of the U -operator in M, i.e. it satisfies the optimality equation There exists a maximizer of V N and every maximizer f * ∈ F of V N defines an optimal stationary (deterministic) policy (f * , f * , . ..).
The proof of this statement and all other longer proofs can be found in the appendix.We summarize the model data below: Suppose individuals move on a triangle.The state space is given by the nodes S = {1, 2, 3}.Admissible actions are adjacent nodes, i.e.

The Mean-Field MDP
Suppose that N is large.Even if the state space S is small, the solution of the problem may not be computationally tractable any more because S N is large.We seek for some simplifications.In particular we want to exploit the symmetry of the problem.In the last section we have seen that the empirical measures of the individuals' states is the essential information.Thus, we define as new state space P N (S).Further we define the following sets: is the set of all probability measures on D which are empirical measures on N points.The set D(µ) consists of probability measures on D which are empirical measures on N points and whose first marginal distribution equals µ.We obtain the following result.
Lemma 3.1.Suppose a ∈ D(x) is an arbitrary action in state x ∈ S N .Then there exists an admissible ) ) then there exists an a ∈ D(x) s.t.
Proof.Let x and a ∈ D(x) be given and let µ : This lemma shows that instead of choosing actions a ∈ D(x) we can choose measures Q ∈ D(µ[x]) and µ = µ[x] is a sufficient information which can replace the high dimensional state x ∈ S N .Intuitively this is clear from the fact that r(x, a) is symmetric (see Remark 2.1).
We consider now a second MDP with the following data which we will call mean-field MDP (for short MDP).The state space is P N (S) and the action space is P N (D).The one-stage reward r : D → R is given by the expression in Lemma 3.1, i.e.
and the transition law T : The value of T simply is the empirical measure of the new states after a random transition.A policy is here denoted by ψ = (ϕ 0 , ϕ 1 , . ..) with ϕ n ∈ F and we denote by (µ n ) the corresponding (random) sequence of empirical measures, i.e. µ 0 = µ, and for n ∈ N 0 We define an action as a joint probability distribution Q on state and action combinations instead of the conditional distribution on actions given the state.Both descriptions are equivalent, since for Q ∈ D(µ) we can disintegrate where Q is the regular conditional probability.For short: Q = µ ⊗ Q.The advantage of using the joint distribution is that we have one object to define actions in all states.The disadvantage is that we need to formulate the restriction that the marginal distribution on the states coincides with µ.
We define the value function of MDP in the usual way for state µ ∈ P N (S) and policy ψ = (ϕ 0 , ϕ 1 , . ..) by Finally, we show that the MDP and the mean-field MDP are equivalent.
we have: Proof.Note that µ 0 = µ = µ[x] by definition.Let a 0 = a ∈ D(x) be the first action taken by MDP under an arbitrary policy.Then by Lemma 3.1 there exists By induction over time n it follows that a sequence of states and feasible actions in MDP (X 0 , A 0 , X 1 , . ..) can be coupled with a sequence of states and feasible actions (µ 0 , Q 0 , µ 1 , . ..) for MDP and vice versa s.t. the same sequence of disturbances The corresponding policies may be history-dependent, but V N = J N follows since it is well-known for MDPs that the maximal value is obtained when we restrict our optimization to Markovian policies.
As in Section 2 we define here a set and an operator for the mean-field MDP.
Due to Theorem 3.3 and Theorem 2.3 we obtain:

Assume (A0)-(A3). Then:
a) The value function J N is the unique fixed point of the Û -operator in M i.e. it satisfies the optimality equation c) There exists a maximizer of J N and every maximizer ϕ * ∈ F of J N defines an optimal stationary policy (ϕ * , ϕ * , . ..).
We summarize the model data below: The transition kernel mentioned in Remark 3.2 in this example is given by Q

The Mean-Field Limit MDP
In this section we let N → ∞ in order to obtain some simplifications.This yields the so-called mean-field limit.
We thus consider a third MDP, the so-called limit MDP (denoted by MDP).We will later show that it will indeed appear to be the limit of the problems studied in the previous section.The limit MDP is defined by the following data: The state space is P(S) and the action space is P(D).We define D(µ) : The one-stage reward r : The transition function is defined by T : D × Z → P(S) where p x,a,µ,Z 0 (B) := P(T (x, a, µ, Z i , Z 0 ) ∈ B|Z 0 ) with B ∈ B(S), is the conditional probability that the next state is in B, given x, a, µ and the common noise random variable Z 0 .
Remark 4.1.Recalling that Q ∈ D(µ) means Q = µ ⊗ Q, we can (with the help of the Fubini theorem) instead of (4.3) equivalently write where P Q,µ,Z 0 (dx ′ |x) = D(x) p x,a,µ,Z 0 (dx ′ ) Q(da|x).Hence P Q,µ,Z 0 is the transition kernel which determines the distribution at the next stage.In general it depends on Q, µ and the common noise Z 0 .
A decision rule is here a measurable mapping ϕ from P(S) to P(D) such that ϕ(µ) ∈ D(µ) for all µ.We denote by F the set of all decision rules.Suppose that ψ = (ϕ 0 , ϕ 1 , . ..) is a policy for the MDP.As in the previous section we set for n ∈ N 0 µ 0 := µ, µ n+1 := T (µ n , ϕ n (µ n ), Z 0 n+1 ) which yields the sequence of distributions of individuals.Note that it is deterministic if T does not depend on the common noise Z 0 .
Remark 4.4.We can use the established solution methods like value iteration, policy iteration, linear programmes or reinforcement learning to numerically solve the limit MDP ([4, 10, 30]).
The limit problem can be seen as a problem which approximates the original model when N is large.In order to proceed, we need a more restrictive assumption than (A3) (A3') Z is compact and (x, a, µ, z, z 0 ) → T (x, a, µ, z, z 0 ) is continuous.
Remark 4.5.The assumption that Z is compact is not a strong assumption.Indeed, w.l.o.g.we may choose the disturbances to be uniformly distributed over [0,1].This is because if for example Z = R and F is the distribution function of Z we get Then it is possible to prove the following limit result.

Theorem 4.6. Assume (A0), (A1), (A2') and (A3'). Let µ
In particular the proof of part b) shows how to obtain an ε-optimal policy for the model with N individuals (N large) when we know the optimal policy for the limit MDP.a) In case there is no common noise, MDP is completely deterministic.The optimality equation then reads where T (µ, Q)(B) = p x,a,µ (B)Q(d(x, a)) with p x,a,µ (B) = P(T (x, a, µ, Z) ∈ B). b) If there is no common noise and r and T do not depend on µ, we obtain as a special case a standard MDP.The usual optimality equation for this MDP (for one individual) would be {r(x, a) + βEV (T (x, a, Z))} , x ∈ S (4.8) where The results in this paper show that we can equivalently consider MDP which implies the optimality equation (4.7).It is possible to show by induction that the relation between both value functions is given by J(µ) = V (x)µ(dx).Moreover, a maximizer of J is given by ϕ for some f * : S → A with f * (x) ∈ D(x) and f * is a maximizer of V .Here the choice of the conditional distribution Q * does not depend on µ and is concentrated on a single action.c) The policy ψ N which is constructed in Theorem 4.6 is deterministic but has the disadvantage that individuals have to communicate.Another possibility is to choose ] then simulate for all x N i actions a N i according to the kernel Q * .This is then a randomized policy but has the advantage that every individual can do this on its own without having the information about the other states and actions.This is then a decentralized control, i.e. f i (x) = f i (x i ).Also the speed of the convergence in Theorem 4.6 depends on the chosen approximation method.
We summarize the model data below: Example 4.8.We reconsider Example 2.4.In MDP a state can be any distribution on S, e.g.

Average Reward Optimality
In this section we consider the problem of finding the maximal average reward of the meanfield limit problem MDP.So suppose an MDP as in the previous section (equation (4.6)) is given.For a fixed policy ψ = (ϕ 1 , ϕ 2 , . ..) define lim inf The problem is to find G(µ) := sup ψ G ψ (µ) for all µ ∈ P(S).We will construct the solution via the vanishing discount approach, see e.g.[34,33,19,3].This has the advantage that we get a statement about the approximation of the β-discounted problem by the average reward problem immediately.For this purpose we denote by J β , J β ψ the value functions of the discounted reward problem MDP of the previous section in order to stress that they depend on the discount factor β.
Proof.Using (A4) we obtain: The last term converges to zero when we choose (β n ) s.t.lim n→∞ β n = 1 and lim n→∞ ρ(β n ) = ρ which is possible due to the considerations preceding this lemma.The first term also tends to zero.
Note that part b) of the previous theorem states that it is possible to obtain an average reward optimal policy from optimal policies in the discounted model.Indeed what is maybe more interesting is the converse.From the average optimal policy we can construct ε-optimal policies for MDP and thus also for MDP if β is close to one.The idea is to use the double approximation (number of agents large, discount factor large) to approximate the discounted finite agent model by the average mean-field problem.We do not tackle the question of convergence speed or how β depends on N here.A policy ψ is ε-optimal in state µ ∈ P(S) for MDP if Thus, we obtain: Corollary 5.4.Under the assumptions of Theorem 5.3 suppose ψ * = (ϕ * , ϕ * , . ..) is an optimal stationary policy for the average reward problem and ψ N is constructed as in Theorem 4. 6.Then for all ε > 0 and for all µ ∈ P(S) there exists a β(µ) < 1 a) s.t.ψ * is ε-optimal for MDP in state µ for all β ≥ β(µ).b) and there exists a N (µ, β(µ)) ∈ N s.t. for all N ≥ N (µ, β(µ)) and β ≥ β(µ) ψ N is ε-optimal for MDP, i.e.
Proof.a) By Theorem 5.3 we know that ρ = G ψ * (µ) is the maximal average reward.Lemma 5.1 and Theorem 5.3 together imply which means that we have equality everywhere.Since r is bounded, w.l.o.g.we may assume that r is bounded from below by C > 0, otherwise we have to shift the function by a constant.Now for all ε > 0 we can choose, due to the preceding equation, β(µ) s.t. for all β ≥ β(µ) which implies the statement.
5.1.Special Case I. We consider the following special case: The reward depends only on µ, i.e. we have r(µ, Q) = r(µ).The transition function is independent of µ and there is no common noise, i.e. all individuals move independently from each other.Suppose µ * ∈ P(S) is the solution of the static optimization problem max r(µ) s.t.µ ∈ P(S) (5.4) which exists since r is continuous on the compact space P(S).In the described situation MDP is deterministic and the evolution of the state process for a given policy is for k ∈ N where we start with the initial distribution µ 0 .Now suppose further that there exists a transition kernel (policy) Q * such that µ * is a stationary distribution of P Q * and P Q * satisfies the Wasserstein ergodicity (see Appendix).Suppose further that (µ * k ) is the state sequence obtained in (5.5)where we replace P Q by P Q * .Then µ * k ⇒ µ * for k → ∞ weakly since convergence in the Wasserstein metric implies weak convergence on compact sets.Problem (5.4) and the solution approach here is similar to the concept of steady state policies in [12].Lemma 5.5.Under the assumptions of this subsection ϕ * (µ) = µ ⊗ Q * defines an average reward optimal stationary policy ψ * = (ϕ * , ϕ * , . ..).
Proof.Since µ → r(µ) is continuous (see proof of Theorem 4.3) we obtain lim k→∞ r(µ * k ) → r(µ * ).Thus we have for all µ ∈ P(S) The last equation follows from the definition of µ * .Hence ψ * is average reward optimal.
We can think of the problem thus been transformed into a Markov Chain Monte Carlo problem to sample from µ * .In order to obtain an ε-optimal policy in the N individual problem with large discount factor, an individual in state x can sample its action from Q * (•|x) (see proof of Theorem 4.6 and Remark 4.7 c)).This yields a decentralized decision which does not depend on the complete state of the system.I.e. the individuals do not have to communicate with each other in order to push the system to the social optimum.The knowledge about the own state is sufficient.Problems may occur when the solution of (5.4) is not unique.Then the individuals have to communicate which solution is preferred.In particular the individual's optimal decision coincides with the social optimal decision.This is because we can interpret µ k as the distribution of a typical individual at time k.Also note that in this case it can be shown that Assumption where W is the Wasserstein distance of two measures (see Appendix).We will give a more specific application in Section 6. 5.2.Special Case II.We relax the previous case and allow the transition function to depend on µ.Again we determine the solution µ * of (5.4) first.Next we check whether there exists a transition kernel (policy) Q * such that µ * is a stationary distribution of P Q * with P Q * (B|x) = p x,a,µ * (B) Q * (da|x) for x ∈ S, B ∈ B(S) and P Q * satisfies the Wasserstein ergodicity.Here, we need some further properties of the model to obtain the same result as in Case I, because we have to make sure that the system still converges to µ * , even if we choose the 'wrong' transition kernel at stage k.Note that the evolution of the state in this model is given by In particular we want to find an optimal decentralized control.The following assumptions will be useful: The next lemma states that under these assumptions the sequence (µ * k ) still converges against the optimal distribution µ * .Lemma 5.6.Under (T1)-(T5) we obtain: W (µ * k+1 , µ * ) ≤ γW (µ * k , µ * ) and thus µ * k ⇒ µ * weakly.
Lemma 5.6 then implies that even in this case the maximal average reward r(µ * ) is achieved by applying Q * throughout the process which corresponds to a decentralized control.An example where (T1), (T3), (T4) are fulfilled is T (x, a, µ, z) = γ S x + γ A a + γ W xµ(dx) + z.
6. Applications 6.1.Avoiding Congestion.We consider here the following special case: N individuals move on a graph with nodes S = {1, . . ., d} and edges E ⊂ {(x, x ′ ) : x, x ′ ∈ S}.Individuals can move along one edge in one time step.We assume that nodes are connected.The aim is to avoid congestion and to try to spread the individuals such that they keep a maximum distance.More precisely suppose that the current empirical distribution of the individuals on the nodes is µ and that the distance between node x and x ′ , x, x ′ ∈ S is given by ∆(x, x ′ ) > 0 where ∆(x, x) = 0 and ∆(x, x ′ ) = ∆(x ′ , x).Then the average distance between an individual at position x and all other individuals is Here r(x, a, µ) does not depend on a. Hence where ∆ = ∆(x, x ′ ) x,x ′ ∈S is the matrix of distances.Note that ∆ is symmetric.We assume that A = S and D(x) = {x ′ ∈ S : (x, x ′ ) ∈ E} ∪ {x}, i.a.actions in the original model are neighbours on the graph.We interpret actions as intended directions the individual wants to move to, but this may be disturbed by some random external noise.In the mean-field limit the state of the system at time n is just given by a generalized distribution on S. Recall that the general transition equation of the mean-field limit is p x,a,µn,z 0 (x ′ ) Qn (a|x)µ n (x) (6.1) if S, A are finite where p x,a,µ,z 0 (x ′ ) = P(T (x, a, µ, Z, z 0 ) = x ′ ) and Q n has first margin µ n .Problems where the reward decreases when more individuals share the same state are typical for mean-field problems, see e.g.[25] where a Wardrop equilibrium is computed.In [28] the authors consider spreading contamination on graphs.
6.1.1.No common noise.We consider the mean-field limit now.At the beginning let us assume that p x,a,µ,z 0 = p x,a does not depend on µ and z 0 , i.e. the individuals move on their own, not affected by others and there is no common noise.Moreover, it is reasonable to set p x,a (x with Q(a|x)µ(x) = Q(x, a).Hence (6.1) can be written as µ n+1 = µ n P Qn .Here it is more intuitive to work with the conditional probabilities Q(a|x) instead of the joint distribution Q(x, a).
Obviously the optimization problem max µ∆µ ⊤ s.t.µ ∈ P(S) (6.3) has an optimal solution µ * since P(S) is compact and µ∆µ ⊤ continuous.We consider the following special case: For a, x ′ ∈ D(x) set p x,a (x ′ ) = α for a = x ′ and p x,a (x ′ ) = 1−α |D(x)|−1 else.All other probabilities are zero.I.e. if we choose a vertex a we will move there with probability α and move to any other admissible vertex with equal probability.
Formally for x ∈ S, action a ∈ D(x) = {x 1 , . . ., x m } (where x i = x for one of the x i 's) and disturbance Z ∼ U [0, 1] the transition function in this example is given by . ., m. Lemma 6.1.If µ * (x) > 0 for all x ∈ S and α is large enough, then there exists a Q * ∈ P(D) s.t.µ * = µ * P Q * , i.e. µ * is a stationary distribution for the transition kernel P Q * given in (6.2).
Proof.We use a construction similar to the Metropolis algorithm.For x, x ′ ∈ S let The parameter κ > 0 should be such that P Q * is a transition matrix.Then the detailed balance x ′ x , x, x ′ ∈ S are satisfied and hence µ * is a stationary distribution of P Q * .We now have to determine Q * s.t.P Q * has the specified form.Let us fix x ∈ S. We have to solve (6.2) for Q * .We claim that (6.2) is solved for This can be seen since In order to have Q * (a|x) ∈ [0, 1] we have to make sure that α ≥ p ) for all x, x ′ ∈ S and α ≥ 1 2 .Theorem 6.2.The optimal average reward policy for the limit model considered here is the stationary policy ψ * = (ϕ * , ϕ * , . ..) with ϕ * (µ) = µ ⊗ Q * with Q * from (6.4).Thus, for N large and β close to one, sampling actions from Q * is ε-optimal for the β-discounted problem with N individuals.
Proof.The statement follows from our previous discussions.Note that when we start with an arbitrary µ * 0 , the sequence of distributions generated by µ * k+1 = µ * k P Q * converges against µ * since the matrix P Q * is irreducible by construction and we have a finite state space.Thus, G ψ (µ * 0 ) in (5.1) yields the same limit µ * ∆(µ * ) ⊤ which is maximal since it solves (5.4).Remark 6.3.It is tempting to say that for the discounted problem, once we have reached the stationary distribution after a transient phase we know that the optimal policy is to choose Q * forever.However, there are only rare cases where the stationary distribution is reached after a finite number of steps (see e.g.[15]), so the transient phase will in most cases last forever.Example 6.4.We consider a regular 3 × 3 grid, i.e. d = 9 (see Figure 1, left).We set the distance between nodes equal to 1 when there is only one edge between them.Nodes which are connected via 2 edges get the distance 1.4, when there are 3 edges in between 1.7 and finally we set the distance equal to 2.2 when there are 4 edges in between.The distance matrix ∆ is thus given by The optimal distribution of problem (5.4) is here given by µ * = 1 37 ( . The masses are illustrated in Figure 1, right picture.The area of the circle is proportional to the corresponding value of µ * .We think of the proportion of individuals who occupy this node.We set α = 1 and ψ = 0.25.Then we obtain from (6.4) that the optimal decision in every node is given by the following transition kernel Q * (a|x) where b = 1 8 and c = 1 14 .So using this decentralized decision throughout the process yields the maximal average reward.In Figure 2 we see the evolution of the system when all mass starts initially in node 1.The pictures show the distribution of the mass after 2, 4, 8, 16, 32 and 64 time steps.Note that sampling actions from Q * is also ε-optimal for the system when we have a finite but large number of individuals and β is close to one for the discounted reward criterion.6.1.2.With common noise.Next we suppose that α depends on the common noise Z 0 .In this case the maximal average reward which can be achieved is less or equal to the case without common noise since the sequence of distributions is stochastic and may deviate from the optimal one.We simplify things a little bit since we assume here that |D(x)| = γ independent of x. on of the individuals using the optimal randomized decision when all start in node 1, after n = 2, 4, 8, 16, 32 and 64 time steps ( From the previous section, equation (6.5) we know that we can write In matrix notation where U is a d × d matrix containing ones only and Q = ( Q(x ′ |x)).Here the situation is more complicated, in particular the next empirical distribution of individuals is stochastic and given by Plugging this into the reward function yields Now consider the problem Obviously this problem has an optimal solution ν * since we maximize a continuous function over a compact set.Now ν corresponds to µ n Qn in (6.6).In case it is possible to choose for all µ ∈ P(S) a matrix Q s.t.µ Q = ν * , then this would be the optimal strategy, since we would get the maximal expected reward in each step.This is for example possible if the graph is complete.
Then we can simply choose Q as the matrix with identical rows which consist of ν * .In what follows in order to simplify the computation we choose d(x, y) = x − y 2 for x, y ∈ S.
We want to solve (5.4) in this case.Let us formulate the problem with the help of random variables.Let X = (X 1 , X 2 ), Y = (Y 1 , Y 2 ) be independent r.v.having distribution µ.Then r(µ) is the same as Thus, we can treat the margins separately and the dependence between them is not interesting for the reward.Now obviously since X and Y both have the same distribution we can write i .Suppose we fix EX i for a moment.Since x → x 2 is convex, the distribution which maximizes the expression is maximal in convex order, given the fixed expectation.But this distribution is due to the convexity property concentrated on the endpoints of the interval.Thus we can restrict to random variables X 1 which have mass p ∈ [0, 1] on B 1 and 1 − p on C 1 , i.e. we maximize The solution is given by p . Since the joint distribution does not matter we can choose independent margins and obtain . This is the target distribution which should be attained.For a numerical example we choose B(0, 0), C(4, 0), D(0, 3), E(4, 3) and A(2.5, 2).In this case we obtain The distribution is illustrated in Figure 3, (right).Depending on how the transition law precisely looks like, if one is able to choose Q * such that µ * is the stationary distribution of P Q * , the problem is solved.Of course the optimal distribution µ * depends on what kind of distance d we choose.Varying the metric for the distance leads to interesting optimization problems.

Conclusion
We have seen that the average reward mean-field problem can in some cases be solved rather easily by computing an optimal measure from a static optimization problem.The policy which is obtained in this way is ε-optimal for the β-discounted N -individuals problem where N is large and β close to one.The static optimization problem for measures gives rise to some interesting mathematical questions.8.2.Wasserstein Ergodicity.For the following definitions and results see [31].Definition 8.2.For two probability measures µ, ν on S, the dual representation of the Wasserstein distance is given by where Note that convergence in Wasserstein metric implies weak convergence when we are on compact sets.
The first term converges to zero due to Assumption (A2') and Lemma 8.1.The second term converges to zero since Q n ⇒ Q for n → ∞ and (A2').Boundedness follows from the boundedness of r.
Next we show (iii).Boundedness is clear.In order to show continuity we first consider the mapping for fixed z 0 .We claim that this mapping is continuous.Let h : S → R be continuous and bounded.By P Z we denote the distribution of the r.v.Z i n .We have to show that In the first term we can interchange the limit lim n→∞ and the integral due to dominated convergence and obtain due to (A2') and Lemma 8.1.The second term converges to zero for n → ∞ since (x, a) → h(T (x, a, µ, z, z 0 )) is continuous due to (A3).In total we have shown that the mapping in (8.2) is continuous.Finally take v ∈ M and pick a sequence with (µ n , Q n ) → (µ, Q) for n → ∞.We obtain with dominated convergence, the continuity of v and the continuity of (8.2) which shows the stated continuity of (µ, Q) → Ev( T (µ, Q, Z 0 )).Now Proposition 2.4.8 in [2] implies that Ũ : M → M.
The next condition in Theorem 7.3.5 [2] is that Ũ is contracting on M. But this follows along the same lines as in the proof of Theorem 2.3.Finally, the existence of maximizers which is another assumption in Theorem 7.3.5 [2] follows again from Proposition 2.4.8 in [2].
In total the statement is a consequence of Theorem 7.3.5 in [2] with the set M.
8.3.4.Proof of Theorem 4.6.We partition the proof into three steps. Step ).Further, suppose we fix ω ∈ Ω and consider a realization z N = (z N 1 , . . ., z N N ) of (Z N 1 , . . ., Z N N ) and z 0 of Z 0 1 .We show that T (µ N , Q N , z N , z 0 ) ⇒ T (µ, Q, z 0 ) where µ is the first margin of Q.In order to show this let h : S → R be bounded and continuous.We obtain: Since h, T are continuous, D, Z are compact and µ N ⇒ µ we can for all ε > 0 choose N large enough s.t.sup Hence the first term in (8.3) converges to zero for N → ∞.Let µ N z be the empirical measures of z N .We obtain: Since Q N ⊗ µ N z ⇒ Q ⊗ P Z for N → ∞ by the Glivenko-Cantelli Theorem for N → ∞, the r.h.s. of (8.4) converges to h T (x, a, µ, z, z 0 ) Q(d(x, a))P Z (dz) = h(y) T (µ, Q, z 0 )(dy).Thus, we get T (µ N , Q N , Z N , Z 0 ) ⇒ T (µ, Q, Z 0 ) P-a.s.In the proof of Theorem 4.3 we have shown that this implies lim N →∞ r(µ N , Q N ) = r(µ, Q).
Step 2: Suppose ψ N = (ϕ N 0 , ϕ N 1 , . ..) is an arbitrary policy for MDP.Let ) is a sequence of measures on the compact space D. Hence there is a subsequence (m N ) s.t.
1 ).When we consider the first L ∈ N transitions in that way, we find a joint subsequence (for convenience still denoted by m N ) s.t. for N → ∞ P-a.s.
and where the limit is by construction an admissible state-action sequence for MDP.This is because the subsequences are taken such that the limits satisfy Q n ∈ P(D) that the first margin of Q n is µ n and finally because of (8.5) which is by induction not only satisfied for time point one, but also for n = 1, . . ., L. Hence lim Since |r| ≤ C we can choose L large enough s.t.
This implies lim sup N →∞ J N (µ N 0 ) ≤ J(µ 0 ).Step 3: We finally have to show that we can construct from ϕ * a policy ψ N = (ϕ N 0 , ϕ N 1 , . ..) s.t.lim sup N →∞ J N (µ N 0 ) = J(µ 0 ).This proves a) and b).Suppose ϕ * (µ 0 ) = Q * 0 .It is possible to construct a sequence Q N 0 ∈ P N (D) s.t.Q N 0 ⇒ Q * 0 and µ N 0 is the first margin of Q N 0 .This can be done as follows: Suppose Q * 0 = µ 0 ⊗ Q0 then µ N 0 ⇒ µ 0 by assumption and Q N 0 = µ N 0 ⊗ QN 0 where the kernel QN 0 is an appropriate discretization of Q0 (e.g. by quantization or quasi Monte Carlo methods).Applying the results in Step 1 we obtain lim N →∞ r(µ N 0 , Q N 0 ) = r(µ * , Q * ) and µ N 1 = T (µ N 0 , Q N 0 , Z 1 , Z 0 1 ) ⇒ T (µ * , Q * , Z 0 1 ) = µ * 1 P a.s.. Continuing in that way as in Step 1 we can attain the upper bound J(µ 0 ) in the limit.In order to implement this strategy the central controller has to know Q * n or µ * n at time n.If there is no common noise, then the sequence (µ * 0 , Q * 0 , µ * 1 , Q * 1 , . ..) is deterministic and we only have to know the time step n, so the policy is non-stationary.If the common noise is present, in order to know Q * n the central controller has to keep track of the history (Z 0 1 , Z 0 2 , . ..), so the policy ψ N is history-dependent.However, we know from MDP theory that such a policy can always be dominated by a Markovian policy, so which yields the statements of the theorem.where d is a metric on P(S).Note that h is a limit of bounded, continuous functions which are decreasing in n and is thus at least upper semicontinuous.

Definition 2 . 2 .
Let us define: a) The set M := {v : S N → R | v is bounded and upper semicontinuous}.b) The operator U on M by a)) which proves the first statement.For the converse, suppose Q ∈ D(µ[x]).By definition this implies that there exists a ∈ D(x) s.t.Q = µ[(x, a)].Using this relation, (3.1) follows.

Definition 3 . 4 .
Let us define a) The set M := {v : P N (S) → R | v is bounded and upper semicontinuous}.b) The operator Û on M by

6. 2 .
Positioning on a Market Place.Suppose we have a rectangular market place like in Figure3.The state µ represents the distribution of individuals over the market place.Point A is an ice cream vendor.The aim of the individuals is to keep distance to others and be as close as possible to the ice cream vendor.Thus, S ⊂ R 2 is the rectangle BCED and the one-stage reward is r(µ) = d(x, y)µ(dx)µ(dy) − d(x, A)µ(dx).

Figure 3 .
Figure 3. Market place with ice cream vendor (left).Optimal distribution in example (right)

Definition 8 . 3 . 8 . 3 . 8 . 3 . 1 . 3 :
A transition kernel P (•|x) from S to S is called Wasserstein ergodic when there exist constants ρ ∈ (0, 1) and C > 0 s.t. for all n ∈ N sup x,y∈S,x =y W (P n (•|x), P n (•|y)) |x − y| ≤ Cρ n .Suppose P is Wasserstein ergodic and has stationary distribution µ * which means that µ * = P (•|x)µ * (dx) =: µ * P .Then for any µ 0 ∈ P and µ n = µ 0 P n we obtain W (µ n , µ * ) ≤ Cρ n .Additional Proofs.Proof of Theorem 2.We first show that U :M → M. Hence, let v ∈ M. Since r and v are bounded, U v is bounded.(A2) implies that (x, a) → r(x, a) is upper semicontinuous.This follows since (x n , a n ) → (x, a) for n → ∞ implies x i n → x i , a i n → a i , i = 1, . . ., N and µ[x n ] → µ[x](in weak topology) for n → ∞.Moreover, the sum of upper semicontinuous functions is upper semicontinuous.And finally due to (A3) and the fact that v is upper semicontinuous(x, a) → E v T(x, a, µ[x], Z, Z 0 ) is upper semicontinuous.This together implies that (x, a) → r(x, a) + E v T(x, a, µ[x], Z, Z 0 ) (8.1)is upper semicontinuous and U : M → M follows from Proposition 2.4.3 in[2].Next note that M together with the sup-norm v = sup x∈S N |v(x)| is a Banach space.Also 0 ∈ M which is the function identical to zero.Moreover for v, w ∈ M: