Optimal bailout strategies resulting from the drift controlled supercooled Stefan problem

We consider the problem faced by a central bank which bails out distressed financial institutions that pose systemic risk to the banking sector. In a structural default model with mutual obligations, the central agent seeks to inject a minimum amount of cash in order to limit defaults to a given proportion of entities. We prove that the value of the central agent's control problem converges as the number of defaultable institutions goes to infinity, and that it satisfies a drift controlled version of the supercooled Stefan problem. We compute optimal strategies in feedback form by solving numerically a regularized version of the corresponding mean field control problem using a policy gradient method. Our simulations show that the central agent's optimal strategy is to subsidise banks whose equity values lie in a non-trivial time-dependent region.


Introduction
In this paper, we analyse a simple mathematical model for a central bank that optimally injects cash into a banking system with interbank lending in order to prevent systemic default events.By way of introduction, we first review known results on the dynamics without intervention and its relation to the supercooled Stefan problem.We then describe the optimisation problem faced by the central agent and discuss its setting within the literature on Mean Field Control (MFC) problems together with this paper's contributions.

Interbank lending and the supercooled Stefan problem
We study a market with N financial institutions and their equity value process X = (X i t ) for t ∈ [0, T ] with finite time horizon T > 0 and i = 1, . . ., N .We interpret X i in the spirit of structural credit risk models as the value of assets minus liabilities.Hence, we consider an institution to be defaulted if its equity value hits 0. We refer the reader to [49] for the classical treatment as well as to [40] and the references therein for a discussion of such models in the present multivariate context with mutual obligations.
We consider specifically a stylised model of interbank lending where all firms are exchangeable, their equity values are driven by Brownian motion, and where the default of one firm leads to a uniform downward jump in the equity value of the surviving entities.The latter effect is the crucial mechanism for credit contagion in our model as it describes how the default of one firm affects the balance sheet of others.Here, we follow [37,50,48]) to assume that the X i satisfy where τ i = inf{t : X i t ≤ 0}, X i 0− are non-negative i.i.d.random variables, (B i ) 1≤i≤N is an N -dimensional standard Brownian motion, independent of X 0− = (X i 0− ) 1≤i≤N , and α ≥ 0 is a given parameter measuring the interconnectedness in the banking system.The initial condition reflects the current state of the banking system.This might include minimal capital requirements prescribed by the regulator as conditions to enter the banking system, but we do not consider this question explicitly.
Even this highly stylised simple system produces complex behaviour for large pools of firms, including systemic events where cascades of defaults caused by interbank lending instantaneously wipe out significant proportions of the firm pool (see [37,50,30]).
One way of analysing this is to pass to the mean-field limit for N → ∞.It is known (see, e.g., [29]) that the interaction (contagion) term in (1) converges (in an appropriate sense) to a deterministic function Λ : [0, T ] → [0, α] as N → ∞, i.e., Moreover, the X i are asymptotically independent with the same law as a process X which together with Λ satisfies a probabilisitic version of the supercooled Stefan problem, namely where Λ is subject to the constraint Here, B is a standard Brownian motion independent of the random variable X 0− , which has the same law as all X i 0− .We refer to [30] for a discussion on how this probabilistic formulation relates to the classical PDE version of the supercooled Stefan problem.
From a large pool perspective (see [37,50]), X t may be viewed as the equity value of a representative bank and τ = inf{t ≥ 0 : X t ≤ 0} as its default time, while Λ t describes the interaction with other institutions under the assumption of uniform lending and exchangeable dynamics.In particular, P(inf 0≤s≤t X s ≤ 0) can be interpreted as the fraction of defaulted banks at time t and consequently Λ t as the loss that the default of these entities has caused for the survivors.
It is known that solutions to (2), (3) are not unique in general (see [29,27]), which explains the need to single out so-called physical solutions that are meaningful from an economic and physical perspective.Under appropriate conditions on X 0− , these physical solutions are characterised by open intervals with smooth t → Λ t , separated by points at which this dependence may only be Hölder continuous or even exhibit a discontinuity, an event frequently referred to as blow-up (see [30]).If the mean of the initial values is close enough to zero relative to the interaction parameter α, a jump necessarily happens (see [37]).
In case a discontinuity does occur at some t ≥ 0, the following restriction on the jump size defines such a physical solution: with Λ t− := lim s↑t Λ s and X t− := lim s↑t X s .By the results of [30], the above condition on the jumps of Λ uniquely defines a solution to (2) and ( 3) under some restrictions on the initial condition X 0− .For future reference, we also introduce the concept of minimal solutions, which we know to be physical whenever the initial condition is integrable (see [27]).We call a solution Λ minimal, if for any other X that satisfies (2) with loss process Λ given by (3), we have that Λ t ≤ Λ t , t ≥ 0.
Note that by combining the results of [27] and [30] the minimal solution is the unique physical one, and thus ecnomoically meaningful one, if the initial condition satisfies the assumptions of [30].

The central agent's optimisation problem
The purpose of this paper is to analyse strategies that a central bank (central agent) can take to limit the number of defaults.They achieve this by controlling the rate of capital injected to distressed institutions.That is to say, rather than bailing out firms which are already defaulted, the central agent intervenes already ahead of the time their equity values become critical.This rate of capital1 received by bank i is determined by processes β i and added to (1).In the finite dimensional situation, the X i then satisfy A mathematically similar problem has been studied in [58].There the question of finding an optimal drift in order to maximize the number of Brownian particles that stay above 0 is treated, however without the singular interaction term appearing in (6).
In anticipation of a propagation of chaos result (proved in Section 2), we therefore consider an extension of ( 2) and (3) with a drift process β, i.e., Throughout the paper we will consider a constraint 0 ≤ β t ≤ b max , which amounts to the assumption that at any point in time the central agent has limited resources for the capital injections.We will specify further technical conditions on β later, which allow us to show that indeed the finite system converges in a suitable sense to this McKean-Vlasov equation.
We now consider a central agent who injects capital into a representative bank at rate β t at time t in order to keep that is the number of defaults that occur before 2 a given time T , below a specified threshold δ, while minimising the expected total cost We therefore consider the following constrained optimisation problem: For given δ, the central agent solves Define now for γ ∈ R + the Lagrange function L(β, γ) = C T (β) + γ(L T − (β) − δ) and use it to express the constrained optimization problem as an unconstrained one, namely min β max γ∈R + L(β, γ), which holds true since Assuming the absence a duality gap 3 (or equivalently the existence of a saddlepoint (β , γ * ) of L, i.e.L(β , γ) ≤ L(β , γ * ) ≤ L(β, γ * ) for all β, γ), then we can interchange the min and max and solve the dual problem max γ∈R + min β L(β, γ). 2 We consider L T − rather than L T in the constraint because Λ may have a jump discontinuity precisely at T , which would considerably complicate the analysis.However, by [46,Corollary 2.3], we know that solutions to (1) cannot have discontinuities after time α 2 /(2π).We can derive an analogous result for the controlled case using Girsanov's theorem if, in addition to the pointwise bound on β, we assume a bound on the total cost over the infinite horizon, i.e. ∞ 0 β s ds ≤ c max for some c max > 0. In this setting, by choosing T sufficiently large, we then do not need to distinguish between L T − (β) and L T (β).
3 Proving the absence a duality gap seems difficult as standard minimax theorems cannot be easily applied.Moreover, our numerical experiments suggest that at least for certain values of δ it might fail to hold true.
For these reasons we shall from now on consider the inner optimisation problem for fixed γ > 0 (which can -due to the complementary slackness condition -only hold if the constraint is binding, i.e.L T − (β) = δ).If there is no duality gap, the optimal γ for a prespecified threshold δ can in turn be determined by solving the outer optimisation problem, i.e. max γ∈R + g(γ) where g(γ) = min β L(β, γ).
Writing X(β) for the solution process associated with the minimal solution Λ(β), analogously defined as in ( 5) but now for (7), we thus minimise the following objective function where X = X t 1 {τ >t} is the absorbed minimal solution process and τ the default time.
Note that the only difference between J(β) and L(β, γ) is the constant −γδ, which however does not play a role in the optimisation over β.By varying γ, we can therefore trace out pairs of costs and losses which are solutions to (9) for different δ.The Lagrange multiplier γ (as a function of δ) can then be interpreted as shadow price of preventing further defaults.Indeed, as for usual constrained optimization problems, the optimal cost C T seen as a function of the loss level δ satisfies under certain technical conditions As we show numerically in Section 3, the optimal loss L T − as a function of γ is monotone decreasing, so that for large enough γ (and b max ), the threshold δ becomes small enough to avoid systemic events.Note that, by the arguments at the end of Section 1.1, using the minimal solution in the optimisation task is the only economically meaningful concept because non-physical solutions (with a potential higher probability of default) cannot be realistically justified, in particular when seen as limits of particle systems.We refer to [29,Section 3.1] for examples of such non-physical solutions.
Both from a theoretical and numerical perspective, we shall analyse the objective function (10) together with the dynamics (7), which is a non-standard MFC problem with a singular interaction through hitting the boundary.As we show in Section 2, in particular Theorem 2.8, optimisation of the McKean-Vlasov equation ( 7) yields the same result as first optimising in the N -particle system and then passing to the limit.In particular, by Theorem 2.10, optimizers of the McKean-Vlasov equation ( 7) are -optimal for the N -particle system.This then justifies our numerical implementation described in Section 3 where we deal directly with the MFC problem.

Theory of MFC problems and applications to systemic risk
Due to the big amount of literature on MFC problems we focus here on relatively recent works and mostly on MFC and not on the related concept of Mean Field Games (MFG) as introduced in [43] and [39].We refer to [21,20,25] for definitions of the MFC and MFG optima in general set-ups and discussions on the differences.As we here deal with a central agent our optimization problem corresponds to a Pareto optimum where all the agents cooperate to minimize the costs.Therefore MFC is the appropriate concept.Note that instead of MFC the terminology McKean-Vlasov control is often also used.
Similarly as for classical optimal control, dynamic programing principles have also been derived for MFC problems and can be found in [53,31].We also refer to [54,14,3,44], where in diffusion set-ups formulations using a Hamilton-Jacobi Bellman (HJB) equation on the space of probability measures for closed-loop controls (also called feedback controls) are deduced.In the recent work [36] this has been generalized to jump diffusion processes.For a dynamic programming principle for open-loop controls we refer to [13], and to [20,1] for a characterisation by a stochastic maximum principle.
We are here interested in feedback controls and would therefore need to solve the corresponding HJB equation (as e.g. in [54]), i.e. an infinite dimensional fully nonlinear partial differential equatios (PDE) of second order in the Wassertein space of probability measures.Solving such an equation is challenging, since it involves computing measure derivatives, which is numerically intractable.In our context the situation is even more intricate due to the singular interactions through the boundary.Indeed, even under the (usually not satisfied) assumption that t → Λ t is C 1 , the problem is far beyond a standard MFC framework.In this case, Λ in (7) can be replaced by • 0 Λt dt, thus a time derivative of the measure component, which makes the problem 'non-Markovian'.Moreover, we deal with subprobability measures describing the marginal distributions of the absorbed process X = X t 1 {τ >t} which governs the underlying dynamics.Note also that the total mass of these subprobability measures as well as X itself can exhibit jumps if Λ is discontinuous, and that these jumps emerge endogenously from the feedback mechanism.
This is in contrast to some other recent papers where jumps are exogenously given.For instance, the recent articles [36,6] consider the control of (conditional) McKean-Vlasov dynamics with jumps and associated HJB-PIDEs, while in [7] a stochastic maximum principle is derived to analyse a mean-field game with random jump time penalty.
In the context of systemic risk and contagion via singular interactions through hitting times the paper [51] is especially relevant.There a game in which the banks determine their (inhomogeneous) connections strategically is analysed.It turns out that by a reduction of lending to critical institutions in equilibrium systemic events can be avoided.A model involving singular interaction through hitting the boundary is also considered in [32].There, an optimization component is incorporated via a quadratic functional that allows the institutions to control parts of their dynamics in order to minimize their expected risk, which then leads to a MFG problem.The quadratic cost functional is inspired by the earlier work [22], which also treats the mean-field game limit of a system of banks who control their borrowing from, and lending to, a central bank, and where the interaction comes from interbank lending.Let us finally mention the very recent article [8] which applies reinforcement learning to a model that can be considered as an extension of [22] adding a cooperative game component within certain groups of banks.
In the wider context of interaction through boundary absorption, a few works on mean-field games have also appeared recently.In [17], the players' dynamics depends on the empirical measure representing players who have not exited a domain.This is extended to smooth dependence on boundary losses prior to the present time in [18], and to the presence of common noise in [16].The economic motivation for these models are, among others, systemic risk and bank runs.

Numerics for MFC and MFG problems
Among the numerical methods proposed for MFC and MFG problems, we refer to [23] for a policy gradient-type method where feedback controls are approximated by neural networks and optimised for a given objective function; to [34] and again to [23] for a mean-field FBSDE method, generalising the deep BSDE method to mean-field dependence and in the former case to delayed effects; and to [5] for a survey of methods for the coupled PDE systems, mainly in the spirit of the seminal works [2,4]; see also a related semi-Lagrangian scheme in [19] and a gradient method and penalisation approach in [52].
Beside these works on PDE systems, a lot of research has recently been conducted on how to apply (deep) reinforcement or Q-learning to solve MFC and MFG problems or combinations thereof (see e.g., [35,11,10,9] and the references therein).We also refer to two recent survey articles [38,45] on machine learning tools for optimal control and games.The first article focuses on methods that try to solve the problems by relying on exact computations of gradients exploiting the full knowledge of the model, while the second one presents learning tools that aim at solving MFC and MFG problems in a model-free fashion.
In our modeling and numerical approach we have full knowledge of the model, but what distinguishes it from the existing literature is the particular singular interaction through the boundary absorption.This means that all the discussed numerical schemes and methods need to be adapted to accommodate the current special situation.We opted for an adaptation of the policy gradient-type method considered in [55], since it shares the same computational complexity as the gradient-based algorithms in e.g.[12,41,52], but enjoys an accelerated convergence rate and can handle general convex nonsmooth costs (including constraints), which allows to incorporate the current objective function.It exploits a forward-backward splitting approach and iteratively refines the approximate controls based on the gradient of the cost, evaluated through a coupled system of nonlocal linear PDEs.The precise algorithm is outlined in Section 3.

Contributions and findings
As already mentioned, our model differs in a number of fundamental points from the existing literature: first, while [22,17,18,51,32,16] study mean-field game solutions and equilibria of N -player games, where each player maximises their own objective, we study the problem of a central planner who specifically seeks to control the number of defaults.Second, in contrast to [18], where the coefficients of the players' processes may depend on the loss process and to [16], which further includes a driver which is a smoothed version of L (hence modeling delayed effects of hitting the boundary), we consider the firm values driven by L directly, resulting in an instantaneous effect of defaults and the emergence of systemic events.Third, the techniques we use are also entirely different from those in these preceding works.Instead of relying on techniques for martingale problems used e.g. in [42] to derive a limit theory for controlled McKean-Vlasov dynamics, we extend the method from [27] to show the convergence of the finite system to the mean-field limit.
Moreover, we provide a numerical solution adapting the approach of [55] to the current setting.This means first of all to express Λ t in the dynamics (7) and in the loss function explicitly in terms of the distribution of X t , which can be (formally) achieved through ∂ x p(t, 0), i.e. the spatial derivative of its density at 0 (assuming it exists).To cast the absorbed process formally into a (more) standard McKean-Vlasov framework, instead of ( 7), we write the dynamics in a form where the drift and diffusion coefficients are multiplied with the Heaviside function (in the state).Both the computation of ∂ x p(t, 0) and the presence of the Heaviside function need some regularization, which is treated in Section 3.1.This regularized version then allows to apply the policy gradient descent method of [55] where we compute the gradient via a coupled forward-backward PDE system (instead of a particle method as in [55]).In particular, the forward problem is given by a smoothed version of the Stefan problem with a drift term determined by the feedback control, while the backward problem determines a decoupling field of the adjoint process.
From an economic point of view, our findings indicate a high sensitivity on the parameter γ in (7).As shown in Figure 7b, for a certain value of γ and in a regime where α triggers jump discontinuities in the uncontrolled regime, the optimal control strategy switches from not avoiding a jump to avoiding a jump.Moreover, our numerical experiments suggest that it is not possible to vary the capital injection to control the size of the jump continuously, since the possible jump size is restricted by the physical jump constraint (4).Viewed differently, a large systemic event can happen if the central agent withdraws a small amount of capital from a scenario without jumps.
Summarizing, the main contributions of the present paper are as follows: • We show convergence of the system with N agents to the mean-field limit (see Section 2), including well-posedness of the central agent's optimisation problem, i.e. the existence of unique minimal solutions to the Stefan problem as given by ( 7)-( 8) for any suitably regular control process β and the existence of an optimal control which minimises (10).
• We propose a numerical scheme (see Section 3) based on policy gradient iteration, where the gradient is computed via coupled forward and backward PDEs satisfied by the density of a regularized version of the equity value process and a decoupling field corresponding to an adjoint process, respectively.
• We analyse by way of detailed numerical studies the structure of the central agent's optimal strategy in different market environments, and the minimal losses that are attained under optimal strategies with varying cost (see also Section 3).
2 Convergence to a mean-field limit In this section, we show the existence of a minimising strategy for the central agent's objective function in the mean-field limit, as well as convergence of the N -agent control problem.

The model setup
We fix a measure ν ∈ P([0, ∞)) and define a reference probability space to be a tuple S = (Ω, F, (F t ) t≥0 , P) such that S supports a Brownian motion that is adapted to (F t ) t≥0 and there is a F 0 -measurable random variable X 0− with law(X 0− ) = ν.Note that with this definition, X 0− is independent of B by construction.We endow the space with the topology of weak convergence in L 2 ([0, ∞)).Since S T is bounded in the L 2 ([0, ∞))-norm and weakly closed, S T is a compact Polish space.We then define the space of admissible controls Note that the space of admissible controls B T as well as the objective functional J as defined in (10) depend implicitly on the choice of stochastic basis S .We will sometimes write B T (S ) or J (S ) when we wish to emphasize this dependence.To be able to guarantee existence of optimizers and to make the optimization problem independent of the choice of stochastic basis S , we will consider the relaxed optimization problem as is standard in the stochastic optimal control literature (see e.g.[33]).We say that (X 0− , B, β, Λ) solve problem ( 7)-( 8) on S if S = (Ω, F, (F t ) t≥0 , P) is a reference probability space with Brownian motion B and initial condition X 0− such that β ∈ B (S ) T and ( 7)-( 8) holds P-almost surely.
Note that it is not clear a priori that the process X given in ( 7) is well-defined.Indeed, it is known that the McKean-Vlasov problem (2) and (3) may admit more than one solution, and it is not known that physical solutions exist for general β ∈ B T , although it is known for β of the special form b(t, X t ), where b is Lipschitz (see e.g.[29,46]).To pin down a meaningful solution concept, we therefore rely on the notion of minimal solutions as defined in (5).By the results of [27], we know that minimal solutions of the uncontrolled system are physical whenever the initial condition X 0− is integrable.
Throughout the following sections P(E) always denotes the set of probability measures on a Polish space E which we endow with the Lévy-Prokhorov metric, i.e., convergence of probability measures is to be understood in the (probabilistic) weak sense.For function spaces etc. we apply rather standard notation and refer to Section A.1 for more details.

Well-posedness of minimal solutions for general drift
We fix the reference probability space S and show that minimal solutions exist for any β ∈ B T .Define the operator Γ for a càdlàg function and β ∈ B T as Note here that (X (β), τ β , ) solves (7) if and only if is a fixed-point of Γ[•, β].We next introduce a function space that is mapped to itself by Γ[•, β]: Set where R is the extended real line.Note that for ∈ M , defines a cumulative distribution function of a probability measure on [0, ∞].Therefore, equipping M with the topology of weak convergence, i.e., we have that n → in M if and only if n t → t for all t ∈ [0, ∞] that are continuity points of , we obtain that M is a compact Polish space.As in the uncontrolled case, Γ[•, β] is continuous on M .Theorem 2.1.For any β ∈ B T , the operator Γ[•, β] : M → M is continuous.Furthermore, there is a (unique) minimal solution to (7), and it is given by Proof.Using Lemma A.1, this follows from Theorem 2.3 in [26].

Existence of an optimal control
A key step in proving existence of an optimizer is to show that sequences of solutions to ( 7)-( 8) are compact in a certain sense and that their cluster points are solutions of ( 7)-( 8).This is the content of the next theorem.7)-( 8) on S n .Then, after passing to subsequences if necessary, there is a reference probability space S such that (X 0− , B, β, Λ) solve ( 7)-( 8) on S and it holds that law(β n ) → law(β) in P(S T ) and 1 α Λ n → 1 α Λ in M .Proof.See Section A.2 in the Appendix.
Remark 2.3.Note that in Theorem 2.2, we do not assume that either the Λ n or Λ are minimal solutions.At this point, we do not know how to prove that Λ is minimal if all Λ n are minimal.Stability of the minimal solution is an open question (cf.[27, Conjecture 6.10]).For the proof of the subsequent theorem where we prove existence of an optimizer to (13), formulated with the minimal solution, this however does not matter.
Next, we prove that the infinite-dimensional problem (13) admits an optimizer.

Theorem 2.4.
There is an optimizer of (13), i.e., there is a stochastic basis S and β ∈ B T (S ) such that . By Theorem 2.2, after passing to subsequences if necessary, there is a reference probability space S such that (X 0− , B , β , Λ ) solves ( 7)-( 8) on S and law(β n ) → law(β ) holds in P(S T ) as well as 1 α Λ n → 1 α Λ in M .In the following, we simply write J instead of J (S ) etc.By construction, we have (13) and (10), as J attains a value less than or equal to γ for β = 0. We proceed to show that J(β ) ≤ lim inf n→∞ J(β n ).
Since the functional b → T 0 b s ds is continuous and bounded on S T , it follows that The Portmanteau theorem implies that and since Λ solves ( 7)-( 8) with drift β and Λ(β ) is the minimal solution on S with drift β , it follows that Λ T − (β ) ≤ Λ T − which is equivalent to L T − (β ) ≤ L T − , which concludes the proof.

Properties of the controlled N -particle system
We describe the controlled N -particle system mentioned in the introduction in more detail.We consider a stochastic basis S N = (Ω N , F N , (F N t ) t≥0 , P N ) supporting certain exchangeable random variables, defined as follows.
Definition 2.5.Set X N := (X 1,N , . . ., X N,N ), where the X i,N are random variables taking values in some space E. We say that for any permutation σ of {1, . . ., N }.We say that β N is S N -exchangeable if the vector (X i,N 0− , B i,N , β i,N ) N i=1 is N -exchangeable under P N .The stochastic basis S N is supposed to support an N -dimensional Brownian motion B N and an N -exchangeable, F N 0 -measurable random vector X N 0− .The particles in the system then satisfy the dynamics where β N is S N -exchangeable, and In analogy to the infinite-dimensional case, we denote L N := 1 α Λ N .We then consider the set of admissible controls The same examples as in the uncontrolled case show that solutions to (19) are not unique in general (cf.Section 3.1.1 in [29]).Therefore, in [28], physical solutions are introduced.Similarly to the infinite dimensional case we can also consider minimal solutions.We call a solution Λ N to (19) minimal, if for every solution Λ N to (19) holds almost surely.The same argument as in [27,Lemma 3.3] shows that the notions of physical and minimal solution are equivalent in the controlled N -particle system.In analogy to the infinite-dimensional case, we introduce the operator where L is some càdlàg process adapted to the filtration generated by B N .We will often simply write The statements are then meant to hold for arbitrary, fixed We then readily see by straightforward induction arguments that holds almost surely, where Λ N is any solution to the particle system and Γ N denotes the k-th iterate of Γ N .A similar argument as for the system without drift in [27] shows that the iteration (Γ ) k∈N converges to the minimal solution after at most N iterations.Lemma 2.6.For N ∈ N, let Γ N be defined as in (21).
is the minimal solution to the particle system with drift β N and the error bound holds almost surely.
Proof.Analogous to the proof of Lemma 3.1 in [27].
Roughly speaking, the next result says that limit points (in distribution) of solutions to the controlled N -particle system converge (along subsequences) to solutions of the controlled McKean-Vlasov equation ( 7)- (8).By D[−1, ∞) we denote here the space of càdlàg paths on [−1, ∞) equipped with the M 1 -topology.
Theorem 2.7.For N ∈ N, let (X N , β N , Λ N ) be a solution to the particle system (19) on the stochastic basis S N and define µ N := 1 N N i=1 δ X i,N .Suppose that for some measure ν 0− ∈ P(R) we have Then there is a subsequence (again denoted by N ) such that law(µ N ) → law(µ) in P(P(D([−1, ∞)))), where µ coincides almost surely with the law of a solution process X to the McKean-Vlasov problem (7)-( 8) satisfying law(X 0− ) = ν 0− .

Proof. See Section A.4
The next theorem shows that when we optimize the policy in the particle system and then take the limit of the resulting optimal values, we obtain the same value that we find by optimizing the infinite-dimensional version of the problem.
Step 1: We show the inequality lim inf N →∞ V N ≥ V ∞ .To that end, choose S N and β N ∈ B T (S N ) such that Arguing as in the proof of Theorem 2.7, we see that is tight, and by Theorem 2.7 converges to a limit ξ that is supported on the set of solutions to the McKean-Vlasov problem ( 7)- (8).By Skorokhod representation, we may assume that this happens almost surely on some stochastic basis S .Since the map (w, b, ) → Step 2: We show that lim sup N →∞ V N ≤ V ∞ .Let S be a probability space and β ∈ B(S ) be an optimizer attaining V ∞ , whose existence was shown in Theorem 2.4.Let S N be the product space obtained by taking N copies of S , and consider the (random) cost functional where • ∞ is the supremum norm on [0, ∞).Let M(S ) be the set of all F-measurable random variables, defined on the stochastic basis S , taking values in M and consider the problem VN := inf Letting (β ) N be the vector obtained by taking N i.i.d.copies of β , and choosing L ≡ L(β ), which we abbreviate in the following with L := L(β ), we obtain Noting that Γ N [L, (β ) N ] is the empirical cumulative distribution function of the i.i.d.random variables τ i β := inf{t ≥ 0 : , and that P(τ i β ≤ t) = L t (β ), the same estimates as in Step 1 of the proof of Proposition 6.1 in [27] show that lim We have therefore shown that lim sup Here we used that L N ( βN ) ≤ 1 and Γ N [L N ( βN ), βN ] = L N ( βN ).This implies where ΓN is defined as in (21) with initial condition Xi,N 0− := X i 0− + αN −κ .Combining ( 29) with (28) and again using the monotonicity of Γ N , we obtain A straightforward induction then shows that LN ≥ Γ(k) N [0, βN ] − N −κ for all k ∈ N, and Lemma 2.6 then yields that we have where LN ( βN ) corresponds to the loss process associated to the particle system with initial condition XN 0− .This yields Since we have already shown that lim sup N →∞ VN ≤ V ∞ , this shows lim sup N →∞ V N ≤ V ∞ , which concludes the proof.
Remark 2.9.We conjecture that the perturbation in the initial condition of the particle system in Theorem 2.8 is an artefact of our proof technique rather than a necessity.
Theorem 2.10.Let S be a probability space and β ∈ B T (S ) be an optimizer attaining V ∞ .Let S N be the product space obtained by taking N copies of S and let (β ) N be the vector obtained by taking N i.i.d.copies of β .Then, (β ) N is -optimal for the particle system, i.e., for every > 0, it holds that for N sufficiently large.
Proof.Let > 0 be given.Recall the notation introduced in Step 2 of the proof of Theorem 2.8.Consider the problem Proceeding as in Step 2 of the proof of Theorem 2.8, it follows that lim sup N →∞ VN ≤ V ∞ .By Theorem 2.8, we have lim N →∞ V N = V ∞ , and therefore VN ≤ V N + /3 for N large enough.Arguing as in Step 2 of the proof of Theorem 2.8, we can find Choosing N large enough such that γN −κ < /3, we obtain

Numerical solution of the MFC problem
In this section, we present a numerical scheme for the central agent's mean-field control problem.We directly compute the optimal feedback control by a policy gradient method (PGM; see Section 3.2 and 3.3) applied to a regularised version of the dynamics and the objective function (see Section 3.1).The gradient is approximated by finite difference schemes for the density of the forward process and a decoupling field for an adjoint process (see Section 3.4).This will allow us to conduct parameter studies of the optimal strategies as well as the resulting losses and costs in Section 3.5.
Recall from (10) and above the process X(β) corresponding to the minimal solution Λ(β), and write the objective function as where L t = P inf 0≤s<T X s (β) ≤ 0 , and derivatives of L are defined in a distributional sense if necessary.
In the case of regular solutions, the absorbed process associated with ( 7), X = X t 1 {τ >t} , for τ the hitting time of 0, has a sub-probability density p supported on (0, ∞) and an atomic mass at 0. Similarly as in [30, Theorem 1.1], it satisfies the forward Kolmogorov equation where and where T denotes the set of all t ∈ [0, T ] where t → L t is differentiable.If t / ∈ T, in particular in the event of a blow-up at t, we have the following jump condition for the solution of (32), p(t−, x) = p(t, x − Λ t + Λ t− ).
Assuming again regular enough solutions where we can take the derivative with respect to time of the equation Moreover, we can rewrite the controlled dynamics of X for t < τ as Note that the current control problem lies outside the standard MFC context due to the following three main aspects: (i) the interaction through the boundary leads to a time derivative of the measure component, which makes the problem as written in (34) (without replacing Lt by ∂ x p(t, 0)/2) 'non-Markovian'; (ii) the drift coefficient is non-Lipschitz in the measure component; (iii) the dynamics are defined by an absorbed process, which moreover has an irregular drift coefficient (as t → L t can be discontinuous in time).We will address these points by a regularisation in the next section, which will subsequently allow us to apply a policy gradient method, which is inspired by [56].

Regularisation
Denote by ν t the law of X t corresponding to p(t, x)dx in the regular case where a density exists.For (small) h > 0, we approximate 1  2 p x (t, 0) in terms of the measure ν t by where φ h (x) is a smooth approximation of the Dirac δ distribution with support in [0, h] and where the bracket notation is used to denote the integral.
We then define a smooth function Φ h : R → R such that Φ h (x) = 0 for x ≤ −κh and Φ h (x) = 1 for x ≥ 0, for some κ > 0, and consider the dynamics where For completeness, we give the specific φ h and Φ h used in our computations in Appendix B.1.There, we also show the graphs of φ h and its first two derivatives for the value h = 10 −3 , which is frequently used in our tests below.Under the dynamics (35), the process does not get absorbed at 0, but once it crosses 0 from above its diffusion and drift coefficients decay rapidly so that with high probability it remains in the interval [−κh, 0] (see Figure 4 for an illustration of the density of such a process).
The reason why we consider these modified dynamics is to cast the absorbed process X t into a standard McKean-Vlasov framework.The objective function can also be rewritten and becomes A crucial point here is that both the coefficients in the dynamics and the objective can be written in terms of β and ν alone, i.e. without any time (or spatial) derivatives of the measure flow (ν t ).Note also that (a h , σ h , f h ) satisfy the differentiability assumptions made [1, Section 3], which we shall need for the Fréchet differentiability of the function F defined in (40) below.
Once we have the optimal control in feedback form and the associated density p of X, we can compute the optimal loss and cost pair as

Policy gradients
We follow here in spirit the approach of [56].We consider first a slightly more general form of the MFC problem, written as a nonsmooth optimization problem over the Hilbert space H 2 (R) of R-valued square integrable, progressively measurable processes, inf with the functionals F : H 2 (R) → R and G : H 2 (R) → R ∪ {∞} defined as follows: for all β ∈ H 2 (R), where f h is defined in (36).
The splitting of the objective function into F and G allows for a separate treatment of the smooth component f h and a non-smooth component g.We will use g to incorporate the constraints on β, specifically, g(x) = 0 for x ∈ [0, b max ] and ∞ outside.It is clear that G : H 2 (R) → R ∪ {∞} is convex due to the convexity of g.
Assuming that ν t lies in the Wasserstein space of probability measures on R with finite second moment, denoted by P 2 (R), we introduce the Hamiltonian with dt ⊗ dP-a.e.Here, X β is the state process controlled by β, satisfying (35), and (Y β , Z β ) are square integrable adapted adjoint processes such that for all t ∈ [0, T ], Above and hereafter, we use the tilde notation to denote an independent copy of a random variable as in [1].We now consider controls in feedback form, namely Then a sufficiently smooth decoupling field u such that Y t = u(t, X t ) and with terminal condition u(T, •) = 0.

Computation of gradient by decoupling fields
In our application, we can express the right-hand side of ( 46) more explicitly.For the Hamiltonian (41) with a h and f h defined by ( 35) and (36), respectively, we have, by [24, Section 5.2.2, Example 1] Consequently, for the decoupling field u, which can be re-written as where we will assume that ν t has a density p(t, •) which satisfies

A proximal policy gradient method (PGM)
We now compute a sequence of approximations to the optimal control in feedback form, namely β m t = β m (t, X m t ).Following [56], we will carry out proximal gradient steps with β 0 given, e.g.zero, and thereafter, for step size τ > 0, where prox τ g : R k → R k is the proximal map of τ g : R → R ∪ {∞} such that For the considered g, an indicator function, prox is simply the projection onto [0, b max ], i.e. prox τ g (b) = min(max(b, 0), b max ).
Then a sufficiently smooth decoupling field u m such that Y m t = u m (t, X m t ) satisfies where ν m (dx) = p m dx for the density p m that satisfies and where Finally,

Numerical implementation
We pick regularisation parameters h, κ > 0 for φ h and Φ h defined as above.Then in the m-th iteration, we first solve numerically (51) for p m and then (46) for u m , where ν m is the measure with density p m .We use a semi-implicit finite difference scheme on a non-uniform mesh, as detailed below.
We define a numerical approximation on a time mesh t i = i∆t, i ∈ I = {0, 1, . . ., N }, ∆t = T /N for a positive integer N .
In the following, we drop the iteration index m and use instead superscript i to denote the timestep of any function defined on the space-time mesh and subscript j its spatial index, in particular, for the numerical PDE solutions, p i j ≈ p(t i , x j ), u i j ≈ u(t i , x j ).We assume a feedback control b i j = β(t i , x j ) is defined on this mesh.Starting with the forward equation ( 51), for each x j and t i , we approximate the drift coefficient a by and set s i j = Φ h (x j ) 2 .Then define a finite difference scheme by p 0 j = f (x j ), and for i > 0, This is an upwind scheme for the first order terms, taking the appearance of p in a explicit, but otherwise implicit.The form of the scheme is chosen to be consistent with (51) for non-uniform meshes, in particular where the mesh size is piecewise constant.
For the adjoint equation ( 47), with p i j now given in addition to b i j , we first define the right-hand side, and then, with u N j = 0, we define for i < N u i+1 j − u i j ∆t + min(a i j , 0) This allows us to compute the gradient on the same mesh, from (52), and perform updates b 37) is approximated by L in ( 53) and ( 38) by Let us remark that we do not have a convergence proof for this numerical scheme and it also seems out of reach due to the delicate interplay between the discretization and regularization parameters visible from Table 1.Nevertheless, for fixed N and h we can empirically show convergence of the gradient iterations (see Figure 1), which then allows us to compute approximate optimal policies.In this sense our numerical tests indicate at least qualitatively how the optimal policies look like.Note that a rigorous convergence proof of a similar policy gradient iteration method in the non-mean field regime has recently been provided in [57].

Set-up and model parameters
In the rest of the paper, we give illustrations of the model's suggested strategies and resulting loss behaviour in different market scenarios, influenced by the interaction parameter α, the risk aversion γ, the initial state f , and maximum cash injection rate b max .
In all examples, we choose a gamma initial density, The parameters of the initial distribution could be calibrated to CDS spreads if they are traded (see [15]).The default parameters we use are k = 2, θ = 1/3, chosen to give a range of different behaviours by varying the other parameters.In this case, f is differentiable with f (0+) = 0.This choice implies that there are smooth solutions for a short enough time interval (see [37,30]).It also implies (see [37,Theorem 1.1]) that a blow-up (of the unregularised system) is guaranteed to happen at some time for α > 2E[X 0− ] = 2kθ = 4/3.Conversely, it is known (see [47, Theorem 2.2 and the comment below it]) that the condition α f ∞ < 1 leads to the so-called weak feedback regime, where continuity of solutions always holds true.
A simple estimation of meaningful α from typical asset volatilities, recovery rates, and mutual lending as proportion of overall debt is found in [48], suggesting possible values from 0.3 to possibly higher than 5.We shall conduct tests for α ∈ {0.5, 1, 1.5}.With f ∞ ≈ 1.1, it is clear that a jump cannot occur for α = 0.5, but is guaranteed for α = 1.5 as then 2E[X 0− ] < α.The terminal time is chosen as T = 0.02.We find empirically that the uncontrolled system does not jump in this interval for α = 1 (although it may jump eventually), and does jump halfway through the interval for α = 1.5.We have intentionally chosen an initial distribution where blow-ups can happen at such relatively short time scales to illustrate the different effects.In our regularised version of the problem, this manifests in a smooth transition to high values of losses, around 60%, over a short period of time.We fix b max = 10 at first, and investigate the effect of larger values later on.
In the following, when not stated otherwise, we choose κ = 1/10 in the construction of Φ h (see 3.1 and Appendix B.1), which was found a reasonable choice in our experiments.As default, solutions are computed with N = 800 timesteps and a non-uniform mesh on [x min , x max ] = [−2, 6] which is constructed as described below.

Mesh convergence
We first analyse the convergence of the finite difference approximations for fixed control.In particular, we first choose β = 0.The interaction parameter is α = 0.5.
The mesh is chosen uniformly in the intervals [x min , −0.02], [−0.02, 0.05], [0.05, x max ], such that approximately 5% of the points lie in the first interval, 45% in the second, and 50% in the third, and the total number N x of spatial mesh points is approximately N • (x max − x min )/(8T ).This has the effect that the average mesh size is roughly eight times the time step size, which turns out a reasonable ratio in our numerical tests.
It is of crucial importance to have enough mesh points in the intervals [−κh, 0] and [0, h] to approximate the smoothed Heaviside function and the smoothed delta distribution with its first two derivatives.A strong local mesh refinement as above allows this while keeping the total computational complexity feasible.Notice for our choice above the local mesh size around zero is almost 100 times smaller than for larger x.
In Table 1, we report for a varying number of time-steps N (and proportionally chosen N x ) and smoothing parameter h the computed loss (columns 5-10, rows 3-8).Let L h N be the loss computed with N time steps and parameter h.Then from the table we conjecture convergence of L h N as N → ∞ for fixed h, but divergence as h → 0 for fixed N .To investigate this more quantitatively, we report in the third and fourth columns , where h = 10 −3 .The fact that, for fixed h, the increments θ N for successive mesh refinements decrease inversely proportional to N is consistent with first order convergence in 1/N and 1/N x .Conversely, we fix N = 3200 in the last two rows.The behaviour indicates a decrease of first order in h as long as 1/N x is small compared to h, but divergence thereafter.Finally, the approximate computational times, reported in the last column, are approximately linear in N N x and independent of h. 4 similar behaviour is observed for the approximation of the cost and for different parameters, as shown in Appendix B.2.

Convergence of policy gradient iteration (PGM)
Next, we analyse the convergence of the policy gradient iteration.Here and thereafter, we will use a modification whereby ( 49) is evaluated for x > h, while β m+1 = b max for x ≤ h.As the occupation time of [0, h] is small, the effect of this choice has a negligible effect on the expected cost in all cases.We found that this modified iteration converged faster and more reliably in our numerical tests.
We monitor in each iteration the loss at time T computed as in (53), and the expected cost, computed as in (55).For L (m) and C (m) the terminal loss and total expected cost at the m-th iteration, respectively, we plot in Figure 1 the steps |L (m+1) − L (m) | and |C (m+1) − C (m) |.In these tests, the iteration terminates if either both of these quantities are smaller than 10 −5 or 50 iterations are reached.
The left-hand plot in Figure 1a shows the convergence for different values of γ ∈ {10 −3 , 10 −2 , 10 −1 }.The intermediate value of γ has the largest absolute error, while the smallest γ leads to the smallest one.In the latter case, the cost is very small due to the very small penalty of losses.The asymptotic rate of convergence appears similar for all parameters considered.
In Figure 1b, we analyse the effect of α on the convergence.The error is largest for the smallest of α ∈ {0.5, 1, 1.5}, while the error is smallest for α = 1.5, which is the case where a jump occurs in an uncontrolled setting and losses are the largest.
Further parameter studies are given in the Appendix, where Figure 10a establishes robustness of the convergence under mesh refinement and Figure 10b illustrates the effect of the step size.
In most situations, the number of iterations required for reasonable accuracy, i.e.  a relative error below around 10 −3 was between 10 and 30, so that for the chosen discretisation (with N = 800 timesteps and mesh as chosen above) the computing time to solve the MFC problem was between 3 and 10 minutes on the laptop as specified earlier.

Computational analysis of central agent's strategy
We now move to an analysis of the optimal strategies, and the achievable pairs of costs and losses under the optimal and other strategies.

Analysis of the optimal strategy
The policy gradient method produces directly an approximation to the optimal feedback control β .We found that an initialisation of the iteration with a function of the form β 0 (t, x) = b max for 0 < x < c and 0 elsewhere, for some c > 0 large enough so that the support of β 0 covers the support of β , produces more regular controls for small iteration numbers than a zero initialisation.The following plots were produced with c = 0.2 and a tolerance 10 −5 in the loss and cost (compare Figure 1a).We depict in Figure 2 contours of the optimal feedback control β (t, x) for different γ.As expected from the form of the Hamilton-Jacobi-Bellmann equation, the control is close to a 'bang-bang' structure, i.e. a piecewise constant function where the control always takes one of the two extreme values,i.e.either 0 or b max .The two regions are separated by a narrow strip where the control transitions continuously.We conjecture this to be an effect of the numerical procedure, which is designed for Lipschitz continuous feedback controls.
The (yellow) shaded region closest to x = 0 is where β (t, x) ≥ 0.95 b max , i.e. the central agent subsidises firms closest to default at or close to the maximum rate.The white region furthest from x = 0 is where β (t, x) ≤ 0.05 b max , i.e. the central agent does not subsidise firms with high reserves.
For larger γ, here exemplified by γ = 0.1 in Figure 2a, the contribution of the loss to the objective is large enough for the central agent to act for all t, for values in x up to  a decreasing curve in t.Close to the chosen end point, the effect of the control on the overall losses becomes negligible and does not justify the associated cost.In a sense this behaviour is an artifact of the finite observation interval.For smaller γ, in Figures 2b and 2c, the agent only acts (for sufficiently small states) up to a certain point in time and then does nothing.Combining this with the plots of the resulting loss curves in Figure 3, a possible interpretation is that the agent seeks to delay the onset of the strongly contagious phase until it is no longer viable to do with a certain cost budget, depending on γ.In particular, as visible from Figure 3, under the current optimization criterion the jump is not avoided for γ = 0.001.We discuss at the end of the next subsection other strategies for avoiding jumps -for the current optimisation criterion such a strategy is however not necessarily optimal.
Next, we analyse the impact the upper bound of b max on the control strategy.We show only the 0.99 b max level set for clarity in Figure 4, for different b max .The region under this curve indicates where the agent controls at (or close to) the maximum rate.The region shrinks as b max increases, meaning that the agent is able to control the banks' equity process more effectively whenever it gets close to zero.
Lastly in this section, we analyse the behaviour of the PDE solutions p and u at  different times and on different scales in x around 0. Figure 5 shows in the left panel the accumulation of probability mass in the interval [−ch, 0] due to the smooth truncation of the SDE coefficients, approximating the absorption at x = 0.As can be deduced from the right plot in Figure 5, the area under the density for positive x is thus reduced, but only by a small amount in the current parameter setting (see Figure 3 for the corresponding loss function with γ = 0.1).
In Figure 6 we illustrate the behaviour of u on different scales in x.The left-most plot shows the range [0, 2h], where u attains large positive values; in the middle plot, over [0, 0.1], u has moderate negative values; the right-most plot is truncated below by −1, this being the threshold which determines where the control is active.This can be seen from ( 52) in conjunction with (49): for x > h, the gradient is u + 1, so for a converged control we have β = b max where u + 1 < 0 and β = 0 where u + 1 > 0. From this the bang-bang structure of the control becomes also clear.

Analysis of optimal cost-loss pairs
Finally, we examine the pairs of costs and losses that are obtained under the optimal policy and other heuristic strategies.
In Figure 7a, we vary γ to trace out the curve (C T (γ), L T (γ)), where C T (γ) and L T (γ) are the costs and losses given by ( 37) and (38) for the chosen γ.For a given cost, the graph gives the loss achievable under the optimal strategy.To achieve a smaller loss, a higher cost is generally incurred.
We focus first on the data for α = 0.5 and α = 1.In these cases, the uncontrolled system exhibits no jumps and cash injection simply reduces the losses.For small γ, minimising the cost is the priority and the losses approach those of the uncontrolled system.For growing γ, it becomes favourable to increase the cash injection and a significant reduction of losses can be achieved.This levels off for large γ as the cap b max on the cash injection rate limits the overall effect of bail-outs.
For strong interaction, here exemplified by α = 1.5, we observe a discontinuity, which is further analysed in Figure 7b.For γ around 0.01, the optimal strategy switches from not preventing a jump to preventing a jump.This is manifested in Figure 7b by an downward discontinuity in the number of losses.The optimal value of the central agent's control problem is also discontinuous in γ at this point.In other words, it is not possible to vary the capital injection to control the size of the jump continuously.Rather, the possible jump size is restricted by the constraint (4) on physical solutions.Conversely, withdrawal of a small amount of cash by the central agent from a scenario with low losses can trigger a large systemic event.
Note that Figure 7b also allows to deduce the relation between γ and the threshold δ by looking for γ such that γ ∈ argmax γ∈R + C T (γ) + γ(L T (γ) − δ), as explained in the introduction.Indeed, this corresponds to solving the outer optimization problem max γ∈R + g(γ) where g(γ) = min β L(β, γ) with L(β, γ) denoting the Lagrange function.Under the assumption of no duality gap and a unique optimizer γ, γ is necessarily determined via L T (γ) = δ.As we observe a jump discontinuity of γ → L T (γ), this  suggests that there is a duality gap at least for certain values of δ.
We proceed by comparing the costs and losses under the optimal strategy with some other heuristic strategies.As first benchmark, we consider a uniform strategy by which the central agent injects cash at a constant rate b max whenever an agent's value X t ≤ c for a constant c, which we vary, resulting in pairs (C u T (c), L u T (c)).The total cost here can be computed as , where p u is the density of the regularised process with such uniform (in time) control.
We also consider a 'front loaded' strategy whereby at the outset, for some chosen 'floor' d > 0, the central agent injects a lump sum of d − X 0− into all players with X 0− < d, hence lifting their reserves up to d.Again, we vary d to obtain a parametrised curve (C f T (d), L f T (d)).The total cost in this case is found as The pairs of cost and loss are shown in Figure 8.In particular, Figure 8a illustrates the case without jump for α = 1, whereas in the situation of 8b with α = 1.5 there is a jump in the uncontrolled system, which can be avoided with sufficiently large control.In both cases, the optimal strategy gives lower losses than the heuristic strategies for the same fixed cost.Conversely, less cash injection is required for a given loss tolerance. 5e observe that we cannot enforce the sufficient condition for avoiding jumps, i.e., α f ∞ < 1, for any of these strategies.It is clear that the strategy where the initial capital of all banks is raised to a certain minimum level d satisfies E[X 0− ] ≥ d and hence the necessary condition for avoiding jumps, E[X 0− ] ≥ α/2, holds for d ≥ α/2.However, the sufficient condition can be violated even when all banks have a high initial capital.What would work to enforce the sufficient condition is to set Considering the physical jump condition (4), for a jump to occur it matters how much of the surviving mass can be concentrated around zero at any given point.Intuitively, starting with a higher initial condition, the Brownian motion will diffuse the mass sufficiently and make large concentrations at zero less likely, hence preventing a jump.Similarly, sufficiently large β(t, x) for small t and x should transport mass away from zero and prevent a jump as long as the initial density satisfies f (0+) < 1/α (which rules out an instantaneous jump).Therefore, both the constant and optimal strategies should be able to prevent jumps for large enough b max .A rigorous analysis, however, goes beyond the scope of this paper.motion.We have shown that S is an admissible reference space.
Since ξ N is a random probability measure on C([0, ∞)) × S T × M and the spaces S T and M are compact, ξ N is tight by the same reasoning as in Corollary 4.5 in [27].Therefore, after passing to a subsequence if necessary, we may assume that law(ξ N ) → law(ξ) for some random probability measure ξ on C([0, ∞))×S T ×M.By Skorokhod representation, we may assume without loss of generality that the convergence happens almost surely on the same probability space S .Arguing in the same fashion as in the proof of Lemma 5.4 in [27], we see that for almost every ω ∈ Ω, if law((W, β, L)) = ξ(ω), then W − W 0 is a Brownian motion with respect to the filtration generated by (W, • 0 β s ds, L).For (w, b, ) ∈ C([0, ∞)) × S T × M , set ι(w, b, ) t := ι(w, ) t + t 0 b s ds, where ι is defined as in (60).By Lemma A.2, the map b → • 0 b s ds is continuous from S T to C([0, ∞)), and therefore also as a map from S T to D([−1, ∞)).Theorem 4.2 in [27] together with Corollary 12.7.4 in [59] then shows that ι is continuous.Since ι(ξ N ) = µ N , the continuous mapping theorem implies that ι(ξ) = µ.Applying the continuous mapping theorem to (w, b, ) → w 0 , we see that law(W 0 ) = ν 0 holds almost surely.It remains to check that if

B Further numerical details and tests
We here report further details and tests of our numerical procedure.

B.2 Mesh convergence
Here, we demonstrate the convergence of the finite difference approximation in two more settings.In Table 2 for the cost, for α = 1.5, and in Table 2 again for the losses, but now for α = 0.5.The notation with θ N ρ N , ϑ N and h is analogous as before.
In these settings, the behaviour in h is somewhat better than in Table 1 (losses for α = 1.5, i.e. with jump in the uncontrolled case), but as there, a good approximation is only achieved if the mesh size is sufficiently small in comparison with h.

B.3 Gradient iteration
Finally we conducted further tests in view of the mesh refinement and the role of the step size τ in the gradient iteration.Figure 10a, left, illustrates that the convergence is robust with respect to mesh refinement, i.e., the number of iterations required for a prescribed accuracy does not increase significantly as the number of time steps and mesh points increases simultaneously.
In Figure 10b we investigate the effect of the step size τ .Choosing τ small leads to poor convergence, while τ = 0.3 is optimal among the values presented here.Picking even larger step sizes can lead to divergence.

Declaration
Funding: The authors gratefully acknowledge financial support by the Vienna Science and Technology Fund (WWTF) under grant MA16-021 and by the Austrian Science Fund (FWF) through grant Y 1235 of the START-program.

T 0 b
s ds + γ T − is bounded and lower semicontinuous on C([0, ∞)) × S T × M , Fatou's Lemma and the Portmanteau theorem imply lim inf ω)), let (w, b, ) denote the canoncial process on S .By the arguments in the proof of Theorem 2.7 we have that w is Brownian motion under ξ(ω) with respect to the filtration generated by (w, b, ), and we see that S (ω) is an admissible reference space for almost every ω.Since ξ(ω) corresponds to the law of a solution to the McKean-Vlasov problem (7)-(8), it follows that T 0 b s ds + γ T − dξ(w, b, ) ≥ V ∞ almost surely.We have therefore obtained lim inf (a) α = 0.5, varying γ, N = 800 (b) varying α, γ = 1, N = 800

Figure 7 :
Figure 7: Cost C T and loss L T in the optimal regime for logarithmically spaced γ ∈ [0.0001, 0.1] and different α in (a) and the dependence of the losses on γ in (b).

Figure 8 :
Figure 8: Cost-loss pairs (C T , L T ) under optimal strategy compared to those for a constant strategy, (C u T , L u T ), and front-up strategy, (C f T , L f T ), for two values of α.