Minimizing spectral risk measures applied to Markov decision processes

We study the minimization of a spectral risk measure of the total discounted cost generated by a Markov Decision Process (MDP) over a finite or infinite planning horizon. The MDP is assumed to have Borel state and action spaces and the cost function may be unbounded above. The optimization problem is split into two minimization problems using an infimum representation for spectral risk measures. We show that the inner minimization problem can be solved as an ordinary MDP on an extended state space and give sufficient conditions under which an optimal policy exists. Regarding the infinite dimensional outer minimization problem, we prove the existence of a solution and derive an algorithm for its numerical approximation. Our results include the findings in Bäuerle and Ott (Math Methods Oper Res 74(3):361–379, 2011) in the special case that the risk measure is Expected Shortfall. As an application, we present a dynamic extension of the classical static optimal reinsurance problem, where an insurance company minimizes its cost of capital.


Introduction
In the last decade, there have been various proposals to replace the expectation in the optimization of Markov Decision Processes (MDPs) by risk measures.The idea behind it is to take the risk-sensitivity of the decision maker into account.Using simply the expectation models a risk-neutral decision maker whose optimal policy sometimes can be very risky, for an example see e.g.Bäuerle and Ott (2011).
The literature can here be divided into two streams: Those papers which apply risk measures recursively and those which apply the risk measure to the total cost.The recursive approach for general MDP can for example be found in Ruszczyński (2010); Chu and Zhang (2014); Bäuerle and Glauner (2020).The theory for these kind of models is rather different to the ones where the risk measures is applied to the total cost, since in the recursive approach we still get a recursive solution procedure directly.In this paper, we contribute to the second model class, i.e. we assume that a cost process is generated over discrete time by a decision maker and she aims at minimizing the risk measure applied to either the cost over a finite time horizon or over an infinite time horizon.The class of risk measures we consider here are so-called spectral risk measures which form a class of coherent risk measures including the Expected Shortfall or Conditional Value-at-Risk.More precisely spectral risk measures are mixtures of Expected Shortfall at different levels.
For Expected Shortfall, the problem has already been treated in e.g.Bäuerle and Ott (2011); Chow et al. (2015); Ugurlu (2017).Whereas in Chow et al. (2015) the authors use a decomposition result of the Expected Shortfall shown in Pflug and Pichler (2016), the authors of Bäuerle and Ott (2011) use the representation of Expected Shortfall as the solution of a global optimization problem over a real valued parameter, see Rockafellar and Uryasev (2000).Interchanging the resulting two infima from the optimization problems yields a two-step method to solve the decision problem.Using the recent representation of spectral risk measures as an optimization problem over functions involving the convex conjugate in Pichler (2015), we follow a similar approach here.The problem can again be decomposed into an inner and outer optimization problem.The inner problem is to minimize the expected convex function of the total cost.It can be solved with MDP techniques after a suitable extension of the original state space.Note that already here we get some difference to the Expected Shortfall problem.In contrast to the findings in Bäuerle and Ott (2011) who assume bounded cost or Ugurlu (2017) who assume L 1 cost, we only require the cost to be bounded from below.No further integrability assumption is necessary here.Moreover, we allow for general Borel state and action spaces and give continuity and compactness conditions under which an optimal policy exists.The major challenge is now the outer optimization problem, since we have to minimize over a function space and the dependence of the value function of the MDP on the functions is involved.However, we are again able to prove the existence of an optimal policy and an optimal function in the representation of the spectral risk measure.Moreover, by approximating the function space in the right way, we are able to reduce the outer optimization problem to a finite dimensional problem with a predetermined error bound.This yields an algorithm for the solution of the original optimization problem.Using an example from optimal reinsurance we show how our results can by applied.
Note that for the Expected Shortfall authors in Chow and Ghavamzadeh (2014); Tamar et al. (2015) have developed gradient-based methods for the numerical computation of the optimal value and policy.For finite state and action spaces Li et al. (2017) provide an algorithm for quantile minimization of MDPs which is a similar problem.However, the outer optimization problem for spectral risk measures is much more demanding, since it is infinite dimensional.
The paper is organized as follows: In the next section, we summarize definitions and properties of risk measures and introduce in particular the class of spectral risk measures which we consider here.In Section 3, we introduce the Markov Decision Model and give continuity and compactness assumptions which will later guarantee the existence of optimal policies.At the end of this section we formulate the spectral risk minimization problem of total cost.We also give some interpretations and show relations to other problems.In Section 4, we consider the inner problem with a finite time horizon.The necessary state space extension is explained as well as the recursive solution algorithm.Moreover, the existence of optimal policies is shown.In the next section we turn to the inner problem with infinite time horizon.We characterize the value function and show how optimal policies are obtained.In Section 6, we discuss our model assumptions.In case the state space is the real line we show that the restrictive assumption of the continuity of the transition function which we need in the general model, can be replaced by semicontinuity if some further monotonicity assumptions are satisfied.In Section 7, we finally treat the outer optimization problem and prove the existence of an optimal function in the representation of the spectral risk measure.The next section then deals with the numerical treatment of this problem.We show here that the infinite dimensional optimization problem can be approximated by a finite dimensional one.Last but not least in the final section we apply our findings to an optimal dynamic reinsurance problem.Problems of this type have been treated in a static setting before, see e.g.Chi and Tan (2013); Cui et al. (2013); Lo (2017); Bäuerle and Glauner (2018), but we will consider them in a dynamic framework for the first time.The aim is to minimize the solvency capital calculated with a spectral risk measure by actively choosing reinsurance contracts for the next period.When the premium for the reinsurance contract is calculated by the expected premium principle we show that the optimal reinsurance contracts are of stop-loss type.

Spectral Risk Measures
Let (Ω, A, P) be a probability space and L 0 = L 0 (Ω, A, P) the vector space of real-valued random variables.By L 1 we denote the subspace of integrable random variables and by L 0 ≥0 the subspace which consists of non-negative random variables.We follow the convention of the actuarial literature that positive realizations of random variables represent losses and negative ones gains.Let X ⊆ L 0 be a convex cone.A risk measure is a functional ρ : X → R. The following properties are relevant in this paper.
A risk measure is referred to as monetary if it is monotone and translation invariant.It appears to be consensus in the literature that these two properties are a necessary minimal requirement for any risk measure.Monetary risk measures which are additionally positive homogeneous and subadditive are called coherent.We will focus on the following class of risk measures.Here, F X (x) = P(X ≤ x), x ∈ R denotes the distribution function and is referred to as spectral risk measure.
Spectral risk measures were introduced by Acerbi (2002).They have all the properties listed in Definition 2.1.Properties a)-e) follow directly from respective properties of the quantile function.Verifying subadditivity is more involved, see Dhaene et al. (2000).As part of the proof they showed that spectral risk measures preserve the increasing convex oder.Spectral risk measures belong to the larger class of distortion risk measures.
In the special case of a spectral risk measure, the distortion function is given by and is convex.This also shows that it is no restriction to assume φ being right continuous (as the right derivative of a convex function).Conversely, for a convex distortion function without a jump in 1, which implies continuity on [0, 1], one can always find a representation as in (2.1) with φ being a spectrum.Consequently, all distortion risk measures with convex and continuous distortion function are spectral.It has been proven by Dhaene et al. (2000) that the convexity of ϕ is equivalent to ρ ϕ being subadditive.Note that ρ φ is finite on L 1 ≥0 if the spectrum φ is bounded.On L 0 ≥0 the value +∞ is possible.Shapiro (2013) has shown that a finite risk measure on L 1 ≥0 with all the properties in Definition 2.1 is already spectral with bounded spectrum.
Example 2.4.The most widely used spectral risk measure is Expected Shortfall Especially in optimization, an infimum represen- tation of Expected Shortfall going back to Rockafellar and Uryasev (2000) is very useful: 2) The infimum is attained at q = F −1 X (α).Henceforth, we assume w.l.o.g. that φ is right-continuous.Then ν([0, t]) = φ(t) defines a Borel measure on [0, 1].Let us define a further measure µ by dµ dν (α) = (1 − α).Every spectral risk measure can be expressed as a mixture of Expected Shortfall over different confidence levels, see e.g.Proposition 8.18 in McNeil et al. (2015).
Proposition 2.5.Let ρ φ be a spectral risk measure.Then µ is a probability measure on [0, 1] and ρ φ has the representation When we allow to take the supremum on the r.h.s.over all probability measures µ we would get the superclass of coherent risk measures, see Kusuoka (2001).
Using Proposition 2.5, the infimum representation (2.2) of Expected Shortfall can be generalized to spectral risk measures.
Proposition 2.6.Let ρ φ be a spectral risk measure with bounded spectrum.We denote by G the set of increasing convex functions g : R → R. Then it holds for where g * is the convex conjugate of g ∈ G.
Proof.For X ∈ L 1 ≥0 the assertion has been proven by Pichler (2015).For non-integrable X ∈ L 0 ≥0 it follows from Proposition 2.5 Now let g ∈ G and U X ∼ U (0, 1) be the generalized distributional transform of X, i.e.F −1 X (U X ) = X a.s.By the definition of the convex conjugate it holds g(X) + g * (φ(U X )) ≥ Xφ(U X ).Hence, we have Since g ∈ G was arbitrary, the assertion follows.
Remark 2.7.The proof by Pichler (2015) shows that for X ∈ L 1 ≥0 the infimum is attained in and that the derivative of this function is g ′ φ,X (x) = φ(F X (x)) a.e.

Markov Decision Model
We consider the following standard Markov Decision Process with general Borel state and action space.The state space E is a Borel space with Borel σ-algebra B(E) and the action space A is a Borel space with Borel σ-Algebra B(A).The possible state-action combinations at time n form a measurable subset D n of E × A such that D n contains the graph of a measurable mapping E → A. The x-section of D n , is the set of admissible actions in state x ∈ E at time n.Note that the sets D n (x) are nonempty.We assume that the dynamics of the MDP are given by measurable transition functions T n : D n ×Z → E and depend on disturbances Z 1 , Z 2 , . . .which are independent random elements on a common probability space (Ω, A, P) with values in a measurable space (Z, Z).When the current state is x n , the controller chooses action a n ∈ D n (x n ) and z n+1 is the realization of Z n+1 , then the next state is given by The one-stage cost function c n : D n × E → R + gives the cost c n (x, a, x ′ ) for choosing action a if the system is in state x at time n and the next state is x ′ .The terminal cost function c N : E → R + gives the cost c N (x) if the system terminates in state x.Note that instead of non-negative cost we can equivalently consider cost which are bounded from below.
The model data is supposed to have the following continuity and compactness properties.Under a finite planning horizon N ∈ N, we consider the model data for n = 0, . . ., N − 1.The decision model is called stationary if D, T, c do not depend on n and the disturbances are identically distributed.If the model is stationary and the terminal cost is zero, we allow for an infinite time horizon N = ∞.
For n ∈ N 0 we denote by H n the set of feasible histories of the decision process up to time n In order for the controller's decisions to be implementable, they must be based on the information available at the time of decision making, i.e. be functions of the history of the decision process.With Π and Π M we denote the sets of all policies and Markov policies respectively.It will be clear from the context if N -stage or infinite stage policies are meant.An admissible policy always exists as D n contains the graph of a measurable mapping.
Since risk measures are defined as real-valued mappings of random variables, we will work with a functional representation of the decision process.The law of motion does not need to be specified explicitly.We define for an initial state x 0 ∈ E and a policy π ∈ Π Here, the process (H π n ) n∈N 0 denotes the history of the decision process viewed as a random element, i.e.
Under a Markov policy the recourse on the random history of the decision process is not needed.
Even though the model is non-stationary we will explicitly introduce discounting by a factor β > 0 since for the following state space extension it is relevant if there is discounting.Otherwise, stationary models with discounting would have to be treated separately.For a finite planning horizon N ∈ N, the total discounted cost generated by a policy π ∈ Π if the initial state is x ∈ E, is given by If the model is stationary and the planning horizon infinite, the total discounted cost is given by For a generic total cost regardless of the planning horizon we write C πx .Our aim is to find a policy π ∈ Π which attains respectively, for a fixed spectral risk measure ρ φ : L 0 ≥0 → R with φ(1) < ∞, i.e. φ is bounded.We can apply Proposition 2.6 to reformulate the optimization problems (3.1) and ( 3 For fixed g ∈ G we will refer to as inner optimization problem.In the two following sections we, solve (3.4) as an ordinary MDP on an extended state space.If C πx ∈ L 0 ≥0 but not in L 1 , then ρ φ (C πx ) = ∞.These policies are not interesting and can be excluded from the optimization.
Since an increasing convex function g : R → R can be viewed as a disutility function, optimality criterion (3.4) implies that the expected disutility of the total discounted cost in minimized.If g is strictly increasing, the optimization problem is not changed by applying g −1 , i.e. minimizing the corresponding certainty equivalent g −1 E[g(C πx )] .For bounded one-stage cost functions, such problems are solved in Bäuerle and Rieder (2014).The special case of the exponential disutility function g(x) = exp(γx), γ > 0, has been studied first by Howard and Matheson (1972) in a decision model with finite state and action space.The term risk-sensitive MDP goes back to them.The certainty equivalent corresponding to an exponential disutility is the entropic risk measure It has been shown by Müller (2007) that an exponential disutility is the only case where the certainty equivalent defines a monetary risk measure apart from expectation itself (linear disutility).
The concepts of spectral risk measures and expected disutilities (or corresponding certainty equivalents) can be combined to so-called rank-dependent expected disutilities of the form ρ φ (u(X)), where u is a disutility function.The corresponding certainty equivalent is u −1 ρ φ (u(X)) .In fact, this concept works more generally for distortion risk measures and incorporates both expected disutilities (identity as distortion function) and distortion risk measures (identity as disutility function).The idea is that the expected disutility is calculated w.r.t. a distorted probability instead of the original probability measure.As long as the distorted probability is spectral, using a rank dependent disutility instead of ρ φ leads to structurally the same inner problem as (3.4),only g is replaced by g(u(•)).Our results apply here, too.The certainty equivalent of a rank-dependent expected disutility combining an exponential disutility with a spectral risk measure is itself a convex (but not coherent) risk measure.It has been introduced by Tsanakas and Desli (2003) as distortion-exponential risk measure.

Inner Problem: Finite Planning Horizon
Under a finite planning horizon N ∈ N, we consider the non-stationary version of the decision model and our aim is to solve for an arbitrary but fixed increasing convex function g ∈ G.We assume that for all x ∈ E there is at least one policy π s.t.C πx N ∈ L 1 .Problem (4.1) is well-defined since the target function is bounded from below by g(0).W.l.o.g.we assume g ≥ 0. Note that the value +∞ is possible.
As the functions g ∈ G are in general non-linear, the optimization problem cannot be solved with MDP techniques directly.This can be overcome by extending the state space to with corresponding Borel σ-algebra following Bäuerle and Rieder (2014).A generic element of E is denoted by (x, s, t).The idea is that s summarizes the cost accumulated to far and that t keeps track of the discounting.The action space A and the admissible state-action combinations D n , n = 0, . . ., N − 1 remain unchanged.Formally, one defines The transition function on the new state space is given by T n : Feasible histories of the decision model with extended state space up to time n have the form and the set of such histories is denoted by H n .With Π and Π M we denote the sets of history-dependent and Markov policies for the decision model with extended state space.We will write E nhn for a conditional expectation given H π n = h n , h n ∈ H n .The value of a policy π ∈ Π at time n = 0, . . ., N is defined as where h n ∈ H n .The corresponding value functions are In the end, the quantity of interest is V 0 (x, 0, 1) which agrees with the infimal value of the original inner optimization problem (4.1).But how do we get an optimal policy for problem (4.1)?When starting in (x 0 , 0, 1) ∈ E, a history (x 0 , a 0 , x 1 , a 1 , . . ., x N ) ∈ H N of the original decision model uniquely determines the history (x 0 , s 0 , t 0 , a 0 , x 1 , s 1 , t 1 , a 1 , . . ., x N , s N , t N ) ∈ H N of the decision model with extended state space through Hence, for the initial state (x 0 , 0, 1) ∈ E, a Markov policy π = (d 0 , . . ., d N −1 ) ∈ Π M with d n : E → A, which will turn out to be optimal for (4.3), can be perceived as a history-dependent policy π ′ = (d ′ 0 , . . ., d ′ N −1 ) ∈ Π of the original decision model, since we can find measurable functions d ′ n : Analogously, a history-dependent policy π ∈ Π can be regarded as a history-dependent policy of the original decision model.We can now proceed to deriving an iteration for the policy values (4.2).
Proposition 4.1.The value of a policy π ∈ Π can be calculated recursively for n = 0, . . ., N − 1 and h n ∈ H n as Proof.The proof is by backward induction.At time N there is nothing to show.Now assume the assertion holds for n + 1, then the tower property of conditional expectation yields Remark 4.2.If there is no discounting or if the discounting is included in the non-stationary one-stage cost functions, the second summary variable t is obviously not needed.In the special case that ρ φ is the Expected Shortfall, one only has to consider the functions g q (x) = (x−q) + , q ∈ R. Due to their positive homogeneity in (x, q), it suffices to extend the state space by only one real-valued summary variable even if there is discounting, cf.Bäuerle and Ott (2011).
Let us now consider specifically Markov policies π ∈ Π M .The function space turns out to be the set of potential value functions under such policies.In order to simplify the notation, we introduce the usual operators on M. All v ∈ M are non-negative and thus at least quasi-integrable.
Definition 4.3.For v ∈ M and a Markov decision rule d : E → A we define Note that the operators are monotone in v.Under a Markov policy π = (d 0 , . . ., d N −1 ) ∈ Π M the value iteration can be expressed with the operators.In order to distinguish from the history-depended case, we denote the Markov value functions with J.
The corresponding Markov value functions are defined for n = 0, . . ., N as The next result shows that V n satisfies a Bellman equation and proves that an optimal policy exists and is Markov.
Theorem 4.4.Let Assumption 3.1 be satisfied.Then, for n = 0, . . ., N the value functions V n only depend on lie in M and satisfy the Bellman equation Furthermore, for n = 0, . . ., N −1 there exist Markov decision rules d * n : E → A with T nd * n J n+1 = T n J n+1 and every sequence of such minimizers constitutes an optimal policy π = (d * 0 , . . ., d * N −1 ) ∈ Π M for problem (4.3).
Proof.The proof is by backward induction.At time N we have • lower semicontinuous since g is increasing and continuous (as a convex function on R) and c N is lower semicontinuous, • increasing in (s N , t N ) since g is increasing and c N is non-negative, • bounded below by g(s N ) since g is increasing and t N c N (x N ) ≥ 0, i.e. in M. Assuming the assertion holds at time n + 1 we have at time n for The last equality holds since the minimization does not depend on the entire policy but only on Here, objective and constraint depend on the history of the process only through x n .Thus, given existence of a minimizing Markov decision rule d * n , (4.4) equals Again by the induction hypothesis, there exists an optimal Markov policy π * ∈ Π M such that J n+1 = J n+1π * .Hence, we have

It remains to show the existence of a minimizing Markov decision rule d *
n and that J n ∈ M. We want to apply Proposition 2.4.3 of Bäuerle and Rieder (2011).The set-valued mapping E ∋ (x, s, t) → D n (x) is compact-valued and upper semicontinuous.Next, we show that is lower semicontinuous.Since v ≥ g ≥ 0, we can apply Fatou's Lemma which yields I.e.L n v is lower semicontinuous.With Proposition 2.4.3 of Bäuerle and Rieder (2011) follows the existence of a minimizing decision rule d * n and the lower semicontinuity of T n v. Now fix x ∈ E. The fact that (s, t) → T n v(x, s, t) is increasing follows as in Theorem 2.4.14 in Bäuerle and Rieder (2011).The inequality T n v(x, s, t) ≥ g(s), (x, s, t) ∈ E, is obvious.Taken together, we have T n v ∈ M and the proof is complete.
Remark 4.5.From Theorem 4.4 it follows that the sequence is a sufficient statistic of the decision model with the original state space in the sense of Hinderer (1970).

Inner Problem: Infinite Planning Horizon
In this section, we consider the inner optimization problem (3.4) of the risk-sensitive total cost minimization under an infinite planning horizon.This is reasonable if the terminal period is unknown or if one wants to approximate a model with a large but finite planning horizon.Solving the infinite horizon problem will turn out to be easier since it admits a stationary optimal policy.
We study the stationary version of the decision model with no terminal cost, i.e.D, T, c do not depend on n, c N ≡ 0 and the disturbances are identically distributed.Let Z be a representative of the disturbance distribution.Our aim is to solve for an arbitrary but fixed increasing convex function g ∈ G.As in the previous section we assume w.l.o.g. that g ≥ 0 and that for all x ∈ E there exists a policy π such that C πx ∞ ∈ L 1 .The remarks in Section 4 regarding connections to the minimization of (rank-dependent) expected disutilities and corresponding certainty equivalents apply in the infinite horizon case as well.
In order to obtain a value iteration, the state space is extended to E = E × R + × (0, ∞) as in Section 4. The action space A and the admissible state-action combinations D remain unchanged, i.e.D = {(x, s, t, a) ∈ E × A : a ∈ D(x)} and D(x, s, t) = D(x), (x, s, t) ∈ E. The transition function on the new state space is given by T : Since the model with infinite planning horizon will be derived as a limit of the one with finite horizon, the consideration can be restricted to Markov policies π = (d 1 , d 2 , . . . ) ∈ Π M due to Theorem 4.4.For the relevant initial state (x 0 , 0, 1) ∈ E, a Markov policy π ∈ Π M can be perceived as a history-dependent policy of the original decision model, cf.Section 4. When calculating limits, it is more convenient to index the value functions with the distance to the time horizon rather than the point in time.This is also referred to as forward form of the value iteration and is only possible under Markov policies in a stationary model.There, the two ways of indexing are equivalent.The value of a policy π = (d 0 , d 1 . . . ) ∈ Π M up to a planning horizon where (X π n , s π n , t π n ) n∈N is the extended decision process under policy π ∈ Π M with initial state (x, s, t) ∈ E. The change of indexing makes it necessary to write the value iteration in terms of the shifted policy π = (d 1 , d 2 , . . . ) corresponding to π = (d 0 , d 1 , . . . ) ∈ Π M : (5.2) The value function for finite planning horizon N ∈ N is given by and satisfies due to Theorem 4.4 the Bellman equation The value of a policy π ∈ Π M under an infinite planning horizon is defined as Note that J ∞π is well-defined since c ≥ 0 and hence J N π is increasing.The infinite horizon value function is J ∞π (x, s, t), (x, s, t) ∈ E. (5. 3) The limit J(x) = lim N →∞ J N (x), x ∈ E, which again exists since J N is increasing, is referred to as limit value function.Note that M is closed under pointwise convergence and hence J ∈ M.
Theorem 5.1.Let Assumption 3.1 be satisfied.Then it holds: a) The infinite horizon value function J ∞ is the smallest fixed point of the Bellman operator T in M and J ∞ = J. b) There exists a Markov decision rule d * such that T d * J ∞ = T J ∞ and each stationary policy π * = (d * , d * , . . . ) induced by such a decision rule is optimal for optimization problem (5.3).
Proof.a) First, we show that J ∞ = J.For all N ∈ N we have J N π ≥ J N .Taking the limit N → ∞ we obtain J ∞π ≥ J for policies π ∈ Π M .Thus J ∞ ≥ J.
For the reverse inequality we start with J N π ≤ J ∞π which is true for all policies π ∈ Π M due to the fact that c ≥ 0. Taking the infimum over all policies yields J N ≤ J ∞ and taking the limit N → ∞ we obtain J ≤ J ∞ .It total, we have J = J ∞ .
Let now v ∈ M be another fixed point of T , i.e. v = T v. Iterating this equality yields v = T n v for all n ∈ N. Since v ∈ M we have v ≥ g and because of the monotonicity of the Bellman operator we get v = T n v ≥ T n g.Letting n → ∞ finally implies v ≥ J = J ∞ , thus J ∞ is the smallest fixed point of the Bellman operator.b) Since J ∞ ∈ M, the existence of a minimizing Markov decision rule follows as in the proof of Theorem 4.4.Furthermore, it holds J ∞ (x, s, t) ≥ g(s), (x, s, t) ∈ E, since J ∞ ∈ M. Consequently, we have i.e. π * is optimal.The first equality is by part a), the inequality thereafter by the monotonicity of the operator T d * and the second equality by the value iteration (5.2).

Relaxed Assumptions for Monotone Models
The model has been introduced in Section 3 with a general Borel space as state space.In order to solve the optimization problem in Sections 4 and 5 we needed a continuous transition function despite having a semicontinuous model.This assumption on the transition function can be relaxed to semicontinuity if the state space is the real line and the transition and onestage cost function have some form of monotonicity.For notational convenience, we consider the stationary model with no terminal cost under both finite and infinite horizon in this section.We replace Assumption 3.1 by Assumption 6.1.
(i) The state space is the real line E = R. (ii) The sets D(x) are compact and R ∋ x → D(x) is upper semicontinuous and decreasing, i.e.D(x) ⊇ D(y) for x ≤ y. (iii) The transition function T is lower semicontinuous in (x, a) and increasing in x.
(iv) The one-stage cost c(x, a, T (x, a, z)) is lower semicontinuous in (x, a) and increasing in x.
Requiring that the one-stage cost function c is lower semicontinuous in (x, a, x ′ ) and increasing in (x, x ′ ) is sufficient for Assumption 6.1 (iv) to hold due to part (iii) of the assumption.
How do the modified continuity assumptions affect the validity of the results in Sections 4 and 5? The only two results that were proven using the continuity of the transition function T in (x, a) and not only its measurability are Theorems 4.4 and 5.1.All other statements are unaffected.
Proposition 6.2.The assertions of Theorems 4.4 and 5.1 hold under Assumption 6.1, too.Moreover, the value functions J n and J ∞ are increasing.The set of potential value functions can therefore be replaced by lower semicontinuous and increasing, v(x, s, t) ≥ g(s) for (x, s, t) ∈ E .
Proof.In Theorem 4.4, the continuity of T is used to show that D ∋ (x, s, t, a) → Lv(x, s, t, a) is lower semicontinuous for every v ∈ M. Due to the monotonicity assumptions, the mapping is lower semicontinuous for every ω ∈ Ω as a composition of an increasing lower semicontinuous function with a lower semicontinuous one.Now, the lower semicontinuity of D ∋ (x, s, t, a) → Lv(x, s, t, a) and the existence of a minimizing decision rule follow as in the proof of Theorem 4.4.The fact that T v is increasing for every v ∈ M follows as in Theorem 2.4.14 in Bäuerle and Rieder (2011).In Theorem 5.1, the continuity of T is only used indirectly through Theorem 4.4.Note that J ∞ ∈ M since the pointwise limit of increasing functions remains increasing.
The monotonicity properties of Assumption 6.1 can be used to construct a convex model.Lemma 6.3.Let Assumption 6.1 be satisfied, A be a subset of a real vector space, the admissible state-action-combinations D be a convex set, the transition function T be convex in (x, a) and the one-stage cost D ∋ (x, a) → c(x, a, T (x, a, z)) be a convex function for every z ∈ Z. Then the value functions J n (•, •, t) and J ∞ (•, •, t) are convex for every t > 0.
Proof.We prove by induction that J n is convex in (x, s) for n ∈ N 0 .Then J ∞ is convex as a pointwise limit of convex functions.For n = 0 we know that J 0 (x, s, t) = g(s) is convex in (x, s).Now assume that J n is convex in (x, s).Recall that J n increasing by Proposition 6.2.Hence, for every ω ∈ Ω and t > 0 the function (x, s, a) → J n T (x, a, Z(ω)), s + tc(x, a, T (x, a, Z(ω))), βt is convex as a composition of an increasing convex with a convex function.By the linearity of expectation (x, s, a) → LJ n (x, s, t, a) is convex, too, for every t > 0. Now, the convexity of J n follows from Proposition 2.4.18 in Bäuerle and Rieder (2011).
If c is increasing in x ′ , it is sufficient to require that c and T are convex in (x, a).The monotonicity requirements in Assumption 6.1 are only one option.The following alternative is relevant i.a. for the dynamic reinsurance model in Section 9.For a proof see Section 6.1.3 in Glauner (2020).(iv') c(x, a, T (x, a, z)) is lower semicontinuous in (x, a) and decreasing in x.
Then, the assertions of Theorems 4.4 and 5.1 still hold with the value functions J n and J ∞ being decreasing in x and increasing in (s, t).If furthermore A is a subset of a real vector space, D a convex set, T concave in (x, a) and D ∋ (x, a) → c(x, a, T (x, a, z)) convex for every z ∈ Z.Then, the value functions J n (•, •, t) and J ∞ (•, •, t) are convex for every t > 0.

Outer Problem: Existence
In this section, we study the existence of a solution to the outer optimization problem (3.3) under both finite and infinite planning horizon.Given a solution of the respective inner problem for every g ∈ G, the two outer problems are essentially the same and therefore treated together.We have assumed in both cases that for all x ∈ E there exists a policy π such that C πx ∈ L 1 and thus ρ φ (C πx ) = ρ < ∞.Hence in what follows we can restrict to policies π such that ρ φ (C πx ) ≤ ρ.In this case, we can further restrict the set G in the representation of Proposition 2.6.
Lemma 7.1.It is sufficient to consider functions g ∈ G in the representation of Proposition 2.6 which are φ(1)-Lipschitz and satisfy 0 ≤ g(x) ≤ ḡ(x), x ∈ R, where The space of such functions is denoted by G.
Proof.Set C = C πx to simplify the notation and assume that ρ φ (C) ≤ ρ.We know from Remark 2.7 that the optimal g ∈ G corresponding to C is with µ from Proposition 2.5.Since C ≥ 0 it follows Furthermore, we have The first inequality uses is by definition of µ.As a convex function, g φ,C is almost everywhere differentiable with derivative g ′ φ,C (x) = φ(F C (x)) ≤ φ(1), cf.Remark 2.7.This establishes the Lipschitz continuity with constant L = φ(1).
For a fixed policy π ∈ Π M , the optimal solution of the outer problem is already given by Remark 2.7 as However, we solved the inner problem for arbitrary but fixed g ∈ G. Hence, the optimal policy depends on g and Proposition 2.6 is not helpful.As a first step in ensuring the existence of a solution of the outer problem, we study the dependence of the value functions of the inner problem on g.In order to do so, we need some structure on G.
Lemma 7.2.(G, m) is a compact metric space, where is the metric of compact convergence.
Proof.Since G ⊆ C(R, R), it suffices to show that G is closed w.r.t.m and verify the assumptions of the Arzelà-Ascoli theorem.Note that convergence w.r.t.m implies pointwise convergence.Convexity, monotonicity, the common Lipschitz constant φ(1), non-negativity and the pointwise upper bound ḡ are all preserved even under pointwise convergence.Hence, G is closed w.r.t.m.Moreover, G is pointwise bounded and the common Lipschitz constant implies that it is uniformly equicontinuous.
For clarity we index the value functions with g.The value functions J g 0 of the finite horizon inner problem and J g ∞ of the infinite horizon inner problem depend semicontinuously on g.
Proof.The proof is by backward induction.At time N we have to verify that , then g converges especially pointwise and Case 1: {c k } k∈N is bounded above and therefore convergent with limit ĉ.Then since c N is lower semicontinuous.As the functions {g k } k∈N and g are all increasing, we get lim inf Case 2: {c k } k∈N is unbounded above.Then there exists K ∈ N such that c k ≥ c N (x) for all k ≥ K and lim inf Now assume the assertion holds for n + 1.By Theorem 4.4 we have at time n The integrand J g n+1 T n (x, a, Z n+1 (ω)), s + tc n (x, a, T n (x, a, Z n+1 (ω))), βt is lower semicontinuous in (g, x, s, t, a) for every ω ∈ Ω by the induction hypothesis.Hence, if (g k , x k , s k , t k ) → (g, x, s, t), Fatou's lemma and the monotonicity of expectation yield I.e.(g, x, s, t) → L n J g n+1 (x, s, t, a) is lower semicontinuous.As the set-valued mapping E ∋ x → D(x) is compact valued and upper semicontinuous, is lower semicontinuous by Proposition 2.4.3 in Bäuerle and Rieder (2011).
Proof.Under an infinite planning horizon, we consider a stationary model and use forward indexing for the value functions J g N .They are lower semicontinuous in (g, x, s, t) by Lemma 7.3.Note that the induction basis holds especially for c N ≡ 0. Since J g N ↑ J g ∞ as N → ∞, the assertion follows from Lemma A.1.4 in Bäuerle and Rieder (2011).
For initial state x ∈ E and finite planning horizon N ∈ N the outer problem is given by inf g∈G J g 0 (x, 0, 1) + 1 0 g * (φ(u)) d u and for infinite planning horizon by inf g∈G J g ∞ (x, 0, 1) + In the following, we will only use the semicontinuity of the value functions in g.Hence, we write inf g∈G J(g) for a generic outer problem and suppress initial state and planning horizon.
Theorem 7.5.Under Assumption 3.1 there exists a solution for the outer optimization problem (7.1).
Proof.We want to apply Weierstraß' extreme value theorem.In view of Lemmata 7.2, 7.3 and 7.4 it suffices to show that the functional G ∋ g → The inequality holds generally for the interchange of infimum and supremum, the equality thereafter by Lemma A.1.6 in Bäuerle and Rieder (2011) and the last but one equality since the sequence {g k } k∈N is especially pointwise convergent.Moreover note that for all k ∈ N and u ∈ [0, 1] it holds Now, Fatou's lemma and (7.2) yield with lim inf the assertion.

Outer Problem: Numerical Approximation
As we know now that a solution to the outer optimization problem (7.1) exists, this section aims to determine the solution numerically.The idea is to approximate the functions g ∈ G by piecewise linear ones and thereby obtain a finite dimensional optimization problem which can be solved with classical methods of global optimization.We are going to show that the minimal values converge when the approximation is continuously refined and give an error bound.Regarding the second summand of the objective function (7.1) our method coincides with the Fast Legendre-Fenchel Transform (FLT) algorithm studied i.a. by Corrias (1996).
For unbounded cost C πx N with N ∈ N ∪ {∞}, π ∈ Π, x ∈ E, the functions g ∈ G would have to be approximated on the whole non-negative real line.This is numerically not feasible.The bounded cost allows for a further reduction of the feasible set of the outer problem.On the reduced feasible set, the second summand of the objective function is guaranteed to be finite and easier to calculate.Recall that the convex conjugate of g ∈ G is an R-valued function defined by g * (y) = sup s∈R {sy − g(s)}, y ∈ R.
a) Fix π ∈ Π, x ∈ E and set C = C πx N to simplify the notation.We know from Remark 2.7 that the optimal g ∈ G corresponding to C is with µ from Proposition 2.5.Clearly, it is sufficient to consider functions g ∈ G which are optimal for at least one As a convex function, g φ,C is almost everywhere differentiable with derivative g ′ φ,C (s) = φ(F C (s)), cf.Remark 2.7, and for s > ĉ it holds F C (s) = 1.b) Let g ∈ G and y ∈ [0, φ(1)].For s ≥ ĉ the function s → sy − g(s) = (y − φ(1))s − g(ĉ) + φ(1)ĉ is decreasing and for s ≤ 0 the function is increasing.Hence, it suffices to consider the supremum over [0, ĉ].
The fact that the supremum of the convex conjugate reduces to the maximum of a continuous function over a compact set, opens the door for a numerical approximation with the FLT algorithm.By definition of G, it is sufficient to approximate the functions g ∈ G on the interval [0, ĉ].For the value iteration in Lemma 4.1 and equation (5.2) it may be necessary to evaluate g in some s > ĉ, but here the function is determined as a linear continuation with slope φ(1).On the interval I = [0, ĉ], the metric of compact convergence reduces to the supremum norm • ∞ .For the piecewise linear approximation we consider equidistant partitions 0 which projects a function g ∈ G to its piecewise linear approximation and its image G m = {p m (g) : g ∈ G}.For considering the restriction of the outer optimization problem (7.1) to G m it is convenient to define for g ∈ G Proof.The first inequality is obvious and it remains to prove the second.We have for N Moreover, it holds for y ∈ [0, φ( 1 Finally, the assertion follows with The proposition shows that the infimum of K m converges to the one of K.The error of restricting the outer problem (7.1) to G m is bounded by 2φ(1) ĉ m−1 .The piecewise linear functions g ∈ G m are uniquely determined by their values in the kinks s 1 , . . ., s m .Hence, we can identify G m with the compact set Note that due to translation invariance of ρ φ it holds under Assumption 8.1 for g ∈ G that g(0) ≤ ḡ(0) = ρ( C) = ρ(ĉ) = ĉ.Thus, the outer problem (7.1) restricted to G m becomes finite dimensional: where g y ∈ G m is the piecewise linear function induced by y ∈ Γ m , i.e.
How to evaluate J(•) in g y , y ∈ Γ m , has been discussed in Sections 4 and 5.The next Lemma simplifies the evaluation of the second summand of the objective function (8.1) to calculating the integrals Lemma 8.4.The convex conjugate of g y , y ∈ Γ m , in ξ ∈ [0, φ(1)] is given by , this implies the last equality.The third case c m−1 < ξ is analogous.
The results of this section can be used to set up an algorithm for the problems in (3.1) and (3.2).First we have to set m = 2φ(1)ĉ ǫ + 1 when we want to solve the problem with error estimate ǫ.Then choose y 0 ∈ Γ m and solve the inner problem with g y 0 .Use a global optimization procedure to determine the next y 1 (note that we do not have convexity of (8.1) in y) like e.g.simulated annealing to determine the optimal value of (8.1) .

Dynamic Optimal Reinsurance
As an application, we present a dynamic extension in discrete time of the static optimal reinsurance problem min In this setting, an insurance company incurs an aggregate loss Y ∈ L 1 ≥0 at the end of a fixed period due to insurance claims.In order to reduce its risk, the insurer may cede a portion of it to a reinsurance company and retain only f (Y ).Here, the reinsurance treaty f determines the retained loss f (Y (ω)) in each scenario ω ∈ Ω.For the risk transfer, the insurer has to compensate the reinsurer with a reinsurance premium π R (f ) = π R (Y − f (Y )), where π R : L 1 ≥0 → R is a premium principle with properties similar to a risk measure.Most widely used is the expected premium principle π R (X) = (1 + θ)E[X] with safety loading θ > 0. In order to preclude moral hazard, it is standard in the actuarial literature to assume that both f and the ceded loss function id R + −f are increasing.Hence, the set of admissible retained loss functions is The insurer's target is to minimize its cost of solvency capital which is calculated as the cost of capital rate r CoC ∈ (0, 1] times the solvency capital requirement determined by applying the risk measure ρ to the insurer's effective risk after reinsurance. First research on the optimal reinsurance problem (9.1) dates back to the 1960s.Borch (1960) proved that a stop-loss reinsurance treaty minimizes the variance of the retained loss of the insurer given the premium is calculated with the expected value principle.A similar result has been derived in Arrow (1963) where the expected utility of terminal wealth of the insurer has been maximized.Since then a lot of generalizations of this problem have been considered.For a comprehensive literature overview, we refer to Albrecher et al. (2017).Since the 2000s, Expected Shortfall has become of special interest.Chi and Tan (2013) identified layer reinsurance contracts as optimal for Expected Shortfall under general premium principles.Their results were extended to general distortion risk measures by Cui et al. (2013).Other generalizations concerned additional constraints, see e.g.Lo (2017), or multidimensional settings induced by a macroeconomic perspective, see Bäuerle and Glauner (2018).We are not aware of any dynamic generalizations in the literature.
Reinsurance treaties are typically written for one year, cf.Albrecher et al. (2017).Hence, it is appropriate to model such an extension in discrete time.The insurer's annual surplus has the dynamics where the bounded, non-negative random variable Z n+1 ∈ L ∞ ≥0 represents the insurer's premium income in the n-th period.The premium principle π R : L p ≥0 → R of the reinsurer is assumed to be law-invariant, monotone, normalized and to have the Fatou property.Normalization means that π R (0) = 0 and the Fatou property is lower semicontinuity w.r.t.dominated convergence.
The Markov Decision Model is given by the state space E = R, the action space A = F, either no constraint or a budget constraint D The insurance companies target is to minimize its solvency cost of capital for the total discounted loss where ρ φ is a spectral risk measure with bounded spectrum φ, β ∈ (0, 1] and N ∈ N. As it is irrelevant for the minimization, we will in the sequel omit the cost of capital rate r CoC and instead minimize the capital requirement.For β = 1 we have i.e. due to translation invariance of spectral risk measures the objective reduces to minimizing the capital requirement for the loss (negative surplus) at the planing horizon −X π N .This is reminiscent of the static reinsurance problem (9.1), however here the loss distribution at the planing horizon can be controlled by interim action.Throughout, we have required that the one-stage cost c(x, f, T (x, f, Y, Z)) = f (Y ) + π R (f ) − Z is non-negative.As f (Y ) and π R (f ) are non-negative for all f ∈ F and c(x, id R + , T (x, id R + , Y, Z)) = Y − Z due to normalization of π R , the premium income Z would have to be non-positive.This makes no sense from an actuarial point of view, but since ρ φ is translation invariant and Z ∈ L ∞ we can add N −1 k=0 β k ess sup(Z) without influencing the minimization.This means that the one-stage cost function is changed to now depends on the deviation from the maximal possible income instead of the actual income.For brevity we write ẑ = ess sup(Z).
As in (3.3) we separate an inner and outer reinsurance problem.For a structural analysis we focus on the inner optimization problem inf with arbitrary g ∈ G, cf.Lemma 7.1.Note that for π = (f, f, . ..) with f = id R + we obtain ρ φ (C πx N ) < ∞.On the extended state space E = R × R + × (0, 1], the value of a policy π ∈ Π is defined as for n = 0, . . ., N and h n ∈ H n .The corresponding value functions are Due to the real state space we want to apply Corollary 6.4 for solving the optimization problem.Note that the one-stage cost ĉ is non-negative and the spectrum φ bounded by assumption.The following lemma shows that also the monotonicity, continuity and compactness assumptions of Corollary 6.4 are satisfied by the dynamic reinsurance model.The fact that T is increasing in x is obvious.d) Due to a), we only have to consider the budget-constrained case.Since F is compact it suffices to show that D(x) = {f ∈ F : π R (f ) ≤ x + } is closed.This is the case since D(x) is a sublevel set of the lower semicontinuous function π R : F → R + , cf.Lemma A.1.3 in Bäuerle and Rieder (2011).Furthermore, we show that D is closed to obtain the upper semicontinuity from Lemma A.2.2 in Bäuerle and Rieder (2011).e) The one-stage cost c(x, f, T (x, a, y, z)) = x − T (x, f, y, z) = f (y) + π R (f ) − z is lower semicontinuous in (x, f ) as a sum of lower semicontinuous functions and decreasing in x since it does not depend on x.Now, Corollary 6.4 yields that it is sufficient to minimize over all Markov policies and the value functions satisfy the Bellman equation J N (x, s, t) = g(s), All structural properties of the optimal policy which do not depend on g are inherited by the optimal solution of the cost of capital minimization problem (9.2).The structural properties we will focus on in the rest of this section are induced by convexity.Therefore, we assume that the premium principle π R is convex and that there is no budget constraint.Note that D(x) is non-convex even for convex π R .Then, we have indeed a convex model: D is trivially convex, the transition function T (x, f, y, z) = x − f (y) − π R (f ) + z is concave in (x, f ) as a sum of concave functions and the one-stage cost (x, f ) → ĉ(x, f, T (x, f, y, z)) = f (y) + π R (f ) + ẑ − z is convex as a sum of convex functions.Now, Corollary 6.4 yields that the value functions J n are convex.Under the widely-used expected premium principle, the optimization problem can be reduced to finite dimension.
Example 9.2.Let π R (•) = (1 + θ)E[•] be the expected premium principle with safety loading θ > 0 and assume there is no budget constraint.We will now show that the optimal reinsurance treaties (i.e.retained loss functions) can be chosen from the class of stop loss treaties f (x) = min{x, a}, a ∈ [0, ∞].
Due to the convexity of J n+1 , we can infer from the Bellman equation that reinsurance treaty f 1 is better than where ≤ cx denotes the convex order.Since The inequality holds since f ≤ id R + .Hence, we have S min{Y,a f } (y) ≥ S f (Y ) (y) for y < a f and S min{Y,a f } (y) ≤ S f (Y ) (y) for y ≥ a f .The cut criterion 1.5.17 in Müller and Stoyan (2002) implies min{Y, a f } ≤ icx f (Y ) and due to the equality in expectation follows (9.4), cf.Theorem 1.5.3 in Müller and Stoyan (2002).So the inner optimization problem (9.3) is reduced to finding an optimal nonnegative parameter of a stop loss treaty at every stage.
sets D n (x) are compact and E ∋ x → D n (x) are upper semicontinuous, i.e. if x k → x and a k ∈ D n (x k ), k ∈ N, then (a k ) has an accumulation point in D n (x).(ii) The transition functions T n are continuous in (x, a).(iii) The one-stage cost functions c n and the terminal cost function c N are lower semicontinuous.
policy and a sequence π = (d 0 , d 1 , . . . ) is called policy.b) A decision rule at time n is called Markov if it depends on the current state only, i.e. d n (h n ) = d n (x n ) for all h n ∈ H n .If all decision rules are Markov, the (N -stage) policy is called Markov.c) An (N -stage) policy π is called stationary if π = (d, . . ., d) or π = (d, d, . . .), respectively, for some Markov decision rule d.
Corollary 6.4.Change Assumption 6.1 (ii)-(iv) to (ii') The sets D(x) are compact and R ∋ x → D(x) is upper semicontinuous and increasing.(iii') T is upper semicontinuous in (x, a) and increasing in x.
Assumption 8.1.If N ∈ N, we require additionally to Assumption 3.1 that c is bounded from above by a constant c.If N = ∞, we also assume that β ∈ (0, 1).Consequently, it holds 0 ≤ C πx N ≤ ĉ for all N ∈ N ∪ {∞}, π ∈ Π and x ∈ E, where we define ĉ = N k=0 β k c for finite planning horizon N ∈ N, c 1−β for infinite planning horizon N = ∞.