1 Introduction

In the last decade, there have been various proposals to replace the expectation in the optimization of Markov Decision Processes (MDPs) by risk measures. The idea behind it is to take the risk-sensitivity of the decision maker into account. Using simply the expectation models a risk-neutral decision maker whose optimal policy sometimes can be very risky, for an example see e.g. Bäuerle and Ott (2011).

The literature can here be divided into two streams: Those papers which apply risk measures recursively and those which apply the risk measure to the total cost. The recursive approach for general MDP can for example be found in Ruszczyński (2010); Chu and Zhang (2014); Bäuerle and Glauner (2021). The theory for these kind of models is rather different to the ones where the risk measures is applied to the total cost, since in the recursive approach we still get a recursive solution procedure directly. In this paper, we contribute to the second model class, i.e. we assume that a cost process is generated over discrete time by a decision maker and she aims at minimizing the risk measure applied to either the cost over a finite time horizon or over an infinite time horizon. The class of risk measures we consider here are so-called spectral risk measures which form a class of coherent risk measures including the Expected Shortfall or Conditional Value-at-Risk. More precisely spectral risk measures are mixtures of Expected Shortfall at different levels.

For Expected Shortfall, the problem has already been treated e.g. in Bäuerle and Ott (2011), Chow et al. (2015), and Uğurlu (2017). Whereas in Chow et al. (2015) the authors use a decomposition result of the Expected Shortfall shown in Pflug and Pichler (2016), the authors of Bäuerle and Ott (2011) use the representation of Expected Shortfall as the solution of a global optimization problem over a real valued parameter, see Rockafellar and Uryasev (2000). Interchanging the resulting two infima from the optimization problems yields a two-step method to solve the decision problem. Using the recent representation of spectral risk measures as an optimization problem over functions involving the convex conjugate in Pichler (2015), we follow a similar approach here. The problem can again be decomposed into an inner and outer optimization problem. The inner problem is to minimize the expected convex function of the total cost. It can be solved with MDP techniques after a suitable extension of the original state space. Note that already here we get some difference to the Expected Shortfall problem. In contrast to the findings in Bäuerle and Ott (2011) who assume bounded cost or Uğurlu (2017) who assumes \(L^1\) cost, we only require the cost to be bounded from below. No further integrability assumption is necessary here. Moreover, we allow for general Borel state and action spaces and give continuity and compactness conditions under which an optimal policy exists. The major challenge is now the outer optimization problem, since we have to minimize over a function space and the dependence of the value function of the MDP on the functions is involved. However, we are again able to prove the existence of an optimal policy and an optimal function in the representation of the spectral risk measure. Moreover, by approximating the function space in the right way, we are able to reduce the outer optimization problem to a finite dimensional problem with a predetermined error bound. This yields an algorithm for the solution of the original optimization problem. Using an example from optimal reinsurance we show how our results can by applied.

Note that for Expected Shortfall the authors in Chow and Ghavamzadeh (2014) and Tamar et al. (2015) have developed gradient-based methods for the numerical computation of the optimal value and policy. For finite state and action spaces (Li et al. 2017) provide an algorithm for quantile minimization of MDPs which is a similar problem. However, the outer optimization problem for spectral risk measures is much more demanding, since it is infinite dimensional.

The paper is organized as follows: In the next section, we summarize definitions and properties of risk measures and introduce in particular the class of spectral risk measures which we consider here. In Sect. 3, we introduce the Markov Decision Model and give continuity and compactness assumptions which will later guarantee the existence of optimal policies. At the end of this section, we formulate the spectral risk minimization problem of the total cost. We also give some interpretations and show relations to other problems. In Sect. 4, we summarize our findings in a nutshell. The necessary state space extension is explained as well as the recursive solution algorithm for the inner optimization problem. Moreover, the existence of optimal policies is stated. Then, we treat the outer optimization problem and state the existence of an optimal function in the representation of the spectral risk measure. Afterwards, we deal with the numerical treatment of this problem. We show here that the infinite dimensional optimization problem can be approximated by a finite dimensional one. In Sect. 5, we extend our results to decision models with infinite planning horizon. Besides, if the state space is the real line we show that the restrictive assumption of the continuity of the transition function which we need in the general model, can be replaced by semicontinuity if some further monotonicity assumptions are satisfied. In the final Sect. 6, we apply our findings to an optimal dynamic reinsurance problem. Problems of this type have been treated in a static setting before, see e.g. Chi and Tan (2013), Cui et al. (2013), Lo (2017) and Bäuerle and Glauner (2018), but we consider them in a dynamic framework for the first time. The aim is to minimize the solvency capital calculated with a spectral risk measure by actively choosing reinsurance contracts for the next period. When the premium for the reinsurance contract is calculated by the expected premium principle, we show that the optimal reinsurance contracts are of stop loss type. All proofs and detailed derivations of our results are deferred to the appendix.

2 Spectral risk measures

Let \((\Omega , \mathcal {A}, \mathbb {P})\) be a probability space and \(L^0=L^0(\Omega , \mathcal {A}, \mathbb {P})\) the vector space of real-valued random variables thereon. By \(L^1\) we denote the subspace of integrable random variables and by \(L^0_{\ge 0}\) the subspace which consists of non-negative random variables. We follow the convention of the actuarial literature that positive realizations of random variables represent losses and negative ones gains. Let \(\mathcal {X}\subseteq L^0\) be a convex cone. A risk measure is a functional \(\rho : \mathcal {X}\rightarrow \mathbb {R}\cup \{\infty \}\). The following properties are relevant in this paper.

Definition 2.1

A risk measure \(\rho : \mathcal {X}\rightarrow \mathbb {R}\cup \{\infty \}\) is called

  1. a)

    law-invariant if \(\rho (X)=\rho (Y)\) for XY having the same distribution.

  2. b)

    monotone if \(X\le Y\) implies \(\rho (X) \le \rho (Y)\).

  3. c)

    translation invariant if \(\rho (X+m)=\rho (X)+m\) for all \(m \in \mathbb {R}\cap \mathcal {X}\).

  4. d)

    positive homogeneous if \(\rho (\lambda X)=\lambda \rho (X)\) for all \(\lambda \in \mathbb {R}_+\).

  5. e)

    comonotonic additive if \(\rho (X+Y) = \rho (X)+\rho (Y)\) for all comonotonic XY.

  6. f)

    subadditive if \(\rho (X+Y)\le \rho (X)+\rho (Y)\) for all XY.

A risk measure is referred to as monetary if it is monotone and translation invariant. It appears to be consensus in the literature that these two properties are a necessary minimal requirement for any risk measure. Monetary risk measures which are additionally positive homogeneous and subadditive are called coherent. Here, \(F_X(x)=\mathbb {P}(X\le x), \ x \in \mathbb {R}\), denotes the distribution function and \(F^{-1}_X(u)=\inf \{x \in \mathbb {R}: F_X(x)\ge u\}, \ u \in [0,1]\), the quantile function of a random variable X. We will focus on the following class of risk measures.

Definition 2.2

An increasing function \(\phi :[0,1] \rightarrow \mathbb {R}_+\) with \(\int _0^1 \phi (u) {\mathrm d}u=1\) is called spectrum and the functional \(\rho _\phi : L^0_{\ge 0} \rightarrow \mathbb {R}\cup \{\infty \}\) with

$$\begin{aligned} \rho _{\phi }(X)= \int _0^1 F^{-1}_X(u) \phi (u) {\mathrm d}u \end{aligned}$$

is referred to as spectral risk measure.

Spectral risk measures were introduced by Acerbi (2002). They have all the properties listed in Definition 2.1. Properties a)–e) follow directly from respective properties of the quantile function. Verifying subadditivity is more involved, see Dhaene et al. (2000). As part of the proof they showed that spectral risk measures preserve the increasing convex oder. Spectral risk measures belong to the larger class of distortion risk measures.

Definition 2.3

An increasing right-continuous function \(\varphi :[0,1] \rightarrow [0,1]\) with \(\varphi (0)=0\) and \(\varphi (1)=1\) is called distortion function and the functional \(\rho _\varphi : L^0_{\ge 0} \rightarrow \mathbb {R}\cup \{\infty \}\) with

$$\begin{aligned} \rho _\varphi (X)= \int _0^1 F^{-1}_X(u){\mathrm d}\varphi (u) \end{aligned}$$

is referred to as distortion risk measure.

In the special case of a spectral risk measure, the distortion function is given by

$$\begin{aligned} \varphi (u)= \int _0^u \phi (s) {\mathrm d}s, \qquad u \in [0,1] \end{aligned}$$
(2.1)

and is convex. This also shows that it is no restriction to assume \(\phi \) being right continuous (as the right derivative of a convex function). Conversely, for a convex distortion function without a jump in 1, which implies continuity on [0, 1], one can always find a representation as in (2.1) with \(\phi \) being a spectrum. Consequently, all distortion risk measures with convex and continuous distortion function are spectral. It has been proven by Dhaene et al. (2000) that the convexity of \(\varphi \) is equivalent to \(\rho _\varphi \) being subadditive.

Note that \(\rho _\phi \) is finite on \(L^1_{\ge 0}\) if the spectrum \(\phi \) is bounded. On \(L^0_{\ge 0}\) the value \(+\infty \) is possible. Shapiro (2013) has shown that a finite risk measure on \(L^1_{\ge 0}\) with all the properties in Definition 2.1 is already spectral with bounded spectrum.

Example 2.4

The most widely used spectral risk measure is Expected Shortfall

$$\begin{aligned} \text {ES}_{\alpha }(X)= \frac{1}{1-\alpha }\int _{\alpha }^1 F^{-1}_X(u) {\mathrm d}u, \qquad \alpha \in [0,1). \end{aligned}$$

Its spectrum \(\phi (u)=\frac{1}{1-\alpha }\mathbb {1}_{[\alpha ,1]}(u)\) is bounded. Especially in optimization, an infimum representation of Expected Shortfall going back to Rockafellar and Uryasev (2000) is very useful:

$$\begin{aligned} \text {ES}_{\alpha }(X) = \inf _{q \in \mathbb {R}} \left\{ q + \frac{1}{1-\alpha } \mathbb {E}[(X-q)^+] \right\} , \qquad X \in L^0_{\ge 0}. \end{aligned}$$
(2.2)

The infimum is attained at \(q=F^{-1}_X(\alpha )\).

Henceforth, we assume w.l.o.g. that \(\phi \) is right-continuous. Then \(\nu ([0,t]) :=\phi (t)\) defines a Borel measure on [0, 1]. Let us define a further measure \(\mu \) by \(\frac{d \mu }{d \nu }(\alpha ):=(1-\alpha )\). Every spectral risk measure can be expressed as a mixture of Expected Shortfall over different confidence levels, see e.g. Proposition 8.18 in McNeil et al. (2015).

Proposition 2.5

Let \(\rho _{\phi }\) be a spectral risk measure. Then \(\mu \) is a probability measure on [0, 1] and \(\rho _{\phi }\) has the representation

$$\begin{aligned} \rho _{\phi }(X) = \int _0^1 \text {ES}_{\alpha }(X) \mu ({\mathrm d}\alpha ), \qquad X\in L^0_{\ge 0}. \end{aligned}$$

When we allow to take the supremum on the r.h.s. over all probability measures \(\mu \) we would get the superclass of coherent risk measures, see Kusuoka (2001).

Using Proposition 2.5, the infimum representation (2.2) of Expected Shortfall can be generalized to spectral risk measures.

Proposition 2.6

Let \(\rho _{\phi }\) be a spectral risk measure with bounded spectrum. We denote by G the set of increasing convex functions \(g:\mathbb {R}\rightarrow \mathbb {R}\). Then it holds for \(X \in L^0_{\ge 0}\)

$$\begin{aligned} \rho _{\phi }(X) = \inf _{g \in G} \left\{ \mathbb {E}[g(X)] + \int _0^1 g^*(\phi (u)) {\mathrm d}u\right\} , \end{aligned}$$

where \(g^*\) is the convex conjugate of \(g \in G\).

Proof

For \(X \in L^1_{\ge 0}\) the assertion has been proven by Pichler (2015). For non-integrable \(X \in L^0_{\ge 0}\) it follows from Proposition 2.5

$$\begin{aligned} \rho _{\phi }(X) = \int _0^1 \text {ES}_{\alpha }(X) \mu ({\mathrm d}\alpha ) \ge \text {ES}_{0}(X) =\mathbb {E}[X] =\infty . \end{aligned}$$

Now let \(g \in G\) and \(U_X \sim \mathcal {U}(0,1)\) be the generalized distributional transform of X, i.e. \(F^{-1}_X(U_X)=X\) a.s. By the definition of the convex conjugate it holds \(g(X) + g^*(\phi (U_X)) \ge X \phi (U_X)\). Hence, we have

$$\begin{aligned} \mathbb {E}[g(X)] + \mathbb {E}[g^*(\phi (U_X))] \ge \mathbb {E}[X \, \phi (U_X)] = \mathbb {E}[F^{-1}_X(U_X)\, \phi (U_X)] = \rho _{\phi }(X) = \infty . \end{aligned}$$

Since \(g \in G\) was arbitrary, the assertion follows.\(\square \)

Remark 2.7

The proof by Pichler (2015) shows that for \(X \in L^1_{\ge 0}\) the infimum is attained in \(g_{\phi ,X}: \mathbb {R}\rightarrow \mathbb {R}\), \( g_{\phi ,X}(x) = \int _0^1 F^{-1}_X(\alpha ) + \frac{1}{1-\alpha }\left( x- F^{-1}_X(\alpha ) \right) ^+ \mu ({\mathrm d}\alpha )\) with \(\mu \) from Proposition 2.5 and that the derivative of this function is \(g_{\phi ,X}'(x) =\phi (F_X(x))\) a.e.

3 Markov decision model

We consider the following standard Markov Decision Process with general Borel state and action space. By Borel space we mean a Borel subset of a Polish space. The state space E is a Borel space with Borel \(\sigma \)-algebra \(\mathcal {B}(E)\) and the action space A is a Borel space with Borel \(\sigma \)-Algebra \(\mathcal {B}(A)\). The possible state-action combinations at time n form a measurable subset \(D_n\) of \(E \times A\) such that \(D_n\) contains the graph of a measurable mapping \(E \rightarrow A\). The x-section of \(D_n\),

$$\begin{aligned} D_n(x) := \{ a \in A: (x,a) \in D_n \}, \end{aligned}$$

is the set of admissible actions in state \(x \in E\) at time n. Note that the sets \(D_n(x)\) are non-empty. We assume that the dynamics of the MDP are given by measurable transition functions \(T_n:D_n \times \mathcal {Z}\rightarrow E\) and depend on disturbances \(Z_1,Z_2,\dots \) which are independent random elements on a common probability space \((\Omega ,\mathcal {A},\mathbb {P})\) with values in a measurable space \((\mathcal {Z}, \mathfrak {Z})\). When the current state is \(x_n\), the controller chooses action \(a_n\in D_n(x_n)\) and \(z_{n+1}\) is the realization of \(Z_{n+1}\), then the next state is given by

$$\begin{aligned} x_{n+1} := T_n(x_n,a_n,z_{n+1}). \end{aligned}$$
(3.1)

The one-stage cost function \(c_n:D_n\times E \rightarrow \mathbb {R}_+\) gives the cost \(c_n(x,a,x')\) for choosing action a if the system is in state x at time n and the next state is \(x'\). The terminal cost function \(c_N: E \rightarrow \mathbb {R}_+\) gives the cost \(c_N(x)\) if the system terminates in state x. Note that instead of non-negative cost we can equivalently consider cost which are bounded from below.

The model data is supposed to have the following continuity and compactness properties.

Assumption 3.1

  1. (i)

    The sets \(D_n(x)\) are compact and \(E \ni x \mapsto D_n(x)\) are upper semicontinuous, i.e. if \(x_k\rightarrow x\) and \(a_k\in D_n(x_k)\), \(k\in \mathbb {N}\), then \((a_k)\) has an accumulation point in \(D_n(x)\).

  2. (ii)

    The transition functions \(T_n\) are continuous in (xa).

  3. (iii)

    The one-stage cost functions \(c_n\) and the terminal cost function \(c_N\) are lower semicontinuous.

Under a finite planning horizon \(N \in \mathbb {N}\), we consider the model data for \(n=0,\dots ,N-1\). The decision model is called stationary if DTc do not depend on n and the disturbances are identically distributed. If the model is stationary and the terminal cost is zero, we allow for an infinite time horizon \(N=\infty \).

For \(n \in \mathbb {N}_0\) we denote by \(\mathcal {H}_n\) the set of feasible histories \( h_n\) of the decision process up to time n where

$$\begin{aligned} h_n := {\left\{ \begin{array}{ll} x_0, &{} \text {if } n=0,\\ (x_0,a_0,x_1, \dots , x_n), &{} \text {if } n \ge 1, \end{array}\right. } \end{aligned}$$

with \(a_k \in D_k(x_k)\) for \(k \in \mathbb {N}_0\). In order for the controller’s decisions to be implementable, they must be based on the information available at the time of decision making, i.e. be functions of the history of the decision process.

Definition 3.2

  1. a)

    A measurable mapping \(f_n: {\mathcal {H}}_n \rightarrow A\) with \(f_n( h_n) \in D_n(x_n)\) for every \( h_n \in {\mathcal {H}}_n\) is called decision rule at time n. A finite sequence \(\sigma =(f_0, \dots ,f_{N-1})\) is called N-stage policy and a sequence \(\sigma =(f_0, f_1, \dots )\) is called policy.

  2. b)

    A decision rule at time n is called Markov if it depends on the current state only, i.e. \(f_n( h_n)=f_n(x_n)\) for all \(h_n \in {\mathcal {H}}_n\). If all decision rules are Markov, the (N-stage) policy is called Markov.

  3. c)

    An (N-stage) policy \(\sigma \) is called stationary if \(\sigma =(f, \dots ,f)\) or \(\sigma =(f,f,\dots )\), respectively, for some Markov decision rule f.

With \(\Pi \) and \(\Pi ^M\) we denote the sets of all policies and Markov policies, respectively. It will be clear from the context if N-stage or infinite stage policies are meant. An admissible policy always exists as \(D_n\) contains the graph of a measurable mapping.

Since risk measures are defined as real-valued mappings of random variables, we will work with a functional representation of the decision process. The law of motion does not need to be specified explicitly. We define for an initial state \(x_0 \in E\) and a policy \(\sigma \in \Pi \)

$$\begin{aligned} X^\sigma _0=x_0, \qquad X^\sigma _{n+1}= T_n(X_n^\sigma ,f_n(H_n^\sigma ),Z_{n+1}). \end{aligned}$$

Here, the process \((H_n^\sigma )_{n \in \mathbb {N}_0}\) denotes the history of the decision process viewed as a random element, i.e.

$$\begin{aligned} H_0^\sigma =x_0, \quad H_1^\sigma =\big (X_0^\sigma ,f_0(X_0^\sigma ),X_1^\sigma \big ),\quad \dots , \quad H_{n}^\sigma =(H_{n-1}^\sigma ,f_{n-1}(H_{n-1}^\sigma ),X_{n}^\sigma ). \end{aligned}$$

Under a Markov policy the recourse on the random history of the decision process is not needed.

Even though the model is non-stationary we will explicitly introduce discounting by a factor \(\beta >0\) since for the following state space extension it is relevant if there is discounting. Otherwise, stationary models with discounting would have to be treated separately. For a finite planning horizon \(N \in \mathbb {N}\), the total discounted cost generated by a policy \(\sigma \in \Pi \) if the initial state is \(x \in E\), is given by

$$\begin{aligned} C_N^{\sigma x} := \sum _{k=0}^{N-1} \beta ^k c_k(X_k^\sigma ,f_k( H_k^\sigma ),X_{k+1}^\sigma ) + \beta ^N c_N(X_N^\sigma ). \end{aligned}$$

If the model is stationary and the planning horizon infinite, the total discounted cost is given by

$$\begin{aligned} C_\infty ^{\sigma x} := \sum _{k=0}^{\infty } \beta ^k c(X_k^\sigma ,f_k( H_k^\sigma ),X_{k+1}^\sigma ). \end{aligned}$$

For a generic total cost regardless of the planning horizon we write \(C^{\sigma x}\). Our aim is to find a policy \(\sigma \in \Pi \) which attains

$$\begin{aligned} \inf _{\sigma \in \Pi } \rho _\phi (C_N^{\sigma x}) \end{aligned}$$
(3.2)

or

$$\begin{aligned} \inf _{\sigma \in \Pi } \rho _\phi (C_\infty ^{\sigma x}), \end{aligned}$$
(3.3)

respectively, for a fixed spectral risk measure \(\rho _{\phi }: L^0_{\ge 0} \rightarrow \mathbb {R}\cup \{\infty \}\) with \(\phi (1)<\infty \), i.e. \(\phi \) is bounded. We can apply Proposition 2.6 to reformulate the optimization problems (3.2) and (3.3) to

$$\begin{aligned} \inf _{\sigma \in \Pi } \rho _\phi \left( C^{\sigma x} \right)&= \inf _{\sigma \in \Pi } \inf _{g \in G} \left\{ \mathbb {E}[g(C^{\sigma x})] + \int _0^1 g^*(\phi (u)) {\mathrm d}u\right\} \nonumber \\&= \inf _{g \in G}\inf _{\sigma \in \Pi } \left\{ \mathbb {E}[g(C^{\sigma x})] + \int _0^1 g^*(\phi (u)) {\mathrm d}u\right\} \nonumber \\&= \inf _{g \in G} \left\{ \inf _{\sigma \in \Pi } \mathbb {E}[g(C^{\sigma x})] + \int _0^1 g^*(\phi (u)) {\mathrm d}u\right\} . \end{aligned}$$
(3.4)

For fixed \(g \in G\) we will refer to

$$\begin{aligned} \inf _{\sigma \in \Pi } \mathbb {E}[g(C^{\sigma x})] \end{aligned}$$
(3.5)

as inner optimization problem. In the following section we solve (3.5) as an ordinary MDP on an extended state space. If \(C^{\sigma x} \in L^0_{\ge 0}\) but not in \(L^1\), then \(\rho _\phi (C^{\sigma x})=\infty \). These policies are not interesting and can be excluded from the optimization.

Since an increasing convex function \(g:\mathbb {R}\rightarrow \mathbb {R}\) can be viewed as a disutility function, optimality criterion (3.5) implies that the expected disutility of the total discounted cost in minimized. If g is strictly increasing, the optimization problem is not changed by applying \(g^{-1}\), i.e. minimizing the corresponding certainty equivalent \(g^{-1}\big (\mathbb {E}[g(C^{\sigma x})]\big )\). For bounded one-stage cost functions such problems are solved in Bäuerle and Rieder (2014). The special case of the exponential disutility function \(g(x) = \exp (\gamma x), \ \gamma >0,\) has been studied first by Howard and Matheson (1972) in a decision model with finite state and action space. The term risk-sensitive MDP goes back to them. The certainty equivalent corresponding to an exponential disutility is the entropic risk measure

$$\begin{aligned} \rho (X)= \frac{1}{\gamma } \log \mathbb {E}\left[ e^{\gamma X} \right] . \end{aligned}$$

It has been shown by Müller (2007) that an exponential disutility is the only case where the certainty equivalent defines a monetary risk measure apart from expectation itself (linear disutility).

The concepts of spectral risk measures and expected disutilities (or corresponding certainty equivalents) can be combined to so-called rank-dependent expected disutilities of the form \(\rho _{\phi }(u(X))\), where u is a disutility function. The corresponding certainty equivalent is \(u^{-1}\big (\rho _{\phi }(u(X))\big )\). In fact, this concept works more generally for distortion risk measures and incorporates both expected disutilities (identity as distortion function) and distortion risk measures (identity as disutility function). The idea is that the expected disutility is calculated w.r.t. a distorted probability instead of the original probability measure. As long as the distorted probability is spectral, using a rank dependent disutility instead of \(\rho _{\phi }\) leads to structurally the same inner problem as (3.5), only g is replaced by \(g(u(\cdot ))\). Our results apply here, too. The certainty equivalent of a rank-dependent expected disutility combining an exponential disutility with a spectral risk measure is itself a convex (but not coherent) risk measure. It has been introduced by Tsanakas and Desli (2003) as distortion-exponential risk measure.

4 Main results: finite planning horizon

4.1 Inner problem

Under a finite planning horizon \(N \in \mathbb {N}\), we consider the non-stationary version of the decision model and our first aim is to solve

$$\begin{aligned} \inf _{\sigma \in \Pi } \mathbb {E}[g(C_N^{\sigma x})] \end{aligned}$$
(4.1)

for an arbitrary but fixed increasing convex function \(g \in G\). We assume that for all \(x\in E\) there is at least one policy \(\sigma \) s.t. \(C_N^{\sigma x}\in L^1\). Problem (4.1) is well-defined since the target function is bounded from below by g(0). W.l.o.g. we assume \(g\ge 0\). Note that the value \(+\infty \) is possible.

As the functions \(g \in G\) are in general non-linear, the optimization problem cannot be solved directly with dynamic programming techniques. This can be overcome by embedding the problem into an extended MDP following Bäuerle and Rieder (2014). The state space of this extended MDP is

$$\begin{aligned} \mathbf {E} := E \times \mathbb {R}_+ \times (0,\infty ) \end{aligned}$$

with corresponding Borel \(\sigma \)-algebra. A generic element of \(\mathbf {E}\) is denoted by (xst). The idea is that s summarizes the cost accumulated to far and that t keeps track of the discounting. The action space A and the admissible state-action combinations \(D_n\), \(n=0,\dots ,N-1,\) remain unchanged. Formally, one defines

$$\begin{aligned} \mathbf {D}_n := \{ (x,s,t,a) \in \mathbf {E} \times A: \ a \in D_n(x) \}, \qquad n=0,\dots ,N-1 \end{aligned}$$

implying \(\mathbf {D}_n(x,s,t) = D_n(x),\ (x,s,t) \in \mathbf {E}\). The transition function on the new state space is given by \(\mathbf {T}_n: \mathbf {D}_n \times \mathcal {Z}\rightarrow \mathbf {E}\),

$$\begin{aligned} \mathbf {T}_n(x,s,t,a,z) := \begin{pmatrix} T_n(x,a,z)\\ s+t c_n(x,a,T_n(x,a,z))\\ \beta t \end{pmatrix}, \qquad n=0,\dots ,N-1. \end{aligned}$$

Feasible histories of the decision model with extended state space up to time n have the form

$$\begin{aligned} \mathbf{h }_n := \left\{ \begin{array}{ll} (x_0,s_0,t_0), &{} n=0,\\ (x_0,s_0,t_0,a_0,x_1,s_1,t_1,a_1, \dots , x_n,s_n,t_n), &{} n \ge 1, \end{array}\right. \end{aligned}$$

where \(a_k \in D_k(x_k)\), \(k=0,\dots ,N-1\), and the set of such histories is denoted by . In particular, we have the same recursion (3.1) for the state process and when we start with \(s_0=0, t_0=1\) we obtain:

$$\begin{aligned} s_n = \sum _{k=0}^{n-1} \beta ^k c_k(x_k,a_k,x_{k+1}) \quad \text {and} \quad t_n=\beta ^n, \qquad n=1,\dots ,N. \end{aligned}$$
(4.2)

By \( \varvec{\Pi }\) we denote the set of all history-dependent policies for the decision model with extended state space. Policies are denoted by \(\pi =(d_0,d_1,\ldots ,d_{N-1})\) with measurable decision rules satisfying \( d_n(\mathbf {h}_n)\in D_n(x_n)\). By \( \varvec{\Pi }^M\) we denote the set of all Markov policies where decision rules are given by \(d_n : \mathbf {E} \rightarrow A\) with \( d_n(x_n,s_n,t_n) \in D_n(x_n)\). For \(\pi =(d_0,\dots ,d_{N-1}) \in \varvec{\Pi }\) the process \((\mathbf {H}^\pi _n)\) denotes the history of the extended MDP viewed as a random element, i.e.

where

$$(X_{n}^\pi ,\mathbf {s}_n^\pi ,\mathbf {t}_n^\pi )= \mathbf {T}_{n-1}\big (X_{n-1}^\pi ,\mathbf {s}_{n-1}^\pi ,\mathbf {t}_{n-1}^\pi ,d_{n-1}(\mathbf {H}_{n-1}^\pi ),Z_n\big ).$$

We will write \(\mathbb {E}_{n \mathbf {h}_n}\) for a conditional expectation given . The value of a policy \(\pi \in \varvec{\Pi }\) with \(\pi =(d_0,d_1,\ldots ,d_{N-1})\) at time \(n=0,\dots ,N\) is defined as

$$\begin{aligned} \begin{aligned} V_{N\pi }(\mathbf {h}_N)&:= g(s_N+ t_Nc_N(x_N)),\\ V_{n\pi }(\mathbf {h}_n)&:= \mathbb {E}_{n\mathbf {h}_n}\left[ g\left( s_n + t_n\left( \sum _{k=n}^{N-1} \beta ^{k-n} c_k(X_k^\pi ,d_k(\mathbf {H}_k^\pi ),X_{k+1}^\pi ) + \beta ^{N-n} c_N(X_N^\pi )\right) \right) \right] , \end{aligned} \end{aligned}$$
(4.3)

where . The corresponding value functions are

(4.4)

Obviously, we have \(V_0(x,0,1)=\inf _{\sigma \in \Pi } \mathbb {E}[g(C_N^{\sigma x})]\). This means in the end, the quantity of interest is \(V_0(x,0,1)\).

Remark 4.1

If there is no discounting or if the discounting is included in the non-stationary one-stage cost functions, the second summary variable t is obviously not needed. In the special case that \(\rho _{\phi }\) is the Expected Shortfall, one only has to consider the functions \(g_q(x)= (x-q)^+, \ q \in \mathbb {R}\), see (2.2). Due to their positive homogeneity in (xq), it suffices to extend the state space by only one real-valued summary variable even if there is discounting, cf. Bäuerle and Ott (2011).

4.2 Solution of the extended MDP

We show next how to solve (4.4). It turns out that optimal policies can be found among Markov policies. Hence, let us now consider Markov policies \(\pi \in \varvec{\Pi }^M\), i.e. \(\pi =(d_0,\ldots , d_{N-1})\) with \(d_n : \mathbf {E} \rightarrow A\) such that \(d_n(x,s,t)\in D_n(x)\). The function space

$$\begin{aligned} \mathbb {M}:= \big \{ v: \mathbf {E} \rightarrow \mathbb {R}_+\mid \&v \text { is lower semicontinuous,}\\&v(x,\cdot ,\cdot ) \text { is increasing for all } x \in E,\\&v(x,s,t) \ge g(s) \text { for } (x,s,t) \in \mathbf {E} \big \} \end{aligned}$$

turns out to be the set of potential value functions under such policies. In order to simplify the notation, we introduce the usual operators on \(\mathbb {M}\). All \(v \in \mathbb {M}\) are non-negative. Thus, integrals are well-defined with values in \(\mathbb {R}_+\cup \{\infty \}\).

Definition 4.2

For \(v \in \mathbb {M}\) and a Markov decision rule \(d:\mathbf {E} \rightarrow A\) we define

$$\begin{aligned} L_n v(x,s,t,a)&:= \mathbb {E}\Big [ v\Big ( \mathbf {T}_n(x,s,t,a,Z_{n+1})\Big )\Big ]\\&=\mathbb {E}\Big [ v\Big (T_n(x,a,Z_{n+1}),\, s+tc_n(x,a,T_n(x,a,Z_{n+1})),\, \beta t\Big ) \Big ],&(x,s,t,a) \in \mathbf {D}_n,\\ \mathcal {T}_{nd} v(x,s,t)&:= L_n v(x,s,t,d(x,s,t)),&(x,s,t) \in \mathbf {E},\\ \mathcal {T}_n v(x,s,t)&:= \inf _{a \in D_n(x)} L_n v(x,s,t,a),&(x,s,t) \in \mathbf {E}. \end{aligned}$$

The next result shows that \(V_n(\mathbf {h}_n)\) depends only on \((x_n,s_n,t_n)\), that \(V_n\) satisfies a Bellman equation and that an optimal policy exists and is Markov. All proofs are deferred to the appendix.

Theorem 4.3

Let Assumption 3.1 be satisfied.

  1. a)

    The value functions \(V_n\) only depend on \((x_n,s_n,t_n)\), i.e. \(V_n(\mathbf {h}_n)=J_n(x_n,s_n,t_n)\) for all and \(J_n\in \mathbb {M}\), \(n=0, \dots , N\).

  2. b)

    The \(J_n \) satisfy for \(n=0, \dots , N\) the Bellman equation

    $$\begin{aligned} J_N(x,s,t)&= g(s+tc_N(x)),\\ J_n(x,s,t)&= \mathcal {T}_n J_{n+1}(x,s,t), \qquad (x,s,t) \in \mathbf {E}. \end{aligned}$$
  3. c)

    There exist Markov decision rules \(d_n^*:\mathbf {E} \rightarrow A\) for \(n=0, \dots , N-1\) with \(\mathcal {T}_{nd_n^*} J_{n+1}=\mathcal {T}_{n} J_{n+1}\) and every sequence of such minimizers constitutes an optimal policy \(\pi ^*=(d_0^*,\dots ,d_{N-1}^*) \in \varvec{\Pi }^M\) for problem (4.4).

  4. d)

    Given \(\pi ^*=(d_0^*,\dots ,d_{N-1}^*) \in \varvec{\Pi }^M\) as in part c), an optimal policy \(\sigma ^* =(f^*_0,\ldots ,f_{N-1}^*)\in \Pi \) for problem (4.1) is given by

    $$\begin{aligned} f_0^*(x_0)&:= d_0^*(x_0,0,1),\\ f_n^*( h_n)&:= d_n^*(x_n,s_n,t_n),\qquad n=1,\ldots ,N-1, \end{aligned}$$

    with \(s_n\) and \(t_n\) as in (4.2).

Remark 4.4

From Theorem 4.3 it follows that the sequence \(\{(x_n,s_n,t_n)\}_{n=0}^{N-1}\) with

$$\begin{aligned} (s_n,t_n) = \left( \sum _{k=0}^{n-1} \beta ^k c_k(x_k,a_k,x_{k+1}),\, \beta ^n \right) \end{aligned}$$

is a sufficient statistic of the decision model with the original state space in the sense of Hinderer (1970).

4.3 Outer problem: existence and numerical approximation

In this subsection, we study the existence of a solution to the outer optimization problem (3.4) under a finite planning horizon and its numerical approximation. We have assumed that for all \(x\in E\) there exists a policy \(\sigma \) such that \(C_N^{\sigma x}\in L^1\) and thus \(\rho _\phi (C_N^{\sigma x})=:\bar{\rho }<\infty \). Hence in what follows we can restrict to policies \(\sigma \) such that \(\rho _\phi (C_N^{\sigma x})\le \bar{\rho }\). In this case, we can further restrict the set G in the representation of Proposition 2.6.

Lemma 4.5

It is sufficient to consider functions \(g \in G\) in the representation of Proposition 2.6 which are \(\phi (1)\)-Lipschitz and satisfy

$$\begin{aligned} 0 \le g(x) \le {\bar{g}}(x) := \phi (1) x^+ +\bar{\rho }, \qquad x \in \mathbb {R}. \end{aligned}$$

The space of such functions is denoted by \(\mathcal {G}\).

In order to stress that the value function \(V_0(x,0,1)=J_0(x,0,1)\) in Theorem 4.3 depends on g we write \(J_0(g):= J_0(x,0,1)\) and suppress the dependence on the other variables. For initial state \(x \in E\) and finite planning horizon \(N \in \mathbb {N}\) the outer problem is given by

$$\begin{aligned} \inf _{g \in \mathcal {G}} J_0(g) + \int _0^1 g^*(\phi (u)) {\mathrm d}u \end{aligned}$$
(4.5)

We obtain now:

Theorem 4.6

Under Assumption 3.1 there exists a solution \(g\in \mathcal {G}\) for the outer optimization problem (4.5).

As we know now that a solution to the outer optimization problem (4.5) exists, we aim to determine the solution numerically. The idea is to approximate the functions \(g \in \mathcal {G}\) by piecewise linear ones and thereby obtain a finite dimensional optimization problem which can be solved with classical methods of global optimization. We are going to show that the minimal values converge when the approximation is continuously refined and give an error bound. Regarding the second summand of the objective function (4.5) our method coincides with the Fast Legendre-Fenchel Transform (FLT) algorithm studied for example by Corrias (1996).

For unbounded cost \(C_N^{\sigma x}\) the functions \(g \in \mathcal {G}\) would have to be approximated on the whole non-negative real line. This is numerically not feasible.

Assumption 4.7

We require additionally to Assumption 3.1 that c is bounded from above by a constant \({\bar{c}} \in \mathbb {R}_+\).

Consequently, it holds \(0 \le C_N^{\sigma x} \le {\hat{c}}:= \sum _{k=0}^N \beta ^k {\bar{c}}\). The bounded cost allows for a further reduction of the feasible set of the outer problem. On the reduced feasible set, the second summand of the objective function is guaranteed to be finite and easier to calculate. Recall that the convex conjugate of \(g \in \mathcal {G}\) is an \(\mathbb {R}\cup \{\infty \}\)-valued function defined by \( g^*(y) := \sup _{s \in \mathbb {R}} \{sy - g(s)\}, \ y \in \mathbb {R}. \)

Lemma 4.8

  1. a)

    Under Assumption 4.7, a minimizer of the outer optimization problem (4.5) lies in

    $$\begin{aligned} \widehat{\mathcal {G}}:= \left\{ g \in \mathcal {G}: \ g(s)= g(0) \text { for } s < 0 \text { and } g(s)= g({\hat{c}}) + \phi (1)(s-{\hat{c}}) \text { for } s > {\hat{c}} \right\} . \end{aligned}$$
  2. b)

    For \(g \in \widehat{\mathcal {G}}\) and \(y \in [0,\phi (1)]\) it holds \(g^*(y) = \max _{s \in [0,{\hat{c}}]} \{sy-g(s)\} < \infty . \)

The fact that the supremum of the convex conjugate reduces to the maximum of a continuous function over a compact set, opens the door for a numerical approximation with the FLT algorithm. By definition of \(\widehat{\mathcal {G}}\), it is sufficient to approximate the functions \(g \in \widehat{\mathcal {G}}\) on the interval \(I:=[0,{\hat{c}}]\). For the piecewise linear approximation we consider equidistant partitions \(0=s_1<s_2<\dots <s_m={\hat{c}}\), i.e. \(s_k=(k-1) \frac{{\hat{c}}}{m-1}, \ k=1,\dots ,m, \ m \ge 2\). Let us define the mapping

$$\begin{aligned} p_m(g)(s) := g(s_k) + \frac{g(s_{k+1})-g(s_k)}{s_{k+1}-s_k} (s-s_k), \qquad s \in [s_k, s_{k+1}], \ k=1,\dots ,m-1, \end{aligned}$$

which projects a function \(g \in \widehat{\mathcal {G}}\) to its piecewise linear approximation and its image \(\widehat{\mathcal {G}}_m:=\{p_m(g): \ g \in \widehat{\mathcal {G}} \}\). For considering the restriction of the outer optimization problem (4.5) to \(\widehat{\mathcal {G}}_m\) it is convenient to define for \(g \in \widehat{\mathcal {G}}\)

$$\begin{aligned}&K_m(g) := J_0(p_m(g)) + \int _0^1 p_m(g)^*(\phi (u)) {\mathrm d}u\\ {}&\qquad \text {and} \qquad K(g) := J_0(g) + \int _0^1 g^*(\phi (u)) {\mathrm d}u. \end{aligned}$$

Proposition 4.9

It holds

$$\left| \inf _{g \in \widehat{\mathcal {G}}}K_m(g) -\inf _{g \in \widehat{\mathcal {G}}}K(g) \right| \le \sup _{g \in \widehat{\mathcal {G}}} |K_m(g)-K(g)| \le 2\phi (1) \frac{{\hat{c}}}{m-1}.$$

The proposition shows that the infimum of \(K_m\) converges to the one of K. The error of restricting the outer problem (4.5) to \(\widehat{\mathcal {G}}_m\) is bounded by \(2\phi (1)\frac{{\hat{c}}}{m-1}\). The piecewise linear functions \(g \in \widehat{\mathcal {G}}_m\) are uniquely determined by their values in the kinks \(s_1,\dots ,s_m\). Hence, we can identify \(\widehat{\mathcal {G}}_m\) with the compact set

$$\begin{aligned} \Gamma _m := \left\{ (y_1,\dots , y_m) \in \mathbb {R}^m: \ y_1 \in I, \ 0 \le \frac{y_2-y_1}{s_2-s_1} \le \dots \le \frac{y_m-y_{m-1}}{s_m-s_{m-1}}\le \phi (1) \right\} . \end{aligned}$$

Note that due to translation invariance of \(\rho _\phi \) it holds under Assumption 4.7 for \(g \in \widehat{\mathcal {G}}\) that \(g(0)\le {\bar{g}}(0)=\bar{\rho }\le \rho ({\hat{c}})={\hat{c}}\). Thus, the outer problem (4.5) restricted to \(\widehat{\mathcal {G}}_m\) becomes finite dimensional:

$$\begin{aligned} \inf _{y \in \Gamma _m} J_0(g_y) + \int _0^1 g_y^*(\phi (u)), \end{aligned}$$
(4.6)

where \(g_y \in \widehat{\mathcal {G}}_m\) is the piecewise linear function induced by \(y \in \Gamma _m\), i.e.

$$g_y(s):= y_k + \frac{y_{k+1}-y_k}{s_{k+1}-s_k}(s-s_k), \qquad s \in [s_k,s_{k+1}],\ k=1,\dots ,m-1.$$

How to evaluate \(J_0(\cdot )\) in \(g_y, \ y \in \Gamma _m,\) has been discussed in Sect. 4.1. The next Lemma simplifies the evaluation of the second summand of the objective function (4.6) to calculating the integrals \(\int _{u_k}^{u_{k+1}} \phi (u) {\mathrm d}u\), where \(u_0:=0\), \(u_k:= \phi ^{-1}\left( \frac{y_{k+1}-y_k}{s_{k+1}-s_k} \right) ,\ k=1,\dots ,m-1\) and \(u_m:=\phi (1)\).

Lemma 4.10

The convex conjugate of \(g_y^*, \ y \in \Gamma _m,\) in \(\xi \in [0,\phi (1)]\) is given by

$$\begin{aligned} g_y^*(\xi ) = {\left\{ \begin{array}{ll} - y_1,&{} 0\le \xi< \frac{y_2-y_1}{s_2-s_1},\\ s_{k+1} \xi - y_{k+1}, &{} \frac{y_{k+1}-y_k}{s_{k+1}-s_k} \le \xi \le \frac{y_{k+2}-y_{k+1}}{s_{k+2}-s_{k+1}}, \ k=1,\dots , m-2\\ s_m \xi -y_m, &{} \frac{y_m-y_{m-1}}{s_{m}-s_{m-1}} < \xi \le \phi (1). \end{array}\right. } \end{aligned}$$

The results of this section can be used to set up an algorithm for optimization problem (3.2). First we have to set \(m:= \left\lceil \frac{2\phi (1){\hat{c}} }{\epsilon }\right\rceil +1\) when we want to solve the problem with error estimate \(\epsilon \). Then choose \(y_0\in \Gamma _m\) and solve the inner problem with \(g_{y_0}\). Use a global optimization procedure to select the next \(y_1\), like, e.g. simulated annealing, and eventually determine the optimal value of (4.6). Note that we do not have convexity of (4.6) in y.

figure a

It is worth noting that an optimal policy \(\sigma ^*=(f_0^*,\dots ,f_{N-1}^*) \in \Pi \) obtained with the algorithm is in general not time consistent. If one implements the policy \(\sigma ^*\) and considers optimization problem (3.2) again at a later point in time \(n \in \{ 1,\dots ,N-1\}\), one can disregard the cost \(\sum _{k=0}^{n-1} \beta ^k c_k(x_k,a_k,x_{k+1})\) which is already realized due to the translation invariance of \(\rho _\phi \) and faces the remaining optimization problem

$$\begin{aligned} \inf _{\sigma \in \Pi } \rho _\phi \left( \sum _{k=n}^{N-1} \beta ^k c_k(X_k^\sigma ,f_k( H_k^\sigma ),X_{k+1}^\sigma ) + \beta ^N c_N(X_N^\sigma ) \right) . \end{aligned}$$
(4.7)

But for (4.7) the remaining policy \((f_n^*,\dots ,f_{N-1}^*)\) will in general not be optimal. The reason it that the optimal function \(g^*\) of the outer optimization problem will change due to Remark 2.7. However, for a fixed \(g \in G\) the optimal solution of inner optimization problem is time consistent by the Bellman equation in Theorem 4.3. A more detailed discussion of time consistent policies for risk-sensitive MDP can be found in Shapiro (2009). Time consistency can alternatively be defined as a property of the risk measure. How this is related to the more general policy-based viewpoint is discussed in Shapiro and Uğurlu (2016).

5 Extensions and further results

5.1 Infinite planning horizon

In this subsection, we consider the risk-sensitive total cost minimization (3.3) under an infinite planning horizon. This is reasonable if the terminal period is unknown or if one wants to approximate a model with a large but finite planning horizon. Solving the infinite horizon problem will turn out to be easier since it admits a stationary optimal policy.

We study the stationary version of the decision model with no terminal cost, i.e. DTc do not depend on n, \(c_N\equiv 0\) and the disturbances are identically distributed. Let Z be a representative of the disturbance distribution. Our first aim is to solve again the inner problem

$$\begin{aligned} \inf _{\sigma \in \Pi } \mathbb {E}[g(C_\infty ^{\sigma x})] \end{aligned}$$
(5.1)

for an arbitrary but fixed increasing convex function \(g \in G\). As in the previous section we assume w.l.o.g. that \(g\ge 0\) and that for all \(x\in E\) there exists a policy \(\sigma \) such that \(C_\infty ^{\sigma x}\in L^1\).

The remarks in Sect. 3 regarding connections to the minimization of (rank-dependent) expected disutilities and corresponding certainty equivalents apply in the infinite horizon case as well.

In order to obtain a solution by value iteration, the state space is extended to \(\mathbf {E} := E \times \mathbb {R}_+ \times (0,\infty )\) as in Sect. 4. The action space A and the admissible state-action combinations \(\mathbf {D}\) remain unchanged, i.e. \(\mathbf {D} := \{ (x,s,t,a) \in \mathbf {E} \times A: \ a \in D(x) \}\) and \(\mathbf {D}(x,s,t) := D(x),\ (x,s,t) \in \mathbf {E}\). The transition function on the new state space is given by \(\mathbf {T}: \mathbf {D} \times \mathcal {Z}\rightarrow \mathbf {E}\),

$$\begin{aligned}\mathbf {T}(x,s,t,a,z) := \begin{pmatrix} T(x,a,z)\\ s+t c(x,a,T(x,a,z))\\ \beta t \end{pmatrix}. \end{aligned}$$

Since the model with infinite planning horizon will be derived as a limit of the one with finite horizon, the consideration can be restricted to Markov policies \(\pi =(d_1,d_2,\dots ) \in {\varvec{\Pi }}^M\) due to Theorem 4.3.

The value of a policy \(\pi =(d_1,d_2,\dots ) \in \varvec{\Pi }^M\) under an infinite planning horizon is defined as

$$\begin{aligned} J_{\infty \pi }(x,s,t) := \mathbb {E}_{0x}\left[ g\left( s + t \sum _{k=0}^{\infty } \beta ^{k} c(X_k^\pi ,d_k(X_k^\pi ,\mathbf {s}_k^\pi ,\mathbf {t}^\pi _k),X_{k+1}^\pi ) \right) \right] , \quad (x,s,t) \in \mathbf {E}. \end{aligned}$$

Note that \(J_{\infty \pi }\) is well-defined since \(c\ge 0\). The infinite horizon value function is

$$\begin{aligned} J_{\infty }(x,s,t) := \inf _{\pi \in \varvec{\Pi }^M} J_{\infty \pi }(x,s,t), \qquad (x,s,t) \in \mathbf {E}. \end{aligned}$$
(5.2)

We obviously get that \(\inf _{\sigma \in \Pi } \mathbb {E}[g(C_\infty ^{\sigma x})]=J_\infty (x,0,1)\). The operators \(\mathcal {T}\) and \(\mathcal {T}_d\) which appear in the next theorem are defined as in Definition 4.2 for the stationary model data.

Theorem 5.1

Let Assumption 3.1 be satisfied. Then it holds:

  1. a)

    The infinite horizon value function \(J_\infty \) is the smallest fixed point of the Bellman operator \(\mathcal {T}\) in \(\mathbb {M}\).

  2. b)

    There exists a Markov decision rule \(d^*\) such that \(\mathcal {T}_{d^*} J_\infty = \mathcal {T}J_\infty \) and each stationary policy \(\pi ^*=(d^*,d^*,\dots )\in \varvec{\Pi }^M\) induced by such a decision rule is optimal for optimization problem (5.2).

  3. c)

    Given \(\pi ^*=(d^*,d^*,\dots )\in \varvec{\Pi }^M\) as in part b), an optimal policy \(\sigma ^* =(f^*_0,f^*_1,\ldots )\in \Pi \) for problem (5.1) is given by

    $$\begin{aligned} f_0^*(x_0)&:= d^*(x_0,0,1),\\ f_n^*( h_n)&:= d^*(x_n,s_n,t_n),\qquad n\in \mathbb {N}, \end{aligned}$$

    with \(s_n\) and \(t_n\) as in (4.2).

The solution of the outer optimization problem

$$\begin{aligned} \inf _{g \in \mathcal {G}} J_\infty (g) + \int _0^1 g^*(\phi (u)) {\mathrm d}u \end{aligned}$$
(5.3)

follows the same lines as in the case of a finite time horizon. Again we can restrict to policies \(\sigma \) such that \(\rho _\phi (C^{\sigma x}_\infty )\le \bar{\rho }\). Lemma 4.5 which reduces the outer optimization problem to \(\mathcal G\) holds also in the infinite horizon case as well as Theorem 4.6 which states the existence of a solution to the outer problem.

The numerical approximation scheme for the infinite horizon works under the following assumption:

Assumption 5.2

In addition to Assumption 3.1 we require that c is bounded from above by a constant \({\bar{c}}\in \mathbb {R}_+\) and that \(\beta \in (0,1)\).

Hence, it holds that \(0 \le C_\infty ^{\sigma x} \le {\hat{c}}\) with \({\hat{c}}= \frac{{\bar{c}}}{1-\beta }\) and we obtain in the same way as Lemma 4.8:

Lemma 5.3

  1. a)

    Under Assumption 5.2, a minimizer of the outer optimization problem (5.3) lies in

    $$\begin{aligned} \widehat{\mathcal {G}}= \left\{ g \in \mathcal {G}: \ g(s)= g(0) \text { for } s < 0 \text { and } g(s)= g({\hat{c}}) + \phi (1)(s-{\hat{c}}) \text { for } s > {\hat{c}} \right\} . \end{aligned}$$
  2. b)

    For \(g \in \widehat{\mathcal {G}}\) and \(y \in [0,\phi (1)]\) it holds \(g^*(y) = \max _{s \in [0,{\hat{c}}]} \{sy-g(s)\} < \infty . \)

The remaining part of the numerical algorithm works as in the case of finite time horizon.

5.2 Relaxed assumptions for monotone models

The model has been introduced in Sect. 3 with a general Borel space as state space. In order to solve the optimization problem with finite or infinite time horizon we assumed a continuous transition function despite having a semicontinuous model. This assumption on the transition function can be relaxed to semicontinuity if the state space is the real line and the transition and one-stage cost function have some form of monotonicity. For notational convenience, we consider the stationary model with no terminal cost under both finite and infinite horizon in this section. We replace Assumption 3.1 by

Assumption 5.4

  1. (i)

    The state space is the real line \(E=\mathbb {R}\).

  2. (ii)

    The sets D(x) are compact and \(\mathbb {R}\ni x \mapsto D(x)\) is upper semicontinuous and decreasing, i.e. \(D(x) \supseteq D(y)\) for \(x \le y\).

  3. (iii)

    The transition function T is lower semicontinuous in (xa) and increasing in x.

  4. (iv)

    The one-stage cost c(xaT(xaz)) is lower semicontinuous in (xa) and increasing in x.

Requiring that the one-stage cost function c is lower semicontinuous in \((x,a,x')\) and increasing in \((x,x')\) is sufficient for Assumption 5.4 (iv) to hold due to part (iii) of the assumption.

How do the modified continuity assumptions affect the validity of the results in Sects. 4.1 and 5.1? The only two results that were proven using the continuity of the transition function T in (xa) and not only its measurability are Theorems 4.3 and 5.1. All other statements are unaffected.

Proposition 5.5

The assertions of Theorems 4.3 and 5.1 hold under Assumption 5.4, too. Moreover, the value functions \(J_n\) and \(J_\infty \) are increasing. The set of potential value functions can therefore be replaced by

$$\begin{aligned} \mathbb {M}= \big \{ v: \mathbf {E} \rightarrow \mathbb {R}\mid \&v \text { is lower semicontinuous and increasing,}\\&v(x,s,t) \ge g(s) \text { for } (x,s,t) \in \mathbf {E} \big \}. \end{aligned}$$

The monotonicity properties of Assumption 5.4 can be used to construct a convex model.

Lemma 5.6

Let Assumption 5.4 be satisfied, A be a subset of a real vector space, the admissible state-action-combinations D be a convex set, the transition function T be convex in (xa) and the one-stage cost \(D \ni (x,a) \mapsto c(x,a,T(x,a,z))\) be a convex function for every \(z \in \mathcal {Z}\). Then, the value functions \(J_n(\cdot ,\cdot ,t)\) and \(J_\infty (\cdot ,\cdot ,t)\) are convex for every \(t> 0\).

If c is increasing in \(x'\), it is sufficient to require that c and T are convex in (xa). The monotonicity requirements in Assumption 5.4 are only one option. The following alternative is relevant in particular for the dynamic reinsurance model in Sect. 6. For a proof see Section 6.1.3 in Glauner (2020).

Corollary 5.7

Change Assumption 5.4 (ii)–(iv) to

  1. (ii’)

    The sets D(x) are compact and \(\mathbb {R}\ni x \mapsto D(x)\) is upper semicontinuous and increasing.

  2. (iii’)

    T is upper semicontinuous in (xa) and increasing in x.

  3. (iv’)

    c(xaT(xaz)) is lower semicontinuous in (xa) and decreasing in x.

Then, the assertions of Theorems 4.3 and 5.1 still hold with the value functions \(J_n\) and \(J_\infty \) being decreasing in x and increasing in (st).

If furthermore A is a subset of a real vector space, D a convex set, T concave in (xa) and \(D \ni (x,a) \mapsto c(x,a,T(x,a,z))\) convex for every \(z \in \mathcal {Z}\), then the value functions \(J_n(\cdot ,\cdot ,t)\) and \(J_\infty (\cdot ,\cdot ,t)\) are convex for every \(t >0\).

6 Dynamic optimal reinsurance

As an application, we present a dynamic extension in discrete time of the static optimal reinsurance problem

$$\begin{aligned} \min _{\ell \in \mathcal {L}} \quad r_{\text {CoC}} \cdot \rho \big (\ell (Y) + \pi _R(\ell )\big ). \end{aligned}$$
(6.1)

In this setting, the insurance company incurs an aggregate loss \(Y \in L^1_{\ge 0}\) at the end of a fixed period due to insurance claims. In order to reduce its risk, the insurer concludes a reinsurance contract \(\ell \) to transfer a part of its potential loss to a reinsurance company. The reinsurance contract \(\ell \) determines the loss \(\ell (Y(\omega ))\) retained by the insurance company in each scenario \(\omega \in \Omega \). For the risk transfer, the insurer has to compensate the reinsurer with a reinsurance premium \(\pi _R(\ell ):= \pi _R(Y-\ell (Y))\), where \(\pi _R:L^1_{\ge 0} \rightarrow \mathbb {R}\) is a premium principle with properties similar to a risk measure. Most widely used is the expected premium principle \(\pi _R(X)=(1+\theta )\mathbb {E}[X]\) with safety loading \(\theta >0\). In order to preclude moral hazard, it is standard in the actuarial literature to assume that both \(\ell \) and the ceded loss function \(\text {id}_{\mathbb {R}_+} -\ell \) are increasing. Hence, the set of admissible retained loss functions is

$$\begin{aligned} \mathcal {L}= \{\ell : \mathbb {R}_+ \rightarrow \mathbb {R}_+ \mid \ell (y) \le y \ \forall y \in \mathbb {R}_+, \ \ell \text { increasing}, \ \text {id}_{\mathbb {R}_+}-\ell \text { increasing} \}. \end{aligned}$$

The insurer’s target is to minimize its cost of solvency capital which is calculated as the cost of capital rate \(r_{\text {CoC}} \in (0,1]\) times the solvency capital requirement determined by applying the risk measure \(\rho \) to the insurer’s effective risk after reinsurance.

First research on the optimal reinsurance problem (6.1) dates back to the 1960s. Borch (1960) proved that a stop loss reinsurance contract minimizes the variance of the retained loss of the insurer given the premium is calculated with the expected value principle. A similar result has been derived in Arrow (1963) where the expected utility of terminal wealth of the insurer has been maximized. Since then a lot of generalizations of this problem have been considered. For a comprehensive literature overview, we refer to Albrecher et al. (2017). Since the 2000s, Expected Shortfall has become of special interest. Chi and Tan (2013) identified layer reinsurance contracts as optimal for Expected Shortfall under general premium principles. Their results were extended to general distortion risk measures by Cui et al. (2013). Other generalizations concerned additional constraints, see e.g. Lo (2017), or multidimensional settings induced by a macroeconomic perspective, see Bäuerle and Glauner (2018). We are not aware of any dynamic generalizations in the literature.

Reinsurance treaties are typically written for one year, cf. Albrecher et al. (2017). Hence, it is appropriate to model such an extension in discrete time. The insurer’s annual surplus has the dynamics

$$\begin{aligned} X_0=x, \qquad X_{n+1}= X_n+Z_{n+1}-\ell _n(Y_{n+1})-\pi _R(\ell _n), \end{aligned}$$

where the bounded, non-negative random variable \(Z_{n+1} \in L^\infty _{\ge 0}\) represents the insurer’s premium income from its customers in the n-th period. The premium principle \(\pi _R:L^p_{\ge 0} \rightarrow \mathbb {R}\) of the reinsurer is assumed to be law-invariant, monotone, normalized and to have the Fatou property. Normalization means that \(\pi _R(0)=0\) and the Fatou property is lower semicontinuity w.r.t. dominated convergence.

The Markov Decision Model is given by the state space \(E=\mathbb {R}\), the action space \(A=\mathcal {L}\), either no constraint or a budget constraint \(D(x) = \{\ell \in \mathcal {L}: \pi _R(\ell ) \le x^+ \}\), the independent disturbances \((Y_n,Z_n)_{n \in \mathbb {N}}\) with \(Y_n \in L^1_{\ge 0}\) and \(Z_n \in L^\infty _{\ge 0}\), the transition function \(T(x,\ell ,y,z) = x - \ell (y) - \pi _R(\ell ) + z\) and the one-stage cost function \(c(x,\ell ,x')= x-x'\). A reinsurance policy is a sequence \(\sigma =(f_0,\dots ,f_{N-1})\) of measurable decision rules \(f_n:\mathcal {H}_n \rightarrow \mathcal {L}\) selecting the reinsurance contract at each stage based on the available information. The insurance companies target is to minimize its solvency cost of capital for the total discounted loss

$$\begin{aligned} \inf _{\sigma \in \Pi } \quad r_{\text {CoC}} \cdot \rho _{\phi }\left( \sum _{k=0}^{N-1} \beta ^k \Big ( f_k(H_k^\sigma )(Y_{k+1}) + \pi _R(f_k(H_k^\sigma )) - Z_{k+1} \Big ) \right) , \end{aligned}$$
(6.2)

where \(\rho _\phi \) is a spectral risk measure with bounded spectrum \(\phi \), \(\beta \in (0,1]\) and \(N \in \mathbb {N}\). As it is irrelevant for the minimization, we will in the sequel omit the cost of capital rate \(r_{\text {CoC}}\) and instead minimize the capital requirement. For \(\beta = 1\) we have

$$\begin{aligned} \sum _{k=0}^{N-1} f_k(H^\sigma _k)(Y_{k+1}) + \pi _R(f_k(H^\sigma _k)) - Z_{k+1} = \sum _{k=0}^{N-1} X_k^\sigma - X_{k+1}^\sigma = x-X_N^\sigma , \end{aligned}$$

i.e. due to translation invariance of spectral risk measures the objective reduces to minimizing the capital requirement for the loss (negative surplus) at the planing horizon \(-X_N^\sigma \). This is reminiscent of the static reinsurance problem (6.1), however here the loss distribution at the planing horizon can be controlled by interim action. Throughout, we have required that the one-stage cost \(c(x,\ell ,T(x,\ell ,Y,Z))= \ell (Y)+ \pi _R(\ell ) -Z\) is non-negative. As \(\ell (Y)\) and \( \pi _R(\ell )\) are non-negative for all \(\ell \in \mathcal {L}\) and \(c(x,\text {id}_{\mathbb {R}_+},T(x,\text {id}_{\mathbb {R}_+},Y,Z))=Y-Z\) due to normalization of \(\pi _R\), the premium income Z would have to be non-positive. This makes no sense from an actuarial point of view, but since \(\rho _\phi \) is translation invariant and \(Z \in L^{\infty }\) we can add \(\sum _{k=0}^{N-1} \beta ^k \text {ess sup}(Z)\) without influencing the minimization. This means that the one-stage cost function is changed to \({\hat{c}}(x,\ell ,x')= x-x'+\text {ess sup}(Z)\). The economic interpretation is that the one-stage cost

$$\begin{aligned} {\hat{c}}(x,\ell ,T(x,\ell ,Y,Z)) = \ell (Y)+ \pi _R(\ell )+ \text {ess sup}(Z) -Z \end{aligned}$$

now depends on the deviation from the maximal possible income instead of the actual income. For brevity we write \({\hat{z}}= \text {ess sup}(Z)\).

As in (3.4) we separate an inner and outer reinsurance problem. For a structural analysis we can focus on the inner optimization problem

$$\begin{aligned} \inf _{\sigma \in \Pi } \mathbb {E}\left[ g\left( \sum _{k=0}^{N-1} \beta ^k \Big ( f_k(H_k^\sigma )(Y_{k+1}) + \pi _R(f_k(H_k^\sigma ))+ {\hat{z}} - Z_{k+1} \Big ) \right) \right] \end{aligned}$$
(6.3)

with arbitrary \(g \in \mathcal {G}\), cf. Lemma 4.5. Note that for \(\sigma =(f,\dots ,f)\) with the constant decision rule \(f \equiv \text {id}_{\mathbb {R}_+}\) representing full retention at all stages we obtain \(\rho _\phi (C_N^{\sigma x})<\infty \). On the extended state space \(\mathbf {E}= \mathbb {R}\times \mathbb {R}_+\times (0,1]\), the value of a policy \(\pi = (d_0,\dots ,d_{N-1}) \in \varvec{\Pi }\) is defined as

$$\begin{aligned} V_{N\pi }(\mathbf {h}_N)&= g(s_N),\\ V_{n\pi }(\mathbf {h}_n)&= \mathbb {E}_{n\mathbf {h}_n}\left[ g\left( s_n + t_n\sum _{k=n}^{N-1} \beta ^{k-n} \Big ( d_k(\mathbf {H}_k^\pi )(Y_{k+1}) + \pi _R(d_k(\mathbf {H}_k^\pi ))+ {\hat{z}} - Z_{k+1} \Big )\right) \right] , \end{aligned}$$

for . The corresponding value functions are

Due to the real state space we want to apply Corollary 5.7 for solving the optimization problem. Note that the one-stage cost \({\hat{c}}\) is non-negative and the spectrum \(\phi \) bounded by assumption. The following lemma shows that also the monotonicity, continuity and compactness assumptions of Corollary 5.7 are satisfied by the dynamic reinsurance model.

Lemma 6.1

  1. a)

    The retained loss functions \(\ell \in \mathcal {L}\) are Lipschitz continuous with constant \(L\le 1\). Moreover, \(\mathcal {L}\) is a Borel space as a compact subset of the metric space \((C(\mathbb {R}_+),m)\) of continuous real-valued functions on \(\mathbb {R}_+\) with the metric of compact convergence.

  2. b)

    The functional \(\pi _R:\mathcal {L}\rightarrow \mathbb {R}_+, \ \ell \mapsto \pi _R(\ell )\) is lower semicontinuous.

  3. c)

    The transition function T is upper semicontinuous and increasing in x.

  4. d)

    D(x) is a compact subset of \(\mathcal {L}\) for all \(x \in \mathbb {R}\) and the set-valued mapping \(\mathbb {R}\ni x \mapsto D(x)\) is upper semicontinuous and increasing.

  5. e)

    The one-stage cost \(D \ni (x,\ell )\mapsto c(x,\ell ,T(x,\ell ,y,z))\) is lower semicontinuous and decreasing in x.

Now, Corollary 5.7 yields that it is sufficient to minimize over all Markov policies and the value functions satisfy the Bellman equation

$$\begin{aligned} J_N(x,s,t)&= g(s),\nonumber \\ J_n(x,s,t)&= \inf _{\ell \in D(x)} \mathbb {E}\Big [ J_{n+1}\Big (x - \ell (Y) - \pi _R(\ell ) + Z,\, s+t\big (\ell (Y)+\pi _R(\ell )+{\hat{z}}- Z\big ),\, \beta t\Big ) \Big ] \end{aligned}$$
(6.4)

for \((x,s,t) \in \mathbf {E}\) and \(n=0,\dots ,N-1\). Moreover, there exists a Markov Decision rule \(d_n^*: \mathbf {E} \rightarrow \mathcal {L}\) minimizing \(J_{n+1}\) and every sequence \(\pi =(d_0^*,\dots ,d_{N-1}^*) \in {\varvec{\Pi }}^M\) of such minimizers is a solution to (6.3).

All structural properties of the optimal policy which do not depend on g are inherited by the optimal solution of the cost of capital minimization problem (6.2). The structural properties we will focus on in the rest of this section are induced by convexity. Therefore, we assume that the premium principle \(\pi _R\) is convex and that there is no budget constraint. Note that D(x) is non-convex even for convex \(\pi _R\). Under these conditions, we have indeed a convex model: D is trivially convex, the transition function \(T(x,\ell ,y,z) = x - \ell (y) - \pi _R(\ell ) + z\) is concave in \((x,\ell )\) as a sum of concave functions and the one-stage cost \((x,\ell ) \mapsto {\hat{c}}(x,\ell ,T(x,\ell ,y,z)) = \ell (y)+\pi _R(\ell )+{\hat{z}}- z\) is convex as a sum of convex functions. Now, Corollary 5.7 yields that the value functions \(J_n\) are convex. Under the widely-used expected premium principle, the optimization problem can be reduced to finite dimension.

Example 6.2

Let \(\pi _R(\cdot ) = (1+\theta )\mathbb {E}[\cdot ]\) be the expected premium principle with safety loading \(\theta >0\) and assume there is no budget constraint. We will now show that the optimal reinsurance contracts (i.e. retained loss functions) can be chosen from the class of stop loss contracts

$$\begin{aligned} \ell (x) = \min \{x,a\}, \qquad a \in [0,\infty ]. \end{aligned}$$

Due to the convexity of \(J_{n+1}\), we can infer from the Bellman equation (6.4) that reinsurance contract \(\ell _1\) is better than \(\ell _2\) if

$$\begin{aligned} \ell _1(Y)+ \pi _R(\ell _1)\le _{cx} \ell _2(Y)+ \pi _R(\ell _2), \end{aligned}$$

where \(\le _{cx}\) denotes the convex order. Since \(Y_1 \le _{cx} Y_2\) implies \(\mathbb {E}[Y_1]=\mathbb {E}[Y_2]\), it suffices to find an \(a_\ell \in [0,\infty ]\) such that

$$\begin{aligned} \min \{Y,a_\ell \} \le _{cx} \ell (Y). \end{aligned}$$
(6.5)

The mapping \([0,\infty ] \ni a \mapsto \min \{Y(\omega ),a\}\) is continuous for all \(\omega \in \Omega \) and \(0 \le \min \{Y,a\} \le Y \in L^1\). Thus, it follows from dominated convergence that \([0,\infty ] \ni a \mapsto \mathbb {E}[\min \{Y,a\}]\) is continuous. Furthermore,

$$\begin{aligned} \mathbb {E}[\min \{Y,0\}] \le \mathbb {E}[\ell (Y)] \le \mathbb {E}[\min \{Y,\text {ess sup}(Y)\}]. \end{aligned}$$

Hence, by the intermediate value theorem there is an \(a_\ell \in [0,\infty ]\) such that \(\mathbb {E}[\ell (Y)] = \mathbb {E}[\min \{Y,a_\ell \}]\). Let us compare the survival functions:

$$\begin{aligned} S_{\min \{Y,a_\ell \}}(y)&=\mathbb {P}(\min \{Y,a_\ell \}>y)=\mathbb {P}(Y>y)\mathbb {1}\{a_\ell>y\},\\ S_{\ell (Y)}(y)&= \mathbb {P}(\ell (Y)>y)\le \mathbb {P}(Y>y). \end{aligned}$$

The inequality holds since \(\ell \le \text {id}_{\mathbb {R}_+}\). Hence, we have \(S_{\min \{Y,a_\ell \}}(y) \ge S_{\ell (Y)}(y)\) for \(y<a_\ell \) and \(S_{\min \{Y,a_\ell \}}(y) \le S_{\ell (Y)}(y)\) for \(y\ge a_\ell \). The cut criterion 1.5.17 in Müller and Stoyan (2002) implies \(\min \{Y,a_\ell \} \le _{icx} \ell (Y)\) and (6.5) follows due to the equality in expectation, cf. Theorem 1.5.3 in Müller and Stoyan (2002). So the inner optimization problem (6.3) is reduced to finding an optimal nonnegative retention level of a stop loss contract at every stage. With this reduction, the dynamic reinsurance problem becomes numerically solvable without requiring a parametric approximation of the retained loss functions.

Let us apply the algorithm derived at the end of Sect. 4.3 and study the optimal retention levels in a two-stage setting without discounting and deteministic income \(Z\equiv z\). Due to the translation invariance of \(\rho _\phi \) a deterministic income can be disregarded in the optimization. The stage-wise insurance claims Y are assumed to be exponentially distributed with parameter \(\lambda >0\), which is a classical choice in actuarial science. For numerical feasibility we truncate the distribution at the \(99.9\%\)-quantile. The safety loading of the expected premim principle is \(\theta =0.1\). As a concrete risk measure we consider Expected Shortfall \(\text {ES}_{0.99}\) with the typical paramter \(\alpha =0.99\). Due to (2.2) we have to consider \(\mathcal {G}=\{g_q: q \in \mathbb {R}\}\) with \(g_q(s)=(s-q)^+\) and \(\int _0^1 g^*(\phi (u)) {\mathrm d}u =q\) for the outer optimization problem. We can infer recursively from the Bellman equation (6.4) that the value functions only depend on the accumulated cost s and not on the current capital x since there is no budget constraint. The same therefore holds for the optimal retention level.

Table 1 Relative optimal retention level of the stop loss constract at time \(t=0\) for different parameters of the claim size distribution

Table 1 shows the optimal retention parameter at time \(t=0\) for different parameters of the truncated exponential distribution. For better comparability the parameter is shown relative to the maximal claim size \(\text {ess sup}(Z)\).

Fig. 1
figure 1

Relative optimal retention level of the stop loss constract at time \(t=1\) as a function of the accumulated cost s for different parameters of the claim size distribution

Figure 1 shows the optimal decision rule at time \(t=1\), i.e. the optimal retention level as a function of the accumulated cost s for different parameters of the truncated exponential distribution again relative to the maximal loss. Hence, the curves are given by \(s \mapsto \frac{a_1^*(s)}{\text {ess sup}(Y)}\). It is worth noting that the parameter \(\lambda \) of the truncated exponential distrubution has a structurally different influence than at time \(t=0\). At time \(t=1\), a smaller \(\lambda \) (i.e. a higher expected insurance claim) leads to a more convervative reinsurance contract in form of less risk retention, whereas at time \(t=0\) it increases the optimal retention level. By comparison, the influence of the safety loading \(\theta \) and the level \(\alpha \) of Expected Shortfall turned out to be negligible and is therefore not presented here.