On Risk-Sensitive Piecewise Deterministic Markov Decision Processes

We consider a piecewise deterministic Markov decision process, where the expected exponential utility of total (nonnegative) cost is to be minimized. The cost rate, transition rate and post-jump distributions are under control. The state space is Borel, and the transition and cost rates are locally integrable along the drift. Under natural conditions, we establish the optimality equation, justify the value iteration algorithm, and show the existence of a deterministic stationary optimal policy. Applied to special cases, the obtained results already significantly improve some existing results in the literature on finite horizon and infinite horizon discounted risk-sensitive continuous-time Markov decision processes.


Introduction
application, an open problem in insurance was recently solved in [1] in the framework of risk-sensitive DTMDP. There are notable differences between risk-sensitive and risk-neutral DTMDPs. For instance, in a finite model, i.e., when the state and action spaces are both finite, there is always a deterministic stationary optimal policy in a discounted risk-neutral DTMDP, but not always in a discounted risk-sensitive DTMDP, see [17].
One of the first works on risk-sensitive continuous-time Markov decision processes (CTMDPs) is [21], where only verification theorems were presented. Recently, there have been reviving interests in this topic; see e.g., [8,14,20,24,25,27]. A finite horizon total undiscounted risk-sensitive CTMDP was considered in [14,21,24], whose arguments were summarized as follows. Firstly, the optimality equation is shown to admit a solution out of a small enough class. Secondly, by using the Feynman-Kac formula, this solution is shown to be the value function, and any Markov policy providing the minimizer in the optimality equation is optimal. The proofs of [14,24] reveal that the main technicalities lie in the first step, for which, the state space was assumed to be denumerable. This assumption is important for the diagonalization argument used in [24], which is an extension of [14] from bounded transition rate to possibly unbounded transition rate, whose growth is bounded by a Lyapunov function. The latter requirement and the boundedness of the cost rate then validate the Feynman-Kac formula applied in the second step. Wei [24] mentioned that it was unclear how to extend his argument to an unbounded cost rate, see Sect. 7 therein. Following a similar argument as described above, a discounted risk-sensitive CTMDP was also considered in [14], although now the first step becomes, to quote the authors' words (see p. 658 therein), "surprisingly far more involved", for which the state space was further assumed to be finite, see Remark 3.6 therein. It is a corollary of the present paper that we significantly weaken the restrictive conditions in [14,24], see Sect. 3 below.
The present paper is concerned with a risk-sensitive piecewise deterministic Markov decision process (PDMDP), where the expected exponential utility of the total cost is to be minimized. The state space is a general Borel space, the transition and the nonnegative cost rates only need be locally integrable along the drift. A PDMDP is an extension of a CTMDP: now between two consecutive jumps, the process evolves according to deterministic Markov process. For simplicity and to keep the conditions as weak as possible, we do not consider the control on the drift. In spite that there has been a vast literature on PDMDPs; see the well known monographs [9,10] and the references therein, to the best of our knowledge, risk-sensitive PDMDPs have not been systematically studied before.
Our main contributions are the following. We establish the optimality equation satisfied by the value function, justify the value iteration algorithm and show the existence of a deterministic stationary optimal policy. As an application and corollary, finite horizon and infinite horizon discounted risk-sensitive CTMDPs are reformulated as total undiscounted risk-sensitive PDMDPs, and are thus treated in a unified way and under much weaker conditions than in [14,24]. This is possible because we follow a different argument. Namely, we directly show that the value function satisfies the optimality equation, by reducing the total undiscounted risk-sensitive PDMDP to a risk-sensitive DTMDP. This method, without referring to the Feynman-Kac formula, was originally developed by Yushkevich [26] for risk-neutral CTMDPs. Later, it was employed in [2,3,9,10,13,23] for studies of risk-neutral PDMDPs, and in [27] for risk-sensitive CTMDPs. In [8], restricted to stationary policies, the discounted risksensitive CTMDP with bounded transition rates was reduced to a DTMDP problem, using the uniformization technique. The induced DTMDP is less standard (with a random cost), and was not further investigated there.
The rest of the paper is organized as follows. In Sect. 2 we describe the concerned optimal control problem. In Sect. 3 we present the main results, the proofs of which are postponed to Sect. 4. We finish the paper with a conclusion in Sect. 5. Some relevant facts were collected in the Appendix for ease of reference.

Notations and Conventions
In what follows, B(X ) is the Borel σ -algebra of the topological space X, I stands for the indicator function, and δ {x} (·) is the Dirac measure concentrated on the singleton {x}, assumed to be measurable. A measure is σ -additive and [0, ∞]-valued. Below, unless stated otherwise, the term of measurability is always understood in the Borel sense. Throughout this paper, we adopt the conventions of If a mapping f defined on X , and {X i } is a partition of X , then when f is piecewise defined as f (x) = g i (x) for all x ∈ X i , the notation f (x) Let S be a nonempty Borel state space, A be a nonempty Borel action space, and q stand for a signed kernel q(dy|x, a) on B(S) given (x, a) ∈ S × A such that q( S |x, a) := q( S \{x}|x, a) ≥ 0 ( 2 ) for all S ∈ B(S). Throughout this article we assume that q(·|x, a) is conservative and stable, i.e., where q x (a) := −q({x}|x, a). The signed kernel q is often called the transition rate. Between two consecutive jumps, the state of the process evolves according to a measurable mapping φ from S × [0, ∞) to S, see (5) below. It is assumed that for each and t → φ(x, t) is continuous.
Finally let the cost rate c be a [0, ∞)-valued measurable function on S × A. For simplicity, we do not consider the case of different admissible action spaces at different states.

Condition 2.1 (a) For each bounded measurable function f on S and each x ∈ S,
The integrals in the above condition are well defined: the integrands are universally measurable in s ∈ [0, ∞); see Chaper 7 of [5].
Let us take the sample space by adjoining to the countable product space S × . , x n belong to S, θ 1 , . . . , θ n belong to (0, ∞), and x ∞ / ∈ S is the isolated point. We equip with its Borel σ -algebra F.
Let t 0 (ω) := 0 =: θ 0 , and for each n ≥ 0, and each element ω := Obviously, (t n (ω)) are measurable mappings on ( , F). In what follows, we often omit the argument ω ∈ from the presentation for simplicity. Also, we regard x n and θ n+1 as the coordinate variables, and note that the pairs {t n , x n } form a marked point process with the internal history {F t } t≥0 , i.e., the filtration generated by {t n , x n }; see Chapter 4 of [19] for greater details. The marked point process {t n , x n } defines the stochastic process {ξ t , t ≥ 0} on ( , F) of interest by where we accept 0 · x := 0 and 1 · x := x for each x ∈ S ∞ , and below we denote A (history-dependent) policy π is given by a sequence (π n ) such that, for each n = 0, 1, 2, . . . , π n (da|x 0 , θ 1 , . . . , x n , s) is a stochastic kernel on A, and for each ω = (x 0 , θ 1 , (6) where a ∞ / ∈ A is some isolated point. A policy π is called Markov if, with slight abuse of notations, π(da|ω, s) = π M (da|ξ s− , s) for some stochastic kernel π M . A Markov policy is further called deterministic if the stochastic kernels π M (da|x, for some measurable mapping f from S to A. We shall identify such a deterministic stationary policy by the underlying measurable mapping f . The class of all policies is denoted by . Under a fixed policy π = (π n ), for each initial distribution γ on (S, B(S)), by using the Ionescu-Tulcea theorem, one can build a probability measure P π γ on ( , F) such that P π γ (x 0 ∈ ) = γ ( ) for each ∈ B(S), and the conditional distribution of (θ n+1 , x n+1 ) with the condition on x 0 , θ 1 , x 1 , . . . , θ n , x n is given on {ω : ( 2 |φ(x n , t), a)π n (da|x 0 , θ 1 , . . . , θ n , x n , t)dt, A q φ(xn ,s) (a)π n (da|x 0 ,θ 1 ,...,θ n ,x n ,s)ds , and given on {ω : Below, when γ is a Dirac measure concentrated at x ∈ S, we use the denotation P π x . Expectations with respect to P π γ and P π x are denoted as E π γ and E π x , respectively. Roughly speaking, the uncontrolled version of the process evolves as follows: given the current state, the process evolves deterministically according to the mapping φ, up to the next jump, taking place after a random time whose distribution is (nonstationary) exponential, and the dynamics continue in the similar manner. A detailed book treatment with many examples of this and more general type of processes, allowing deterministic jumps, can be found in [10].
For each x ∈ S, and policy π = (π n ), θ n+1 0 A c(φ(x n ,s),a)π n (da|x 0 ,θ 1 ,...,x n ,s)ds defines the concerned performance measure of the policy π ∈ given the initial state x ∈ S. Here and below, we put c(x ∞ , a) := 0 for each a ∈ A, and φ(x ∞ , t) = x ∞ for each t ∈ [0, ∞). We are interested in the following optimal control problem for each x ∈ S : Minimize over π ∈ : V (x, π).
The objective of this paper is to show, under the imposed conditions, the existence of a deterministic stationary optimal policy, and to establish the corresponding optimality equation satisfied by the value function V * , together with its value iteration. Evidently, V * (x) ≥ 1 for each x ∈ S. Under the next condition, it will be seen that for each x ∈ S, V * (φ(x, s)) is absolutely continuous in s.
The above condition is mainly assumed for notational convenience. In fact, the main optimality results (such as the existence of a deterministic stationary optimal policy) obtained in this paper can be established without assuming Condition 2.3, at the cost of some additional notations. In a nutshell, one has to consider the setsŜ := {x ∈ S : V * (x) < ∞} and S\Ŝ separately, and note that if x ∈Ŝ, then φ(x, t) ∈Ŝ for each t ∈ [0, ∞). The reasoning presented under Condition 2.3 can be followed in an obvious manner. We formulate the corresponding optimality results in Remarks 3.1 and 3.2 below.

Main Statements
We first present the main optimality results concerning problem (8) for the PDMDP model. Their proofs are postponed to the next section.
In particular, V * (φ(x, t)) is absolutely continuous in t for each x ∈ S. (b) There exists a deterministic stationary optimal policy f , which can be taken as any measurable mapping from S to A such that In particular, There exists a deterministic stationary optimal policy f , which can be taken as any measurable mapping from S to A such that Next, we present the value iteration algorithm for the value function V * .
such that V (n+1) (φ(x, t)) is absolutely continuous in t for each x ∈ S. (For each n ≥ 0, such a solution always exists.) Furthermore, {V (n) } is a monontone nondecreasing sequence of measurable functions on S such that for each x ∈ S, Here We can apply our theorems to a special case of a CTMDP. That is, φ(x, t) ≡ x for each x ∈ S. The following α-discounted risk-sensitive CTMDP problem was considered in [14]: Here α > 0 is a fixed constant. In fact, Ghosh and Saha [14] were restricted to Markov policies, bounded transition and cost rates, i.e., sup x∈S q x < ∞, and sup x∈S,a∈A c(x, a) < ∞, and a finite state space S. These restrictions, e.g., the finiteness of S, were needed for their investigations, see e.g., [14,Remark 3.6]. Under the compactness-continuity condition (Condition 2.1), it was shown in [14] that there exists an optimal Markov policy for the discounted risk-sensitive CTMDP, and established the optimality equation. By using the theorems presented earlier in this section, we can obtain these optimality results for problem (10) in a much more general setup: the state space S is Borel, there is no boundedness requirement on the transition rate with respect to the state x ∈ S, and the optimality is over the class of history-dependent policies. Furthermore, we let the CTMDP model be nonhomogeneous, i.e., the transi- satisfying the corresponding version of (3); the notationsq is kept as before, see (2), with the extra argument t in addition to x. Similarly, the nonnegative cost rate c is allowed to be a measurable function on [0, ∞) × S × A.
Then the value function say L * to the α-discounted risk-sensitive CTMDP problem (10) (with c(ξ t , a) being replaced by c(t, ξ t , a)) is given by There exists an optimal deterministic Markov policy f for the α-discounted risksensitive CTMDP problem (10) (with c(ξ t , a) being replaced by c(t, ξ t , a)). One can take f as any measurable mapping from for each u ∈ [0, ∞) and x ∈ S.
Proof We prove this by reformulating the nonhomogeneous version of the αdiscounted risk-sensitive (nonhomogeneous) CTMDP problem (10) in the form of problem (8) for a PDMDP, which we introduce as follows. We use the notation "hat" to distinguish this model from the original (nonhomogeneous) CTMDP model.
• The action space is the same as in the CTMDP: for each (t, x) ∈Ŝ and a ∈Â. • The drift is given byφ((t, x), s) := (t + s, x) for each x ∈ S and t, s ≥ 0. Clearly it satisfies the corresponding version of (4).
• The cost rate is given bŷ Now the marked point process {t n ,x n } and controlled processξ t in this PDMDP model is connected to those in the original (nonhomogeneous) CTMDP model, namely (t n , x n ) and ξ t , viat n = t n andx n = (t n , x n ), andξ t = (t, ξ t ). For example, under a fixed strategyπ and initial distributionγ in this PDMDP model, the version of the first equation in (7) now reads on {ω : Clearly, Conditions 2.1, 2.2 and 2.3 are satisfied by this PDMDP model. It remains to apply Theorem 3.1.
The condition in the previous corollary is much weaker than in [14], and can be further weakened; one only needs the reformulated PDMDP to satisfy Conditions 2.1, 2.2 and 2.3. Moreover, the boundedness of the cost rate c was assumed in the previous corollary only to ensure Condition 2.3 to be satisfied. It can be relaxed if one formulates the previous corollary using the statements in Remarks 3.1 and 3.2.
One can also consider the risk-sensitive nonhomogeneous CTMDP problem on the finite horizon [0, T ] with T > 0 being a fixed constant: where g is a [0, ∞)-valued measurable function; g(x) represents the terminal cost incurred when ξ T = x ∈ S. Let us put g(x ∞ ) := 0. Here α is a fixed nonnegative finite constant. A simpler version of this problem was considered in [24] with α = 0 and a bounded cost rate, where additional restrictions were put on the growth of the transition rate. We can reformulate this problem into the PDMDP problem (8) just as in the above. The only difference is that now we put q (t,x) (a) ≡ 0 for each x ∈ S and t ≥ T, and introduce the following cost rate for each x ∈ S, t ≥ 0 and a ∈ A : topology. For each μ ∈ P(A), Let R denote the set of (Borel) measurable mappings ρ t (da) from t ∈ (0, ∞) → P(A). Here, we do not distinguish two measurable mappings in t ∈ (0, ∞), which coincide almost everywhere with respect to the Lebesgue measure. Let us equip R with the Young topology, which is the weakest topology with respect to which the function ρ ∈ R → is lower semicontinuous for each x ∈ S.
Proof One can legitimately consider the following DTMDP (discrete-time Markov decision process): according to [9,Lemma 2.29], all the involved mappings are measurable.
• The transition kernel p on B(X) from X × A, c.f. (7), is given for each ρ ∈ A by • The cost function l is a [0, ∞]-valued measurable function on X × A × X given by l ((θ, x), ρ, (τ, y)) The relevant facts and statements for the DTMDP are included in the Appendix. (dz|(θ, x), a) is continuous for each bounded measurable function f on X; for each (θ, x) ∈ X and (τ, y) ∈ X, a ∈ A → l((θ, x), ρ, (τ, y)) is lower semicontinuous, and A is a compact Borel space. Hence, Condition A.1 for the DTMDP model  ((θ, x)) does not depend on θ ∈ (0, ∞) for each x ∈ S. We preserve the term of policy for the PDMDP and the term of strategy for the DTMDP.

One can show that under Conditions 2.1 and 2.2, for each
According to Proposition A.1, the function is the minimal [1, ∞]-valued measurable solution to the optimality equation
The policy π * in the proof of the previous lemma is actually optimal for problem (8). However, it is not necessarily a deterministic nor stationary policy. Also the reduction of the risk-sensitive PDMDP problem (8) to a risk-sensitive problem for the DTMDP model {X, A, p, l} as seen in the proof of the above theorem will be used without special reference in what follows.
Proof Let 0 ≤ t 1 < t 2 < ∞ be arbitrarily fixed. We need show It is without loss of generality to assume Then all the four terms in (11) are nonnegative and finite, and (11) is equivalent to which is verified as follows. Let δ > 0 be arbitrarily fixed. By Lemma 4.1, there exists someν ∈ R such that

Then routine calculations lead to
Since δ > 0 was arbitrarily fixed, now it follows that the term in the parenthesis in (12) is nonnegative, and thus inequality (12) Proof Let x ∈ S be fixed, and let ρ * ∈ R be such that V * (x) = W (x, ρ * ), see Lemma 4.1. Suppose t ∈ [0, ∞) is arbitrarily fixed. Considerρ ∈ R defined byρ s = ρ * t+s for each s > 0. Then recall (4). On the other hand, by Lemma 4.2, , t)) .
The statement of this lemma is thus proved.
are absolutely continuous in τ and are finite for each τ ∈ [0, ∞). Since φ(x, 0) = x, see (4), where f is a measurable mapping from S to A such that for each x ∈ S; the existence of such a mapping is according to a well known measurable selection theorem, c.f. Proposition D.5 of [15]. ,v),ρ v ))dv is bounded and separated from zero in τ ∈ [0, t] for each ρ ∈ R; recall Condition 2.2. So which is against (14). Therefore, Then is absolutely continuous on [0, t]. After legitimately differentiating the above expression with respect to v, and applying Lemma 4.2, we see This and (14) imply Remember, t ∈ [0, ∞) was arbitrarily fixed. The first part of (a) is thus verified, and we postpone the justification of the second part of (a) after the proof of part (b). (b) We use the same notation as in the above. Note that ,s))))ds . (15) Indeed, if either , s))))ds is finite, then in the above inequality, the equality takes place; and if both φ(x, s)))ds and ∞ 0 c (φ(x, s), f (φ(x, s))))ds are infinite, then the right hand side of the inequality is zero according to (1).
In the proof of part (a), it was observed that are absolutely continuous in t and are thus finite for each t ∈ [0, ∞). As in the proof of part (a), similar calculations to those in (14) imply that for each t ∈ [0, ∞), where the last equality is by what was established in part (a). Therefore, for each t ∈ [0, ∞), where the inequality holds because V * (x) ≥ 1 for each x ∈ S. Taking lim t→∞ on the both sides of the previous equality yields: ,s))))ds with the inequality following from (15). Hence Here it is clear that s ∈ [0, ∞) → f (φ(x, s)) can be identified as an element of R, , whereas x ∈ S →f x ∈ R is measurable. This measurable mapping x ∈ S →f x ∈ R defines a deterministic stationary optimal strategy for the risk-sensitive DTMDP problem (20) by Proposition A.1. It is clear that the measurable mapping x ∈ S → f (x) ∈ A defines an optimal deterministic stationary policy for the PDMDP problem (8).
Finally, we show the remaining part of (a). Let H * be a measurable [1, ∞)-valued function on S such that There exists a measurable mapping h from S to A such that c.f., Proposition D.5 of [15]. It follows that s 0 S H * (y)q(dy|φ(x, τ ), h(φ(x, τ )))dτ is absolutely continuous in s ∈ [0, t] for each t ≥ 0. As in the proof of part (b), h(φ(x,v))))dv S H * (y)q(dy|φ(x, s), h(φ(x, s)))ds and by passing to the lower limit as t → ∞, (dy|φ(x, s), h(φ(x, s)))ds It remains to refer to Proposition A.1 for that Proof of Theorem 3.2 Let V * 0 (x) := 1 for each x ∈ S. For each n ≥ 0, one can legitimately define Let n ≥ 0 be fixed. As in Lemma 4.3, for each x ∈ S, there is some ρ * ∈ R such that Also the relevant version of Lemma 4.2 holds: for each x ∈ S and ρ ∈ R, is monotone nondecreasing in t ∈ [0, ∞). Clearly, V * n+1 (φ(x, t)) is absolutely continuous in t ∈ [0, ∞) for each x ∈ S.
Corresponding to (14), we now have where τ ∈ [0, t] → U * n+1 (x, τ ) is integrable and coincides with ,t)) ∂t almost everywhere, and f is some measurable mapping from S to A, whose existence is guaranteed by [15,Proposition D.5]. Continued from the above relation, the reasoning in the proof of the first assertion in part (a) of Theorem 3.1 can be followed: eventually we see  (φ(x, τ ), a))V (φ(x, τ )) dτ, t ∈ [0, ∞), x ∈ S, (18) is satisfied by V = V * n+1 . Recall that V * 0 = V (0) . Suppose the recursive definition in (9) is valid up to step n, and V * n (x) = V (n) (x) for each x ∈ S. Consider an arbitrarily fixed [1, ∞)-valued measurable solution V to (18), and let f * be a measurable mapping from S to A such that One can follow the reasoning in the last part of the proof of Theorem 3.1, and see, c.f. (16), φ(x,v)))−c(φ(x,v), f * (φ(x,v))))dv where the last equality is by (17). Thus, V * n+1 is the minimal [1, ∞)-valued measurable solution to (18), and coincides with V (n+1) . Therefore, by induction V * n = V (n) for each n ≥ 0. It follows now that V (n) (x) ↑ V * (x) as n ↑ ∞ for each x ∈ S.

Conclusion
In this paper, we considered total undiscounted risk-sensitive PDMDP in Borel state and action spaces with a nonnegative cost rate. The transition and cost rates are assumed to be locally integrable along the drift. Under quite natural conditions, we showed that the value function is a solution to the optimality equation, justified the value iteration algorithm, and showed the existence of deterministic stationary optimal policy. As a corollary, the obtained results were applied to improving significantly known results for finite horizon undiscounted and infinite horizon discounted risk-sensitive CTMDP in the literature. Then (V (n) (x)) increases to V * (x) for each x ∈ X, where V * is the value function for problem (19). Furthermore, there exists a deterministic stationary strategy ϕ satisfying (21), and so in particular, there exists a deterministic stationary optimal strategy for the risk-sensitive DTMDP problem (19).