Markov Decision Processes with Risk-Sensitive Criteria: An Overview

The paper provides an overview of the theory and applications of risk-sensitive Markov decision processes. The term 'risk-sensitive' refers here to the use of the Optimized Certainty Equivalent as a means to measure expectation and risk. This comprises the well-known entropic risk measure and Conditional Value-at-Risk. We restrict our considerations to stationary problems with an infinite time horizon. Conditions are given under which optimal policies exist and solution procedures are explained. We present both the theory when the Optimized Certainty Equivalent is applied recursively as well as the case where it is applied to the cumulated reward. Discounted as well as non-discounted models are reviewed


Introduction
The theory of Markov decision processes (MDPs) deals with stochastic, dynamic optimization problems.In the classical situation, the aim is to maximize an expected cumulated or averaged reward of a system.Since the first formulations by Richard Bellman in the 1950s, the theory has developed tremendously.In particular, one branch of literature is devoted to extending this theory beyond the simple expectation, since there is an evidence from various fields that the expectation should be replaced by some criterion which allows to model risk-sensitivity of the decision maker.This evidence comes from disciplines like psychology, economics and biology.For instance, Braun et al (2011) reviewed evidence for risk-sensitivity in motor control tasks.
From a mathematical point of view, the decision problem gets of course more complicated when risk-sensitivity is taken into account.Loosely speaking, risk-sensitivity weights the possible fluctuations around the mean.A simple way to deal with this is to consider a weighted criterion of the expectation and the variance of a random income, i.e. to include the second moment into the decision.This has for example been propagated in Markowitz (1952).Naturally, one can generalize this idea to higher moments.One of the ways is to use an exponential function which plays a prominent role in risksensitive MDPs.Then, all moments of a random payoff are taken into account if we consider the expectation of an exponential function of this random payoff.This fact can be seen via the Taylor series expansion of the exponential function around 0. To be more precise let us consider for example the following expression where (X k , A k ) k is a controlled state-action process, r is a one-stage reward function, β a discount factor, γ = 0 is a risk-sensitivity parameter and the transition law is determined by a policy π.The initial state is X 0 = x.A target function like this has first been studied in Howard and Matheson (1972).Indeed, for small γ this is approximately equal to However, from a mathematical point of view it is more tractable than the variance.
From the approximation it is also obvious that γ > 0 models a risk-averse decision maker (since then the variance is subtracted), whereas γ < 0 corresponds to a riskloving decision maker.The preceding target function is a special case of the situation we consider here in this paper.It can also be interpreted as a Certainty Equivalent of the exponential utility function.This point of view can then be generalized to Optimized Certainty Equivalents which we consider in this survey.The aim of this paper is to provide an overview of the ideas, concepts and literature in this area.We will also discuss the situation where the Optimized Certainty Equivalent is applied to the single-stage rewards in a recursive way.However, we will stay within the setting where optimal policies are stationary in a certain sense and can be computed from optimality equations, thus naturally avoiding time-inconsistency issues.Our point of view is mainly from the economics and operations research perspective.We do not consider problems with a finite time horizon, nor do we treat problems in continuous time.For this direction the reader is referred to the recent survey by Biswas and Borkar (2023).
The outline of our survey is the following.In the next section we explain and discuss our main building block for the target function: the Optimized Certainty Equivalent.
The Optimized Certainty Equivalents have been introduced by Ben-Tal and Teboulle (2007) and provide a useful generalization of Certainty Equivalents.They comprise important cases like the entropic risk measure and the Conditional Value-at-Risk and are still tractable from a mathematical point of view.In Section 3 we introduce the theory of Markov decision processes.We restrict our attention to stationary problems (i.e. the model data do not depend on the time point) with an infinite time horizon.Conditions are given under which optimal policies exist and a solution procedure is explained.Section 4 presents the theory when the Optimized Certainty Equivalent is applied recursively.Some generalizations and related problems are discussed at the end.Afterwards, Section 5 treats the situation when the Optimized Certainty Equivalent is applied to the cumulated reward.Here the presented solution technique is via an extension of the state space.Finally in Section 6 we provide an overview on the risk-sensitive average cost case.Section 7 summarizes some typical applications of the presented theory.The appendix contains two proofs.
Notation.As usual, the symbol N denotes the set of positive integers and N 0 = N∪{0}.By R we denote the set of all real numbers.We use the following abbreviations: w.r.t.means with respect to, r.h.s means right-hand side and l.h.s means left-hand side.

Certainty Equivalents and Optimized Certainty Equivalents
Decision makers are often risk averse when faced with decisions,1 in particular when monetary rewards or costs have to be optimized.Consider for example the following two lotteries: • Lottery 1: receive a reward of 1000 with probability 0.05 and 0 else.
• Lottery 2: receive a reward of 50 with probability 1.
Both lotteries have an expected value of 50.However when confronted with this choice in reality, most people prefer lottery 2, since they are risk averse and consider the probability of 0.05 to be very low.Thus, it is reasonable to model risk aversion in decision making.This can be done for example by using risk measures.
In what follows let (Ω, F , P) be a probability space.All random variables which appear here are defined on this space.We will consider Certainty Equivalents and Optimized Certainty Equivalents.Let u : R → [−∞, ∞) be a strictly increasing, strictly concave utility function.The main purpose of the utility function is to provide a systematic way to rank alternatives that captures the principle of risk aversion, see, Von Neumann and Morgenstern (2007).This is accomplished whenever the utility function is concave.The degree of risk aversion exhibited by the utility function corresponds to the magnitude of the bend in the function, i.e. the stronger the bend the greater the risk aversion.The degree of risk aversion is formally defined by the Arrow-Pratt absolute risk aversion coefficient (Arrow (1971); Pratt (1964)): Basically, the parameter shows how risk aversion changes with the wealth level.
Although the actual value of the expected utility of a random outcome is meaningless except with comparison with other alternatives, there is a derived measure with units that has intuitive meaning.The Certainty Equivalent of a bounded random income X ∈ L ∞ (Ω, F , P) is defined as where E is the expectation operator with respect to the probability measure P. CE(X) is the sure amount which yields the same utility as the random outcome.The Optimized Certainty Equivalent is defined as follows (Ben-Tal and Teboulle ( 2007)): ) be a proper, closed, concave and non-decreasing utility function with u(0) = 0 and u ′ + (0) ≤ 1 ≤ u ′ − (0) where u ′ + and u ′ − are the right and left derivatives of u.2 Further let X ∈ L ∞ (Ω, F , P) be a bounded random variable.The Optimized Certainty Equivalent (OCE) for X is a map which is assumed to be a proper function, which means that the domain domS u := {X ∈ L ∞ (Ω, F , P) : S u (X) > −∞} is not empty and S u is finite on this domain.
The interpretation here is that the decision maker may consume the amount η today and obtain the present value η + Eu(X − η) as a result.Optimizing over the consumption then yields the present value of X.Among others, S u (X) has the following properties for X, Y ∈ L ∞ (Ω, F , P) (see Ben-Tal and Teboulle (2007)): Indeed it can be shown that −S u is a convex risk measure in the sense of Föllmer and Schied (2010).A random variable X is now preferred over Y if S u (X) ≥ S u (Y ).Thus (P3) and (P4) imply that this preference order models risk aversion since S u (X) ≤ EX = S u (EX), i.e. the sure amount EX is preferred over a random amount X with same expectation.Moreover, it holds that lim δ→0 1 δ S u (δX) = EX which means that the risk-neutral setting is achieved in the limit.
A further representation of S u is due to Ben-Tal and Teboulle (2007) given by where Q is the set of all probability measures Q absolutely continuous w.r.t.P such that dQ dP ∈ L 1 (Ω, F , P) and I ϕ is the usual ϕ-divergence defined by Here ϕ : R → [0, +∞] is a proper closed convex function with closed interval (containing 1) as domain and ϕ(1) = 0.This representation can be exploited in the analysis of risk-sensitive problems and in order to construct a connection to robust decision making, see Dai Pra et al (1996); Bäuerle and Glauner (2022a).Indeed this representation consists of the risk-neutral part E Q X where however the infimum over a set Q of probability measures is taken.This resembles a robust approach.The ϕ-divergence term penalizes the distance of Q to P.
The following examples list important special cases of the Optimized Certainty Equivalent.
Example 1.a)When we choose The quantity −S u (X) is known as entropic risk measure, see p. 184 in Föllmer and Schied (2010).However, we shall further also refer to (2) as the entropic risk measure.It is easy to see that in this case coincides with the Certainty Equivalent of X w.r.t.u.A Taylor series expansion yields which connects the entropic risk to the mean variance criterion.The entropic risk measure is the most widely used functional which is applied in risk-sensitive dynamic decision making.This is mainly because it is still mathematically tractable.Indeed, the paper of Howard and Matheson (1972) which is considered to be the first work in this field, coined the name risk-sensitive Markov decision process.Since then the adjective 'risk-sensitive' is often used as a synonym for applying the entropic risk measure.b) If for α ∈ (0, 1) we choose where the risk measure Conditional Value-at-Risk (CVaR) is defined as The Conditional Value-at-Risk is sometimes also called Average Value-at-Risk or Expected Shortfall.It can be represented as In case of a continuous random variable X we also have The mean-variance criterion is a popular decision criterion in finance since its first appearance in Markowitz (1952).However the interpretation is here restricted to random variables with bounded support.

Remark 1.
In what follows we consider optimization problems with rewards.Thus, we maximize S u .In case we want to minimize cost, we have to define the criterion in a different way.In this case let ℓ : R → (−∞, ∞] be a proper, closed, convex and non-decreasing function bounded from below with ℓ(0) = 0 and ℓ which is assumed to be a proper function.For X being a cost, this criterion has to be minimized.

The Model
For dynamic decision making we consider the following controlled Markov process in discrete time, Puterman (2014);Hernández-Lerma and Lasserre (1996); Bäuerle and Rieder (2011).We define the set of histories of the process.At time k = 0 we have H 0 = E.For k ≥ 1 the set of histories are given by The set of all policies is denoted by Π.Let F be the set of all measurable mappings f : E → A such that f (x) ∈ D(x) for every x ∈ E. By our assumption F = ∅.A Markovian policy is a sequence (f k ) k∈N0 where each f k ∈ F. The class of Markovian policies is denoted by Π M .A Markovian policy (f k ) k∈N0 is stationary if there is some f ∈ F such that f k = f for every k ∈ N 0 , i.e. the same decision rule f is used throughout the time.We identify a stationary policy with the element of the sequence.Therefore, the set of all stationary policies will be denoted by F. We have Let (Ω, F ) be a measurable space consisting of the sample space Ω = (E × A) ∞ with the corresponding product σ-algebra F on Ω.The elements of Ω are the sequences ω = (x 0 , a 0 , x 1 , a 1 , . ..) ∈ H ∞ with x n ∈ E and a n ∈ A for n ∈ N 0 .The random variables X 0 , A 0 , X 1 , A 1 , . . .are defined by and represent the state and action process, respectively.Let π ∈ Π and the initial state x be fixed.Then according to the Ionescu-Tulcea theorem there exists a unique probability measure P π x on (Ω, F ), which is supported on H ∞ , i.e.P π x (H ∞ ) = 1.Moreover, for k ∈ N :

Risk Neutral Decision Maker
One of the standard optimization problems for Markov decision processes is to find the maximal expected discounted reward: where β ∈ [0, 1) is a discount coefficient and, if possible, an optimal policy π * with J * β (x) = J β (x, π * ).Under some continuity and compactness assumptions, the maximal value J * β and an optimal policy can be characterized via the Bellman equation.In order to establish this equation we may use one of two different sets of conditions which are common in the literature, see Schäl (1975Schäl ( , 1983)): In what follows let U (E) be the set of all bounded, non-negative upper semicontinuous functions on E and B(E) the set of all bounded, non-negative Borel measurable functions on E. We equip these spaces with the supremum norm • .such that for all x ∈ E : ) for all x ∈ E, i.e. f * ∈ F is an optimal stationary policy.
Theorem 1 can be used to establish the link between the expected discounted reward and the long-run average reward defined as: for any initial state x ∈ E and π ∈ Π.The aim is to find a policy π * ∈ Π such that J (x) := sup π∈Π J (x, π) = J (x, π * ) for every x ∈ E. A first relation between discounted reward and long-run average reward is provided by the Hardy-Littlewood theorem.It claims that for bounded sequences of real numbers (R k ) k∈N0 it holds and consequently A second relation is given via (4).Let z ∈ E be a fixed state and put Under certain set of conditions and letting β → 1, the pair ((1 − β)V β (z), h β (•)) would converge to a pair (ξ, h(•)) that satisfies the average reward optimality equation If a set of "reasonably mild" assumptions is imposed on the family of functions {h β (•)} then a pair (ξ, h(•)) meets the average reward optimality inequality If the maximizer, say f * ∈ F, of the r.h.s. in ( 5) or ( 6) exists, it constitutes an optimal stationary policy, i.e.J (x, f * ) = sup π∈Π J (x, π) for every x ∈ E and moreover, the optimal average reward is independent of the initial state and ξ = J (x, f * ).This approach is well-described in the literature.The reader is referred to Hernández-Lerma and Lasserre (1996); Piunovskiy (2013) where also other methods are presented with comments and illustrative examples.
There are a number of established computational approaches which can, often after modifications, also be applied to the risk-sensitive cases which we discuss later.For example if we consider the setting of Theorem 1, the operator is a contraction on a suitable function space into the same function space.Then applying Banach's fixed point theorem, the value function and the optimal policy can be approximated by iterating the T -operator.Alternatively, one can start with an arbitrary stationary policy, given by a decision rule f ∈ F , compute the corresponding value 3)) and improve it by computing the maximum points on the r.h.s. of ( 7) with v replaced by V f .Under mild assumptions this procedure converges to the optimal solution.For computational purposes it is often more convenient to consider the so-called Q-function, which is defined as follows Note that we have This representation has the advantage that the maximization can be done before the integration.The algorithms discussed so far are only applicable when the state and action spaces are of low dimension and all data of the model are known.Modern approximate solution techniques are summarized under the name Reinforcement Learning (RL).The aim of these methods is to find an optimal strategy while simultaneously learn the right model.A popular approach is Q-learning, where the learned action-value function Q (t) directly approximates the Q-function.First we initialize Q (0) arbitrarily.Then we repeat the following steps: 1. Choose an admissible pair (x, a) at random and observe the next state y (or generate y ∼ q(•|x, a)). 2. Update at (x, a) : where the learning rates (α t ) have to be chosen appropriately.
Under mild assumptions this method is known to converge to the Q-function.Further methods parametrize the class of policies and thus, the value function and estimate the optimal parameters.Though not being optimal, in this situation it is more convenient to work with randomized policies.In order to find the best parameters in this setting, often the gradient is computed and parameters are updated by a gradient ascent rule.For computational issue consult among others with Sutton and Barto (2018); Powell (2022); Hambly et al (2023).
As discussed in the previous section this criterion does not account for deviations around the mean or in other words the risk of the decision maker.Thus, in what follows we consider risk-sensitive optimization criteria.

Markov Decision Processes with Recursive Risk-Sensitive Preferences
Measuring risk in a stochastic dynamic process is much more complicated than in a single-step situation.It may be measured at every stage and then aggregated or measured by a nested application of risk measures or a single-step risk measure is applied to the aggregated discounted reward.
In what follows we concentrate on the underlying controlled stochastic dynamic process to be Markovian (like in the previous section) and that the Optimized Certainty Equivalent risk measures are applied recursively.This setting guarantees that the optimality principle holds and optimal policies are stationary. For be equipped with the supremum norm • .Let π = (π k ) k∈N0 ∈ Π be an arbitrary policy.For v k+1 ∈ B(H k+1 ) and h k ∈ H k we define a conditional Optimized Certainty Equivalent where the random variable X k+1 has the distribution q(•|x k , π k (h k )).Then we define the operator L π k as follows: where β ∈ [0, 1) is a discount factor.The operator L π k is monotone by (P1), i.e.
Let now N ∈ N.For the N -stage decision model we apply these operators recursively.Thus, for an initial state x ∈ E, the total discounted recursive risk-sensitive reward under policy π is given by where 0 is the function 0(h k ) ≡ 0 for all h k ∈ H k , k ∈ N 0 .For N = 2 this equation reads Aggregation over time is still additive in this approach.By our assumptions and (P1), the sequence (J N (x, π)) N ∈N is non-decreasing and bounded from below by 0 for all x ∈ E and π ∈ Π.Moreover, by ( 8) we obtain Hence the limit lim N →∞ J N (x, π) exists for x ∈ E and π ∈ Π.
Problem 1.For an initial wealth x ∈ E and a policy π ∈ Π we define the total discounted recursive risk-sensitive reward by The aim of the decision maker is to find the maximal value, i.e.
In order to solve the problem we use dynamic programming.We need essentially the same assumptions as in the risk-neutral case.
A proof of the following theorem can be found in the appendix.
] and a decision rule f * ∈ F such that for all x ∈ E : where S (x,a) u indicates that X 1 has the distribution q(•|x, a).b) Moreover, V (x) = J * (x) = J(x, f * ) for all x ∈ E, i.e. f * ∈ F is an optimal stationary policy.
If u is an exponential utility then we obtain in the previous case that S u is the entropic risk measure (Example 1 a)) and the optimality equation ( 9) reduces to (see Asienkiewicz and Jaśkiewicz (2017)) for x ∈ E. The expression in brackets on the r.h.s is also referred to as risk-sensitive Koopmans operator (see Miao (2020); Sargent and Stachurski (2023)).By applying the exponential function on both sides, the equation can also be written as e −γr(x,a) Ṽ (y)q(dy|x, a) x) which yields a multiplicative Bellman equation.
A discounted, recursive entropic cost linear quadratic Gaussian regulator problem with the infinite time horizon has been treated in Hansen and Sargent (1995).Conditional consistency of the recursive entropic risk measure is discussed in Dowson et al (2020).An efficient learning algorithm for recursive Optimized Certainty Equivalents based on value iteration and upper confidence bound can be found in Xu et al (2023); Fei et al (2021) where the latter concentrates on the entropic risk measure.
CVaR optimization (which is according to Example 1 b) another special case) for a finite time horizon applied at the terminal wealth has been considered in Rudloff et al (2014); Pflug and Pichler (2016) and for the infinite time horizon in Ugurlu (2018).The authors also discuss time-consistency issues of optimal policies.Shapiro et al (2013) consider risk averse approaches (in terms of a weighted criterion of expectation and CVaR) to multistage (linear) stochastic programming problems based on the Stochastic Dual Dynamic Programming method.For further computational approaches see Kozmík and Morton (2015).The recursive CVaR is very popular for applications (see Section 7).
Some papers have studied the more general class of convex risk measure for a nested application to stochastic dynamic decision problems.For example Shen et al (2013Shen et al ( , 2014)); Chu and Zhang (2014); Bäuerle and Glauner (2022b) consider the infinite time horizon, unbounded cost functions and establish optimality equations and existence of optimal policies.Martyr et al (2022) consider an iterated G-expectation for non-Markovian optimal switching problems.In Dowson et al (2022) the problem is tackled as a multistage stochastic program.Algorithms based on stochastic dual dynamic programming and the special role of the entropic risk measure in this class are discussed in Shapiro (2021); Dupačová and Kozmík (2015).Philpott et al (2013) use inner, outer approximations based on dynamic programming.Further algorithms can be found in Le Tallec (2007); Tamar et al (2016); Guigues (2016); Huang et al (2021).Algorithms for a finite time horizon and convex risk measures based on reinforcement learning are studied in Coache and Jaimungal (2023).
There are further recursive risk-sensitive preferences in the literature which are not covered by our model.Kreps and Porteus (1978) and Epstein and Zin (1989) propose an alternative specification of lifetime value that separates and independently parametrizes temporal elasticity of substitution and risk aversion.To be more precise Kreps and Porteus (1978) consider a finite time horizon recursive preferences with the conditional Certainty Equivalent3 defined with u(x) = x 1−γ , γ > 0 and γ = 1, see (1).Here, the parameter γ is responsible for the level of relative risk aversion.Epstein and Zin (1989) generalize their approach to the infinite time horizon and suggest the following form of aggregation: The function v n denotes the future payoff from period n ∈ N 0 onwards when the process is governed by a Markovian policy (f n ) ∈ Π M .Moreover, we assume that ρ > 0 and ρ = 1.The value 1/ρ represents a Constant Elasticity of Intertemporal Substitution (CES).Therefore, the Epstein-Zin aggregator (named from their authors) is also called a CES time aggregator.Epstein and Zin (1989) obtain a remarkable result for the existence of recursive utilities across the broad set of parameters γ and ρ.Their results have been further strengthened by Ozaki and Streufert (1996) who provide an extensive analysis of existence and uniqueness of recursive utilities by introducing the notion of biconvergence.This concept requires that returns can be sufficiently discounted from above and sufficiently discounted from below.Moreover, their results are useful for studying dynamic programming with non-additive stochastic objectives in a pretty general setting.The Epstein-Zin time aggregator has also been examined by Weil (1993) but with the conditional Certainty Equivalent defined by an exponential utility function.The function v n is there given as follows The aforementioned recursive preferences are very popular among economists (see for instance Sargent and Stachurski (2023); Miao (2020) and references cited therein) who put a lot of criticism on the standard expected discounted utility.To learn more on this subject the reader is referred to the notes following Chapter 7 in Sargent and Stachurski (2023).It is worthy to mention that the CES time aggregator and different conditional Certainty Equivalents have been also exploited within dynamic programming framework by a number of authors, see the references in Ren and Stachurski (2018), Chapter 8 in Sargent and Stachurski (2023).Marinacci and Montrucchio (2010) propose a new class of Thompson aggregators and study a class of quasi-arithmetic Certainty Equivalent operators that generalize those of Kreps and Porteus (1978).Based on specific properties of such operators and the time aggregator they provide a comprehensive analysis of existence, uniqueness and global attractivity of a continuation value process.Particularly, they make use of monotonicity and concavity of the Thompson aggregator and subhomogeneity of the quasi-arithmetic operator.These facts allow them to define a contraction within the Thompson metric.Bloise and Vailakis (2018) develop an approach to convex programs for bounded recursive utilities.Their technique relies upon the theory of monotone concave operators.An extension is given in Bloise et al (2021).Iwamoto (1999), on the other hand, treats optimization problems with nested recursive utilities given by applying appropriate functions.A dynamic programming approach is used to solve the problems.
Further extensions include Feinstein and Rudloff (2017) and Schlosser (2020).In the latter paper a multi-valued dynamic programming approach is considered that allows to control the moments of the distributions of future rewards.The former paper is devoted to the development of set-valued risk measures and the recursive algorithms for a dynamic setting.

Markov Decision Processes with Risk-Sensitive Discounted Reward
Instead of applying the Optimized Certainty Equivalent recursively one can also apply it to the discounted sum of the rewards.Within such a framework the optimal policies need not be time-consistent.We say that a multiperiod stochastic decision problem is time-consistent, if resolving the problem at later stages (i.e., after observing some random outcomes), the original solutions remain optimal for the later stages.For a recent survey of different approaches to dynamic decision problems with risk measures and their connection to time-consistency, see Homem-de-Mello and Pagnoncelli (2016).
We use the same MDP model as in the previous section.For fixed history ω = (x 0 , a 0 , x 1 , a 1 , . ..) ∈ H ∞ let us define the sum of the discounted rewards by where we always assume that the initial state x 0 = x.We also put where with a little abuse of notation R ∞ β in ( 12) is now understood as a random variable on (Ω, F ) with the distribution P π x supported on H ∞ .In other words, S π u indicates that the distribution of R ∞ β is P π x .Then we consider the following problem.
Problem 2. For initial wealth x ∈ E and policy π ∈ Π we define the total discounted risk-sensitive reward by The aim of the decision maker is to find the maximal value, i.e.
A comparison between the obtained values when a coherent risk measure is applied outside or recursively (without control problem), can be found in Iancu et al (2015).Note that in case of no discounting (β = 1) Problem 1 and Problem 2 are equivalent.This follows from (P2) and (P4).However, discounting ensures that the value of the problem is finite since we have bounded rewards.Without discounting it depends on the distribution of (X k , A k ) k whether the expectations are finite.The motivation or interpretation of applying the risk measure outside is somewhat easier than for the recursive application of the risk measure.It can be deduced in particular from the different representations of S u in Example 1.
In order to solve Problem 2 note that by definition of Thus, we essentially have to solve The challenge here is that there is no obvious optimality equation for solving the problem.A way to work around this is to enlarge the state space.This has been done in Bäuerle and Rieder (2014).More precisely, it is helpful to introduce a new MDP on an extended state space E := E × [−η, ∞) × [0, 1].Decision rules f are now measurable mappings from E to A respecting f (x, y, z) ∈ D(x) for every (x, y, z) ∈ E. Denote this set of decision rules by F .Policies are defined in an obvious way and with a little abuse of notation denote the set of all policies in this new MDP by Π.For any policy π ∈ Π let be the value functions on an extended state space.Thus, we are looking for V ∞ (x, −η, 1) which is the value of the inner optimization problem in (13).Let us denote U ( E) to be the set of all upper semicontinuous functions v with v(x, •, •) is continuous and increasing in both variables for all x, and v(x, y, z) ≥ u(y).Moreover, denote b(y, z) where d is a lower bound for r (possibly zero).The next theorem summarizes the solution.
If we denote the operator T : U ( Ẽ) → U ( Ẽ) by then it can also be shown that This implies that value iteration works here and yields numerical bounds on the value function.
In Bäuerle and Rieder (2014) it has also been shown that the policy improvement converges.
If u is an exponential utility we obtain in the previous case that S π u is related to the entropic risk measure (Example 1 a)).Here we can drop the component y and obtain Note here the difference to the optimality equation given in (10) where we use the nested application of the entropic risk measure.In case β = 1 the value function V ∞ does not depend on z and both equations coincide.
Next we give a simple example from Jaquette (1976) to show the difference in optimal policies within the aforementioned frameworks.
From state 2 and from state 3 the process always jumps to state 1 with probability 1.The rewards are as follows: Obviously, there are two stationary strategies f and g, i.e.
Assume that β = 1/2 and the initial state is x 0 ≡ 1.Then, the decision maker essentially chooses between two independent gambles every other period.The first gamble, call it X f , gives the payoff 0 or 4 with equal probabilities whilst the second gamble, call it X g , yields the reward 1 with probability 0.9 or 5 with probability 0.1.Since EX f = 2 > EX g = 1.4, the risk-neutral decision maker prefers a stationary policy f to g.Hence, the maximal expected discounted reward is equal to Let us suppose that the decision maker uses the Optimized Certainty Equivalent defined in (2) with γ = 1.Consider first Problem 1.Then, Theorem 2 takes the following form and Then, g is an optimal stationary policy and the maximal reward is ≈ 1.4035.Now let us turn to Problem 2. In our case the aim is to maximize over the set of all policies π ∈ Π the functional This is equivalent to minimization of the expression over the set of all π ∈ Π.Since the decision maker chooses in each period between two independent gambles X f and X g , then where X 0 , X 2 , . . .are independent random variables with the distribution as X f or X g , depending whether the policy π = (π k ) indicates to use f or g in period k = 0, 2, 4, . ... Clearly, π k ≡ a for k = 1, 3, 5, . . . .Therefore, Observe that This holds for s > 0.455904.Hence, for the decision maker g is better than f in periods 2n for which (1/2) 2n > 0.455904.This is equivalent to 2n < 1.1332.Summing up, the optimal policy is (g, f, f, f . ..).Obviously, the policy is not stationary and it is not time-consistent4 .However, this policy is ultimately stationary, i.e., there is a period such that from this period onwards the policy is stationary.In fact, Jaquette (1976) proves that an MDP with a finite state space and the entropic risk measure must be ultimately stationary.This has not to be true for MDPs with an infinite state space.
For other examples illustrating the lack of stationarity and time-consistency the reader is referred to Brau-Rojas et al (1998).
The first studies of this entropic setting are due to Howard and Matheson (1972) and Jaquette (1976).Linear-quadratic problems with a finite time horizon and the entropic risk measure are considered in Jacobson (1973); Whittle (1981).A more general approach can be found in Chung and Sobel (1987) where fixed point theorems for the whole distribution of the infinite time horizon discounted reward in a finite MDP are considered.In Collins and McNamara (1998) the authors deal with a finite time horizon problem where they maximize a strictly concave functional of the distribution of the terminal state.Coraluppi and Marcus (1999) connect the problem with the entropic risk measure to a minimax criterion for finite state MDPs.A turnpike theorem for a risk sensitive MDP model with stopping is shown in Denardo and Rothblum (2006).Though Di Masi and Stettner (1999) consider the average reward criterion, they also solve as a by-product the infinite time horizon discounted model with Borel state and action spaces.
Numerical methods for the MDP with the entropic risk measure and finite and infinite time horizons are given in Hau et al (2023).A finite time horizon non-discounted MDP with Borel state and action spaces and with entropic risk measure is considered in Chapman and Smith (2021).General Certainty Equivalents for MDPs with Borel state and action spaces and finite and infinite time horizons are treated in Bäuerle and Rieder (2014).Partially observable MDPs with the entropic risk measures are examined in James et al (1994);Fernández-Gaucherand and Marcus (1997); Bäuerle andRieder (2015, 2017).
The special case of optimizing the CVaR of R ∞ β with bounded reward has been considered in Bäuerle and Ott (2011).A numerical algorithm and the connection to robust optimization problems is discussed in Chow et al (2015); Ding and Feinberg (2022).Unbounded cost problems with CVaR are treated in Ugurlu (2017).In Chapman et al ( 2023) the authors minimize the CVaR of a maximum random cost over a finite time horizon.Kadota et al (2006) maximize the expected utility of the total discounted reward subject to multiple expected utility constraints.

Markov Decision Processes with Other
Risk-Sensitive Payoff Criteria In this section we focus on other payoff criteria than those considered in Sections 4 and 5.We start with average risk-sensitive payoff criteria when a controller is equipped with a constant Arrow-Pratt's risk coefficient, i.e. she evaluates her future income using an exponential utility function.However, sometimes instead of a reward r in the MDP we shall study a cost c : D → R + .This is because the papers published so far with this criterion mainly deal with a minimization problem and moreover, the cost minimization is not equivalent to the reward maximization when changing the sign in the cost function as in the risk-neutral case (see also Remark 1).
Problem 3.For an initial state x ∈ E and a policy π ∈ Π we shall consider the following cost functional: Here in order to ensure that the average risk-sensitive cost is well-defined, let us assume as before that c is bounded.The objective is to find the minimal cost The policy π * is optimal for the ergodic risk-sensitive control problem if Note that then the optimal cost ξ(x) must be independent of x.
The paper of Howard and Matheson (1972) 5 is a pioneering work that deals with the aforementioned problem for MDPs with finite state and action spaces.They assume that the Markov chain is aperiodic and comprises one communicating class under any stationary policy.A Perrron-Frobenius theory of positive matrices allows them to establish a solution to the optimality equation which is of the form for every x ∈ E.Here ξ o is a real number and h : E → R is a given function.
If the equation holds, it is possible to prove two points.Firstly, the optimal cost is ξ(x) = ξ o /γ for every x ∈ E. Secondly, the minimizer of the r.h.s. in (15) (if exists), say f * , defines an optimal stationary policy f * ∈ F, which means ξo γ = J (x, f * ), x ∈ E. It should be noted that the optimal cost need not be constant (unlike in the risk neutral case) if the Markov chain induced by a stationary policy has transient states, consult with Brau-Rojas et al (1998) for counterexamples.The communication properties of the Markov chains in the analysis of the ergodic risk-sensitive control problem are underlined in Cavazos-Cadena and Hernández-Hernández (2002).Since then the finite state space models have been extensively developed and the Perron-Frobenius theory has been employed, see among others Sladkỳ (2018Sladkỳ ( , 2008)); Rothblum (1984); Cavazos-Cadena and Hernández-Hernández (2009) and references cited therein.In addition, the Perron-Frobenius theory provides a link between risk-sensitive control and the Donsker-Varadhan theory of large deviations.It is known that, under suitable recurrence conditions, the occupation measure of a Markov process satisfies the large deviation principle with rate function given by the convex conjugate of a long run expected rate of exponential growth function.Such a variational formula for the optimal growth rate of reward in the spirit of the Donsker-Varadhan formula is given in Anantharam and Borkar (2017) where the existence of a Perron-Frobenius eigenvalue and an associated eigenfunction is analyzed by the nonlinear Krein-Rutman theorem.For further results in this direction the reader is referred to Cavazos-Cadena (2018); Arapostathis et al (2016).
A nice characterization of an optimal cost via a minimization problem in a finite dimensional Euclidean space is given in Cavazos-Cadena and Hernández-Hernández (2005) where the transition law of the Markov chain satisfies a simultaneous Doeblin condition.This result is generalized to an MDP model on a Borel state space in Cavazos-Cadena and Salem-Silva (2010).
The second approach for solving ergodic risk-sensitive control problem is based on an approximation technique.This can be done either by discounted risk-sensitive cost models Cavazos-Cadena and Fernández-Gaucherand (2000); Cavazos-Cadena and Cruz-Suárez (2017) (as in Problem 2) or by certain discounted risk-sensitive dynamic games, see Cavazos-Cadena andHernández-Hernández (2002, 2011); Hernández-Hernández andMarcus (1999, 1996) for a countable state space case and Di Masi andStettner (2000, 1999); Jaśkiewicz (2007a,b) for a general state space case.This technique leads via the vanishing discount factor approach to the optimality equation or to the optimality inequality, (when the sign '=' in ( 15) is replaced by '≥').For instance, the existence of a solution to the optimality inequality is established in Hernández-Hernández and Marcus (1999); Jaśkiewicz (2007a) where a generalization of the Hardy-Littlewood formula is needed, known as a uniform Tauberian theorem, see Jaśkiewicz (2007a) and Proposition 1 in Jaśkiewicz and Nowak (2014).The essential ingredient in this approach is the variational formula for the logarithmic momentgenerating function (see Fleming and Hernández-Hernández (1997);Dai Pra et al (1996); Dembo and Zeitouni (1998)).It should be noted that in contrast to the risk neutral case to get a solution to the optimality equation or inequality one needs to assume except ergodicity conditions that the absolute vale of the risk coefficient is sufficiently small.This condition is either imposed explicitly or implicitly, i.e. other conditions in fact enforce this requirement, see Example 1 in Jaśkiewicz (2007a).There is only one exception: the so-called invariant models in which the transition probabilities are independent of the state space, see Jaśkiewicz (2007b).A further discussion on the conditions when the optimality equation or the optimal inequality hold is provided in Cavazos-Cadena (2010).
The ergodic risk-sensitive control problem is also attacked from different sides.Borkar and Meyn (2002) apply an ergodic multiplicative theorem and assume a simple growth condition on the one-stage cost function.They establish the optimality equation for a countable state Markov decision chain.The very recent results for countable state space models have been developed in Biswas and Pradhan (2022); Chen and Wei (2023).Finally, an approximation by uniformly ergodic Markov controlled processes for a general state space model under minorization condition is studied in Di Masi and Stettner (2007).A mutual relationship between the aforementioned works, an extensive discussion of other results and a list of further references are given in the excellent survey of Biswas and Borkar (2023).Finally, we would like to mention that the nested form of an average risk-sensitive reward is discussed in Shen et al (2013).
Parallel to the theoretical results much effort was put on developing efficient algorithms to solve ergodic risk-sensitive control problem.The value iterations are established in Bielecki et al (1999b); Cavazos-Cadena and Montes-de Oca ( 2003) for stationary models and in Cavazos-Cadena and Montes-De-Oca (2005) for nonstationary models.A Q-learning algorithm is proposed in Borkar (2002) and a version of an actor-critic algorithm is considered in Borkar (2001).However, these algorithms do not incorporate any approximation of the value function in order to defeat the curse of dimensionality.Such an approximation in terms of linear combination of a moderate number of basis functions is developed in Basu et al (2008).The learning scheme iteratively learns coefficients in the linear combination instead of learning the whole value function.The other tools are applied in Arapostathis and Borkar (2021) and Borkar (2017) where equivalent linear and dynamic programs are derived.The former work deals with minimization of the asymptotic growth rate of the cumulative cost whereas the latter one uses a variational representation for asymptotic growth rate of risk-sensitive reward obtained in Anantharam and Borkar (2017).This technique allows to link the average risk-sensitive reward with linear programming without assuming irreducibility of the Markov chain.
Except for the average cost/reward criteria defined with the help of an exponential utility function, there are papers that deal with other average risk-sensitive payoff criteria for which traditional dynamic programming fails.For example in Cavazos-Cadena and Hernández-Hernández (2016) a finite-state irreducible risk-sensitive MDP is considered where the usual exponential utility is replaced by an arbitrary utility function (see also Stettner (2023)).The authors prove a connection to the exponential utility criterion.Xia (2020) studies the optimization of the mean-variance combined metric assuming that the finite state Markov decision chain is ergodic under any stationary policy.More precisely, for f ∈ F, and an initial state x ∈ E he defines where λ > 0 is a trade-off parameter and Note that J av and J 0 are independent of an initial state, because of the ergodicity condition.The objective is to find a stationary policy f * ∈ F which maximizes the associated value, i.e. f * ∈ arg max f ∈F J 0 (x, f ) for all x ∈ E. Since the optimality equation does not hold, the theory of sensitivity-based optimization is utilized.A version of value iteration algorithm is proposed to find an optimal policy.The theory of sensitivity-based optimization is also applied in Xia and Glynn (2022) to the ergodic Markov decision chains when the CVaR measure is used.In this work Xia and Glynn (2022) consider the cost functions and aim at the cost functional where here and F −1 c(X k ,A k ) (α) denotes the upper α-quantile of the random variable c(X k , A k ).The objective is to find an optimal policy, i.e. f * ∈ F such that f * ∈ arg min f ∈F CV aR f α .In particular, the authors establish the local optimality equation and develop a policy iteration procedure that turns out to be more efficient than solving the bilevel MDP problem examined among others for risk-sensitive discounted rewards in Bäuerle and Ott (2011).
At the end let us mention the undiscounted models, i.e models in which the discount factor β = 1 and the time horizon is infinite.MDPs with non-positive payoffs and an entropic risk measure are studied in Jaśkiewicz (2008), where a non-recursive case is treated (as in Problem 2).The aim is to show the existence of an optimal stationary policy and the convergence of the value iteration algorithm.In C ¸avuş and Ruszczyński (2014), on the other hand, a recursive undiscounted cost is defined with the aid of Markov risk measures.For the so-called uniformly risk transient Markov decision process the optimality equation is established and the existence of an optimal stationary policy.

Applications
In this section we summarize some applications of the risk-sensitive criterion in dynamic, discrete-time optimization problems.This is not a complete list but simply a biased selection of examples.We start with the entropic risk measure.

Entropic Risk Criterion
One area of applications where the entropic risk criterion is used is financial mathematics and economics.In Bielecki et al (1999a) the authors consider an investment problem in a financial market with a factor process given by a Markov chain (X t ) t∈N .The evolution of the wealth is defined by where r is a fixed interest rate, (Z t ) t are the relative price vectors, conditionally independent given the states of the Markov chain at time t and t + 1 and (π t ) t are the proportions of wealth invested in the risky assets.The aim is to maximize lim inf over all investment strategies.Under some irreducibility assumptions an optimal investment strategy is stationary and is characterized by the optimality equation given in (15).Stettner (1999) considers a similar problem which however stems from a discretized version of a continuous Black-Scholes model with several factors.The optimization criterion is again (16).Under a uniform ergodicity condition an optimal investment strategy is characterized via the optimality equation.The cases with (proportional) and without transaction cost are considered.The model with proportional transaction cost and consumption is taken up in Stettner (2005).Finally, the assumptions are further relaxed in Pitera and Stettner (2023) for the same optimization criterion.Bäuerle and Jaśkiewicz (2018) consider a stochastic optimal growth model with nested entropic risk measures.The model is as follows: an agent obtains the output x t , which is divided between consumption a t and investment (saving) y t = x t − a t .From consumption a t the agent receives utility u(a t ).Investment is used for production with input y t yielding output x t+1 = f (y t , ξ t ) where (ξ t ) is a sequence of i.i.d.shocks and f a production function.The criterion of Problem 1 is used for the aggregation of the utilities.The value function and an optimal policy are again characterized via the optimality equation.Properties of the optimal consumption strategy are also shown.The problem is solved explicitly for special utility and production functions.The results are extended in Goswami et al (2022) to include regime switches.
Other applications in economics touch the problem of precautionary savings, which is one of the most studied issues in the theory of choice under uncertainty.For example, Luo and Young (2010) study the consumption-savings behavior of households who have risk-sensitive preferences and suffer from limited information-processing capacity (rational inattention).The value iteration is as for Problem 1 given by where x is the present value of lifetime resources, c consumption and c denotes a bliss point.The authors solve the model explicitly and show that rational inattention increases precautionary savings by interacting with income uncertainty and risk sensitivity.They show that the model displays a wide range of observational equivalence properties, implying that consumption and savings data cannot distinguish between risk sensitivity, robustness, or the discount factor, in any combination.Bommier and Le Grand (2019), on the other hand, examine non-stationary models of precautionary savings with recursive risk-sensitive preferences (as in Problem 1) of the infinitelylived agents.Agents are endowed with an exogenous income process (Z t ).The value function in period t is given by the equation where x t is the wealth at time t, a t is the consumption at time t and z t = (z 0 , . . ., z t ) is the realized exogenous income trajectory.Here, ũ is the one-stage utility of a household.It is assumed that the function (z 0 , . . ., z t ) → P(Z t+1 ≥ z|z 0 , . . ., z t ) is nondecreasing.Moreover, a t > 0, x t + Z t − y t = a t , x t+1 = r t+1 y t , where y t is investment and r t+1 is the deterministic (but time varying) gross interest rate between periods t and t + 1.Additionally, the constraint y t ≥ ȳt (z t ) allows to borrow the agent, but no more what she can repay in the worst scenario.The main result announces that the greater risk aversion (the greater absolute values of γ) implies a higher propensity to save at any time.This leads to the conclusion that the greater risk aversion implies greater accumulated wealth or larger precautionary savings.It should be stressed out that this is not the case when other recursive preferences are considered, for instance, the Epstein-Zin-Weil preferences, see Epstein and Zin (1989); Weil (1990) or the preferences developed in Weil (1993).The reader is referred to the numerical results obtained in Bommier and Le Grand (2019).
It is worth mentioning that Pareto optimal consumption allocations is studied by Anderson (2005), who also assumes that the agents have recursive risk-sensitive preferences defined by an exponential utility function.
Nested entropic risk measures are used in actuarial theory as well.In this matter the reader is referred to the works of Bäuerle andJaśkiewicz (2015, 2017).In the latter paper, within the recursive preference framework they determine the optimal dividend strategy for an insurance company and derive a policy improvement algorithm.
The next prominent applications can be found in the operations research area.The paper of Bouakiz and Sobel (1992) is one of the first paper that uses the exponential utility function to the multiperiod news vendor model.The authors minimize the risksensitive discounted cost, i.e. as in Problem 2. It is shown that the base-stock policy is optimal and depends on the length of a time horizon, discount factor and risk parameter.For the infinite time horizon an optimal policy is ultimately stationary.Their considerations are extended to models with dependent demands in Choi and Ruszczyński (2011) where an asymptotic behavior of the solution when the degree of risk aversion coefficient converges to zero or infinity is analyzed.Another interesting issue from the area of revenue management can be found in Barz and Waldmann (2007).The approach is explained in the setting of optimal airline ticket booking where the airline has to decide whether or not to accept a request for a certain fare given the remaining capacity.The target function is the one from Problem 2. The optimal strategy is computed and compared to the risk-neutral setting.Further applications to revenue management with different risk-averse target functions can be found in Schlosser (2015Schlosser ( , 2016)).A survey of risk-sensitive and robust revenue management problems the reader may find in Gönsch (2017), where among other issues the capacity control and dynamic pricing are considered.Finally, Denardo et al (2007) consider the multiarmed bandit problem with an exponential utility and criterion as in Problem 2. They show the optimality of some kind of index policy using analytical arguments.
Applications in computer science and engineering are as follows.One of the first papers is Koenig and Simmons (1994).The authors discuss goal reaching problems (e.g. for robots) under risk-sensitive criteria.They obtain the following optimality equation (there is no discounting): , a)e γc(x,a,y) V (y) + y∈G q(y|x, a)e γ(c(x,a,y)+r(y))

  
where G ⊂ E are the goal states, c(x, a, y) is the cost of executing action a in state x and proceeding state y and r is the terminal reward function.Solution algorithms, in particular under change of measure are discussed and some block world problems are considered.In Medina et al (2012); Befekadu et al (2015) the authors consider a finite time horizon linear-quadratic problem with target function like in Problem 2 with an exponential utility.In Medina et al (2012) the setting is to optimize a human-robot interaction such that the physically coupled human-robot follows a desired trajectory.Befekadu et al (2015) study the impact of cyber-attacks in control systems with partial observation.Further, Mazouchi et al (2022) investigate risk-averse preview-based Qlearning planner for navigation of autonomous vehicles on a multi-lane road.The criterion is that of Problem 2 with an exponential utility function.

CVaR Risk Criterion
Another popular optimization criterion is the CVaR.
We start with some examples from operations research and engineering.Gönsch et al (2018) consider dynamic pricing with a risk-averse seller maximizing the CVaR over the selling horizon.The aim is to dynamically adjust the price during the selling horizon in order to sell a fixed capacity of a perishable product where demand is stochastic such that the total expected/risk averse revenue is maximized.As optimization criterion they use the CVaR of the cumulated revenue.More precisely, they consider the setting of Section 5 with a finite time horizon and CVaR, i.e.
where A k is the price offered at time k by the firm.The state x the remaining good and Y k are i.i.d.continuous random variables which represent the willingness to pay of a potential customer arriving in period k.The authors use recursive algorithms to solve the problem, based on specific properties of the CVaR given by V 0 (x, α) = 0 for x ≥ 0 and V t (x, α) = max a CV aR α 1 [Yt≥a] (a + V t−1 (x − 1, αz t−1,x−1 )) + 1 [Yt<a] V t−1 (x, αz t−1,x ) where z t−1,x−1 are certain constants arising from CVaR minimization.A nested formulation with CVaR is considered in Schur et al (2019).
Wozabal and Rameseder (2020) consider multi-stage stochastic programming approaches to optimize the bidding strategy of a virtual power plant operating on the Spanish spot market for electricity.They consider different setups among others a nested CVaR approach.Maceira et al (2015) deal with hydrothermal generation planning in Brazil.The aim is to optimize the system operation, taking into account the expected value of thermal generation and possible load curtailment costs over a given set of inflow scenarios to the reservoirs in the future.Risk aversion is crucial here to avoid unacceptable amounts of load curtailment in critical inflow scenarios.The authors use nested CVaR and dual stochastic dynamic programming to solve the problem.
The PhD thesis of Ott (2010) treats several problems of surveillance of critical infrastructures treated as stochastic dynamic optimization problems.The author uses CVaR as criterion in the total discounted cost problems and average cost problems.Jiang and Powell (2016) investigate a dynamic decision problem faced by the manager of an electric vehicle charging station, who aims to satisfy the charging demand of the customer while minimizing cost.Since the total time needed to charge the electric vehicle up to capacity is often less than the amount of time that the customer is away, there are opportunities to exploit electricity spot price variations.The authors formulate this problem as a combination of nested CVaR and expectation over a finite time horizon.They identify structural properties of the optimal policy and propose an approximation algorithm based on regression and polynomial optimization to solve the problem.
Zhang et al ( 2016) consider five decompositions of nested CVaR application in multistage stochastic linear programming.They apply the proposed formulations to a water management problem in the area of the southeastern portion of Tucson, AZ to best use the limited water resources available to that region.
Finally, Ahmed et al (2007) solve a multiperiod inventory model with nested approach of coherent risk measures.For a finite time horizon they prove that the optimal policy has a similar structure as that of the expected value problem.Moreover, an analyis of monotonicity properties of the optimal order quantity with respect to the degree of risk aversion for certain risk measures like CVaR is conducted.
Applications in financial mathematics and economics are as follows: Staino and Russo (2020) treat portfolio optimization problems with nested CVaR when asset log returns are stage-wise dependent by a single-factor.Using a cubic spline interpolation the authors numerically solve the problem with a finite time horizon by backward recursion.A dynamic mean-risk problem, where the risk constraint is given by the CVaR is considered in Bäuerle and Mundt (2009).The financial market is a binomial model which allows for explicit solutions.Since the problem is solved via a Lagrange function, the CVaR appears in the optimization criterion.It is applied to the cumulated gain/loss and the problem is solved by recursion explicitly.
An application in biology is given in Bushaj et al (2022) where the authors apply a mean-CVaR multistage, stochastic mixed-integer programming model to optimize a manager's decisions about the surveillance and control of a non-native forest insect, the emerald ash borer.
As mentioned before, this is just a selection of applications.Further examples can be found in the literature.
(a) The state space E is a Borel space (non-empty Borel subset of a Polish space).(b) The action space A is a Borel space.(c) D ⊂ E × A is the set of admissible state-action combinations.D contains the graph of a measurable mapping f : E → A. The sets D(x) = {a ∈ A : (x, a) ∈ D} of admissible actions in state x are assumed to be compact.(d) q is a regular conditional distribution from D to E. (e) The one-stage reward r : D → R + is a bounded Borel measurable function r(x, a) ≤ d for all (x, a) ∈ D for some constant d > 0.
(a) The sets D(x), x ∈ E, are compact.(b) For each x ∈ E and every Borel set C ⊂ E the function q(C|x, •) is continuous on D(x).(c) The reward r(x, •) is upper semicontinuous on D(x) for each x ∈ E. Condition (W): (a) The sets D(x), x ∈ E, are compact and the mapping x → D(x) is upper semicontinuous.(b) The transition law q is weakly continuous on D, i.e. the function (x, a) → h(y)q(dy|x, a) is continuous for each continuous bounded function h.(c) The reward r is upper semicontinuous on D.