First-order sensitivity of the optimal value in a Markov decision model with respect to deviations in the transition probability function

Markov decision models (MDM) used in practical applications are most often less complex than the underlying `true' MDM. The reduction of model complexity is performed for several reasons. However, it is obviously of interest to know what kind of model reduction is reasonable (in regard to the optimal value) and what kind is not. In this article we propose a way how to address this question. We introduce a sort of derivative of the optimal value as a function of the transition probabilities, which can be used to measure the (first-order) sensitivity of the optimal value w.r.t.\ changes in the transition probabilities. `Differentiability' is obtained for a fairly broad class of MDMs, and the `derivative' is specified explicitly. Our theoretical findings are illustrated by means of optimization problems in inventory control and mathematical finance.


Introduction
Already in the 1990th, Müller [27] pointed out that the impact of the transition probabilities of a Markov decision process (MDP) on the optimal value of a corresponding Markov decision model (MDM) can not be ignored for practical issues. For instance, in most cases the transition probabilities are unknown and have to be estimated by statistical methods. Moreover in many applications the 'true' model is replaced by an approximate version of the 'true' model or by a variant which is simplified and thus less complex. The result is that in practical applications the optimal (strategy and thus the optimal) value is most often computed on the basis of transition probabilities that differ from the underlying true transition probabilities. Therefore the sensitivity of the optimal value w.r.t. deviations in the transition probabilities is obviously of interest.
Müller [27] showed that under some structural assumptions the optimal value in a discrete-time MDM depends continuously on the transition probabilities, and he established bounds for the approximation error. In the course of this the distance between transition probabilities was measured by means of some suitable probability metrics. Even earlier, Kolonko [20] obtained analogous bounds in a MDM in which the transition probabilities depend on a parameter. Here the distance between transition probabilities was measured by means of the distance between the respective parameters. Error bounds for the expected total reward of discrete-time Markov reward processes were also specified by van Dijk [40] and van Dijk and Puterman [41]. In the latter reference the authors also discussed the case of discrete-time Markov decision processes with countable state and action spaces.
In this article, we focus on the situation where the 'true' model is replaced by a less complex version (for a simple example, see Subsection 5.4.3 in the supplementary material). The reduction of model complexity in practical applications is common and performed for several reasons. Apart from computational aspects and the difficulty of considering all relevant factors, one major point is that statistical inference for certain transition probabilities can be costly in terms of both time and money. However, it is obviously of interest to know what kind of model reduction is reasonable and what kind is not. In the following we want to propose a way how to address the latter question.
Our original motivation comes from the field of optimal logistics transportation planning, where ongoing projects like SYNCHRO-NET [38] aim at stochastic decision models based on transition probabilities estimated from historical route information. Due to the lack of historical data for unlikely events, transition probabilities are often modeled in a simplified way. In fact, events with small probabilities are often ignored in the model. However, the impact of these events on the optimal value (here the minimal expected transportation costs) of the corresponding MDM may nevertheless be significant. The identification of unlikely but potentially cost sensitive events is therefore a major challenge. In logistics planning operations engineers have indeed become increas-ingly interested in comprehensibly quantifying the sensitivity of the optimal value w.r.t. the incorporation of unlikely events into the model. For background see, for instance, [15,16]. The assessment of rare but risky events takes on greater importance also in other areas of applications; see, for instance, [21,44] and references cited therein.
By an incorporation of an unlikely event into the model we mean, for instance, that under performance of an action a at some time n a previously impossible transition from one state x to another state y gets now assigned small but strictly positive probability ε. Mathematically this means that the transition probability P n ((x, a), · ) is replaced by (1 − ε)P n ((x, a), • ) + εQ n ((x, a), • ) with Q n ((x, a), • ) := δ y [ • ], where δ y is the Dirac measure at y. More generally one could consider a change of the whole transition function (the family of all transition probabilities) P to (1 − ε)P + εQ with ε > 0 small. For operations engineers it is here interesting to know how this change affects the optimal value, V 0 (P ). If the effect is minor, then an incorporation can be seen as superfluous, at least from a pragmatic point of view. If on the other hand the effect is significant, then the engineer should consider the option to extend the model and to make an effort to get access to statistical data for the extended model.
At this point it is worth mentioning that a change of the transition function from P to (1 − ε)P + εQ with ε > 0 small can also have a different interpretation than an incorporation of an (unlikely) new event. It could also be associated with an incorporation of an (unlikely) divergence from the normal transition rules. See Subsection 4.5 for an example.
In this article, we will introduce an approach for quantifying the effect of changing the transition function from P to (1 − ε)P + εQ, with ε > 0 small, on the optimal value V 0 (P ) of the MDM. In view of (1 − ε)P + εQ = P + ε(Q − P ), we feel that it is reasonable to quantify the effect by a sort of derivative of the value functional V 0 at P evaluated at direction Q − P . To some extent the 'derivative'V 0;P (Q − P ) specifies the first-order sensitivity of V 0 (P ) w.r.t. a change of P as above. Take into account that V 0 (P + ε(Q − P )) − V 0 (P ) ≈ ε ·V 0;P (Q − P ) for ε > 0 small.
To be able to compare the first-order sensitivity for (infinitely) many different Q, it is favourable to know that the approximation in (1) is uniform in Q ∈ K for preferably large sets K of transition functions. Moreover, it is not always possible to specify the relevant Q exactly. For that reason it would be also good to have robustness (i.e. some sort of continuity) ofV 0;P (Q − P ) in Q. These two things induced us to focus on a variant of tangential S-differentiability as introduced by Sebastião e Silva [36] and Averbukh and Smolyanov [1] (here S is a family of sets K of transition functions). In Section 3 we present a result on 'S-differentiability' of V 0 for the family S of all relatively compact sets of admissible transition functions and a reasonably broad class of MDMs, where we measure the distance between transition functions by means of metrics based on probability metrics as in [27].
The 'derivative'V 0;P (Q − P ) of the optimal value functional V 0 at P quantifies the effect of a change from P to (1 − ε)P + εQ, with ε > 0 small, assuming that after the change the strategy π (tuple of the underlying decision rules) is chosen such that it optimizes the target value V π 0 (P ′ ) (e.g. expected total costs or rewards) in π under the new transition function P ′ := (1 − ε)P + εQ. On the other hand, practitioners are also interested in quantifying the impact of a change of P when the optimal strategy (under P ) is kept after the change. Such a quantification would somehow answers the question: How much different does a strategy derived in a simplified MDM perform in a more complex (more realistic) variant of the MDM? Since the 'derivative'V π 0;P (Q − P ) of the functional V π 0 under a fixed strategy π turns out to be a building stone for the derivativeV 0;P (Q − P ) of the optimal value functional V 0 at P , our elaborations cover both situations anyway. For fixed strategy π we obtain 'S-differentiability' of V π 0 even for the broader family S of all bounded sets of admissible transition functions.
The 'derivative' which we propose to regard as a measure for the first-order sensitivity will formally be introduced in Definition 3.9. This definition is applicable to quite general finite time horizon MDMs and might look somewhat cumbersome at first glance. However, in the special case of a finite state space and finite action spaces, a situation one faces in many practical applications, the proposed 'differentiability' boils down to a rather intuitive concept. This will be explained in Section 5 of the supplementary material with a minimum of notation and terminology. In Section 5 of the supplementary material we will also reformulate a backward iteration scheme for the computation of the 'derivative' (which can be deduced from our main result, Theorem 3.14) in the discrete case, and we will discuss an example.
In Section 2 we formally introduce quite general MDMs in the fashion of the standard monographs [2,12,13,30]. Since it is important to have an elaborate notation in order to formulate our main result, we are very precise in Section 2. As a result, this section is a little longer compared to the respective sections in other articles on MDMs. In Section 3 we carefully introduce our notion of 'differentiability' and state our main result concerning the computation of the 'derivative' of the value functional.
In Section 4 we will apply the results of Section 3 to assess the impact of one or more than one unlikely but substantial shock in the dynamics of an asset on the solution of a terminal wealth problem in a (simple) financial market model free of shocks. This example somehow motivates the general set-up chosen in Sections 2-3. All results of this article are proven in Sections 7-9 of the supplementary material. For the convenience of the reader we recall in Section 10 of the supplementary material a result on the existence of optimal strategies in general MDMs. Section 11 of the supplementary material contains an auxiliary topological result.

Formal definition of Markov decision model
Let E be a non-empty set equipped with a σ-algebra E, referred to as state space. Let N ∈ N be a fixed finite time horizon (or planning horizon) in discrete time. For each point of time n = 0, . . . , N − 1 and each state x ∈ E, let A n (x) be a non-empty set. The elements of A n (x) will be seen as the admissible actions (or controls) at time n in state x. For each n = 0, . . . , N − 1, let A n := x∈E A n (x) and D n := (x, a) ∈ E × A n : a ∈ A n (x) .
The elements of A n can be seen as the actions that may basically be selected at time n whereas the elements of D n are the possible state-action combinations at time n. For our subsequent analysis, we equip A n with a σ-algebra A n , and let D n := (E ⊗ A n ) ∩ D n be the trace of the product σ-algebra E ⊗ A n in D n . Recall that a map P n : D n × E → [0, 1] is said to be a probability kernel (or Markov kernel) from (D n , D n ) to (E, E) if P n ( · , B) is a (D n , B([0, 1]))-measurable map for any B ∈ E, and P n ((x, a), • ) ∈ M 1 (E) for any (x, a) ∈ D n . Here M 1 (E) is the set of all probability measures on (E, E).

Markov decision process
In this subsection, we will give a formal definition of an E-valued (discrete-time) Markov decision process (MDP) associated with a given initial state, a given transition function and a given strategy. By definition a (Markov decision) transition (probability) function is an N-tuple P = (P 0 , . . . , P N −1 ) whose n-th entry P n is a probability kernel from (D n , D n ) to (E, E). In this context P n will be referred to as one-step transition (probability) kernel at time n (or from time n to n + 1) and the probability measure P n ((x, a), • ) is referred to as one-step transition probability at time n (or from time n to n + 1) given state x and action a. We denote by P the set of all transition functions. We will assume that the actions are performed by a so-called N-stage strategy (or N-stage policy). An (N-stage) strategy is an N-tuple π = (f 0 , . . . , f N −1 ) of decision rules at times n = 0, . . . , N − 1, where a decision rule at time n is an (E, A n )measurable map f n : E → A n satisfying f n (x) ∈ A n (x) for all x ∈ E. Note that a decision rule at time n is (deterministic and) 'Markovian' since it only depends on the current state and is independent of previous states and actions. We denote by F n the set of all decision rules at time n, and assume that F n is non-empty. Hence a strategy is an element of the set F 0 × · · · × F N −1 , and this set can be seen as the set of all strategies. Moreover, we fix for any n = 0, . . . , N − 1 some F n ⊆ F n which can be seen as the set of all admissible decision rules at time n. In particular, the set Π := F 0 × · · · × F N −1 can be seen as the set of all admissible strategies.
The probability measure P π n (x, • ) can be seen as the one-step transition probability at time n given state x when the transitions and actions are governed by P and π, respectively. Now, consider the measurable space (Ω, F ) := (E N +1 , E ⊗(N +1) ).
For any x 0 ∈ E, P = (P n ) N −1 n=0 ∈ P, and π ∈ Π define the probability measure P x 0 ,P ;π := δ x 0 ⊗ P π 0 ⊗ · · · ⊗ P π N −1 on (Ω, F ), where x 0 should be seen as the initial state of the MDP to be constructed. The right-hand side of (3) is the usual product of the probability measure δ x 0 and the kernels P π 0 , . . . , P π N −1 ; for details see display (59) in Section 6 of the supplementary material. Moreover let X = (X 0 , . . . , X N ) be the identity on Ω, i.e.
Note that, for any x 0 ∈ E, P = (P n ) N −1 n=0 ∈ P, and π ∈ Π, the map X can be regarded as an (E N +1 , E ⊗(N +1) )-valued random variable on the probability space (Ω, F , P x 0 ,P ;π ) with distribution δ x 0 ⊗ P π 0 ⊗ · · · ⊗ P π N −1 . It follows from Lemma 6.1 in the supplementary material) that for any x 0 , x 0 , x 1 , . . . , x n ∈ E, P = (P n ) N −1 n=0 ∈ P, π = (f n ) N −1 n=0 ∈ Π, and n = 1, . . . , N − 1 The formulation of (ii)-(iv) is somewhat sloppy, because in general a (regular version of the) factorized conditional distribution of X given Y under P x 0 ,P ;π (evaluated at a fixed set B ∈ E) is only P x 0 ,P ;π Y -a.s. unique. So assertion (iv) in fact means that the probability kernel P n (( · , f n ( · )), • ) provides a (regular version of the) factorized conditional distribution of X n+1 given X n under P x 0 ,P ;π , and analogously for (ii) and (iii). Note that the factorized conditional distribution in part (ii) is constant w.r.t.
x 0 ∈ E. Assertions (iii) and (iv) together imply that the temporal evolution of X n is Markovian. This justifies the following terminology.

Markov decision model and value function
Maintain the notation and terminology introduced in Subsection 2.1. In this subsection, we will first define a (discrete-time) Markov decision model (MDM) and introduce subsequently the corresponding value function. The latter will be derived from a reward maximization problem. Fix P ∈ P, and let for each point of time n = 0, . . . , N − 1 r n : D n −→ R be a (D n , B(R))-measurable map, referred to as one-stage reward function. Here r n (x, a) specifies the one-stage reward when action a is taken at time n in state x. Let be an (E, B(R))-measurable map, referred to as terminal reward function. The value r N (x) specifies the reward of being in state x at terminal time N.
Denote by A the family of all sets A n (x), n = 0, . . . , N −1, x ∈ E, and set r := (r n ) N n=0 . Moreover let X be defined as in (4) and recall Definition 2.1. Then we define our MDM as follows.

Definition 2.2 (MDM)
The quintuple (X, A, P , Π, r) is called (discrete-time) Markov decision model (MDM) associated with the family of action spaces A, transition function P ∈ P, set of admissible strategies Π, and reward functions r.
In the sequel we will always assume that a MDM (X, A, P , Π, r) satisfies the following Assumption (A). In Subsection 3.1 we will discuss some conditions on the MDM under which Assumption (A) holds. We will use E x 0 ,P ;π n,xn to denote the expectation w.r.t. the factorized conditional distribution P x 0 ,P ;π [ • X n = x n ]. For n = 0, we clearly have for every x 0 ∈ E; see Lemma 6.1 in the supplementary material. In what follows we use the convention that the sum over the empty set is zero.
Under Assumption (A) we may define in a MDM (X, A, P , Π, r) for any π = (f n ) N −1 n=0 ∈ Π and n = 0, . . . , N a map V P ;π n : E → R through As a factorized conditional expectation this map is (E, B(R))-measurable (for any π ∈ Π and n = 0, . . . , N). Note that for n = 1, . . . , N the right-hand side of (5) does not depend on x 0 ; see Lemma 6.2 in the supplementary material. Therefore the map V P ;π n (·) need not be equipped with an index x 0 .
The value V P ;π n (x n ) specifies the expected total reward from time n to N of X under P x 0 ,P ;π when strategy π is used and X is in state x n at time n. It is natural to ask for those strategies π ∈ Π for which the expected total reward from time 0 to N is maximal for all initial states x 0 ∈ E. This results in the following optimization problem: If a solution π P to the optimization problem (6) (in the sense of Definition 2.4 ahead) exists, then the corresponding maximal expected total reward is given by the so-called value function (at time 0).

Definition 2.3 (Value function)
For a MDM (X, A, P , Π, r) the value function at time n ∈ {0, . . . , N} is the map V P n : E → R defined by Note that the value function V P n is well defined due to Assumption (A) but not necessarily (E, B(R))-measurable. The measurability holds true, for example, if the sets F n , . . . , F N −1 are at most countable or if conditions (a)-(c) of Theorem 10.3 in the supplementary material) are satisfied; see also Remark 10.4(i) in the supplementary material.
Definition 2.4 (Optimal strategy) In a MDM (X, A, P , Π, r) a strategy π P ∈ Π is called optimal w.r.t. P if In this case V P ;π P 0 (x 0 ) is called optimal value (function), and we denote by Π(P ) the set of all optimal strategies w.r.t. P . Further, for any given δ > 0, a strategy π P ;δ ∈ Π is called δ-optimal w.r.t. P in a MDM (X, A, P , Π, r) if and we denote by Π(P ; δ) the set of all δ-optimal strategies w.r.t. P .
Note that condition (8) requires that π P ∈ Π is an optimal strategy for all possible initial states x 0 ∈ E. Though, in some situations it might be sufficient to ensure that π P ∈ Π is an optimal strategy only for some fixed initial state x 0 . For a brief discussion of the existence and computation of optimal strategies, see Section 10 of the supplementary material.
Remark 2.5 (i) In practice, the choice of an action can possibly be based on historical observations of states and actions. In particular one could relinquish the Markov property of the decision rules and allow them to depend also on previous states and actions. Then one might hope that the corresponding (deterministic) history-dependent strategies improve the optimal value of a MDM (X, A, P , Π, r). However, it is known that the optimal value of a MDM (X, A, P , Π, r) can not be enhanced by considering history-dependent strategies; see, e.g., Theorem 18.4 in [13] or Theorem 4.5.1 in [30].
(ii) Instead of considering the reward maximization problem (6) one could as well be interested in minimizing expected total costs over the time horizon N. In this case, one can maintain the previous notation and terminology when regarding the functions r n and r N as the one-stage costs and the terminal costs, respectively. The only thing one has to do is to replace "sup" by "inf" in the representation (7) of the value function. Accordingly, a strategy π P ;δ ∈ Π will be δ-optimal for a given δ > 0 if in condition (9) "−δ" and "≤" are replaced by "+δ" and "≥". ✸

'Differentiability' in P of the optimal value
In this section, we show that the value function of a MDM, regarded as a real-valued functional on a set of transition functions, is 'differentiable' in a certain sense. The notion of 'differentiability' we use for functionals that are defined on a set of admissible transition functions will be introduced in Subsection 3.4. The motivation of our notion of 'differentiability' was discussed subsequent to (1). Before defining 'differentiability' in a precise way, we will explain in Subsections 3.2-3.3 how we measure the distance between transition functions. In Subsections 3.5-3.6 we will specify the 'Hadamard derivative' of the value function. At first, however, we will discuss in Subsection 3.1 some conditions under which Assumption (A) holds true. Throughout this section, A, Π, and r are fixed.

Bounding functions
Recall from Section 2 that P stands for the set of all transition functions, i.e. of all Ntuples P = (P n ) N −1 n=0 of probability kernels P n from (D n , D n ) to (E, E). Let ψ : E → R ≥1 be an (E, B(R ≥1 ))-measurable map, referred to as gauge function, where R ≥1 := [1, ∞). Denote by M(E) the set of all (E, B(R))-measurable maps h ∈ R E , and let M ψ (E) be the set of all h ∈ M(E) satisfying h ψ := sup x∈E |h(x)|/ψ(x) < ∞. The following definition is adapted from [2,27,43]. Conditions (a)-(c) of this definition are sufficient for the well-definiteness of V P ;π n (and V P n ); see Lemma 3.2 ahead.
is called a bounding function for the family of MDMs {(X, A, P , Π, r) : P ∈ P ′ } if there exist finite constants K 1 , K 2 , K 3 > 0 such that the following conditions hold for any n = 0, . . . , N − 1 and P = ( If P ′ = {P } for some P ∈ P, then ψ is called a bounding function for the MDM (X, A, P , Π, r).
Note that the conditions in Definition 3.1 do not depend on the set Π. That is, the terminology bounding function is independent of the set of all (admissible) strategies. Also note that conditions (a) and (b) can be satisfied by unbounded reward functions.
The following lemma, whose proof can be found in Subsection 7.1 of the supplementary material, ensures that Assumption (A) is satisfied when the underlying MDM possesses a bounding function. Lemma 3.2 Let P ′ ⊆ P. If the family of MDMs {(X, A, P , Π, r) : P ∈ P ′ } possesses a bounding function ψ, then Assumption (A) is satisfied for any P ∈ P ′ . Moreover, the expectation in Assumption (A) is even uniformly bounded w.r.t. P ∈ P ′ , and V P ;π n (·) is contained in M ψ (E) for any P ∈ P ′ , π ∈ Π, and n = 0, . . . , N.

Metric on set of probability measures
In Subsection 3.4 we will work with a (semi-) metric (on a set of transition functions) to be defined in (11) below. As it is common in the theory of probability metrics (see, e.g., p. 10 ff in [31]), we allow the distance between two probability measures and the distance between two transition functions to be infinite. That is, we adapt the axioms of a (semi-) metric but we allow a (semi-) metric to take values in R ≥0 := R ≥0 ∪ {∞} rather than only in R ≥0 := [0, ∞).
Let ψ be any gauge function, and denote by M ψ 1 (E) the set of all µ ∈ M 1 (E) for which E ψ dµ < ∞. Note that the integral´E h dµ exists and is finite for any h ∈ M ψ (E) and µ ∈ M ψ 1 (E). For any fixed M ⊆ M ψ (E), the distance between two probability measures µ, ν ∈ M ψ 1 (E) can be measured by Note that (10) indeed defines a map d M : M ψ 1 (E) × M ψ 1 (E) → R + which is symmetric and fulfills the triangle inequality, i.e. d M provides a semi-metric. If M separates points in M ψ 1 (E) (i.e. if any two µ, ν ∈ M ψ 1 (E) coincide when´E h dµ =´E h dν for all h ∈ M), then d M is even a metric. It is sometimes called integral probability metric or probability metric with a ζ-structure; see [28,45]. In some situations the (semi-) metric d M (with M fixed) can be represented by the right-hand side of (10) with M replaced by a different subset M ′ of M ψ (E). Each such set M ′ is said to be a generator of d M . The largest generator of d M is called the maximal generator of d M and denoted by M. That is, M is defined to be the set of all dt coincides with the Kantorovich metric. In this case the ψ-weak topology is also referred to as L 1 -weak topology. Note that the L 1 -Wasserstein metric is a conventional metric for measuring the distance between probability distributions; see, for instance, [7,18,39] for the general concept and [4,19,22,24] for recent applications. ✸ Although the Kantorovich metric is a popular and well established metric, for the application in Section 4 we will need the following generalization from α = 1 to α ∈ (0, 1].
Example 3.7 Assume that (E, d E ) is a metric space and let E := B(E). For some fixed x ′ ∈ E and α ∈ (0, 1], let ψ( The set M Höl,α separates points in M ψ 1 (E) (this follows with similar arguments as in the proof of Lemma 9.3.2 in [8]). Then d M provides a metric on M ψ 1 (E) which we denote by d Höl,α and refer to as Hölder-α metric. Especially when dealing with risk averse utility functions (as, e.g., in Section 4) this metric can be beneficial. Lemma 11.1 in Section 11 of the supplementary material shows that if E is complete and separable then d Höl,α metricizes the ψ-weak topology on M ψ 1 (E). ✸

Metric on set of transition functions
Maintain the notation from Subsection 3.2. Let us denote by P ψ the set of all transition functions P = (P n ) N −1 n=0 ∈ P satisfying´E ψ(y) P n ((x, a), dy) < ∞ for all (x, a) ∈ D n and n = 0, . . . , N − 1. That is, P ψ consists of those transition functions P = (P n ) N −1 n=0 ∈ P with P n ((x, a), • ) ∈ M ψ 1 (E) for all (x, a) ∈ D n and n = 0, . . . , N − 1. Hence, for the elements P = (P n ) N −1 n=0 of P ψ all integrals of the shape´E h(y) P n ((x, a), dy), h ∈ M ψ (E), (x, a) ∈ D n , n = 0, . . . , N − 1, exist and are finite. In particular, for two transition functions P = (P n ) N −1 n=0 and Q = (Q n ) N −1 n=0 from P ψ the distance d M (P n ((x, a), • ), Q n ((x, a), • )) is well defined for all (x, a) ∈ D n and n = 0, . . . , N − 1 (recall that M ⊆ M ψ (E)). So we can define the distance between two transition functions P = (P n ) N −1 n=0 and Q = (Q n ) N −1 n=0 from P ψ by for another gauge function φ : E → R ≥1 . Note that (11) defines a semi-metric d φ ∞,M : P ψ × P ψ → R ≥0 on P ψ which is even a metric if M separates points in M ψ 1 (E). Maybe apart from the factor 1/φ(x), the definition of d φ ∞,M (P , Q) in (11) is quite natural and in line with the definition of a distance introduced by Müller [27, p. 880]. In [27], Müller considers time-homogeneous MDMs, so that the transition kernels do not depend on n. He fixed a state x and took the supremum only over all admissible actions a in state x. That is, for any x ∈ E he defined the distance between P ((x, · ), • ) and Q((x, · ), • ) by sup a∈A(x) d M (P ((x, a), • ), Q((x, a), • )). To obtain a reasonable distance between P n and Q n it is however natural to take the supremum of the distance between P n ((x, · ), • ) and Q n ((x, · ), • ) uniformly over a and over x.
The factor 1/φ(x) in (11) causes that the (semi-) metric d φ ∞,M is less strict compared to the (semi-) metric d 1 ∞,M which is defined as in (11) with φ :≡ 1. For a motivation of considering the factor 1/φ(x), see part (iii) of Remark 3.10 and the discussion afterwards.

Definition of 'differentiability'
Let ψ be any gauge function, and fix some P ψ ⊆ P ψ being closed under mixtures (i.e. (1 − ε)P + εQ ∈ P ψ for any P , Q ∈ P ψ , ε ∈ (0, 1)). The set P ψ will be equipped with the distance d φ ∞,M introduced in (11). In Definition 3.9 below we will introduce a reasonable notion of 'differentiability' for an arbitrary functional V : P ψ → L taking values in a normed vector space (L, · L ). It is related to the general functional analytic concept of (tangential) S-differentiability introduced by Sebastião e Silva [36] and Averbukh and Smolyanov [1]; see also [9,11,37] for applications. However, P ψ is not a vector space. This implies that Definition 3.9 differs from the classical notion of (tangential) S-differentiability. For that reason we will use inverted commas and write 'S-differentiability' instead of S-differentiability. Due to the missing vector space structure, we in particular need to allow the tangent space to depend on the point P ∈ P ψ at which V is differentiated. The role of the 'tangent space' will be played by the set P P ;± ψ := {Q − P : Q ∈ P ψ } whose elements Q − P := (Q 0 − P 0 , . . . , Q N −1 − P N −1 ) can be seen as signed transition functions. In Definition 3.9 we will employ the following terminology.
, φ be another gauge function, and fix P ∈ P ψ . A map W : For the following definition it is important to note that P + ε(Q − P ) lies in P ψ for any P , Q ∈ P ψ and ε ∈ (0, 1]. Definition 3.9 ('S-differentiability') Let M ⊆ M ψ (E), φ be another gauge function, and fix P ∈ P ψ . Moreover let S be a system of subsets of P ψ . A map V : P ψ → L is said to be 'S-differentiable' at P w.r.t. (M, φ) if there exists an (M, φ)-continuous maṗ V P : P P ;± ψ → L such that for every K ∈ S and every sequence (ε m ) ∈ (0, 1] N with ε m → 0. In this case,V P is called 'S-derivative' of V at P w.r.t. (M, φ).
Note that in Definition 3.9 the derivative is not required to be linear (in fact the derivative is not even defined on a vector space). This is another point where Definition 3.9 differs from the functional analytic definition of (tangential) S-differentiability. However, non-linear derivatives are common in the field of mathematical optimization; see, for instance, [32, 37]. Remark 3.10 (i) At least in the case L = R, the 'S-derivative'V P evaluated at Q − P , i.e.V P (Q − P ), can be seen as a measure for the first-order sensitivity of the functional V : P ψ → R w.r.t. a change of the argument from P to (1 − ε)P + εQ, with ε > 0 small, for some given transition function Q.
(ii) The prefix 'S-' in Definition 3.9 provides the following information. Since the convergence in (12) is required to be uniform in Q ∈ K, the values of the first-order sensitivitiesV P (Q − P ), Q ∈ K, can be compared with each other with clear conscience for any fixed K ∈ S. It is therefore favorable if the sets in S are large. However, the larger the sets in S, the stricter the condition of 'S-differentiability'.
(iii) The subset M (⊆ M ψ (E)) and the gauge function φ tell us in a way how 'robust' the 'S-derivative'V P is w.r.t. changes in Q: The smaller the set M and the 'steeper' the gauge function φ, the less strict the metric d φ ∞,M (P , Q) (given by (11)), and therefore the more robustV P (Q − P ) in Q. It is thus favorable if the set M is small and the gauge function φ is 'steep'. However, the smaller M and the 'steeper' φ, the stricter the condition of 'S-differentiability'. More precisely, if M 1 ⊆ M 2 and φ 1 ≥ φ 2 then 'S-differentiability' w.r.t. (M 1 , φ 1 ) implies 'S-differentiability' w.r.t. (M 2 , φ 2 ). Also note that in general the choice of S in Definition 3.9 is not influenced by the choice of the pair (M, φ), and vice versa. ✸ In the general framework of our main result (Theorem 3.14) we can not choose φ 'steeper' than the gauge function ψ which plays the role of a bounding function there. Indeed, the proof of (M, ψ)-continuity of the mapV P : P P ;± ψ → R in Theorem 3.14 does not work anymore if d ψ ∞,M is replaced by d φ ∞,M for any gauge function φ 'steeper' than ψ. And here it does not matter how exactly S is chosen.
In the application in Section 4, the set {Q ∆,τ : ∆ ∈ [0, δ]} should be contained in S (for details see Remark 4.8). This set can be shown to be (relatively) compact w.r.t. d φ ∞,M for φ(x) = ψ(x) (:= 1 + u α (x)) but not for any 'flatter' gauge function φ. So, in this example, and certainly in many other examples, relatively compact subsets of P ψ w.r.t. d ψ ∞,M should be contained in S. It is thus often beneficial to know that the value functional is 'differentiable' in the sense of part (b) of the following Definition 3.11.
The terminology of Definition 3.11 is motivated by the functional analytic analogues. Bounded and relatively compact sets in the (semi-) metric space (P ψ , d φ ∞,M ) are understood in the conventional way. A set K ⊆ P ψ is said to be bounded (w. The system of all bounded sets and the system of all relatively compact sets (w.r.t. d φ ∞,M ) are larger the 'steeper' the gauge function φ is.
The following lemma, whose proof can be found in Subsection 7.2 of the supplementary material, provides an equivalent characterization of 'Hadamard differentiability'. Lemma 3.12 Let M ⊆ M ψ (E), φ be another gauge function, V : P ψ → L be any map, and fix P ∈ P ψ . Then the following two assertions hold.

'Differentiability' of the value functional
Recall that A, Π, and r are fixed, and let V P ;π n and V P n be defined as in (5) and (7), respectively. Moreover let ψ be any gauge function and fix some P ψ ⊆ P ψ being closed under mixtures.
In view of Lemma 3.2 (with P ′ := {P }), condition (a) of Theorem 3.14 below ensures that Assumption (A) is satisfied for any P ∈ P ψ . Then for any x n ∈ E, π ∈ Π, and n = 0, . . . , N we may define under condition (a) of Theorem 3.14 functionals V xn;π n : P ψ → R and V xn n : P ψ → R by V xn;π n (P ) := V P ;π n (x n ) and V xn n (P ) : respectively. Note that V xn n (P ) specifies the maximal value for the expected total reward in the MDM (given state x n at time n) when the underlying transition function is P . By analogy with the name 'value function' we refer to V xn n as value functional given state x n at time n. Part (ii) of Theorem 3.14 provides (under some assumptions) an 'Hadamard derivative' of the value functional V xn n in the sense of Definition 3.11. Conditions (b) and (c) of Theorem 3.14 involve the so-called Minkowski (or gauge) where we use the convention inf ∅ := ∞, M is any subset of M ψ (E), and we set R >0 := (0, ∞). We note that Müller [27] also used the Minkowski functional to formulate his assumptions.
Example 3.13 For the sets M (and the corresponding gauge functions ψ) from Exam- Recall from Definition 2.4 that for given P ∈ P ψ and δ > 0 the sets Π(P ; δ) and Π(P ) consist of all δ-optimal strategies w.r.t. P and of all optimal strategies w.r.t. P , respectively. Generators M ′ of d M were introduced subsequent to (10). Theorem 3.14 ('Differentiability' of V xn;π n and V xn n ) Let M ⊆ M ψ (E) and M ′ be any generator of d M . Fix P = (P n ) N −1 n=0 ∈ P ψ , and assume that the following three conditions hold.
(a) ψ is a bounding function for the MDM (X, A, Q, Π, r) for any Q ∈ P ψ .
Then the following two assertions hold.
The proof of Theorem 3.14 can be found in Section 8 of the supplementary material. Note that the set Π(P ; δ) shrinks as δ decreases. Therefore the right-hand side of (17) is well defined. The supremum in (18) ranges over all optimal strategies w.r.t. P . If, for example, the MDM (X, A, P , Π, r) satisfies conditions (a)-(c) of Theorem 10.3 in the supplementary material, then by part (iii) of this theorem an optimal strategy can be found, i.e. Π(P ) is non-empty. The existence of an optimal strategy is also ensured if the sets F 0 , . . . , F N −1 are finite (a situation one often faces in applications). In the latter case the 'Hadamard derivative'V xn n;P (Q − P ) can easily be determined by computing the finitely many valuesV xn;π n;P (Q − P ), π ∈ Π(P ), and taking their maximum. The discrete case will be discussed in more detail in Subsection 5.5 of the supplementary material.
If there exists a unique optimal strategy π P ∈ Π w.r.t. P , then Π(P ) is nothing but the singleton {π P }, and in this case the 'Hadamard derivative'V x 0 0;P of the optimal value (functional) V x 0 0 at P coincides withV x 0 ;π P 0;P . Remark 3.15 (i) The 'Fréchet differentiability' in part (i) of Theorem 3.14 holds even uniformly in π ∈ Π; see Theorem 8.1 in the supplementary material for the precise meaning.
(ii) We do not know if it is possible to replace 'Hadamard differentiability' by 'Fréchet differentiability' in part (ii) of Theorem 3.14. The following arguments rather cast doubt on this possibility. The proof of part (ii) is based on the decomposition of the value functional V xn n in display (69) of the supplementary material and a suitable chain rule, where the decomposition (69) involves the sup-functional Ψ introduced in display (70) of the supplementary material. However, Corollary 1 in [6] (see also Proposition 4.6.5 in [35]) shows that in normed vector spaces sup-functionals are in general not Fréchet differentiable. This could be an indication that 'Fréchet differentiable' of the value functional indeed fails. We can not make a reliable statement in this regard.
, the sets K for whose elements the first-order sensitivities can be compared with each other with clear conscience are smaller and the 'derivative' is less robust.
(iii) In many situations, condition (c) of Theorem 3.14 holds trivially. This is the (iv) The conditions (b) and (c) of Theorem 3.14 can also be verified directly in some cases; see, for instance, the proof of Lemma 9.2 in Subsection 9.3.1 of the supplementary material. ✸ In applications it is not necessarily easy to specify the set Π(P ) of all optimal strategies w.r.t. P . While in most cases an optimal strategy can be found with little effort (one can use the Bellman equation; see part (i) of Theorem 10.3 in Section 10 of the supplementary material), it is typically more involved to specify all optimal strategies or to show that the optimal strategy is unique. The following remark may help in some situations; for an application see Subsection 4.4.

Remark 3.17
In some situations it turns out that for every P ∈ P ψ the solution of the optimization problem (6) does not change if Π is replaced by a subset Π ′ ⊆ Π (being independent of P ). Then in the definition (7) of the value function (at time 0) the set Π can be replaced by the subset Π ′ , and it follows (under the assumptions of Theorem 3.14) that in the representation (18) of the 'Hadamard derivative'V x 0 0;P of V x 0 0 at P the set Π(P ) can be replaced by the set Π ′ (P ) of all optimal strategies w.r.t. P from the subset Π ′ . Of course, in this case it suffices to ensure that conditions (a)-(b) of Theorem 3.14 are satisfied for the subset Π ′ instead of Π. ✸ 3.6 Two alternative representations ofV x n ;π n;P In this subsection we present two alternative representations (see (19) and (20)) of the 'Fréchet derivative'V xn;π n;P in (16). The representation (19) will be beneficial for the proof of Theorem 3.14 (see Lemma 8.2 in Subsection 8.1 of the supplementary material) and the representation (20) will be used to derive the 'Hadamard derivative' of the optimal value of the terminal wealth problem in (28) below (see the proof of Theorem 4.6 in Subsection 9.3 of the supplementary material). (16), we obtain under the assumptions of Theorem 3.14 that for every fixed P = (P n ) N −1 n=0 ∈ P ψ the 'Fréchet derivative'V xn;π n;P of V xn;π n at P can be represented aṡ

Remark 3.18 (Representation I) By rearranging the sums in
for every ∈ Π, and n = 0, . . . , N. ✸ Remark 3.19 (Representation II) For every fixed P = (P n ) N −1 n=0 ∈ P ψ , and under the assumptions of Theorem 3.14, the 'Fréchet derivative'V xn;π n;P of V xn;π n at P admits the representationV for every Indeed, it is easily seen thatV P ,Q;π n (x n ) coincides with the right-hand side of (19). Note that it can be verified iteratively by means of condition (a) of Theorem 3.14 and Lemma 3.2 (with P ′ := {Q}) thatV P ,Q;π n (·) ∈ M ψ (E) for every Q ∈ P ψ , π ∈ Π, and n = 0, . . . , N. In particular, this implies that the integrals on the right-hand side of (21) exist and are finite. Also note that the iteration scheme (21) involves the family (V P ;π k ) N k=1 which itself can be seen as the solution of a backward iteration scheme: see Proposition 10.1 of the supplementary material. ✸

Application to a terminal wealth optimization problem in mathematical finance
In this section we will apply the theory of Sections 2-3 to a particular optimization problem in mathematical finance. At first, we introduce in Subsection 4.1 the basic financial market model and formulate subsequently the terminal wealth problem as a classical optimization problem in mathematical finance. The market model is in line with standard literature as [2,Chapter 4] or [10,Chapter 5]. To keep the presentation as clear as possible we restrict ourselves to a simple variant of the market model (only one risky asset). In Subsection 4.2 we will see that the market model can be embedded into the MDM of Section 2. It turns out that the existence (and computation) of an optimal (trading) strategy can be obtained by solving iteratively N one-stage investment problems; see Subsection 4.3. In Subsection 4.4 we will specify the 'Hadamard derivative' of the optimal value functional of the terminal wealth problem, and Subsection 4.5 provides some numerical examples.

Basic financial market model, and the target
Consider an N-period financial market consisting of one riskless bond B = (B 0 , . . . , B N ) and one risky asset S = (S 0 , . . . , S N ). Further assume that the value of the bond evolves deterministically according to for some fixed constants r 1 , . . . , r N ∈ R ≥1 , and that the value of the asset evolves stochastically according to for some independent R ≥0 -valued random variables R 1 , . . . , R N on some probability space (Ω, F , P) with distributions m 1 , . . . , m N , respectively. Throughout Section 4 we will assume that the financial market satisfies the following Assumption (FM), where α ∈ (0, 1) is fixed and chosen as in (24) below. In Examples 4.4 and 4.5 we will discuss specific financial market models which satisfy Assumption (FM).
Note that for any n = 0, . . . , N − 1 the value r n+1 (resp. R n+1 ) corresponds to the relative price change B n+1 /B n (resp. S n+1 /S n ) of the bond (resp. asset) between time n and n + 1. Let F 0 be the trivial σ-algebra, and set F n := σ(S 0 , . . . , S n ) = σ(R 1 , . . . , R n ) for any n = 1, . . . , N. Now, an agent invests a given amount of capital x 0 ∈ R ≥0 in the bond and the asset according to some self-financing trading strategy. By trading strategy we mean an (F n )adapted R 2 ≥0 -valued stochastic process ϕ = (ϕ 0 n , ϕ n ) N −1 n=0 , where ϕ 0 n (resp. ϕ n ) specifies the amount of capital that is invested in the bond (resp. asset) during the time interval [n, n + 1). Here we require that both ϕ 0 n and ϕ n are nonnegative for any n, which means that taking loans and short sellings of the asset are excluded. The corresponding portfolio process X ϕ = (X ϕ 0 , . . . , X ϕ N ) associated with ϕ = (ϕ 0 n , ϕ n ) N −1 n=0 is given by It is easily seen that for any self-financing trading strategy ϕ = (ϕ 0 n , ϕ n ) N −1 n=0 w.r.t. x 0 the corresponding portfolio process admits the representation Note that X ϕ n − ϕ n corresponds to the amount of capital which is invested in the bond between time n and n + 1. Also note that it can be verified easily by means of Remark 3.1.6 in [2] that under condition (c) of Assumption (FM) the financial market introduced above is free of arbitrage opportunities.
In view of (22), we may and do identify a self-financing trading strategy w.
x 0 ] and ϕ n ∈ [0, X ϕ n ] for all n = 1, . . . , N − 1. We restrict ourselves to Markovian self-financing trading strategies ϕ = (ϕ n ) N −1 n=0 w.r.t. x 0 which means that ϕ n only depends on n and X ϕ n . To put it another way, we assume that for any n = 0, . . . , N − 1 there exists some Borel measurable map f n : R ≥0 → R ≥0 such that ϕ n = f n (X ϕ n ). Then, in particular, X ϕ is an R ≥0 -valued (F n )-Markov process whose one-step transition probability at time n ∈ {0, . . . , N − 1} given state x n ∈ R ≥0 and strategy The agent's aim is to find a self-financing trading strategy x 0 for which her expected utility of the discounted terminal wealth is maximized. We assume that the agent is risk averse and that her attitude towards risk is set via the power utility function u α : R ≥0 → R ≥0 defined by for some fixed α ∈ (0, 1) (as in Assumption (FM)). The coefficient α determines the degree of risk aversion of the agent: the smaller the coefficient α, the greater her risk aversion. Hence the agent is interested in those self-financing trading strategies ϕ = In the following subsections we will assume for notational simplicity that r 1 , . . . , r N are fixed and that m 1 , . . . , m N are a sort of model parameters. In this case the factor 1/B N in (25) is superfluous; it indeed does not influence the maximization problem or any 'derivative' of the optimal value. On the other hand, if also the (Dirac-) distributions of r 1 , . . . , r N would be allowed to be variable, then this factor could matter for the derivative of the optimal value w.r.t. changes in the (deterministic) dynamics of B N .

Embedding into MDM, and optimal trading strategies
The setting introduced in Subsection 4.1 can be embedded into the setting of Sections 2-3 as follows. Let r 1 , . . . , r N ∈ R ≥1 be a priori fixed constants. Let (E, ) ∩ D and the set F n of all decision rules at time n consists of all those Borel measurable functions f n : . For any n = 0, . . . , N − 1, let the set F n of all admissible decision rules at time n be equal to F n . Let as before Π : Moreover let r n :≡ 0 for any n = 0, . . . , N − 1, and Consider the gauge function ψ : Let P ψ be the set of all transition functions P = (P n ) N −1 n=0 ∈ P consisting of transition kernels of the shape and the map η n,(x,a) is defined as in (23). In particular, P ψ ⊆ P ψ (with P ψ defined as in Subsection 3.3), and it can be verified easily that ψ given by (26) is a bounding function for the MDM (X, A, Q, Π, r) for any Q ∈ P ψ (see Lemma 9.2(i) of the supplementary material). Note that X plays the role of the portfolio process X ϕ from Subsection 4.1. Also note that for some fixed x 0 ∈ R ≥0 , any self-financing Then, for every fixed x 0 ∈ R ≥0 and P ∈ P ψ the terminal wealth problem introduced at the very end of Subsection 4.1 reads as A strategy π P ∈ Π is called an optimal (self-financing) trading strategy w.r.t. P (and x 0 ) if it solves the maximization problem (28).
. Of course, one could also assume that the decision rules of a trading strategy π also depend on past actions and past values of the portfolio process X ϕ . However, as already discussed in Remark 2.5(i), the corresponding historydependent trading strategies do not lead to an improved optimal value for the terminal wealth problem (28). ✸

Computation of optimal trading strategies
In this subsection we discuss the existence and computation of solutions to the terminal wealth problem (28), maintaining the notation of Subsection 4.2. We will adapt the arguments of Section 4.2 in [2]. As before r 1 , . . . , r N ∈ R ≥1 are fixed constants. Basically the existence of an optimal trading strategy for the terminal wealth problem (28) can be ensured with the help of a suitable analogue of Theorem 4.2.2 in [2]. In order to specify the optimal trading strategy explicitly one has to determine the local maximizers in the Bellman equation; see Theorem 10.3(i) in Section 10 of the supplementary material. However this is not necessarily easy. On the other hand, part (ii) of Theorem 4.3 ahead (a variant of Theorem 4.2.6 in [2]) shows that, for our particular choice of the utility function (recall (24)), the optimal investment in the asset at time n ∈ {0, . . . , N − 1} has a rather simple form insofar as it depends linearly on the wealth. The respective coefficient can be obtained by solving the one-stage optimization problem in (29) ahead. That is, instead of finding the optimal amount of capital (possibly depending on the wealth) to be invested in the asset, it suffices to find the optimal fraction of the wealth (being independent of the wealth itself) to be invested in the asset.
For the formulation of the one-stage optimization problem note that every transition function P ∈ P ψ is generated through (27) by some (m 1 , . . . , m N ) ∈ M α 1 (R ≥0 ) N . For every P ∈ P ψ , we use (m P 1 , . . . , m P N ) to denote any such set of 'parameters'. Now, consider for any P ∈ P ψ and n = 0, . . . , N − 1 the optimization problem Note that 1 + γ(y/r n+1 − 1) lies in R ≥0 for any γ ∈ [0, 1] and y ∈ R ≥0 , and that the integral on the left-hand side (exists and) is finite (this follows from displays (77)-(79) in Subsection 9.1 of the supplementary material) and should be seen as the expectation of u α (1 + γ(R n+1 /r n+1 − 1)) under P.
The following lemma, whose proof can be found in Subsection 9.1 of the supplementary material, shows in particular that v P n := sup is the maximal value of the optimization problem (29).
Lemma 4.2 For any P ∈ P ψ and n = 0, . . . , N − 1, there exists a unique solution γ P n ∈ [0, 1] to the optimization problem (29). Part (i) of the following Theorem 4.3 involves the value function introduced in (7). In the present setting this function has a comparatively simple form: for any x n ∈ R ≥0 , P ∈ P ψ , and n = 0, . . . , N. Part (ii) involves the subset Π lin of Π which consists of all linear trading strategies, i.e. of all π ∈ Π of the form π In part (i) and elsewhere we use the convention that the product over the empty set is 1. (i) The value function V P n given by (30) admits the representation for any x n ∈ R ≥0 and n = 0, . . . , (ii) For any n = 0, . . . , N − 1, let γ P n ∈ [0, 1] be the unique solution to the optimization problem (29) and define a decision rule f P n : Then π P := (f P n ) N −1 n=0 ∈ Π lin forms an optimal trading strategy w.r.t. P . Moreover, there is no further optimal trading strategy w.r.t. P which belongs to Π lin .  4.4 (Cox-Ross-Rubinstein model) Let r 1 = · · · = r N = r for some r ∈ R ≥1 . Moreover let P ∈ P be any transition function defined as in (27) with m 1 = · · · = m N = m P for some m P := p P δ u P + (1 − p P )δ d P , where p P ∈ [0, 1] and d P , u P ∈ R >0 are some given constants (depending on P ) satisfying d P < r < u P . Then P ∈ P ψ and conditions (a)-(c) of Assumption (FM) are clearly satisfied. In particular, the corresponding financial market is arbitrage-free and the optimization problem (29) simplifies to (up to the factor r −α ) Lemma 4.2 ensures that (33) has a unique solution, γ P CRR , and it can be checked easily (see, e.g., [2, p. 86]) that this solution admits the representation where κ α := (1 − α) −1 and Note that only fractions from the interval [0, 1] are admissible, and that the expression in the middle line in (34) lies in (0, 1) when p P ∈ (p P ,0 , p P ,1 ). Thus, part (ii) of Theorem 4.3 shows that the strategy π P CRR defined by (32) (with γ P n replaced by γ P CRR ) is optimal w.r.t. P and unique among all π ∈ Π lin (P ). ✸ In the following example the bond and the asset evolve according to the ordinary differential equation and the Itô stochastic differential equation respectively, where ν, µ ∈ R ≥0 and σ ∈ R >0 are constants and W is a one-dimensional standard Brownian motion. We assume that the trading period is (without loss of generality) the unit interval [0, 1] and that the bond and the asset can be traded only at N equidistant time points in [0, 1], namely at t N,n := n/N, n = 0, . . . , N − 1. Then, in particular, the relative price changes r n+1 : , respectively. In particular, r n+1 = exp(ν/N) and R n+1 is distributed according to the log-normal distribution LN (µ−σ 2 /2)/N,σ 2 /N for any n = 0, . . . , N − 1.
For the formulation of Theorem 4.6 recall from (14) the definition of the functionals V x 0 ;π 0 and V x 0 0 , where the maps V P ;π 0 and V P 0 are given by (5) and (7), respectively. In the specific setting of Subsection 4.2 we know from (30) that for any x 0 ∈ R ≥0 , P ∈ P ψ , and π ∈ Π. Further recall that any γ = (γ n ) N −1 n=0 ∈ [0, 1] N induces a linear trading strategy π γ := (f γ n ) N −1 n=0 ∈ Π lin through (31). Let v P ;γn n be defined as on the left-hand side of (29) and set v P ;γ and V x 0 0 ) In the setting above let P ∈ P ψ , γ ∈ [0, 1] N , and x 0 ∈ R ≥0 . Then the following two assertions hold.
wherev P ,Q;πγ 0 Remark 4.7 Basically Theorem 3.14 yields the first "=" in (39) with Π lin (P ) replaced by Π(P ). Since part (ii) of Theorem 4.3 ensures that for any P ∈ P ψ there exists an optimal trading strategy which belongs to Π lin , we may replace for any P ∈ P ψ in the representation (30) of the value function V P 0 (x 0 ) (or, equivalently, in the representation (37) of the value functional V x 0 0 (P )) the set Π by Π lin (⊆ Π). Therefore one can use Theorem 3.14 to derive the first "=" in (39). The second "=" in (39) is ensured by the second assertion in part (ii) of Theorem 4.3. For details see the proof which is carried out in Subsection 9.3 of the supplementary material.

Numerical examples for the 'Hadamard derivative'
In this subsection we quantify by means of the 'Hadamard derivative' (of the optimal value functional V x 0 0 ) the effect of incorporating an unlikely but significant jump in the dynamics S = (S 0 , . . . , S N ) of an asset price on the optimal value of the corresponding terminal wealth problem (28). At the end of this subsection we will also study the effect of incorporating more than one jump.
We specifically focus on the setting of the discretized Black-Scholes-Merton model from Example 4.5 with (mainly) N = 12. That is, we let r 1 = · · · = r N = r for r := exp(ν/N), where ν ∈ R ≥0 . Moreover let P correspond to m 1 = · · · = m N = m P for m P := LN (µ P −σ 2 P /2)/N,σ 2 P /N , where µ P ∈ R ≥0 and σ P ∈ R >0 are chosen such that µ P > (1 − α)σ 2 P . In fact we let specifically µ P = 0.05 and σ P = 0.2. This set of parameters is often used in numerical examples in the field of mathematical finance; see, e.g., [25, p. 898]. For the initial state we choose x 0 = 1. For the drift ν of the bond we will consider different values, all of them lying in {0.01, 0.02, 0.03, 0.035, 0.04}. Moreover, we let (mainly) α ∈ {0.25, 0.5, 0.75}. Recall that α determines the degree of risk aversion of the agent; a small α corresponds to high risk aversion.
By a price jump at a fixed time n ∈ {0, . . . , N − 1} we mean that the asset's return R n+1 is not anymore drawn from m P but is given by a deterministic value ∆ ∈ R ≥0 esstentially 'away' from 1. As appears from Table 1, in the case N = 12 it seems to be reasonable to speak of a 'jump' at least if ∆ ≤ 0.8 or ∆ ≥ 1.25. The probability under m P for a realized return smaller than 0.8 (resp. larger than 1.25) is smaller than 0.0001. A realized return of ≤ 0.5 (resp. ≥ 1.5) is practically impossible; its probability under m P is smaller than 10 −30 (resp. 10 −10 ). That is, the choice ∆ = 0.5 or ∆ = 1.5 doubtlessly corresponds to a significant price jump.  If at a fixed time τ ∈ {0, . . . , N − 1} a formerly nearly impossible 'jump' ∆ can now occur with probability ε, then instead of m τ +1 = m P one has m τ +1 = (1 − ε)m P + εδ ∆ . That is, instead of P the transition function is now given by (1 − ε)P + εQ ∆,τ with Q ∆,τ generated through (27) by m n+1 = m Q ∆,τ ;n , n = 0, . . . , N − 1, where m Q ∆,τ ;n := δ ∆ , n = τ m P , otherwise .
By part (ii) of Theorem 4.6 the 'Hadamard derivative'V x 0 0;P of the optimal value functional V x 0 0 evaluated at Q ∆,τ − P can be written aṡ for n = 0, . . . , N − 1, where ν P ,α := µ P − (1 − α)σ 2 P (∈ (0, µ P )). Note thatV x 0 0;P (Q ∆,τ − P ) is independent of τ , which can be seen from (40)-(43). That is, the effect of a jump is independent of the time at which the jump takes place. Also note thatV x 0 0;P (Q ∆,τ − P ) ≡ 0 when ν ∈ [µ P , ∞). This is not surprising, because in this case the optimal fraction γ P BSM to be invested into the asset is equal to 0 (see (36)) and the agent performs a complete investment in the bond at each trading time n.
Remark 4.8 As mentioned before, the 'Hadamard derivative'V x 0 0;P evaluated at Q ∆,τ − P can be seen as the first-order sensitivity of the optimal value V x 0 0 (P ) w.r.t. a change of P to (1 − ε)P + εQ ∆,τ , with ε > 0 small. It is a natural wish to compare these values for different ∆ ∈ R >0 . In Subsection 9.4 of the supplementary material it is proven that the family {Q ∆,τ : ∆ ∈ [0, δ]} is relatively compact w.r.t. d ψ ∞,M Höl,α (the proof does not work if d ψ ∞,M Höl,α is replaced by d φ ∞,M Höl,α for any gauge function φ 'flatter' than ψ) for any fixed δ ∈ R >0 . As a consequence the approximation (1) with Q = Q ∆,τ holds uniformly in ∆ ∈ [0, δ], and therefore the valuesV  As appears from Figure 1 the negative effect of incorporating a 'jump' ∆ = 0.5 in the dynamics S = (S 0 , . . . , S N ) of an asset price is larger than the positive effect of incorporating a 'jump' ∆ = 1.5 for every choice of the agent's degree of risk aversion. Figure 1 also shows the unsurprising effect that a high risk aversion (small value of α) leads to a negligible sensitivity.
Next we compare the values ofV x 0 0;P (Q ∆,τ − P ) for trading horizons N ∈ {4, 12, 52} in dependence of the drift ν of the bond and the 'jump' ∆. This choices of N correspond respectively to a quarterly, monthly, and weekly time discretization. We will restrict ourselves to 'jumps' ∆ ≤ 0.8. On the one hand, this ensures that the 'jumps' are significant; see the discussion above. On the other hand, as just discerned from Figure  1, the effect of jumps 'down' are more significant than jumps 'up'.  Finally, let us briefly touch on the case where more than one jump may appear. More precisely, instead of Q ∆,τ (with τ ∈ {0, . . . , N − 1}) consider the transition function Q ∆,τ (ℓ) (with 1 ≤ ℓ ≤ N, τ (ℓ) = (τ 1 , . . . , τ ℓ ), τ 1 , . . . , τ ℓ ∈ {0, . . . , N − 1} pairwise distinct) which is still generated by means of (40) but with the difference that at the ℓ different times τ 1 , . . . , τ ℓ the distribution m P is replaced by δ ∆ . Just as in the case ℓ = 1, it turns out that it does not matter at which times τ 1 , . . . , τ ℓ exactly these ℓ jumps occur. Figure 4 shows the value ofV x 0 0;P (Q ∆,τ (ℓ) − P ) in dependence on ℓ and ∆. It seems that for any fixed ∆ ∈ [0, 0.8] the first-order sensitivity increases approximately linearly in ℓ.

Supplementary material
The supplementary material illustrates the setting of Sections 2-3 in the case of finite state space and action spaces, and contains the proofs of the results from Sections 3-4. Moreover, supplemental definitions and results to Section 2 are given and the existence of optimal strategies in general MDMs is discussed. Finally, an supplemental topological result is shown.

Supplement: The discrete case as an illustrating example
In Sections 2-3 we work in a rather general set-up. This implies that we cannot avoid dealing with 'advanced' objects. In the special case where the state space and the action spaces are finite the situation is different. In this case it is possible to present the basic definitions and the main result (Theorem 3.14) in a more comprehensible way. For the moment we assume that the reader is already familiar with the basic terminology of MDMs. Otherwise we advise the reader to first read Section 2. In Section 5.5 it will be discussed how the following elaborations fit to the general set-up of Sections 2-3.

Basic model components
Let E = {x 1 , . . . , x s } be a finite state space, N ∈ N be a fixed finite time horizon, and A n (x i ) = {a n,i;1 , . . . , a n,i;t n,i } be the finite set of possible actions that can be performed when the MDP is in state x i at time n ∈ {0, . . . , N − 1}. For any n = 0, . . . , N − 1, i = 1, . . . , s, and a ∈ A n (x i ), the (one-step transition) probability measure on E from which the state of the MDP at time n+1 is drawn, given that the MDP is in state x i and action a is selected at time n, can be identified with an element p n,i;a = (p n,i;a (1), . . . , p n,i;a (s)) of R s ≥0,1 . Here R s ≥0,1 is the set of all vectors from R s whose entries are nonnegative and sum up to 1, and p n,i;a (j) specifies the probability that the MDP will be in state x j at time n + 1, given it is in state x i and action a ∈ A n (x i ) is selected at time n. In particular, if the initial state x 0 ∈ E is fixed and i 0 refers to the corresponding index (i.e. x 0 = x i 0 ), the vector in R d , with d := (t 0,i 0 + N −1 n=1 s i=1 t n,i )s, can be identified with the transition probability function, i.e. with the ensemble of all transition probabilities. Here ⊕ is the 'clueing operator' defined by (α 1 , . . . , α s ) ⊕ (β 1 , . . . , β t ) := (α 1 , . . . , α s , β 1 , . . . , β t ). In fact p is even an element of the following subset of R d : If V 0 (p) denotes the optimal value of the MDM based on transition probability function p, then V 0 = V 0 (·) can be seen as a map from P (⊆ R d ) to R.

Definition of first-order sensitivity in the discrete case
It is tempting to consider the classical Fréchet (or total) derivativeV 0;p of V 0 at p in order to obtain a tool for measuring the first-order sensitivity of the optimal value w.r.t. a change from p to (1 − ε)p + εq: for any (h m ) ∈ R N 0 with h m → 0, where B 1 (p) is the closed ball in R d around p with radius 1 and R 0 := R \ {0}. This approach is indeed expedient to some extent. However, one has to note that p+h m (q −p) may lie outside V 0 's domain P. To avoid this problem, we replace condition (46) by the following variant of (46): for any (ε m ) ∈ (0, 1] N with ε m → 0. Take into account that p + ε(q − p) lies in P for any p, q ∈ P and ε ∈ (0, 1]. Also note that, if R d is equipped with the max-norm, P is contained in B 1 (p) for any p ∈ P. For classical Fréchet (or total) differentiability the derivativeV 0;p is required to be linear and continuous. On the one hand, for 'Fréchet differentiability' (see Definition 5.1) we will also require a sort of continuity, namely that the mapping q →V 0;p (q − p) from P to R is continuous, where P is equipped with the relative topology of R d . On the other hand, the domain ofV 0;p is given by P p;± := {q − p : q ∈ P} and thus not a linear space. Therefore linearity ofV 0;p is an indefinite property.
In view of (1), the quantityV 0;p (q − p) can be seen as a measure for the first-order sensitivity of the optimal value V 0 (p) under transition probability function p w.r.t. a change from p to (1 − ε)p + εq, with ε > 0 small. For this interpretation it is actually not necessary to require thatV 0;p ( · − p) is continuous or that the convergence in (47) holds uniformly in q ∈ P. One can indeed be content with the directional derivative, i.e. with the convergence in (47) for fixed q. Nevertheless continuity and uniformity are natural wishes in this context, because they ensure stability of the first-order sensitivity w.r.t. small modifications of q as well as comparability of the first-order sensitivity of (infinitely) many different q. We refer to the discussion subsequent to (1).
Definition 5.1 A map V : P → R is said to be 'Fréchet differentiable' at p ∈ P if there exists a mapV p : P p;± → R for which (47) (with V 0 ,V 0;p replaced by V,V p respectively) holds and for which the mapping q →V p (q − p) from P to R is continuous. In this casė V p is called 'Fréchet derivative' of V at p.

Computation of first-order sensitivity in the discrete case
To specify the 'Fréchet derivative' of V 0 at p we need some further notation. For any strategy π = (f n ) N −1 n=0 , we use V π 0 (p) to denote the expected total reward (from time 0 to N) when p is the underlying transition probability function and the decisions are performed according to π. In the finite setting there exists under p at least one optimal strategy π p , i.e. a strategy π p with V π p 0 (p) = max π V π 0 (p). We will write Π(p) for the (finite) set of all optimal strategies w.r.t. p. Then the results of Subsection 3.5 show that the 'Fréchet derivative' of V 0 at p is given bẏ whereV π 0;p refers to the 'Fréchet derivative' of V π 0 at p. The latter can be obtained from a suitable iteration scheme. According to Remark 3.19 we indeed havė V π 0;p (q − p) =V p,q;π 0 (x i 0 ) (recall that x i 0 ∈ E is the initial state and that π = (f n ) N −1 n=0 refers to a strategy) foṙ i = 1, . . . , s, where the V p;π n (·) are given by the usual backward iteration scheme (see, e.g., Lemma 3.5 in [13] or p. 80 in [30]) for the computation of V π 0 (p):

An example: stochastic inventory control
In this subsection we will consider an inventory control problem, which is a classical example in discrete dynamic optimization; see, e.g., [5,12,30

Basic inventory control model, and the target
Consider an N-period inventory control system where a supplier of a single product seeks optimal inventory management to meet random commodity demand in such a way that a measure of profit over a time horizon of N periods is maximized. For the formulation of the model, let I 1 , . . . , I N be N 0 -valued independent random variables on some probability space (Ω, F , P), where I n+1 can be seen as the random demand of the single product in the period between time n and time n + 1. We denote by p n+1 = (p n+1;k ) k∈N 0 the counting density of I n+1 (i.e. p n+1;k := P[I n+1 = k]) and assume that p n+1 is known for any n = 0, . . . , N − 1. Note that p n+1 ∈ R N 0 ≥0,1 , where R N 0 ≥0,1 denotes the space of all real-valued sequences whose entries are nonnegative and sum up to 1. Let F 0 be the trivial σ-algebra, and set F n := σ(I 1 , . . . , I n ) for any n = 1, . . . , N.
We suppose that within each period of time the available inventory level of the single product is restricted to K units (for some fixed K ∈ N) and that there is no backlogging of unsatisfied demand at the end of each period. The latter means that if at the end of a period the demand exceeds the current inventory, then the whole inventory is sold and the surplus demand gets lost.
Given an initial inventory level y 0 ∈ {0, . . . , K}, the supplier intends to find optimal order quantities according to an order strategy to maximize some measure of profit. By order strategy we mean an (F n )-adapted {0, . . . , K}-valued stochastic process ϕ = (ϕ n ) N −1 n=0 , where ϕ n specifies the amount of ordered units of the single product at the beginning of period n. Here we suppose that the delivery of any order occurs instantaneously. Since excess demand is lost by assumption, the corresponding inventory Note that min{I n+1 , Y ϕ n + ϕ n } corresponds to the amount of units of the single product sold in the period between time n and time n + 1. Hence we refer to the process Z ϕ := (Z ϕ 0 , . . . , Z ϕ N ) defined by Z ϕ 0 := 0 and Z ϕ n+1 := min{I n+1 , Y ϕ n + ϕ n }, n = 0, . . . , N − 1 (52) as sales process associated with ϕ = (ϕ n ) N −1 n=0 . In view of (51) and since the inventory capacity is restricted to K units, we may and do identify any order strategy with an (F n )-adapted {0, . . . , K}-valued stochastic process ϕ = (ϕ n ) N −1 n=0 satisfying ϕ 0 ∈ {0, . . . , K − y 0 } and ϕ n ∈ {0, . . . , K − Y ϕ n } for all n = 1, . . . , N − 1. We restrict ourselves to Markovian order strategies ϕ = (ϕ n ) N −1 n=0 which means that ϕ n only depends on n and (Y ϕ n , Z ϕ n ). To put it another way, we suppose that for any n = 0, . . . , N − 1 there is some map f n : {0, . . . , K} 2 → {0, . . . , K} such that ϕ n = f n (Y ϕ n , Z ϕ n ). Hence, for given strategy ϕ = (ϕ n ) N −1 n=0 (resp. π = (f n ) N −1 n=0 ) the process X ϕ := (Y ϕ , Z ϕ ) is a {0, . . . , K} 2 -valued (F n )-Markov process whose onestep transition probability for the transition from state x = (y, z) ∈ {0, . . . , K} 2 at time n ∈ {0, . . . , N − 1} to state x ′ = (y ′ , z ′ ) ∈ {0, . . . , K} 2 at time n + 1 is given by η The supplier's aim is to find an order strategy ϕ = (ϕ n ) N −1 n=0 (resp. π = (f n ) N −1 n=0 ) for which the expected total profit is maximized. Here the profit can be seen as the difference between the sales revenue and the costs for ordering and holding the single product. For the sake of simplicity, we suppose that the sales revenue as well as the ordering and holding costs are known and linear in each period. Hence, we are interested in those order strategies ϕ = (ϕ n ) N −1 n=0 (resp. π = (f n ) N −1 n=0 ) for which the expectation of Note here that s rev , c ord , c fix , and c hol denote the sales revenue, the ordering costs, the fixed ordering costs, and the holding costs per unit of the single product, respectively. let the component p n,i;a n,i;k = (p n,i;a n,i;k (1), . . . , p n,i;a n,i;k (s)) of the vector p from (44) be given by p n,i;a n,i;k (j) := η p n+1 (y i ,a n,i;k ) (z j )½ {y j =y i +a n,i;k −z j } , j = 1, . . . , s

Embedding into MDM, and optimal order strategies
for some predetermined p n+1 ∈ R N 0 ≥0,1 and for η p n+1 (y,a) (·) introduced in (53). In fact any element p of P is generated via (53)-(54) by some N-tuple p = (p 1 , . . . , p N ) of counting densities p 1 , . . . , p N on N 0 ; here p 1 , . . . , p N should be seen as the counting densities of I 1 , . . . , I N . The value in (54) should be seen as the probability of a transition from state (y i , z i ) to state (y j , z j ) in time between n and n + 1 (this transition probability is even independent of z i ).
Then for every fixed p ∈ P the inventory control problem introduced in Subsection 5.4.1 reads as where V π 0 (p) := V p;π 0 (x i 0 ) is given by (50) with (55)-(57) (x i 0 ∈ E is the initial state). A strategy π p is called an optimal order strategy w.r.t. p if it solves the maximization problem (58).

Numerical examples for the 'Fréchet derivative'
Let us take up the numerical example at p. 41 in [30] where N := 3, K := 4, s rev := 8, c ord := 2, c fix := 4, and c hol := 1. We fix p := (p • , p • , p • ) with p • := (0, 1 4 , 1 2 , 1 4 , 0, 0 . . .), and denote by p the unique element of P generated by p through (53)-(54). This choice of p means that in each period the demand is 1, 2, or 3 with probability 1 4 , 1 2 , and 1 4 , respectively. Table 2 provides the (unique) optimal order strategy π p = (f p 0 , f p 1 , f p 2 ), and the second column of Table 3 displays the maximal expected total reward V π p 0 (p) of the inventory control problem (58) for all possible initial inventory levels y 0 := y i 0 ∈ {0, . . . , 4}. Moreover, the last two columns in Table 3 display the 'Fréchet derivative'V π p 0;p (·) of V π p 0 at p evaluated at direction q (0) − p and at direction q (4) − p (calculated with the iteration scheme (49)), again for all possible initial inventory levels y 0 . Here q (0) and q (4) are generated through (53) As the optimal strategy π p is unique in our example, we even haveV 0;p (·) =V π p 0;p (·). Note that for i ∈ {0, 4} the valueV 0;p (q (i) − p) (in our case it equalsV π p 0;p (q (i) − p)) quantifies the first-order sensitivity of V 0 (p) (respectively of V π p 0 (p)) w.r.t. a change of the underlying probability transition function from p to p (i) := (1 − ε)p + εq (i) with ε ∈ (0, 1) small. It can be easily seen that p (i) is generated through (53)-(54) where (take into account that the case differentiation in (53) does not depend on the counting density p n+1 ). That is, the change from p to p (i) means that the formerly impossible demand i now gets assigned small but strictly positive probability ε in each period. Table 2: Optimal order strategy π p = (f p 0 , f p 1 , f p 2 ) for p as above. (y, z) (0, 0) (0, 1) (0, 2) (0, 3) (0, 4) (1, 0) (1, 1) (1, 2) (1, 3) (1, 4) (2, 0) · · · (4, 4) Table 3: Optimal value V π p 0 (p) and the 'Fréchet derivative'V π p 0;p (q (i) −p) (in our example it equalsV 0;p (q (i) −p)) with q (i) as above, i ∈ {0, 4}, in dependence of the initial inventory level y 0 . As appears from Table 3, the negative effect of incorporating demand 0 into the counting density p • with small probability ε is roughly twice as large as the positive effect of incorporating demand 4 into p • with the same small probability ε, no matter what the initial inventory level looks like. So, when worrying about robustness of the optimal value w.r.t. changes in the demand's counting density p • , it seems to be somewhat more important to analyse in detail the adequacy of the assumption that an absent demand is impossible than the adequacy of the assumption that a demand of 4 is impossible.

Embedding the discrete case into the set-up of Sections 2-3
In this subsection we will explain how the elaborations in Subsections 5.1-5.3 match our general theory introduced in Sections 2-3. Assume that the state space E as well as the set of all admissible actions A n (x) for each point of time n = 0, . . . , N − 1 and state x ∈ E are finite. Let s := #E ∈ N and E := P(E), and note that the sets A n as well as D n are finite for any n = 0, . . . , N − 1.
Let us measure the distance between two probability measures µ and ν from M 1 (E) by the total variation metric d TV , i.e. by This fits the setting of Subsection 3.2 with M := M TV and ψ :≡ 1; see Example 3.3. Since E was assumed to be finite with s := #E ∈ N, we may and do identify any probability measure µ ∈ M 1 (E) with some element p µ = (p µ (1), . . . , p µ (s)) of R s ≥0,1 (with R s ≥0,1 as in Subsection 5.1). Hence the total variation distance d TV between µ, ν ∈ M 1 (E) can be identified (up to the factor 1/2) with the ℓ 1 -distance between p µ and p ν : That is, the map Λ : M 1 (E) → R s ≥0,1/2 , µ → p µ /2, provides a surjective isometry (here R s ≥0,1/2 is the set of all vectors from R s whose entries are nonnegative and sum up to 1/2), and therefore the metric spaces (M 1 (E), d TV ) and (R s ≥0,1/2 , · ℓ 1 ) are isometrically isomorphic. This implies in particular that the set M 1 (E) is compact w.r.t. d TV , because R s ≥0,1/2 is clearly compact w.r.t. · ℓ 1 . For the distance between two transition functions we will employ the metric d 1 ∞,M TV , which is defined as in (11) with ψ :≡ 1. As the sets D 0 , . . . , D N −1 are finite, we can identify the set P as a finite product of M 1 (E): The metric d 1 ∞,M TV obviously metricizes the product topology on P ½ = P and, as seen above, the set M 1 (E) is compact w.r.t. d TV . It follows from Tychonoff's theorem (see, e.g., [8,Theorem 2.2.8]) that P is compact w.r.t. d 1 ∞,M TV and therefore in particular relatively compact w.r.t. d 1 ∞,M TV . Hence, Definition 3.11(b) of 'Hadamard differentiability' (i.e. Definition 3.9 with S := S rc ) simplifies insofar as one can simply require that the convergence in (12) holds uniformly in all Q ∈ P for every sequence (ε m ) ∈ (0, 1] N . Under the imposed assumptions we may via (44) identify any transition function p n,i;a n,i;k − q n,i;a n,i;k ℓ 1 on P, it is apparent that Definition 5.1 is a special case of Definition 3.9 with S := S rc . Note that in the finite setting there exists for any fixed P ∈ P an optimal strategy π P ∈ Π w.r.t. P , which means that the set Π(P ) is non-empty; see, e.g., [30,Proposition 4.4.3]. Also note that ψ :≡ 1 provides a bounding function for the MDM (X, A, Q, Π, r) for any Q ∈ P. Thus condition (a) of Theorem 3.14 is satisfied for ψ :≡ 1. According to Remark 3.16(ii)-(iii), conditions (b) and (c) of Theorem 3.14 are satisfied for M ′ := M TV and ψ :≡ 1, where M TV is defined as in Example 3.3. Hence, in the finite setting the assumptions of Theorem 3.14 (with M := M TV , M ′ := M TV , and ψ :≡ 1) are always fulfilled so that the representation (48) of the 'Fréchet derivative' of the value functional (with fixed initial state x 0 ∈ E) always follows from part (ii) of Theorem 3.14. Take into account that in the finite setting 'Fréchet differentiability' and 'Hadamard differentiability' are equivalent.
6 Supplement: Auxiliary definitions and results to Section 2 In this section we supplement the definitions and results of Section 2. The precise meaning of the definition in display (3) of the probability measure P x 0 ,P ;π on (Ω, F ) := (E N +1 , E ⊗(N +1) ) is in view of (2) for B ∈ F , for any given x 0 ∈ E, P = (P n ) N −1 n=0 ∈ P, and π = (f n ) N −1 n=0 ∈ Π. By a (regular version of the) factorized conditional distribution of X given Y under P x 0 ,P ;π we mean a probability kernel P x 0 ,P ;π X Y ( · , •) for which for every B ∈ E the random variable ω → P x 0 ,P ;π X Y (Y (ω), B) is a conditional probability of {X ∈ B} given Y under P x 0 ,P ;π . This object is only P x 0 ,P ;π Y -a.s. unique. Thus the formulation of (ii)-(viii) in the following lemma is somewhat sloppy. Assertion (v) in fact means that the probability kernel P n (( · , f n ( · )), • ) provides a (regular version of the) factorized conditional distribution of X n+1 given X n under P x 0 ,P ;π , and analogously for parts (ii)-(iv) and (vi)-(viii). Note that it is also customary to write P x 0 ,P ;π [X ∈ • Y = · ] instead of P x 0 ,P ;π X Y ( · , •); see, for instance, (ii)-(iv) in Subsection 2.1.

(61)
Proof First of all it is clear that assertion (i) holds. Thus it suffices to show assertions (ii)-(viii).
(ii): The claim holds true, because for any B ∈ E and B 1 ∈ E.
(v): As in the proof of (iv) we obtain E x 0 ,P ;π P n (X n , f n (X n )), for any B ∈ E and B 1 ∈ E.
(vii): As in the proof of (vi) we obtain by iterating (62) along with part (v) and (61) for any B ∈ E.
(viii): Analogously to the proof of (ii) we obtain by means of part (vi) for any B ∈ E and B 1 ∈ E. This completes the proof. ✷ Note that the factorized conditional distributions in parts (ii)-(iii) and (vi) of Lemma 6.1 are constant w.r.t. x 0 ∈ E. Also note that in view of part (vii) of Lemma 6.1 the probability measure P x 0 ,P ;π X k Xn (x n , • ) can be seen as a (k − n)-step transition probability from stages n to k given state x n .
Proof First of all, it is easily seen that the identities and hold for any x j ∈ E and 0 ≤ j ≤ m ≤ N.
(i): The claim is an immediate consequence of (64) and part (i) of Lemma 6.1.
Clearly, in view of part (vi) of Lemma 6.1, the assertions in (66) and (67) are valid for indicator functions and thus by linearity for simple functions. The latter assertions can be extended by the Monotone Convergence theorem to arbitrary nonnegative maps h ∈ M(E). Since the integrals on the left-hand sides of (66) and (67) exist and are finite (recall that h(X n ) ∈ L 1 (Ω, F , P x 0 ,P ;π ) for all n = 0, . . . , N by assumption), it follows that the equalities in (66) and (67) hold even for all h ∈ M(E).
The additional assertions can be verified easily by means of (60) and (61) with the same arguments as in the proof of (66) and (67). This completes the proof. ✷ Note that (for any given x 0 ∈ E, P ∈ P, and π ∈ Π) the assumption h(X n ) ∈ L 1 (Ω, F , P x 0 ,P ;π ) (for some h ∈ M(E) and any n = 0, . . . , N) is not trivially satisfied. It holds, for example, if ψ is a bounding function for the MDM (X, A, P , Π, r) (in the sense of Definition 3.1 with P ′ := {P }) and if h ∈ M ψ (E) (with M ψ (E) as in Subsection 3.1). In this case it can be easily verified by means of part (c) of Definition 3.1 (with P ′ := {P }) that indeed h(X n ) ∈ L 1 (Ω, F , P x 0 ,P ;π ) for all n = 0, . . . , N.
Note that f P n ( · , γ) is clearly Borel measurable for any γ ∈ [0, 1], and it is easily seen that for every y ∈ R ≥0 and γ ∈ [0, 1]. Therefore, the function f P n is absolutely dominated by the Borel measurable function h : R ≥0 → R ≥0 given by h(y) := u α (1 + y). Set m P := max k=0,...,N −1´R ≥0 u α dm P k+1 and note that m P ∈ R >0 . Since h satisfieŝ (i.e. h is m P n+1 -integrable) and f P n (y, · ) is continuous on [0, 1] for any y ∈ R ≥0 , we may apply the continuity lemma (see, e.g., [3,Lemma 16.1]) to obtain that the mapping F P n : [0, 1] → R >0 given by F P n (γ) :=´R ≥0 f P n (y, γ) m P n+1 (dy) is continuous. Along with the compactness of the set [0, 1] this ensures the existence of a solution γ P n ∈ [0, 1] to the optimization problem (29). Moreover it can be verified easily by means of part (c) of Assumption (FM) that F P n is strictly concave; take into account that´R ≥0 f P n (y, γ) m P n+1 (dy) can be seen for any γ ∈ [0, 1] as the expectation of u α (1 + γ(R n+1 /r n+1 − 1)) under P. This implies that the solution γ P n is even unique.
with F := F n (recall that F n = F n and that F n is independent of n). It is easily seen that M P n = M ′ is a subset of M P n (R ≥0 ) for any n = 0, . . . , N − 1, where M P n (R ≥0 ) is defined as in (94) in Section 10. Moreover we obviously have F ′ n = F ′ ⊆ F n for any n = 0, . . . , N − 1.
Below we will show that conditions (a)-(c) of Theorem 10.3 are met. Thus we may apply part (i) of Theorem 10.3 (Bellman equation) to obtain part (i) of Theorem 4.3. In fact, for n = N we have for any x N ∈ R ≥0 , where v P N := 1. Now, suppose that the assertion holds for k ∈ {n + 1, . . . , N}. Then, using again part (i) of Theorem 10.3, we have for any x n ∈ R ≥0 For x n = 0 we have f n (x n ) = 0 for any f n ∈ F n and therefore (in view of (81)) V P n (x n ) = 0. For x n ∈ R >0 we obtain from (81) where we used for the second "=" that the value of f n (x n ) ranges over the interval [0, x n ] when f n ranges over F n ; we can then indeed replace f n (x n ) by γx n when "sup fn∈Fn " is replaced by "sup γ∈[0,1] ". For the last step we employed v P n = v P n+1 v P n . Hence we have verified the representation of the value function asserted in part (i). It remains to show that conditions (a)-(c) of Theorem 10.3 (in Section 10) are indeed satisfied.
(a): In view of (25) we obtain r N ∈ M ′ by choosing ϑ := 1 (∈ R >0 ) and κ := B N (∈ R ≥1 ). In particular, r N ∈ M P N −1 . (b): Let n ∈ {1, . . . , N − 1} and h ∈ M P n = M ′ , i.e. h(x) = ϑ u α (x/κ), x ∈ R ≥0 , for some ϑ ∈ R >0 and κ ∈ R ≥1 . Then as in (81) we obtain for any x ∈ R ≥0 For x = 0 we have f n (x) = 0 for any f n ∈ F n and therefore (in view of (83)) T P n h(x) = 0. For x ∈ R >0 we obtain from (83) (analogously to (82)) where ϑ := ϑr α n+1 v P n ∈ R >0 is finite due to (77)-(79). Altogether we have shown that T P n h ∈ M ′ . In particular, T P n h ∈ M P n−1 . (c): Let n ∈ {0, . . . , N − 1} and h ∈ M P n = M ′ (with corresponding ϑ and κ as in (b)). Moreover, let f P n be the map as defined in (32), and note that f P n ∈ F n . Then, similarly to (83), we have for any x ∈ R ≥0 and f n ∈ F n For x = 0 we obviously have T P n,fn h(x) = 0 and thus T P . For x ∈ R >0 we have similarly to (84) that for any f n ∈ F n By Lemma 4.2, the map γ →´R ≥0 u α (1 + γ(y/r n+1 − 1)) m P n+1 (dy) has exactly one maximal point, γ P n , in [0, 1]. Thus, since the second line in (84) coincides with T P n h(x), we obtain T P n,f P n h(x) = T P n h(x) also for any x ∈ R >0 . Therefore the map f P n provides a maximizer f P n ∈ F n of h with f P n ∈ F ′ n . (ii): In the proof of (i) we have seen that the assumptions of Theorem 10.3 are fulfilled. Thus, part (i) of this theorem gives V P n+1 ∈ M P n for any n = 0, . . . , N − 1. In particular, the above elaborations under (c) show that for any n = 0, . . . , N − 1 the map f P n defined by (32) provides a maximizer f P n ∈ F n of V P n+1 with f P n ∈ F ′ n . Hence, part (iii) of Theorem 10.3 ensures that the strategy π P := (f P n ) N −1 n=0 ∈ Π lin forms an optimal trading strategy w.r.t. P .
For the second part of the assertion we assume that there exists another optimal trading strategy π P w.r.t. P with π P ∈ Π lin . Then, by definition of Π lin , there exists γ P = ( γ P n ) N −1 n=0 ∈ [0, 1] N such that π P = π γ P := (f γ P n ) N −1 n=0 . In particular, we have V P 0 (x 0 ) = V P ;π γ P 0 (x 0 ) for any x 0 ∈ R ≥0 . Along with part (i) of this theorem and Below we will show that (85) implies v P n = v P ; γ P n n for all n = 0, . . . , N − 1.
Then it follows from (86) that for any n = 0, . . . , N − 1 the fraction γ P n ∈ [0, 1] is a solution to the optimization problem (29). However, according to Lemma 4.2, this optimization problem has exactly one solution, γ P n , in [0, 1]. Hence γ P n = γ P n for any n = 0, . . . , N − 1 and we arrive at π P = π P which implies that π P is unique among all π ∈ Π lin (P ).
It remains to show that (85)  v P ;γ n > v P ; γ P n n because the reverse inequality would lead to a contradiction of the maximality of v P n . By assumption (85), this implies that there exists k ∈ {0, . . . , N − 1} with k = n such that v P k = sup This, however, contradicts the maximality of v P k . Hence (85) indeed implies (86). ✷

Proof of Theorem 4.6
The following Lemmas 9.1-9.3 involve the map V P ;π n given by (5). In the specific setting of Subsection 4.2 this map admits the representations for any x n ∈ R ≥0 , P ∈ P ψ , π ∈ Π, and n = 0, . . . , N.

Auxiliary lemmas
Lemma 9.1 Let P = (P n ) N −1 n=0 ∈ P ψ and γ = (γ n ) N −1 n=0 ∈ [0, 1] N be fixed. Then the map V P ;πγ n given by (87) admits the representation for any x n ∈ R ≥0 and n = 0, . . . , N, where v P ;πγ n Proof We prove the assertion in (88) by (backward) induction on n. For n = N we obtain by means of (87), part (iii) of Lemma 6.2, and (25) for any x N ∈ R ≥0 , where v P ;πγ N := 1. Now, suppose that the assertion in (88) holds for k ∈ {n + 1, . . . , N}. Note that V P ;πγ n+1 (·) ∈ M ′ (with M ′ defined as in (80)) by choosing ϑ := v P ;πγ n+1 (∈ R >0 ) as well as κ := B n+1 (∈ R ≥1 ), and that it can be verified easily that M ′ is a subset of M P n (R ≥0 ), where M P n (R ≥0 ) is defined as in (94) in the Appendix 10. Then, in view of part (i) of Proposition 10.1, for any x n ∈ R ≥0 we get where we used for the fifth "=" the definition of the map f γ n in (31). For the last step we employed v P ;πγ n = v P ;πγ n+1 v P ;γ n . Thus we have verified the representation of the map V P ;πγ n in (88). ✷ Lemma 9.2 Let M Höl,α be defined as in Example 3.7, and let ψ be the gauge function from (26). Then the following three assertions hold.
(i) ψ is a bounding function for the MDM (X, A, Q, Π, r) for any Q ∈ P ψ .
As a consequence, part (ii) of Theorem 3.14 implies that the value functional V x 0 0 is 'Hadamard differentiable' at P w.r.t. M Höl,α with 'Hadamard derivative'V x 0 0;P given bẏ This completes the proof of Theorem 4.6. ✷

Supplement: Existence of optimal strategies
Consider the setting of Subsection 2.2, that is, let (X, A, P , Π, r) be a MDM in the sense of Definition 2.2 with fixed transition function P = (P n ) N −1 n=0 ∈ P. In this section we will recall from [2] a statement on the existence of optimal strategies in the sense of Definition 2.4 ; see Theorem 10.3 below. Moreover Proposition 10.1 below recalls the so-called reward iteration from [2] which is used for the proof of Theorem 10.3 (see [2, p. 23]) and in our elaborations in Sections 3-4.
Recall that we used E to denote the state space of the MDP X and that E was equipped with a σ-algebra E. For any n = 0, . . . , N − 1 we used F n to denote the set of all decision rules at time n and we fixed some F n ⊆ F n which was regarded as the set of all admissible decision rules at time n. We referred to Π := F 0 × · · · × F N −1 as the set of all admissible strategies, and we defined M(E) to be the set of all (E, B(R))-measurable functions in R E .
For any n = 0, . . . , N − 1, let M P n (E) be the set of all h ∈ M(E) satisfyinĝ E |h(y)| P n (x, f n (x)), dy < ∞ for all x ∈ E and f n ∈ F n .
For any h ∈ M P n (E), n = 0, . . . , N − 1, and f n ∈ F n we may define maps T P n,fn h : E → R and T P n h : E → (−∞, ∞] by T P n,fn h(x) := r n (x, f n (x)) +ˆE h(y) P n (x, f n (x)), dy and T P n h(x) := sup fn∈Fn T P n,fn h(x).
(95) Note that T P n,fn and T P n can be seen as maps from M P n (E) to M(E) and from M P n (E) to (−∞, ∞] E respectively, and that T P n is also called maximal reward operator at time n. Finally, recall from (5) the definition of the map V P ;π n . This map can be computed via the so-called reward iteration: Proposition 10.1 Let π = (f n ) N −1 n=0 ∈ Π be fixed. If V P ;π n+1 (·) ∈ M P n (E) for any n = 0, . . . , N − 1, then the following two assertions hold.
Proof The proof of Theorem 2.3.4 in [2] can be transferred verbatim. ✷ Note that the assumption V P ;π n+1 (·) ∈ M P n (E) (for any n = 0, . . . , N − 1) is not trivially satisfied. It holds, for example, if the MDM (X, A, P , Π, r) possesses a bounding function ψ (in the sense of Definition 3.1 with P ′ := {P }). This is ensured by Lemma 3.2 with P ′ := {P }, taking into account that by (c) of Definition 3.1 we have M ψ (E) ⊆ M P n (E) (with M ψ (E) as in Subsection 3.1) for any n = 0, . . . , N − 1. In some cases the assumption in Proposition 10.1 can also be shown directly; see e.g. the proof of Lemma 9.1 in Subsection 9.3.1. Theorem 10.3 below is concerned with the existence of optimal strategies. It invokes the following definition.
Definition 10.2 For any n = 0, . . . , N − 1, a decision rule f P n ∈ F n is called a maximizer of h ∈ M P n (E) if T P n,f P n h(x) = T P n h(x) for all x ∈ E. The following result which is also known as structure theorem provides sufficient conditions for the existence of optimal strategies. Recall from (7) the definition of the value function V P n .
Theorem 10.3 Suppose that there exist for any n = 0, . . . , N − 1 sets M P n ⊆ M P n (E) and F ′ n ⊆ F n such that the following conditions hold.
(a) r N ∈ M P N −1 .
(b) For any n = 1, . . . , N − 1 and h ∈ M P n we have T P n h ∈ M P n−1 .
(c) For any n = 0, . . . , N − 1 and h ∈ M P n , there exists a maximizer f P n ∈ F n of h with f P n ∈ F ′ n .
Then the following three assertions are valid: (i) V P 0 ∈ M(E), and V P n+1 ∈ M P n for any n = 0, . . . , N − 1. Moreover V P N = r N , and V P n = T P n V P n+1 for any n = 0, . . . , N − 1.
Proof The proof of Theorem 2.3.8 in [2] can be transferred verbatim. ✷ The iteration scheme in part (i) of Theorem 10.3 is known as Bellman equation. Note that conditions (a)-(c) of Theorem 10.3 are not trivially satisfied. It is discussed in Subsection 2.4 of the monograph [2] that these conditions hold in so-called structured MDMs. In some situations, however, these conditions can be verified directly; see Subsection 9.2 (proof of Theorem 4.3) for an example. For original work on the existence of optimal strategies in MDM see, for instance, [13,34]. Also note that Theorem 10.3 shows that a solution to the (Markov decision) optimization problem (6) can be obtained by solving iteratively N (one-stage) optimization problems. implies that the value function V P n (·) is (E, B(R))-measurable for any n = 0, . . . , N. The measurability of the value function is also ensured if the sets F n , . . . , F N −1 are at most countable; take into account that the right-hand side of (7) includes the map V P ;π n (as defined in (5)) which depends only on the last N − n components (f n , . . . , f N −1 ) of the strategy π = (f n ) N −1 n=0 ∈ Π. The measurability of the value function has been discussed in the literature several times; see, for instance, [13,34].
(ii) It follows from Theorem 10.3 that any N-tuple (f P n ) N −1 n=0 of maximizers provides an optimal strategy π P w.r.t. P in the MDM (X, A, P , Π, r) via π P := (f P n ) N −1 n=0 . The reverse statement, however, is not true since even under the assumptions of Theorem 10.3 optimal strategies are not necessarily composed of maximizers; see, e.g., [2, Example 2.3.10]. Hence, Theorem 10.3 provides only a sufficient criterion for the existence of optimal strategies.
(iii) In view of the second part of (ii), an optimal strategy in a MDM can in general be non-unique. However, this does not exclude that in specific situations there is exactly one optimal strategy. For an example see Subsection 4.3.
(iv) In the case where we are interested in minimizing expected total costs in the MDM (X, A, P , Π, r) (see Remark 2.5(ii)), the integral operator T P n is given by (95) with "sup" replaced by "inf" and in Definition 10.2 we have to replace "maximizer" by "minimizer". ✸ 11 Supplement: Topology generated by the Hölder-α metric We use the notation and terminology introduced in Subsection 3.2. In particular, the Hölder-α metric d Höl,α was introduced in Example 3.7 of Subsection 3.2.
Conversely, assume that µ n → µ ψ-weakly. We have to show that for every ε > 0 there exists some n 0 ∈ N such that sup h∈M Höl,α ˆE h dµ n −ˆE h dµ ≤ ε for all n ≥ n 0 .
For any K > 0, the left hand side of (96) is bounded above by sup The first summand in (100) converges to 0 as n → ∞, because µ n → µ ψ-weakly. Thus we can find n 0 ∈ N such that it is bounded above by ε/5 for every n ≥ n 0 . Since µ • ψ −1 as a probability measure on the real line has at most countably many atom, we may and do assume that K > 0 is chosen such that µ[{ψ = K}] = 0. Since µ n → µ 0 (ψ-weakly and thus) weakly, it follows by the portmanteau theorem that the second summand in (100) converges to 0 as n → ∞. By possibly increasing n 0 we obtain that the second summand in (100) is at most ε/5 for all n ≥ n 0 . So far we have shown that the second summand in (97) is bounded above by 4ε/5 for all n ≥ n 0 . As the functions of M Höl,α;K := {h K : h ∈ M Höl,α } are uniformly bounded and equicontinuous, Corollary 11.3.4 in [8] ensures that one can increase n 0 further such that the first summand in (97) is bounded above by ε/5 for all n ≥ n 0 . That is, we arrive at (96). ✷