Solving optimal stopping problems under model uncertainty via empirical dual optimisation

In this work, we consider optimal stopping problems with model uncertainty incorporated into the formulation of the underlying objective function. Typically, the robust, efficient hedging of American options in incomplete markets may be described as optimal stopping of such kind. Based on a generalisation of the additive dual representation of Rogers (Math. Financ. 12:271–286, 2002) to the case of optimal stopping under model uncertainty, we develop a novel regression-based Monte Carlo algorithm for the approximation of the corresponding value function. The algorithm involves optimising a penalised empirical dual objective functional over a class of martingales. This formulation allows us to construct upper bounds for the optimal value with reduced complexity. Finally, we carry out a convergence analysis of the proposed algorithm and illustrate its performance by several numerical examples.


Introduction
In this paper, we consider optimal stopping problems under model uncertainty in terms of ambiguity aversion. By representation results, this means that we look at stochastic optimisation problems of the form where T and Q denote the set of stopping times and a set of probability measures, respectively, whereas β stands for a convex penalty function (see Maccheroni et al. [27]). In the special case  [20,Chap. 8]). If the seller of this American option is only willing to invest an amount c strictly smaller than the superhedging price, then for any stopping time τ ∈ T , the random variable Y τ may represent the shortfall risk of a hedging strategy with initial investment c when the American option is exercised at τ . Then gives a robust quantification of the shortfall risk at time τ reflecting the seller's model uncertainty.
The aim of the present paper is to solve the stopping problem (1.1) numerically. We restrict ourselves to penalty functions in the form of divergence functionals with respect to a reference probability measure. In this case, (1.1) reads as where : [0, ∞) → [0, ∞] denotes a lower semicontinuous convex function and Q consists of all probability measures Q which are absolutely continuous with respect to some reference probability measure P . Besides the standard optimal stopping, prominent specialisations of (1.3) are optimal stopping under average value at risk and the family of entropic risk measures. Our investigations are built upon a specific representation of (1.3) established in Belomestny and Krätschmer [8] (with a refinement in Belomestny and Krätschmer [9]). The crucial point is that we may reformulate the optimal stopping problem in terms of a family of standard optimal stopping problems parametrised by a set of real numbers. This allows us to derive a so-called additive dual representation generalising the well-known dual representation of Rogers [30] for the standard optimal stopping problems, given by (1.4) where (Z t ) is an adapted cash-flow process and M is the set of all (F t )-martingales starting in 0 at t = 0. We use this new generalised dual representation to efficiently construct Monte Carlo upper bounds for the value of the optimal stopping problems under model uncertainty. As to the standard optimal stopping problems, several Monte Carlo algorithms for constructing upper biased estimators for V * based on (1.4) were suggested in the literature. They typically consist of two steps: a) apply some numerical method to construct a martingale M which is close to optimality; b) estimate E[sup t∈ [0,T ] (Z t − M t )] by the sample mean, using a new independent sample (testing sample).
All the existing dual Monte Carlo algorithms can be divided into two broad categories depending on how the martingale M is constructed. In the first class of algorithms, see for example Andersen and Broadie [2] and Glasserman [22,Chap. 8 Because of (1.5), we say that the Doob martingale is surely or strongly optimal. In the second class of algorithms, one tries to solve the dual optimisation problem (1.4) directly by using methods of stochastic approximation and some parametric subclasses of M. Let us mention Desai et al. [18], where the authors essentially applied the stochastic average approximation (SAA) approach and used a nested Monte Carlo method to construct a suitable finite-dimensional linear space of martingales, thus casting the resulting minimisation problem into a linear program. However, it was demonstrated later in Belomestny [6] that the approach in Desai et al. [18] may end up with martingales M that are close to optimal but only in expectation, with the variance of the random variable sup t∈[0,T ] (Z t − M t ) being relatively high. In contrast, due to (1.5), for a martingale that is close to the Doob martingale M * (in an L 2 sense, for instance), this variance will be close to zero. Consequently, the estimation in step b) can be done more efficiently for such a martingale. Thus it is essential to find martingales that are "close" to the Doob martingale, or at least "close" to a surely optimal martingale. In this respect, Belomestny [6] proposed a modification of the plain SAA based on variance penalisation. The convergence analysis of this algorithm reveals that the variance of the random variable sup t∈[0,T ] (Z t − M t ) converges to zero as the number of paths used to build M increases.
The contribution of the current work is twofold. First, we generalise the approach of Belomestny [6] to the case of optimal stopping problems under model uncertainty by using the dual representation by Belomestny and Krätschmer [8]. Second, we provide a thorough convergence analysis of the proposed algorithm. The main theoretical challenge is to extend the analysis of Belomestny [6] to objective functions involving empirical expectations and empirical variances of much more complicated objects than in Belomestny [6]; see Sect. 3. We use essentially different techniques (e.g. different concentration inequalities) and derive faster convergence rates that improve upon those in Belomestny [6] for standard optimal stopping problems. We also illustrate our results for the case of martingales in a diffusion setting defined as integrals with respect to the corresponding Brownian motion by the martingale representation theorem. As compared to Belomestny [6], we consider here not only parametric linear families of martingales, but rather general nonparametric ones defined as stochastic integrals with smooth integrands.
Putting our contribution into perspective, it should be emphasised that one cannot utilise any general device that is suggested in the literature to analyse the optimal stopping problem (1.1) or even (1.3). To the best of our knowledge, there exist two general strategies both based on some underlying filtered probability space ( , F, (F t ) 0≤t≤T , P ). The first focuses on sets Q where we may find conditional nonlinear expectations extending the functional and satisfying a property called time-consistency which extends the tower property of conditional expectations. Time-consistency, sometimes also called recursiveness, allows extending the dynamic programming principle from standard optimal stopping problems to optimal stopping problems of the form (1.1). Studies following this line of reasoning may be found e.g. in Trevino-Aguilar [34,Sects. 4.1 and 4.2], Bayraktar and Yao [3], Bayraktar and Yao [4], Ekren et al. [19] and Bayraktar and Yao [5] (see also Riedel [29], Krätschmer and Schoenmakers [25] and Föllmer and Schied [20,Chap. 6] for the discrete-time case). Unfortunately, this approach requires very restrictive conditions that Q should satisfy, at least for optimal stopping (1.2) (see e.g. Delbaen [17], or Belomestny et al. [7] for the case where Q consists of probability measures that are equivalent to P ). Even worse, it is known from Kupper and Schachermayer [26] that for the optimal stopping problem (1.3), Q meets this requirement in two cases only. These choices of correspond to standard optimal stopping and optimal stopping under entropic risk measures.
The second approach proposed very recently in Huang and Yu [24] and Huang et al. [23] offers a way to solve the optimal stopping problem (1.2) when a dynamic programming principle cannot be applied. The main idea in these papers is to tackle optimal stopping within a game-theoretic framework and look for Nash subgame perfect equilibria. This line of reasoning refers to a long history in economics on how to deal with time-inconsistent dynamic utility maximisation, going back to Strotz [33], Selten [31] and Selten [32]. It has become popular for applications in stochastic finance due to the contributions by Björk and Murgoci [13] and Björk et al. [12], where the authors treat stochastic control problems which do not admit a Bellman optimality principle. Formally, the expected payoffs corresponding to the equilibria approximate the optimal values of (1.2) from below. However, this approach cannot be used directly for the optimal stopping problem (1.3) since this reduces to (1.2) only in a few cases (see Ben-Tal and Teboulle [11]), with optimal stopping under average value at risk as the outstanding representative. Moreover, a numerical method to calculate the payoffs at the equilibria is missing.
In conclusion, the existing literature on robust optimal stopping does not lead in general to a constructive numerical approach to calculate the optimal values of (1.3). This paper offers a method to deal with this problem and is completed by studying its theoretical properties.
The paper is organised as follows. In Sect. 2, we introduce convex risk measures and give some examples. Then we introduce primal and dual representations for our optimal stopping problems under model uncertainty. In Sect. 3, we develop a Monte Carlo functional optimisation algorithm based on the derived dual representation. Then we analyse its convergence towards the solution, depending on the number of Monte Carlo paths and complexity of the underlying functional class. The results are specified to a setting of diffusion processes in Sect. 4. Afterwards, we present some numerical results in Sect. 5. The proofs of the results from Sects. 3 and 4 are given in Sects. 6-8.

Setup
Let 0 < T < ∞ and let ( , F, (F t ) 0≤t≤T , P ) be a filtered probability space, where (F t ) t∈[0,T ] is a right-continuous filtration with F 0 complete and trivial. We also impose the following requirements: Consider a lower semicontinuous convex mapping is a finite nondecreasing convex function whose restriction to [0, ∞) is a finite Young function, that is, * : [0, ∞) → [0, ∞) is convex and satisfies Consider the space where L 0 is the class of all (equivalence classes of) finite-valued random variables. For abbreviation, let us introduce the functional ρ : H * → R defined by where Q stands for the set of all probability measures Q which are absolutely continuous with respect to P and such that (dQ/dP ) is P -integrable. Note that XdQ/dP is P -integrable for every Q ∈ Q and any X ∈ H * due to Young's inequality.

Example 2.1
Let us illustrate our setup in the case of the so-called average value at risk, also known as expected shortfall or conditional value at risk. The average value at risk at level α ∈ (0, 1] is defined as the functional where X is P -integrable and F ← X denotes the left-continuous quantile function of the distribution function F X of X defined by F ← where α stands for the function defined by α (x) = 0 for x ≤ 1/(1 − α), whereas α (x) = ∞ otherwise (cf. Föllmer and Schied [20,Theorem 4.52]). Observe that the set Q α consists of all probability measures on F with dQ/dP ≤ 1/(1 − α) P -a.s.
Consider now a right-continuous nonnegative stochastic process (Y t ) adapted to (F t ). Furthermore, let T consist of all [0, T ]-valued stopping times τ with respect to (F t ). The main object of our study is the optimal stopping problem Let int(dom( )) denote the topological interior of the effective domain of the mapping : [0, ∞) → [0, ∞]. We assume that is a lower semicontinuous convex function satisfying 1 ∈ int(dom( )). Denote by M 0 the set of all martingales (M t ) t∈ [0,T ] with M 0 = 0 such that sup t∈[0,T ] |M t | is P -integrable. The following result was proved in Belomestny and Krätschmer [8] along with Belomestny and Krätschmer [9]. We point out that this uses that ( , F t , P | F t ) is atomless and L 1 ( , F t , P | F t ) is weakly separable, for each t > 0.

Theorem 2.2
If there is some p > 1 such that sup t∈[0,T ] | * (x + Y t )| is P -integrable of order p for any x ∈ R, then we have the dual representations Here M * ,x is the martingale part of the Doob-Meyer decomposition of the Snell envelope V x and K ⊆ R denotes a suitably chosen compact set.

Remark 2.3
The above dual representation is remarkable for at least two reasons.
Firstly, it allows one to construct upper bounds for the value V 0 by choosing a martingale M from the set M 0 . Secondly, if the optimal martingale M * ,x is found, then we need a single trajectory of the reward process Y and the martingale M * ,x to compute V 0 with no error. In this sense, such a dual representation can be computationally more efficient than the primal one.

Remark 2.4
We may describe more precisely how to choose the compact set K in Theorem 2.2. First of all, observe that under the assumptions of this theorem, the representation results imply Secondly, holds for any real number x. Next, by assumption we may find 0 ≤ x 0 < 1 < x 1 such that x 0 , x 1 belong to the effective domain of . Then by the definition of * , Then it is easy to check that Hence any compact set K ⊇ [a , a u ] may be used in Theorem 2.2. We can derive a more accessible choice for the set K in the case of average value at risk AV @R α . By nonnegativity of the process (Y t ), where Thus any compact K ⊇ [a α , a α u ] is a proper choice in Theorem 2.2 for ρ = AV @R α .
In the next section, we propose a Monte Carlo method for solving the dual optimisation problem (2.2) empirically.

Dual empirical minimisation
The representation result in Theorem 2.2, in particular (2.2), is the starting point for our method to solve the optimal stopping problem (2.1). We start by fixing a metric space and a family (M t (ψ)) t∈[0,T ] of martingales parametrised by ψ ∈ , adapted to (F t ) t∈[0,T ] and satisfying M 0 (ψ) = 0. Define the process Z = (Z(x, ψ)) via We shall find the "best" ψ ∈ by solving the empirical optimisation problem on a set of trajectories. To this end, we define the product space ( N , F N , P N ) and its natural projections as well as the processes Z (i) , i = 1, 2, . . ., on N × R × via Fix some λ > 0 and let (x n , ψ n ) denote one of the random solutions of the random optimisation problem where K is a compact set in R as in Theorem 2.2. If n → ∞, this optimisation problem becomes P N -a.s. close to the optimisation problem 2 (see Rogers [30]). In this way, a variance reduction effect can be achieved, as we shall illustrate in Sect. 5.
Let us now analyse the properties of the measurable selector (x n , ψ n ). For any n ∈ N, set The mapping D n can be interpreted as a set of Monte Carlo paths of the process Z used to construct (x n , ψ n ). In order to formulate our main results, we introduce the function, for a selector (x n , ψ n ), With a slight abuse of notation, we set for (x, ψ) ∈ K × , Now let (K × ) η denote the set of centres of a covering of K × by a minimal number of η-balls with respect to the (semi)metric Then define where N (K × , ε) stands for the minimal number to cover the set K × by open d-balls with radius ε > 0. We tacitly set N (K × , ε) = ∞ if no finite cover is available.
Then it holds for all n ∈ N that

Corollary 3.2 Let all assumptions of Theorem 3.1 be valid and further assume that
Then for all ε > 0, it holds that In some situations, the bounds of Theorem 3.1 can be improved. Suppose that that is, the set is assumed to be rich enough such that the solution (x * , ψ * ) satisfies M(ψ * ) = M * ,x * , where M * ,x * is the martingale part of the Doob-Meyer decomposition of V x * . As already mentioned above, in this case it holds that In the case that M(ψ * ) = M * ,x * for some ψ * , the selectors (x n , ψ n ) are nothing else but so-called M-estimators. So we may invoke the established theory on asymptotics of M-estimation. The reader is referred to van der Vaart and Wellner [35,Chap. 3] for comprehensive information. In this theory, a starting point is the so called "wellseparated minimum" condition is a basic assumption to find general criteria which ensure that the sequence (x n , ψ n ) n∈N converges in probability to (x * , ψ * ) (see van der Vaart and Wellner [35, Corollary 3.2.3]). Since our metric d is assumed to be totally bounded, the topological closure of the set {Z(x, ψ) : (x, ψ) ∈ K × } with respect to the L 1 -norm is compact. Note then that condition (3.3) is satisfied if and only if the restriction of the expectation operator to the L 1 -closure cl( If we are interested in convergence rates, we must complete the "well-separated minimum condition". The following type of identifiability condition is now standard in the literature of M-estimation (see van der Vaart and Wellner [35, Theorem 3.2.5]): There exist C, δ > 0 such that Now we are prepared to improve the convergence rates.
where c 1 , c 2 > 0 are some universal constants, and . In this sense, the bound in Theorem 3.3 is better than the one in Theorem 3.1.

Specification analysis for the class
In this section, we specify the convergence rates in (3.5) depending on the properties of the parameter space . The convergence rate strongly depends on the quantity γ (K × , n). This quantity in turn depends on the set . Thus to analyse the convergence rate, we have to study the covering number of . In what follows, we consider parametric families of martingales arising in the setting of diffusion processes. Let (S t ) t∈[0,T ] denote a d-dimensional diffusion process solving the system of SDEs for a function f : R d → R has a representation (4.2). More specifically, we may (4.3) below; see Ye and Zhou [37,Theorem 5]. Therefore it is reasonable to parametrise a subclass of square-integrable martingales adapted to Note that this type of representations was already used to solve optimal stopping/ control problems in a dual formulation; see e.g. Wang and Caflisch [36] and Ye and Zhou [37]. Denote by H s p (R d ) the Sobolev space consisting of all functions f ∈ L p (R d ) such that for every multi-index α with |α| ≤ s, the mixed partial derivative D α f exists in the weak sense and is in L p (R d ). Further let β ∈ R and Let π t denote the density function of S t . We set Now let us first look at convergence rates in the case that Var[Z(x * , ψ * )] does not vanish. Built upon an integrability condition on the density process (π t ), they will be described in terms of the degree s of smoothness that the functions in π fulfil, and the dimension d of their domain. Recalling the process Z = (Z(x, ψ)) introduced at the beginning of Sect. 3, the following result is an application of Theorem 3.1.
In addition, suppose that 2) for α > s − (d + 1)/2 and s/(s + d + 1) > 1/2, The parameter α in Theorem 4.1 may be viewed as a degree of integrability for the density process (π t ). The terms s/(s +d +1) and (α/(d +1)+1/2)/(α/(d +1)+3/2) occurring in the result are nondecreasing in s and α, respectively, with So Theorem 4.1 tells us that for a fixed degree of integrability, the convergence rates are nondecreasing with respect to the degree of smoothness. However, the second and fourth cases show that in case of a significant degree of smoothness in comparison with the dimension d, there is always a point of saturation where the convergence rates cannot be improved by higher degrees of smoothness. In addition, for a given degree of smoothness, the higher the degree of integrability, the better the convergence rates, with certain points of saturation. Let us turn to the situation when the assumptions of Theorem 3.3 hold. We may derive from Theorem 3.3 the next result which is qualitatively of the same nature as Theorem 4.1, but with doubled convergence rates.
Remark 4.3 Theorem 4.1 implies that Q λ (x n , ψ n ) converges to Q λ (x * , ψ * ) at a rate depending on the smoothness of the density π t (x) and its decay for |x| → ∞. It is well known (see Friedman [21,Theorem 9.8]) that if the diffusion coefficient σ is uniformly elliptic and the coefficients μ and σ are infinitely differentiable in [0, T ] × R d with bounded derivatives of any order, then ∂ s t ∂ r x π t (x) exists for all positive integers r and s. Moreover, it holds for all x ∈ R d and t > 0 that for some c > 0.
Here means that the above inequality holds up to a constant only depending on s and r. Hence (4.4) holds for an arbitrarily large α ≥ β and with arbitrary but fixed β ≤ 0 and s > d + 1. Here we refer to the norm introduced in Theorem 4.1.

Numerical results
We use the Euler scheme and L = 200 discretisation points to approximate the solution of the SDE In particular, we discretise the interval [0, T ] with Then for computational reasons, we smooth our objective function using a soft-max type method to get Note that for p → ∞, the pointwise convergence Z p → Z holds. This follows from the observation that well-known relationships between L p -norms (see e.g. Aliprantis and Border [1, Lemma 13.1]) yield For our numerical study, we focus on the optimal stopping problems sup τ ∈T where AV @R 1−α denotes the risk measure average value at risk at level 1 − α as introduced in Example 2.1. The real-valued martingale can be approximated by the sum For the space , we take a linear span of trigonometric basis functions and use a gradient-based method to solve the resulting optimisation problem. Next, we present numerical examples of pricing American put and Bermudan max-call options. Some of these examples were discussed for standard optimal stopping in Glasserman [22,Chap. 8] and in Belomestny [6]. Note also that for the stopping problems considered in this section, some examples were presented in Belomestny and Krätschmer [8] albeit with different parameters.
where K c,p denotes the strike price. Under these conditions, our algorithm approximates the solution of the optimal stopping problem sup τ ∈T In our implementation, we let be a linear space of functions ψ : and First we generate n = 10'000 paths to obtain the optimal values (x n , ψ n ). Then we generate 100'000 new paths to test the solution. For K c,p = 100, D = 10 and α = 0.05, the results are presented in Table 1. It is interesting to see how the upper bounds depend on α. Setting S 0 = 100 and using the same parameter values as above, we obtain and then divide the result by α. This allowed us to increase p in (5.1) and to get better results.
where r, δ, σ are constants. This system of SDEs describes two identically distributed assets, where each underlying yields a dividend rate δ. At any time t ∈ {t 0 , . . . , t I }, the holder of the option may exercise it and receive the payoff In  Table 3. Like in Example 5.1, it is interesting to vary α. By fixing S 1 0 = S 2 0 = 100, we get the results presented in Table 4. In order to compare the current approach with the one used in Belomestny and Krätschmer [8, Table 1], we take S 1 0 = S 2 0 = 90 and α ∈ {0.33, 0.5, 0.67, 0.75}. The corresponding results are presented in Table 5. The upper bounds are worse than those in [8]. Note that in [8], a nested approach to compute martingales was used.
We define our reward function as For I = 9, T = 0.5, r = 0.06, δ = 0, K c,p = 100, σ = 0.6 and with the basis functions used in Example 5.2, we get the results presented in Table 6. By varying α, we get the results for S 1 0 = S 2 0 = 100 which are presented in Table 7.
In all the above examples, it is important to find a suitable compact subset K of R in Theorem 2.2. Using the notations of Remark 2.4, this can be reduced to finding a lower estimate a 1−α and an upper estimate of a 1−α u . For this purpose, note that in any of the above examples, the desired estimates may be derived from upper estimates for the quantity sup τ ∈T E[S τ ], where for some μ, σ ∈ R, dS t = μS t dt + σ S t dW t , S 0 = s 0 . We may invoke the reflection principle for Brownian motion to get Once we have found a suitable interval K := [a , a u ], we proceed in the following way. First we fix a grid X = {a = x 0 < x 1 < · · · < x J = a u }. Then for a fixed x ∈ X, we use the Longstaff-Schwartz algorithm to approximate the value where X is the underlying Markov process with values in R d and f : R d → R. To this end, we use a time discretisation by fixing a time grid 0 = t 0 < t 1 < · · · < t L = T on [0, T ]. The LS algorithm is now used to obtain estimates C x 0 , . . . , C x L for the corresponding continuation functions based on polynomials of degree 3 and with 100'000 Monte Carlo paths of the process X. After that, we approximate the value of t L with i = 1, . . . , n are n trajectories of the process X independent of those used to approximate the continuation values. Note that due to the discretisation in x, we may incur an additional upward bias in the estimate (5.2). On the other hand, the time discretisation introduces a downward bias that can compensate. Our numerical experiments suggest that both biases are negligible (for large enough J and L) compared to the downward bias due to the error of approximating the underlying continuation functions.

Preparations and notations
To prove Theorems 3.1 and 3.3, we need some preparation. Since the way of proving both theorems is the same at the beginning, the preparation is valid for both proofs. Let η > 0. With (K × ) η , we denote the space of centres of minimal η-balls needed to cover K × , with respect to the semimetric Fix n ∈ N and λ > 0. By (x n,η , ψ n,η ), we denote a measurable selector of the set arg min and by (x * η , ψ * η ), we denote an element of the set (K × ) η satisfying for a solution (x * , ψ * ) of (3.1). Due to the construction of (K × ) η , there always exists such an (x * η , ψ * η ), but it need not be unique. For (x, ψ) ∈ K × , let as well as , ψ)) is an independent copy of Z. With the above definitions, we have P N -a.s. for c ≥ 0 that

2E[h n,λ (x, ψ)] − (1 + c)h n,λ (x, ψ)
and Let us start with the first term. Observe that Note that for all n ∈ N, At this point, it makes sense to separate the further steps of the proofs for the two theorems. But to prove both theorems, we have to analyse the following terms, where the aim is to find upper bounds holding within a given probability:

Outline for the proof of Theorem 3.1
The idea is to derive bounds for T 1 , T 2 , T 3 . Therefore we use some concentration inequalities like the Hoeffding inequality, the Bernstein inequality and a new one which is based on a bounded differences approach. Let c = 1 and fix n ∈ N and η > 0. We show in Sect. 6.6.1 that For the analysis of T 3 , we notice that and in Sect. 6.6.1, we obtain for > 0 that (6.8) With the help of these concentration inequalities, we can derive bounds for T 1 , T 2 , T 3 within a given probability if we choose well. After deriving bounds for T 1 , T 2 , T 3 , we can easily find a bound for T 1 + T 2 + T 3 within a given probability. Then the same bound holds for 2Q λ (x n , ψ n ) − 2Q λ (x * , ψ * ) within the given probability because 2Q λ (x n , ψ n ) − 2Q λ (x * , ψ * ) ≤ T 1 + T 2 + T 3 P N -a.s.

Proof of Theorem 3.1
Fix n ∈ N and η > 0, as well as δ ∈ (0, 1). Further, we impose that we have Then we set and we derive with the inequalities (6.5) and (6.6) the estimates Therefore, by elementary calculations, we arrive at Concerning T 3 , let us set Then we can derive first

This leads again with elementary calculations to
Now we only need an upper estimate of E[h n,λ (x * η , ψ * η )], which is presented via Above, the equality follows directly by definition. The first inequality is derived by using the third binomial formula x 2 − y 2 = (x − y)(x + y) backwards in connection with the boundedness of Z andZ. The second inequality holds because Z andZ are independent with identical distribution. The final inequality results from the definition of (x * η , ψ * η ); see (6.1). So by (6.10) and (6.11), we get for T 3 that Now, combining (6.9) and (6.12), we derive and since we have P N -a.s. that we finish with Setting η = γ (K × , n), the assumption is always satisfied, and we have Now the statement of Theorem 3.1 follows immediately.

Outline for the proof of Theorem 3.3
The proof of Theorem 3.3 is similar to the proof of Theorem 3.1, but relies on some different concentration inequalities given below. Let c ≥ 2, n ∈ N and η > 0, as well as > 0. With L from (6.18) below, we have (see Sect. 6.6.2). Moreover, with C λ,b as in Theorem 3.3, we derive in Sect. 6.6.2 the estimates (6.15) and for κ > 0. (6.16)

Proof of Theorem 3.3
In the following, let δ ∈ (0, 1), c ≥ 2 and let n ∈ N and η > 0 satisfy the condition log N (K × , η) ≤ nη. (6.17) Let us introduce First of all, the sequence (Var[Z(x k , ψ k )]) is bounded because the random variables Z(x, ψ) are P N -essentially bounded, uniformly in (x, ψ) ∈ K × . Therefore, in order to show the finiteness of L, it suffices to restrict our considerations to the case E[Z(x k , ψ k )] − E[Z(x * , ψ * )] → 0. In this situation, the "well-separated minimum" property (3.3) implies the convergence d((x k , ψ k ), (x * , ψ * )) → 0. Therefore by the identifiability condition (3.4), we may find some C > 0 and k 0 ∈ N such that Next, in view of (3.2), for k ∈ N, and thus This completes the proof due to the choice of the sequence (x k , ψ k ) k∈N .

Proofs of the concentration inequalities
Let us at first give an auxiliary result which will turn out to be useful.

Proofs of the concentration inequalities for Theorem 3.1
Let us now prove the concentration inequalities used to prove Theorem 3.1. We start with (6.5).
Proof of (6.5) Due to Lemma 6.2, there exists (x, ψ) ∈ (K × ) η such that Using Corollary A.3, we derive So finally, we get This shows (6.5) since we have chosen c = 1.
To prove (6.6), we need the following result for preparation.
Finally, (6.7) may be proved by an application of Corollary A.3, whereas (6.8) follows from Theorem 6.3.

Proofs of the concentration inequalities for Theorem 3.3
Let us now prove the inequalities used for the proof of Theorem 3.3. Under the additional assumption that Var[Z(x * , ψ * )] = 0, we first give a lemma, recalling the semimetric d on K × introduced in Sect. 6.1.

Lemma 6.4 Under the condition (3.2), we have
Proof Assumption (3.2) means that Z(x * , ψ * ) andZ(x * , ψ * ) coincide P -a.s. Hence Since Z andZ are bounded by the constant b and are identically distributed, using x 2 − y 2 = (x − y)(x + y) along with the triangle inequality yields This completes the proof.
The following auxiliary result is a useful consequence of Lemma 6.4.

Lemma 6.5 If (3.2)
is satisfied, then for (x, ψ) ∈ K × and > 0, In particular, 2g n (x, ψ) is a so-called U-statistic with kernel q : R 2 → R defined by q(s, t) = (s − t) 2 . Hence we may draw on a Bernstein inequality for U-statistics (see e.g. Clémençon et al. [16,Appendix A]) to conclude that where n/2 denotes the integer part of n/2. By using (3.2) again, we obtain (see e.g. the proof of Lemma 6.4). Then the statement of Lemma 6.5 follows immediately from Lemma 6.4. Now we are ready to verify the concentration inequalities. Let us start with (6.13), recalling that L as defined in (6.18) is finite by Lemma 6.1.
Let us turn now to (6.14).

Proof of Remark 3.4
It is easy to check that for a constant k not depending on n and γ , we have With the assumption that lim n→∞ γ (K × , n) = 0, we get Then by lim n→∞ nγ (K × , n) = ∞, we end up with

Proofs of Theorems 4.1 and 4.2
Let the assumptions of Theorem 4.1 be fulfilled, retaking notation from its formulation. To prove Theorems 4.1 and 4.2, we need some preparations. These mainly concern estimates of different semimetrics, but may reveal some interesting results.

Preparations and notations
Firstly, we endow the space K × with the semimetric This is well defined because Z(x, ψ) is assumed to be essentially bounded uniformly in (x, ψ) ∈ K × . Secondly, by assumption, we may equip the set π with the L 2 -metric d π defined by Next, we want to find a suitable semimetric on the space . It is based on the following observation.

Lemma 8.1
There exists some C 1 > 0 such that for ψ 1 , ψ 2 ∈ , the inequality E sup Proof By the Burkholder-Davis-Gundy inequality (for p = 1), we may find some Invoking Jensen's inequality, we end up with This completes the proof.
which are obviously alternative semimetrics on K × . In the next step, we want to find an upper estimate of the semimetric d in terms of the semimetrics ρ 1 and ρ 2 .
Proof Set x := (max K) + + 1. The proof is based on a representation for convex functions. We use that for any x, where * + denotes the right derivative of * . Since * and its right derivative are both nondecreasing, we may observe for x, x ∈ K and η ≥ 0 that where the last step additionally uses that x ≥ 1 and that * is nonnegative on [0, ∞).
As a consequence, we may conclude by nonnegativity of (Y t ) that (8.2) Since |Z| ≤ b P -a.s., we further obtain for x, x ∈ K and ψ, ψ ∈ . The proof is complete.
Let us introduce some further notation.
Henceforth, we denote by N ( π , ε) the covering number of π by ε-balls with respect to d π . Furthermore, we define Finally, let us introduce the following notation:

Proofs of Theorems 4.1 and 4.2
Settingd C K× := ρ 1 + Cρ 2 , we may find by Theorem 8.2 some C > 1 such that The idea of the proofs of Theorems 4.1 and 4.2 is based on a result given by Nickl and Pötscher [28]. Under the imposed assumptions, this result enables us to give analytical upper estimates for γ (K × , n,d C K× ). Then we use these analytical estimates and apply Theorems 3.1 and 3.3, respectively, to derive an analytical bound for the deviations Q λ (x n , ψ n ) − Q λ (x * , ψ * ).
Note that here, the bounds can even be derived for all n ∈ N. To see this, check the definition of κ in the proof of Theorem 3.3. The proof of Theorem 4.2 is complete.

Proof of Remark 4.3
Let π(x, t) = π t (x) denote the density of the diffusion process given in (4.1). Furthermore, let σ := 1 2 σ σ , where σ denotes the transposed matrix. Then the Fokker-Planck equation states that ∂ 2 ∂x i ∂x j σ i,j (x, t)π(x, t) . (9.1) This is a parabolic partial differential equation. To show that, under some conditions on μ and σ , the density π is infinitely differentiable in space and time, we want to make use of Friedman [21,Theorem 3.11]. To apply that theorem, we need to impose that σ is uniformly elliptic, i.e., there exists λ > 0 such that for all t, x ∈ [0, T ] × R d and all ξ ∈ R d , exists for all 0 ≤ k, < ∞ and is Hölder-continuous.
Let C > 0 and let X 1 , . . . , X n be independent random variables with 0 ≤ X i ≤ C for i = 1, . . . , n. Then we define Var [X i ] . In our setting, we consider random variables that are not only independent, but also have the same distribution. The following corollaries can be derived as immediate consequences of the Hoeffding respectively the Bernstein inequality.