1 Introduction

The present article is motivated by Guyon and Henry-Labordère [13], where the authors proposed a particle method for the calibration of local stochastic volatility models (e.g. stock price models). For ease of presentation, let us assume zero interest rates and recall that local volatility models

$$ dX_{t} = \sigma (t,X_{t}) X_{t} dW_{t} , $$
(1.1)

where \(W\) denotes a one-dimensional Brownian motion under a risk-neutral measure and \(X\) the price of a stock, can replicate any sufficiently regular implied volatility surface, provided that we choose the local volatility according to Dupire’s formula, symbolically \(\sigma := \sigma _{\text{Dup}}\); see Dupire [8]. (In case of deterministic nonzero interest rates, the discussion below remains virtually unchanged after passing to forward stock and option prices). Unfortunately, it is well understood that Dupire’s model exhibits unrealistic random price behaviour despite perfect fits to market prices of options. On the other hand, stochastic volatility models

$$dX_{t} = \sqrt{v_{t}} \, X_{t} dW_{t} $$

for a suitably chosen stochastic variance process \((v_{t})\), may lead to realistic (in particular, time-homogeneous) dynamics, but are typically difficult or impossible to fit to observed implied volatility surfaces. We refer to Gatheral [11] for an overview of stochastic and local volatility models. Local stochastic volatility models can combine the advantages of both local and stochastic volatility models. Indeed, if the stock price is given by

$$ dX_{t} = \sqrt{v_{t}}\, \sigma (t,X_{t}) X_{t} dW_{t}, $$

then it exactly fits the observed market option prices provided that

$$ \sigma ^{2}_{\text{Dup}}(t,x) = \sigma ^{2}(t,x) \mathbb{E}[ v_{t} | X_{t} = x ]. $$

This is a simple consequence of Gyöngy’s celebrated Markovian projection theorem; see Gyöngy [14, Theorem 4.6] and also Brunick and Shreve [4, Corollary 3.7]. With this choice of \(\sigma \), we have

$$ dX_{t} = \sigma _{\text{Dup}}(t,X_{t}) X_{t} \frac{\sqrt{v_{t}} }{\sqrt{\mathbb{E}\left [ v_{t} | X_{t} \right ]}} dW_{t}. $$
(1.2)

Note that \(v\) in (1.2) can be any integrable and positive adapted stochastic process. In a sense, (1.2) may be considered as an inversion of the Markovian projection due to [14], applied to Dupire’s local volatility model, i.e., (1.1) with \(\sigma = \sigma _{\text{Dup}}\).

Thus the stochastic local volatility model of McKean–Vlasov type (1.2) solves the smile calibration problem. However, equation (1.2) is singular in a sense explained below and very hard to analyse and solve. Even the problem of proving existence or uniqueness for (1.2) (under various assumptions on \(v\)) turned out to be notoriously difficult and only a few results are available; we refer to Lacker et al. [17] for an extensive discussion and literature review. Let us recall that the theory of standard McKean–Vlasov equations of the form

$$ d Z_{t} = \widetilde{H}\left (t, Z_{t}, \mu _{t} \right ) d t + \widetilde{F}\left (t, Z_{t}, \mu _{t} \right ) d W_{t} $$

with \(\mu _{t} = \mathrm{Law}(Z_{t})\) is well understood under appropriate regularity conditions, in particular, Lipschitz-continuity of \(\widetilde{H}\) and \(\widetilde{F}\) with respect to the standard Euclidean distances in the first two arguments and with respect to the Wasserstein distance in \(\mu _{t}\); see Funaki [10], Carmona and Delarue [6, Chap. 4.2], Mishura and Veretennikov [20]. Denoting \(Z_{t} := (X_{t}, Y_{t})\), it is not difficult to see that the conditional expectation \((x, \mu _{t}) \mapsto \mathbb{E}\left [ A(Y_{t}) \mid X_{t} = x \right ]\) is in general not Lipschitz-continuous in the above sense. Therefore the standard theory does not apply to (1.2).

There are a number of results available in the literature where the Lipschitz condition on drift and diffusion is not imposed. Bossy and Jabir [3] considered singular McKean–Vlasov (MV) systems of the form:

$$\begin{aligned} dX_{t} &= \mathbb{E}[\ell (X_{t}) | Y_{t}] dt + \mathbb{E}[\gamma (X_{t}) | Y_{t}] dW_{t}, \end{aligned}$$
(1.3)
$$\begin{aligned} dY_{t} &= b(X_{t}, Y_{t}) dt + \sigma (Y_{t}) dB_{t} , \end{aligned}$$
(1.4)

or, alternatively, the seemingly even less regular equation

$$ dX_{t} = \sigma \big(p(t, X_{t})\big) dW_{t}, $$
(1.5)

where \(p(t, \,\cdot \,)\) denotes the density of \(X_{t}\). Bossy and Jabir [3] establish well-posedness of (1.3)–(1.5) under suitable regularity conditions (in particular, ellipticity) based on energy estimates of the corresponding nonlinear PDEs. Interestingly, these techniques break down when the roles of \(X\) and \(Y\) are reversed in (1.3), (1.4), that is, when \(\mathbb{E}[\gamma (X_{t}) | Y_{t}]\) is replaced by \(\mathbb{E}[\gamma (Y_{t}) | X_{t}]\) in (1.3), and similarly for the drift term. Hence the results of [3] do not imply well-posedness of (1.2). Lacker et al. [17] studied the two-dimensional SDE

$$\begin{aligned} dX_{t}&=b_{1}(X_{t})\frac{h(Y_{t})}{\mathbb{E}[h(Y_{t})|X_{t}]}\,dt+ \sigma _{1}(X_{t}) \frac{f(Y_{t})}{\sqrt{\mathbb{E}[f^{2}(Y_{t})|X_{t}]}}\, dW_{t}, \end{aligned}$$
(1.6)
$$\begin{aligned} dY_{t}&=b_{2}(Y_{t})\,dt+\sigma _{2}(Y_{t})\,dB_{t}, \end{aligned}$$
(1.7)

where \(W\) and \(B\) are two independent one-dimensional Brownian motions. Clearly, this can be seen as a generalisation of (1.2) with a non-zero drift and with the process \(v\) chosen in a special way. The authors proved strong existence and uniqueness of solutions to (1.6), (1.7) in the stationary case. In particular, this implies strong conditions on \(b_{1}\) and \(b_{2}\), but also requires the initial value \((X_{0},Y_{0})\) to be random and to have the stationary distribution. Existence and uniqueness of solutions to (1.6), (1.7) in the general case (without the stationarity assumptions) remains open. Finally, let us mention the result of Jourdain and Zhou [16, Theorem 2.2] which established weak existence of the solution to (1.2) for the case when \(v\) is a jump process taking finitely many values.

Another question apart from well-posedness of these singular McKean–Vlasov equations is how to solve them numerically (in a certain sense). Let us recall that even for standard SDEs with singular or irregular drift, where existence/uniqueness is known for quite some time, the convergence of the corresponding Euler scheme with non-vanishing rate has been established only very recently; see Butkovsky et al. [5], Jourdain and Menozzi [15]. The situation with the singular McKean–Vlasov equations presented above is much more complicated and very few results are available in the literature. In particular, the results of Lacker et al. [17] do not provide a way to construct a numerical algorithm for solving (1.2) even in the stationary case considered there.

In this paper, we study the problem of numerically solving singular McKean–Vlasov (MV) equations of a more general form than (1.2), namely

$$ d X_{t} = H\big(t, X_{t}, Y_{t}, \mathbb{E}[A_{1}(Y_{t}) | X_{t}] \big) d t + F\big(t, X_{t}, Y_{t}, \mathbb{E}[A_{2}(Y_{t}) | X_{t}] \big) d W_{t}, $$
(1.8)

where \(H\), \(F\), \(A_{1}\), \(A_{2}\) are sufficiently regular functions, \(W\) is a \(d\)-dimensional Brownian motion and \(Y\) is a given stochastic process, for example a diffusion process. A key issue is how to approximate the conditional expectations \(\mathbb{E}[A_{i}(Y_{t}) | X_{t} = x]\), \(i=1,2\), \(x\in \mathbb{R}^{d}\).

In their seminal paper, Guyon and Henry-Labordère [13] suggested an approach to tackle this problem (see also Antonelli and Kohatsu-Higa [1]). They used the “identity”

$$ \mathbb{E}[A(Y_{t}) | X_{t} = x] \,\text{``}{=}\text{''}\, \frac{\mathbb{E}[A(Y_{t}) \delta _{x}(X_{t})]}{\mathbb{E}[\delta _{x}(X_{t})]}, $$

where \(\delta _{x}\) is the Dirac delta function concentrated at \(x\). This suggests the approximation

$$ \mathbb{E}[A(Y_{t}) | X_{t} = x] \approx \frac{\sum _{i=1}^{N} A(Y^{i,N}_{t}) k_{\varepsilon}(X^{i,N}_{t} - x)}{\sum _{i=1}^{N} k_{\varepsilon}(X^{i,N}_{t} - x)}. $$
(1.9)

Here \((X^{i,N}, Y^{i,N})_{i=1,\dots , N}\) is a particle system, \(k_{\varepsilon}(\,\cdot \,) \approx \delta _{0}(\,\cdot \,)\) is a regularising kernel and \(\varepsilon >0\) is a small parameter. This technique for solving (1.8) (assuming (1.8) has a solution for a moment) works very well in practice, especially when coupled with interpolation on a grid in \(x\)-space. Due to the local nature of the performed regression, the method can be justified under only weak regularity assumptions on the conditional expectation. Note, however, that the interpolation part might require higher order regularity.

On the other hand, the method has an important disadvantage shared by all local regression methods: For any given point \(x\), only points \((X^{i},Y^{i})\) in a neighbourhood around \(x\) of size proportional to \(\varepsilon \) contribute to the estimate of \(\mathbb{E}[A(Y_{t}) | X_{t} = x]\) as we have \(k_{\varepsilon}(\,\cdot \, - x)\approx 0\) outside that neighbourhood. Hence local regression cannot take advantage of “global” information about the structure of the function \(x \mapsto \mathbb{E}[A(Y_{t}) | X_{t} = x]\). If for example the conditional expectation can be globally approximated via a polynomial, it is highly inefficient (from a computational point of view) to approximate it locally using (1.9). Taken to the extreme, if we assume a compactly supported kernel \(k\) and formally take \(\varepsilon = 0\), then the estimator (1.9) collapses to \(\mathbb{E}[A(Y_{t}) | X_{t} = X^{i,N}_{t}] \approx A(Y^{i,N}_{t})\) since only \(X^{i,N}_{t}\) is close enough to itself to contribute to the estimator. In the context of the stochastic local volatility model (1.2), this means that the dynamics silently collapses to a pure local volatility dynamics if \(\varepsilon \) is chosen too small.

This disadvantage of local regression methods can be avoided by using global regression techniques. Indeed, taking advantage of global regularity and global structural features of the unknown target function, global regression methods are often seen to be more efficient than their local counterparts; see e.g. Bach [2]. On the other hand, the global regression methods require more regularity (e.g. global smoothness) than the minimal assumptions needed for local regression methods. In addition, the choice of basis functions can be crucial for global regression methods.

In fact, the starting point of this work was to replace (1.9) by global regression based on, say, \(L\) basis functions. However, it turns out that the Lipschitz constants of the resulting approximation to the conditional expectations in terms of the particle distribution explode as \(L \to \infty \), unless the basis functions are carefully chosen.

As an alternative to Guyon and Henry-Labordère [13], we propose in this paper a novel approach based on ridge regression in the context of reproducing kernel Hilbert spaces (RKHSs) which in particular does not have either of the above mentioned disadvantages, even when the number of basis functions is infinite.

Recall that an RKHS ℋ is a Hilbert space of real-valued functions \({f: \mathcal{X}\rightarrow \mathbb{R}}\) such that the evaluation map \(\mathcal{H}\ni f \mapsto f(x)\) is continuous for every \(x\in \mathcal{X}\). This crucial property implies that there exists a positive symmetric kernel \(k: \mathcal{X}\times \mathcal{X}\) \(\rightarrow \mathbb{R}\), i.e., for any \(c_{1},\dots ,c_{n}\in \mathbb{R}\), \(x_{1},\dots ,x_{n}\in \mathcal{X}\), one has

$$ \sum _{i,j=1}^{n}c_{i}c_{j}k(x_{i},x_{j})\geq 0, $$

such that \(k_{x} := k(\,\cdot \,,x) \in \mathcal{H}\) for every \(x\in \mathcal{X}\), and one has \(\langle f,k_{x} \rangle _{\mathcal{H}}= f(x)\) for all \(f\in \mathcal{H}\). As a main feature, any positive definite kernel \(k\) uniquely determines an RKHS ℋ and the other way around. In our setting, we consider \(\mathcal{X}\subseteq \mathbb{R}^{d}\). For a detailed introduction and further properties of RKHSs, we refer to the literature, for example Steinwart and Christmann [25, Chap. 4]. We recall that the RKHS framework is popular in machine learning where it is widely used for computing conditional expectations. In the learning context, kernel methods are most prominently used in order to avoid the curse of dimensionality when dealing with high-dimensional features by the kernel trick. We stress that this issue is not relevant in the application to calibration of equity models, but it might be interesting for more general, high-dimensional singular McKean–Vlasov systems.

Consider a pair of random variables \((X,Y)\) taking values in \(\mathcal{X}\times \mathcal{X}\) with finite second moments and denote \(\nu := \operatorname{Law}(X,Y)\). Suppose that \(A\colon \mathcal{X} \to \mathbb{R}\) is sufficiently regular and ℋ is large enough so that we have \(\mathbb{E}[ A(Y) | X = \,\cdot \, ] \in \mathcal{H}\). Then formally,

$$\begin{aligned} c_{A}^{\nu}(\,\cdot \,):= \int _{\mathcal{X}\times \mathcal{X}}k(\, \cdot \, ,x)A(y)\nu (dx,dy) & =\int _{\mathcal{X}}k(\,\cdot \,,x)\nu (dx, \mathcal{X})\int _{\mathcal{X}}A(y)\nu (dy|x) \\ & =\int _{\mathcal{X}}k(\,\cdot \,,x)\mathbb{E}[ A(Y)|X=x ] \nu (dx, \mathcal{X}) \\ & =:\mathcal{C}^{\nu}\mathbb{E}[ A(Y)|X=\,\cdot \, ], \end{aligned}$$

where

$$ \mathcal{C}^{\nu}f(\,\cdot \,):= \int _{\mathcal{X}}k(\,\cdot \,,x)f(x)\nu (dx, \mathcal{X}), \qquad f\in \mathcal{H}. $$

Unfortunately, in general, the operator \(\mathcal{C}^{\nu}\) is not invertible. As \(\mathcal{C}^{\nu}\) is positive definite, it is, however, possible to regularise the inversion by replacing \(\mathcal{C}^{\nu}\) by \(\mathcal{C}^{\nu }+ \lambda I_{\mathcal{H}}\) for some \(\lambda > 0\), where \(I_{\mathcal{H}}\) is the identity operator on ℋ. Indeed, it turns out that

$$ m^{\lambda}_{A}(\,\cdot \,;\nu ) :=(\mathcal{C}^{\nu }+ \lambda I_{ \mathcal{H}})^{-1} c^{\nu}_{A} $$
(1.10)

is the solution to the minimisation problem

$$ m^{\lambda}_{A}(\,\cdot \,;\nu ):=\operatorname*{arg\, min}_{f\in \mathcal{H}}\Big( \mathbb{E}\big[ \big(A(Y)-f(X)\big)^{2}\big]+\lambda \| f\|_{ \mathcal{H}}^{2}\Big); $$
(1.11)

see Proposition 3.3. On the other hand, one also has

$$ \mathbb{E}[ A(Y) | X = \,\cdot \,] = \operatorname*{arg\, min}_{f\in L^{2} (\mathbb{R}^{d}, \operatorname{Law}(X))}\mathbb{E}\big[\big(A(Y)-f(X)\big)^{2}\big], $$

and therefore it is natural to expect that if \(\lambda >0\) is small enough and ℋ is large enough, then \(m^{\lambda}_{A}(\,\cdot \,; \nu )\approx \mathbb{E}[A(Y) | X = \, \cdot \,]\), that is, \(m^{\lambda}_{A}(\,\cdot \,; \nu )\) is close to the true conditional expectation.

The main result of the article is that the regularised MV system obtained by replacing the conditional expectations with their regularised versions (1.10) in (1.8) is well posed and propagation of chaos holds for the corresponding particle system; see Theorems 2.2 and 2.3. To establish these results, we study the joint regularity of \(m^{\lambda}_{A}(x;\nu )\) in the space variable \(x\) and the measure \(\nu \) for fixed \(\lambda >0\). Such results are almost absent in the literature on RKHSs and we here fill this gap. In particular, we prove that under suitable conditions, \(m^{\lambda}_{A}(x; \nu )\) is Lipschitz in both arguments, that is, with respect to the standard Euclidean norm in \(x\) and the Wasserstein-1-norm in \(\nu \), and can be calculated numerically in an efficient way; see Sect. 2. Additionally, in Sect. 3, we study the convergence of \(m^{\lambda}_{A}(\,\cdot \,; \nu )\) in (1.10) to the true conditional expectation for fixed \(\nu \) as \(\lambda \searrow 0\).

Let us note that as a further nice feature of the RKHS approach compared to the kernel method of [13], one may incorporate, at least in principle, global prior information concerning properties of \(\mathbb{E}[ A(Y) | X = \,\cdot \,]\) into the choice of the RKHS-generating kernel \(k\). In a nutshell, if one anticipates beforehand that \(\mathbb{E}[ A(Y) | X = x]\approx f(x)\) for some known “nice” function \(f\), one may pass to a new kernel given by setting \(\widetilde{k}(x,y) := k(x,y)+f(x)f(y)\). This degree of freedom is similar to, for example, the possibility of choosing basis functions in line with the problem under consideration in the usual regression methods for American options. We also note that the Lipschitz constants for \(m^{\lambda}_{A}(x;\nu )\) with respect to both arguments are expressed in bounds related to \(A\) and the kernel \(k\) only; see Theorem 2.4. In contrast, if we had dealt with standard ridge regression, that is, ridge regression based on a fixed system of basis functions, we should have to impose restrictions on the regression coefficients leading to a nonconvex constrained optimisation problem.

In summary, the contribution of the current work is fourfold. First, we propose an RKHS-based approach to regularise (1.8) and prove the well-posedness of the regularised equation. Second, we show convergence of the approximation (1.11) to the true conditional expectation as \(\lambda \searrow 0\). Third, we suggest a particle-based approximation of the regularised equation and analyse its convergence. Finally, we apply our algorithm to the problem of smile calibration in finance and illustrate its performance on simulated data. In particular, we validate our results by solving numerically a regularised version of (1.2) (with \(m^{\lambda}_{A}\) in place of the conditional expectation). We show that our system is indeed an approximate solution to (1.2) in the sense that we get very close fits of the implied volatility surface – the final goal of the smile calibration problem.

The rest of the paper is organised as follows. Our main theoretical results are given in Sect. 2. Convergence properties of the regularised conditional expectation \(m^{\lambda}_{A}\) are established in Sect. 3. A numerical algorithm for solving (1.8) and an efficient implementable approximation of \(m^{\lambda}_{A}\) are discussed in Sect. 4. Section 5 contains numerical examples. The results of the paper are summarised in Sect. 6. Finally, all the proofs are placed in Sect. 7.

Convention on constants. Throughout the paper, \(C\) denotes a positive constant whose value may change from line to line. The dependence of constants on parameters, if needed, will be indicated e.g. by \(C(\lambda )\).

2 Main results

We begin by introducing the basic notation. For \(a\in \mathbb{R}\), we set \(a^{+}:=\max (a,0)\). Let \((\Omega , \mathcal{F},\mathbb{P})\) be a probability space. For \(d\in \mathbb{N}\), let \(\mathcal{X}\subseteq \mathbb{R}^{d}\) be an open subset and \(\mathcal{P}_{2}(\mathcal{X})\) the set of all probability measures on \((\mathcal{X},\mathcal{B}(\mathcal{X}))\) with finite second moment. If \(\mu ,\nu \in \mathcal{P}_{2}(\mathcal{X}) \), \(p\in [1,2]\), we denote the Wasserstein-p (Kantorovich) distance between them by

$$ \mathbb{W}_{p}(\mu ,\nu ):=\inf (\mathbb{E}[|X-Y|^{p}])^{1/p}, $$

where the infimum is taken over all random variables \(X\), \(Y\) with \(\operatorname{Law}(X)=\mu \) and \({\operatorname{Law}(Y)=\nu}\). Let \(k:\mathcal{X}\times \mathcal{X}\to \mathbb{R}\) be a symmetric, positive definite kernel and ℋ a reproducing kernel Hilbert space of functions \(f\colon \mathcal{X}\to \mathbb{R}\) associated with the kernel \(k\). That is, for any \(x\in \mathcal{X}\), \(f\in \mathcal{H}\), one has

$$ f(x)=\langle f,k(x,\,\cdot \,)\rangle _{\mathcal{H}}. $$

In particular, \(\langle k(x,\,\cdot \,),k(y,\,\cdot \,)\rangle _{\mathcal{H}}=k(x,y)\) for any \(x,y\in \mathcal{X}\). We refer to Steinwart and Christmann [25, Chap. 4] for further properties of RKHSs.

Let \(A\colon \mathcal{X}\to \mathbb{R}\) be a measurable function such that \(|A(x)|\le C(1+|x|)\) for some universal constant \(C>0\) and all \(x\in \mathcal{X}\). For \(\nu \in \mathcal{P}_{2}(\mathcal{X}\times \mathcal{X})\), \(\lambda \ge 0\), consider the optimisation problem (ridge regression)

$$ m^{\lambda}_{A}(\,\cdot \,; \nu ) := \operatorname*{arg\, min}_{f\in \mathcal{H}} \bigg( \int _{\mathcal{X}\times \mathcal{X}}|A(y)-f(x)|^{2}\,\nu (dx,dy)+ \lambda \| f\|_{\mathcal{H}}^{2}\bigg). $$
(2.1)

We fix \(T>0\), \(d\in \mathbb{N}\) and consider the system

$$\begin{aligned} dX_{t} &= H\big(t,X_{t}, Y_{t}, \mathbb{E}[A_{1}(Y_{t}) | X_{t}] \big) d t \!+ F\big(t,X_{t}, Y_{t}, \mathbb{E}[A_{2}(Y_{t}) | X_{t}] \big) d W_{t}^{X}, \end{aligned}$$
(2.2)
$$\begin{aligned} dY_{t}&=b(t,Y_{t})dt+\sigma (t,Y_{t})dW_{t}^{Y}, \end{aligned}$$
(2.3)

where \(H\colon [0,T]\times \mathbb{R}^{d}\times \mathbb{R}^{d} \times \mathbb{R}\to \mathbb{R}^{d}\), \(F\colon [0,T]\times \mathbb{R}^{d}\times \mathbb{R}^{d} \times \mathbb{R}\to \mathbb{R}^{d}\times \mathbb{R}^{d}\), \(A_{i}\colon \mathbb{R}^{d}\to \mathbb{R}\), \(b\colon [0,T]\times \mathbb{R}^{d}\to \mathbb{R}^{d}\), \(\sigma \colon [0,T]\times \mathbb{R}^{d}\to \mathbb{R}^{d}\times \mathbb{R}^{d}\) are measurable functions, \(W^{X}\), \(W^{Y}\) are two (possibly correlated) \(d\)-dimensional Brownian motions on \((\Omega , \mathcal{F},\mathbb{P})\), and \(t\in [0,T]\). We note that our choice of \(Y\) as a diffusion process in (2.3) is mostly for convenience, and we expect our results to hold in more generality when appropriately modified.

Denote \(\mu _{t}:=\operatorname{Law}(X_{t},Y_{t})\). As mentioned above, the functional

$$ (x, \mu _{t}) \mapsto \mathbb{E}\left [ A_{i}(Y_{t}) | X_{t} = x \right ] $$

is not Lipschitz-continuous even if \(A_{i}\) is smooth. Therefore the classical results on well-posedness of McKean–Vlasov equations are not applicable to (2.2), (2.3). The main idea of our approach is to replace the conditional expectation by the corresponding RKHS approximation (2.1) which has “nice” properties (in particular, it is Lipschitz-continuous). This implies strong existence and uniqueness of the new system. Furthermore, we demonstrate numerically that the solution to the new system is still “close” to the solution of (2.2), (2.3) in a certain sense. Thus we consider the system

$$\begin{aligned} d \widehat{X}_{t} &= H\big(t,\widehat{X}_{t}, Y_{t}, m^{\lambda}_{A_{1}}( \widehat{X}_{t};\widehat{\mu}_{t}) \big) d t + F\big(t,\widehat{X}_{t}, Y_{t}, m^{\lambda}_{A_{2}}(\widehat{X}_{t};\widehat{\mu}_{t})\big)\, d W_{t}^{X}, \end{aligned}$$
(2.4)
$$\begin{aligned} dY_{t}&=b(t,Y_{t})dt+\sigma (t,Y_{t})\, dW_{t}^{Y}, \end{aligned}$$
(2.5)
$$\begin{aligned} \widehat{\mu}_{t}&=\operatorname{Law}(\widehat{X}_{t},Y_{t}), \end{aligned}$$
(2.6)

where \(t\in [0,T]\). We need the following assumptions on the kernel \(k\) (formulated in a slightly redundant manner for ease of notation).

Assumption 2.1

The kernel \(k\) is twice continuously differentiable in both variables, \(k(x,x)>0\) for all \(x\in \mathcal{X}\), and

$$\begin{aligned} D_{k}^{2}:=\sup _{ \substack{(x,y)\in \mathcal{X}\times \mathcal{X}\\1\leq i,j\leq d}} \max \{& |\partial _{x_{i}}\partial _{y_{j}}k^{2}(x,y)|,|\partial _{x_{i}} \partial _{y_{j}}k(x,y)|,|\partial _{x_{i}}k(x,y)|, \\ & |\partial _{y_{j}}k(x,y)|,|k(x,y)| \} < \infty . \end{aligned}$$

Let \(\mathcal{C}^{1}(\mathcal{X};\mathbb{R})\) be the space of all functions \(f\colon \mathcal{X}\to \mathbb{R}\) such that

$$ \|f\|_{\mathcal{C}^{1}}:=\sup _{x\in \mathcal{X}} |f(x)|+\sup _{ \substack{x\in \mathcal{X}\\i=1,\dots ,d}} |\partial _{x_{i}} f(x)|< \infty . $$

Now we are ready to state our main results. Their proofs are given in Sect. 7.

Theorem 2.2

Suppose that Assumption 2.1is satisfied for the kernel \(k\) with \(\mathcal{X}= \mathbb{R}^{d}\) and

1) \(A_{i}\in \mathcal{C}^{1}(\mathbb{R}^{d};\mathbb{R})\), \(i=1,2\);

2) there exists a constant \(C>0\) such that for any \(t\in [0,T]\), \({x,y,x',y'\in \mathbb{R}^{d}}\), \({z,z'\in \mathbb{R}}\), we have

$$\begin{aligned} &|H(t,x,y,z)-H(t,x',y',z')|+|F(t,x,y,z)-F(t,x',y',z')| \\ &\qquad +|b(t,y)-b(t,y')|+|\sigma (t,y)-\sigma (t,y')| \\ &\leq C(|x-x'|+|y-y'|+|z-z'|); \end{aligned}$$

3) for any fixed \(x,y,\in \mathbb{R}^{d}\), \(z\in \mathbb{R}\), we have

$$ \int _{0}^{T} \big(|H(t,x,y,z)|^{2}+|F(t,x,y,z)|^{2}+|b(t,y)|^{2}+| \sigma (t,y)|^{2}\big)\,dt < \infty ; $$

4) \(\mathbb{E}[ |\widehat{X}_{0}|^{2} ] + \mathbb{E}[|Y_{0}|^{2}]<\infty \).

Then for any \(\lambda >0\), the system (2.4)(2.6) with the initial condition \((\widehat{X}_{0}, Y_{0})\) has a unique strong solution.

To analyse a numerical scheme solving (2.4)–(2.6), we consider the particle system

$$\begin{aligned} d X^{N,n}_{t} &= H\bigl(t,X^{N,n}_{t}, Y^{N,n}_{t}, m^{\lambda}_{A_{1}}(X^{N,n}_{t}; \mu _{t}^{N}) \bigr) d t \\ & \phantom{=:} + F\bigl(t, X^{N,n}_{t}, Y^{N,n}_{t}, m^{\lambda}_{A_{2}}(X^{N,n}_{t}; \mu _{t}^{N})\bigr) d W_{t}^{X,n}, \end{aligned}$$
(2.7)
$$\begin{aligned} dY^{N,n}_{t}&=b(t,Y^{N,n}_{t})\,dt+\sigma (t,Y^{N,n}_{t})\,dW_{t}^{Y,n}, \end{aligned}$$
(2.8)
$$\begin{aligned} \mu _{t}^{N}&=\frac{1}{N}\sum _{n=1}^{N}\delta _{(X_{t}^{N,n},Y_{t}^{N,n})}, \end{aligned}$$
(2.9)

where \(N\in \mathbb{N}\), \(n=1,\ldots ,N\), \(t\in [0,T]\) and the pairs of \(d\)-dimensional Brownian motions \((W^{X,n},W^{Y,n})\), \(n=1,\ldots ,N\), are independent and have the same law as \((W^{X},W^{Y})\). The following propagation of chaos result holds; it establishes both weak and strong convergence of \(X^{N,n}\).

Theorem 2.3

Suppose that all the conditions of Theorem 2.2are satisfied. Suppose the initial values \((X_{0}^{N,n},Y_{0}^{N,n})\) are independent and have the same law as \((\widehat{X}_{0},Y_{0})\). Moreover, suppose that \(\mathbb{E}[ |\widehat{X}_{0}|^{q}] + \mathbb{E}[|Y_{0}|^{q}]<\infty \) for some \(q>4\). Then there exists a constant \(C=C(\lambda ,T,\mathbb{E}[|\widehat{X}_{0}|^{q}],\mathbb{E}[|Y_{0}|^{q}])>0\) such that for any \(n=1,\ldots ,N\), \(N\in \mathbb{N}\), we have

$$\begin{aligned} \mathbb{E}\Big[ \sup _{0\leq t\leq T}|X_{t}^{N,n}-\widehat{X}_{t}^{n}|^{2} \Big] +\sup _{0\leq t\leq T} \mathbb{E}\big[\big(\mathbb{W}_{2}(\mu _{t}^{N}, \widehat{\mu}_{t})\big)^{2}\big]\le C \epsilon _{N}, \end{aligned}$$
(2.10)

where the process \(\widehat{X}^{n}\) solves (2.4)(2.6) with \(W^{X,n}\), \(W^{Y,n}\) in place of \(W^{X}\), \(W^{Y}\), respectively, and where

$$ \epsilon _{N}= \textstyle\begin{cases} N^{-1/2}&\quad \textit{if }d=1, \\ N^{-1/2}\log N &\quad \textit{if }d=2, \\ N^{-1/d} &\quad \textit{if }d>2. \end{cases} $$

A crucial step which allows us to obtain these results is the Lipschitz-continuity of \(m^{\lambda}\). The following holds.

Theorem 2.4

Assume that the kernel \(k\) satisfies Assumption 2.1. Let \(A\in \mathcal{C}^{1}(\mathcal{X};\mathbb{R})\). Then for any \(x,y\in \mathcal{X}\), \(\mu ,\nu \in \mathcal{P}_{2}(\mathcal{X}\times \mathcal{X})\), we have

$$ |m^{\lambda}_{A}(x;\mu )-m^{\lambda}_{A}(y;\nu )|\le C_{1} \mathbb{W}_{1}( \mu ,\nu )+C_{2}|x-y|, $$

where

$$ C_{1}:=\bigg( \frac{D_{k}}{\lambda ^{2}}+\frac{1}{\lambda}\bigg) dD_{k}^{2} \| A\|_{\mathcal{C}^{1}} \qquad \textit{and} \qquad C_{2}:= \frac{\sqrt{d}}{\lambda}D_{k}^{2}\|A\|_{\mathcal{C}^{1}} $$

may be considered to be (possibly suboptimal) Lipschitz constants with respect to the Wasserstein metric and Euclidean norm, respectively.

This result is interesting for at least two reasons. First, it shows that \(m^{\lambda}_{A}\) is Lipschitz-continuous in both arguments, provided that the kernel \(k\) is smooth enough. That is, the Lipschitz-continuity property depends on ℋ only through the smoothness of the kernel \(k\). Second, this result gives an explicit dependence of the corresponding (possibly suboptimal) Lipschitz constants on \(\lambda \) and \(k\).

Remark 2.5

Let us stress that Theorem 2.2 establishes the existence and uniqueness of a solution to (2.2), (2.3) only for a fixed regularisation parameter \(\lambda >0\) and cannot be used to study the limiting case \(\lambda \to 0\). Indeed, it follows from Theorem 2.4 that as \(\lambda \to 0\), the Lipschitz constants of \(m^{\lambda}_{A}\) blow up. However, Theorem 2.3 does not imply that the optimal Lipschitz constants blow up for \(\lambda \to 0\), or that the solution to (2.2), (2.3) blows up. We demonstrate numerically in Sect. 5 that for \(\lambda \to 0\), in the examples there, the solution to (2.2), (2.3) does not blow up. On the contrary, it weakly converges to a limit; this suggests that (at least) weak existence of a solution to (2.2), (2.3) may hold. Verifying this theoretically remains however an important open problem.

Remark 2.6

A natural question is whether (2.2), (2.3) can be formulated for a different state space, that is, for \(X\), \(Y\) taking values in \(\mathcal{X}\), \(\mathcal{Y}\) rather than \(\mathbb{R}^{d}\). Indeed, for equity models, \(\mathcal{X}= \mathcal{Y} = \mathbb{R}_{+}\) is clearly a more natural choice for both the price process and the variance process. Heuristically, the theory should hold for more general \(\mathcal{X}\) and \(\mathcal{Y}\), provided that those sets are invariant under the dynamics (2.2), (2.3) as well as under the regularised dynamics. It is, however, difficult to derive meaningful assumptions guaranteeing this kind of invariance, which prompts us to work with \(\mathbb{R}^{d}\) instead.

3 Approximation of conditional expectations

In this section, we study the approximation \(m^{\lambda}_{A}\) introduced in (2.1) in more detail. Throughout this section, we fix an open set \(\mathcal{X}\subseteq \mathbb{R}^{d}\) and a measure \(\nu \in \mathcal{P}_{2}(\mathcal{X}\times \mathcal{X})\), and impose the following relatively weak assumptions on the function \(A\colon \mathcal{X}\to \mathbb{R}\) and the positive kernel \(k\colon \mathcal{X}\times \mathcal{X}\) \(\to \mathbb{R}\).

Assumption 3.1

The function \(A\) has sublinear growth, i.e., there exists a constant \(C>0\) such that \(|A(x)|\le C(1+|x|)\) for all \(x\in \mathcal{X}\).

Assumption 3.2

The kernel \(k(\,\cdot \,,\,\cdot \,)\) is continuous on \(\mathcal{X}\times \mathcal{X}\) and satisfies

$$ 0< k(x,x)\le C(1+|x|^{2}) $$

for some \(C>0\).

It is easy to see that Assumption 3.2 implies for any \(x\in \mathcal{X}\) that

$$ \|k(x,\,\cdot \,)\|_{\mathcal{H}}^{2}=\langle k(x,\,\cdot \,),k(x,\, \cdot \,)\rangle _{\mathcal{H}}=k(x,x)\le C(1+|x|^{2}). $$
(3.1)

Due to Assumption 3.2 and Steinwart and Christmann [25, Lemma 4.33], ℋ is a separable RKHS and one has for any \(f\in \mathcal{H}\), \(x\in \mathcal{X}\) that

$$ | f(x)| =|\langle k(x,\,\cdot \,),f\rangle _{\mathcal{H}}|\le \| k(x, \,\cdot \,)\|_{\mathcal{H}} \| f\|_{\mathcal{H}}\le {C}(1+|x|)\| f\|_{\mathcal{H}}, $$
(3.2)

where we also used (3.1). Hence every \(f\in \mathcal{H}\) has sublinear growth and, as a consequence, the objective functional in (2.1) is finite for any fixed \(\nu \in \mathcal{P}_{2}(\mathcal{X}\times \mathcal{X})\). It is also easy to see that (3.2) and (3.1) imply that for any \(x,y\in \mathcal{X}\),

$$ |k(x,y)|\le {C}(1+|x|)\|k(\,\cdot \,,y)\|_{\mathcal{H}}\le C(1+|x|) (1+|y|). $$
(3.3)

Therefore, the Bochner integrals

$$ c_{A}^{\nu}:=\int _{\mathcal{X}\times \mathcal{X}} k(\,\cdot \,,x)A(y) \nu (dx,dy), \qquad \mathcal{C}^{\nu}f :=\int _{\mathcal{X}\times \mathcal{X}} k(\,\cdot \,,x)f(x)\nu (dx,dy) $$

are well-defined functions in ℋ for every \(f\in \mathcal{H}\). Moreover, it is clear that the operator \({\mathcal{C}^{\nu}\colon \mathcal{H}\to \mathcal{H}}\) is symmetric and positive semidefinite since

$$ \langle g,\mathcal{C}^{\nu}f \rangle _{\mathcal{H}} =\int _{ \mathcal{X}}\left \langle g,k(\,\cdot \,,x)\right \rangle f(x)\nu (dx, \mathcal{X}) =\int _{\mathcal{X}} g(x)f(x)\nu (dx,\mathcal{X}). $$

Thus by the Hellinger–Toeplitz theorem (see e.g. Reed and Simon [21, Sect. III.5]), \(\mathcal{C}^{\nu}\) is a bounded self-adjoint linear operator on ℋ. As a consequence, for any \(\lambda \ge 0\), the operator \(\mathcal{C}^{\nu}+\lambda I_{\mathcal{H}}\) is a bounded self-adjoint operator on ℋ with spectrum contained in the interval \([ \lambda ,\|\mathcal{C}^{\nu}\| +\lambda ]\). Hence if \(\lambda >0\), then \((\mathcal{C}^{\nu}+\lambda I_{\mathcal{H}})^{-1}\) exists and is a bounded self-adjoint operator on ℋ with norm

$$\|(\mathcal{C}^{\nu}+\lambda I_{\mathcal{H}})^{-1}\|_{\mathcal{H}} \le \lambda ^{-1}. $$

We are now ready to state the following useful representation for the solution to (2.1).

Proposition 3.3

Under Assumptions 3.1, 3.2, for any fixed \(\nu \in \mathcal{P}_{2}(\mathcal{X}\times \mathcal{X})\) and \(\lambda >0\), the solution to (2.1) can be represented as

$$ m^{\lambda}_{A}(\,\cdot \,;\nu )=(\mathcal{C}^{\nu}+\lambda I_{ \mathcal{H}})^{-1}c^{\nu}_{A}. $$
(3.4)

This representation may be seen as an infinite sample version of the usual solution representation for a ridge regression problem based on finite samples. We thus consider it as not essentially new, but in order to keep our paper as self-contained as possible, we present a proof in Sect. 7. Proposition 3.3 allows us to prove Lipschitz-continuity of \(m^{\lambda}_{A}\), that is, Theorem 2.4.

Let us now proceed with investigating when the function \(m^{\lambda}_{A}=m^{\lambda}_{A}(\,\cdot \,;\nu )\) is a “good” approximation to the true conditional expectation

$$ m_{A}=m_{A}(x;\nu ):=\mathbb{E}_{(X,Y)\sim \nu}[ A(Y)|X=x] $$
(3.5)

for small enough \(\lambda >0\). Consider the Hilbert space \(\mathcal{L}_{2}^{\nu}:=L_{2}(\mathcal{X}, \nu (dx,\mathcal{X}))\) with \(\nu (U,\mathcal{X}) := \nu (U \times \mathcal{X}) >0\). For \(f\in \mathcal{L}_{2}^{\nu}\), put

$$ T^{\nu }f(\,\cdot\,):=\int _{\mathcal{X}}k(\,\cdot \,,x)f(x)\nu (dx,\mathcal{X}). $$
(3.6)

Recalling (3.3), it is easy to see that \(T^{\nu}\) is a linear operator \(\mathcal{L}_{2}^{\nu}\to \mathcal{L}_{2}^{\nu}\). Note that \(\mathcal{H}\subseteq \mathcal{L}_{2}^{\nu}\) due to (3.2); thus \(\mathcal{C}^{\nu}\) is the restriction of \(T^{\nu}\) to ℋ. Further, since

$$ |k(x,y)| \leq \sqrt{k(x,x)} \sqrt{k(y,y)}, $$

the kernel \(k\) is Hilbert–Schmidt on \(\mathcal{L}_{2}(\mathcal{X\times X},\nu (dx,\mathcal{X}) \otimes \nu (dy,\mathcal{X}))\), i.e.,

$$ \int k^{2}(x,y)\nu (dx,\mathcal{X})\nu (dy,\mathcal{X})< \infty $$

due to Assumption 3.2. As a consequence of standard results from functional analysis, one then has (see for example [21, Sect. VI]) that

(i) the operator \(T^{\nu}\) is self-adjoint and compact;

(ii) there exists an orthonormal system \(\left ( a_{n}\right ) _{n\in \mathbb{N}}\) in \(\mathcal{L}^{\nu}_{2}\) of eigenfunctions corresponding to nonnegative eigenvalues \(\sigma _{n}\) of \(T^{\nu}\), and \(\sigma _{1}\ge \sigma _{2}\ge \sigma _{3}\ge \cdots \);

(iii) if \(J:=\{n\in \mathbb{N}: \sigma _{n}>0\}\), one has

$$ T^{\nu }f=\sum _{n\in J}\sigma _{n}\left \langle f,a_{n}\right \rangle _{\mathcal{L}^{\nu }_{2}}a_{n}, \qquad f\in \mathcal{L}_{2}^{ \nu}, $$
(3.7)

with \(\lim _{n\rightarrow \infty}\sigma _{n}=0\) if \(J=\mathbb{N}\).

A generalisation of Mercer’s theorem to unbounded domains, see Sun [26], implies the following statement.

Proposition 3.4

Let \(k\) be a kernel satisfying Assumption 3.2and assume that \(\nu (\,\cdot \,,\mathcal{X})\) is a nondegenerate Borel measure, that is, for every open set \(U\subseteq \mathcal{X}\), one has \(\nu (U, \mathcal{X})>0\). Then one may take the eigenfunctions \(a_{n}\) in (3.7) to be continuous and \(k\) has a series representation

$$ k(x,y)=\sum _{n\in J}\sigma _{n}a_{n}(x)a_{n}(y),\qquad x,y\in \mathcal{X}, $$

with uniform convergence on compact sets. Moreover, \((\widetilde{a}_{n})_{n\in J}\) with \(\widetilde{a}_{n}:=\sqrt{\sigma _{n}}\,a_{n}\) is an orthonormal basis of ℋ, and the scalar product intakes the form

$$ \langle f,g\rangle _{\mathcal{H}}=\sum _{n\in J} \frac{\langle f,a_{n}\rangle _{\mathcal{L}_{2}^{\nu}} \langle g,a_{n}\rangle _{\mathcal{L}_{2}^{\nu}}}{\sigma _{n}} \qquad \,\, \textit{for }f,g\in \mathcal{H}. $$
(3.8)

Now we are ready to present the main result of this section, which quantifies the convergence properties of \(m^{\lambda}_{A}(\,\cdot \,,\nu )\) as \(\lambda \to 0\) for a fixed measure \(\nu \). Recall the notation (3.5). Let \(P_{\overline{\mathcal{H}}}\) denote the orthogonal projection in \(\mathcal{L}_{2}^{\nu}\) onto \(\overline{\mathcal{H}}\), the closure of ℋ in \(\mathcal{L}_{2}^{\nu}\). Then for any \(f\in \mathcal{L}_{2}^{\nu}\),

$$ P_{\overline{\mathcal{H}}}f=\sum _{n\in J} \langle f,a_{n} \rangle _{ \mathcal{L}_{2}^{\nu}} \, a_{n}\quad \text{and}\quad \langle P_{ \overline{\mathcal{H}}}f,a_{m} \rangle _{\mathcal{L}_{2}^{\nu}} = \langle f,a_{m} \rangle _{ \mathcal{L}_{2}^{\nu}} ,\qquad m\in J, $$
(3.9)

since \((a_{n})_{n\in J}\) is an orthonormal system in \(\mathcal{L}_{2}^{\nu}\).

Theorem 3.5

Assume that the kernel \(k\) satisfies Assumption 3.2, \(\nu (\,\cdot \,,\mathcal{X})\) is a nondegenerate Borel measure, and that \(m_{A}(\,\cdot \,;\nu )\in \mathcal{L}_{2}^{\nu}\) (for instance when \(A\) is bounded and measurable). Then for any \(\lambda >0\),

$$ \|P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu )-m_{A}^{\lambda}(\, \cdot \,;\nu ) \|_{\mathcal{L}_{2}^{\nu}}^{2}=\sum _{n\in J} \frac{\lambda ^{2}}{\left ( \sigma _{n}+\lambda \right ) ^{2}}\left \langle m_{A}(\,\cdot \,; \nu ),a_{n}\right \rangle _{\mathcal{L}_{2}^{ \nu}}^{2}. $$
(3.10)

In particular, \(\| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu )-m_{A}^{\lambda}( \,\cdot \,;\nu ) \|_{\mathcal{L}_{2}^{\nu}}\rightarrow 0\) as \(\lambda \searrow 0\). If we have in addition \(P_{ \overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu )\in \mathcal{H}\), then

$$ \| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu )-m(\,\cdot \,;\nu )_{A}^{ \lambda}(\,\cdot \,;\nu ) \| _{\mathcal{H}}^{2}=\sum _{n\in J} \frac{\lambda ^{2}}{\left ( \sigma _{n}+\lambda \right ) ^{2}\sigma _{n}}\left \langle m_{A}(\,\cdot \,;\nu ),a_{n} \right \rangle _{\mathcal{L}_{2}^{\nu}}^{2}, $$
(3.11)

and thus \(\| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu )-m_{A}^{\lambda }( \,\cdot \,;\nu ) \|_{\mathcal{H}}\rightarrow 0\) for \(\lambda \searrow 0\).

Theorem 3.5 establishes convergence of \(m_{A}^{\lambda}(\,\cdot \,;\nu )\) as \(\lambda \to 0\), but without a rate. Its proof is placed in Sect. 7. Additional assumptions are needed to guarantee a convergence rate. This is done in the following corollary.

Corollary 3.6

Suppose that the conditions of Theorem 3.5are satisfied and that, moreover, for some \(\theta \in (0,1]\), we have

$$ \sum _{n\in J}\sigma _{n}^{-\theta}\left \langle m_{A}(\,\cdot \,;\nu ),a_{n}\right \rangle _{\mathcal{L}_{2}^{\nu}}^{2}< \infty . $$
(3.12)

Then

$$\begin{aligned} & \| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu ) -m^{\lambda}_{A}( \,\cdot \,;\nu ) \| _{\mathcal{L}_{2}^{\nu}}^{2} \\ &\leq \left ( 1-\frac{\theta}{2}\right ) ^{2}\left ( \frac{\lambda \theta }{2-\theta}\right ) ^{\theta}\sum _{n\in J} \sigma _{n}^{-\theta}\left \langle m_{A}(\,\cdot \,;\nu ),a_{n} \right \rangle _{\mathcal{L}_{2}^{\nu}}^{2}. \end{aligned}$$
(3.13)

In particular, if \(\theta =1\), then \(P_{\overline{\mathcal{H}}}m_{A} \in \mathcal{H}\) and we get

$$ \| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu ) -m^{\lambda}_{A}( \,\cdot \,;\nu ) \Vert _{\mathcal{L}_{2}^{\nu}} \leq \frac{\sqrt{\lambda}}{2} \| P_{\overline{\mathcal{H}}}m_{A}(\,\cdot \,;\nu ) \|_{\mathcal{H}}. $$
(3.14)

Proof

Inequality (3.13) follows from (3.10), (3.12) and the fact that the maximum of the function \(x\mapsto \lambda ^{2}x^{\theta}/(x+\lambda )^{2}\), \(x>0\), is equal to \({(1-\theta /2) ^{2} (\lambda \theta /(2-\theta ) ) ^{\theta}} \). Inequality (3.14) follows from (3.8), (3.9) and (3.13). □

Remark 3.7

If the operator \(T^{\nu}\) defined in (3.6) is injective, that is, \(T^{\nu}f = 0\) for \(f\in \mathcal{L}_{2}^{\nu}\) implies \(f = 0\) \(\nu \)-a.s., then \(P_{\overline{\mathcal{H}}}=I_{\mathcal{L}_{2}^{\nu}}\). In this case, \(J=\mathbb{N}\) and Theorem 3.5 and Corollary 3.6 quantify the convergence to the true conditional expectation. A sufficient condition for \(T^{\nu}\) to be injective is that the kernel \(k\) is integrally strictly positive definite (ispd) in the sense that

$$ \int _{\mathcal{X}\times \mathcal{X}}k(x,y)\mu (dx)\mu (dy)>0 $$

for all non-zero signed Borel measures \(\mu \) defined on \(\mathcal{X}\). Indeed, for any \(f\in \mathcal{L}_{2}^{\nu}\), we may define a signed Borel measure \(\mu _{f}(A):=\int _{A}f(x)\nu (dx,\mathcal{X})\), \(A\in \mathcal{B}(\mathcal{X})\), which is finite since \(\vert \mu _{f}(A) \vert \leq \int f^{2}(x)\nu (dx,\mathcal{X})<\infty \). Hence if \(k\) is an ispd kernel, then \(T^{\nu }f = 0\) implies

$$\begin{aligned} 0 = \langle T^{\nu}f,f \rangle _{\mathcal{L}_{2}^{\nu}}&=\int _{\mathcal{X}\times \mathcal{X}}k(y,x)f(x)f(y)\nu (dx, \mathcal{X})\nu (dy,\mathcal{X}) \\ & =\int _{\mathcal{X}\times \mathcal{X}}k(y,x)\mu _{f}(dx)\mu _{f}(dy), \end{aligned}$$

which in turn implies \(\mu _{f}=0\), i.e., \(f=0\) \(\nu \)-a.s. Furthermore, it should be noted that any ispd kernel is strictly positive definite in the usual sense, but the converse is not true. Examples of ispd kernels are Gaussian kernels, Laplace kernels and many more. For details on ispd kernels, we refer to Sriperumbudur et al. [24].

In summary, we have shown in this section that under certain conditions, \(m^{\lambda}_{A}(\,\cdot \,,\nu )\) may converge at least in the \({\mathcal{L}_{2}^{\nu}}\)-sense to the true conditional expectation \(m_{A}(\,\cdot \,,\nu )\) as \({\lambda \to 0}\). This makes the heuristic discussion around (1.10) and (1.11) in Sect. 1 more rigorous.

Remark 3.8

Note that the measure \(\widehat{\mu}_{t}\) in the solution of (2.4)–(2.6) depends on \(\lambda \) so that in fact \(\widehat{\mu}_{t}=\widehat{\mu}_{t}^{\lambda}\). Therefore, even when \(m^{\lambda}_{A}(\,\cdot \,,\nu )\to m_{A}(\,\cdot \,,\nu )\) for fixed \(\nu \) and \({\lambda \downarrow 0}\), the question whether \(m^{\lambda}_{A_{i}}(\,\cdot \,,\widehat{\mu}_{t}^{\lambda})\) converges in some sense is still not answered. We believe that this question is intimately linked to the problem of existence of a solution to (2.2), (2.3). As already explained, this is an unsolved open problem and considered beyond our scope here. However, loosely speaking, assuming that the latter system has indeed a solution (in some sense) with solution measure \(\mu _{t}\), say, it is natural to expect that for a suitable “rich enough” RKHS, we obtain \(m^{\lambda}_{A_{i}}(\,\cdot \,,\mu _{t})\to m_{A_{i}}(\,\cdot \,, \mu _{t})\) (the true conditional expectation) as \(\lambda \searrow 0\).

4 Numerical algorithm

Let us now describe in detail our numerical algorithm to construct solutions to (1.8). We begin by discussing an efficient way of calculating \(m^{\lambda}_{A}\).

4.1 Estimation of the conditional expectation

Let us recall that in order to solve the particle system (2.7)–(2.9), we need to compute

$$ m^{\lambda}_{A}(\,\cdot \,; \mu ^{N}_{t}) = \operatorname*{arg\, min}_{f \in \mathcal{H}} \bigg( \frac{1}{N} \sum _{n=1}^{N} |A(Y_{t}^{N,n}) - f(X_{t}^{N,n})|^{2} + \lambda \left \lVert f\right \rVert _{\mathcal{H}}^{2} \bigg) $$
(4.1)

for \(t\) belonging to a certain partition of \([0,T]\) and fixed large \(N\in \mathbb{N}\); here \(A=A_{1}\) or \(A=A_{2}\). It follows from the representer theorem for RKHSs in Schölkopf et al. [23, Theorem 1] that \(m^{\lambda}_{A}\) has the representation

$$ m^{\lambda}_{A}(\,\cdot \,; \mu ^{N}_{t}) = \sum _{i=1}^{N} \alpha _{i} k(X_{t}^{N,i}, \,\cdot \,) $$
(4.2)

for some \(\alpha = (\alpha _{1}, \ldots , \alpha _{N})^{\top }\in \mathbb{R}^{N}\). Note that the optimal \(\alpha \) can be calculated explicitly by plugging the representation (4.2) into the minimisation problem (4.1) in place of \(f\) and minimising over \(\alpha \). However, computing the optimal \(\alpha \) directly takes \(O(N^{3})\) operations, which is prohibitively expensive keeping in mind that the number of particles \(N\) is going to be very large. Furthermore, even evaluating (4.2) at \(X_{t}^{N,n}\), \(n=1,\ldots , N\), for a given \(\alpha \in \mathbb{R}^{N}\) is rather expensive; it requires \(O(N^{2})\) operations and thus is impossible to implement.

To develop an efficient algorithm, let us note that many particles \(X_{t}^{N,i}\) – and as a consequence the implied basis functions \(k(X_{t}^{N,i}, \,\cdot \,)\) – will be close to each other. Therefore we can considerably reduce the computational cost by only using \({L\ll N}\) rather than \(N\) basis functions as suggested in (4.2). More precisely, we choose \(Z^{1}, \ldots , Z^{L}\) among \(X_{t}^{N,1}, \ldots , X_{t}^{N,N}\) – e.g. by random choice or by taking every \(\frac{N}{L}\)th point among the ordered sequence \(X_{t}^{N,(1)}, \ldots , X_{t}^{N,(N)}\) when \(X\) is one-dimensional – and approximate

$$\sum _{i=1}^{N} \alpha _{i} k(X_{t}^{N,i}, \,\cdot \,) \approx \sum _{j=1}^{L} \beta _{j} k(Z^{j}, \,\cdot \,), $$

where \(\beta = (\beta _{1}, \ldots , \beta _{L})^{\top }\in \mathbb{R}^{L}\). It is easy to see that

$$\begin{aligned} \bigg\| \sum _{j=1}^{L} \beta _{j} k(Z^{j},\,\cdot \,)\bigg\| _{ \mathcal{H}}^{2}&= \bigg\langle \sum _{j=1}^{L} \beta _{j} k(Z^{j},\, \cdot \,),\sum _{j=1}^{L} \beta _{j} k(Z^{j},\,\cdot \,)\bigg\rangle _{ \mathcal{H}} \\ &=\sum _{j,\ell =1}^{L} \beta _{j}\beta _{\ell }\langle k(Z^{j},\, \cdot \,),k(Z^{\ell},\,\cdot \,)\rangle _{\mathcal{H}} \\ &=\sum _{j,\ell =1}^{L} \beta _{j}\beta _{\ell }k(Z^{j},Z^{\ell})= \beta ^{\top }R \beta , \end{aligned}$$

where \(R:= (k(Z^{j},Z^{\ell}))_{j,\ell =1,\ldots ,L}\) is an \(L\times L\) matrix. Thus recalling (4.1), we see that we have to solve

$$ \operatorname*{arg\, min}_{\beta \in \mathbb{R}^{L}} \bigg(\frac{1}{N}(G-K\beta )^{ \top}(G-K\beta )+\lambda \beta ^{\top }R \beta \bigg), $$

where \(G:=(A(Y_{t}^{N,n}))_{n=1,\ldots ,N}\) and \(K := (k(Z^{j},X_{t}^{N,n}))_{n=1,\ldots , N, j=1,\ldots ,L}\) is an \((N\times L)\)-matrix. Differentiating with respect to \(\beta \), we obtain that the optimal value \(\widehat{\beta}=\widehat{\beta}((X_{t}^{N}),(Y_{t}^{N}))\) satisfies

$$ (K^{\top }K+N\lambda R)\widehat{\beta}=K^{\top }G, $$
(4.3)

and we approximate the expectation as

$$ m^{\lambda}_{A}(x; \mu ^{N}_{t}) \approx \sum _{j=1}^{L} \widehat{\beta}_{j} k(Z^{j},x) =: \widehat{m}_{A}^{\lambda}(x; \mu ^{N}_{t}). $$
(4.4)

Remark 4.1

The method of choosing basis points \(Z^{1}, \ldots , Z^{L}\) can be seen as a systematic and adaptive approach of choosing basis functions \(k(Z^{j}, \,\cdot \,)\), \(j=1, \ldots , L\), in a global regression method. We note that the technique of evaluating the conditional expectation only in points on a grid \(G_{f,t}\) coupled with spline-type interpolation between grid points suggested in Guyon and Henry-Labordère [13] is motivated by similar concerns regarding the explosion of computational costs.

Remark 4.2

Let us see how many operations we need to calculate \(\widehat{\beta}\), taking into account that \(L\ll N\). We need \(O(NL)\) to calculate \(K\), \(O(L^{2})\) to calculate \(R\), \(O(N L^{2})\) to calculate \(K^{\top }K\) (this is the bottleneck), \(O(L^{3})\) to invert \(K^{\top }K+N\lambda R\) and \(O(NL)\) to calculate \(K^{\top }G\) and solve (4.3). Thus in total, we need \(O(N L^{2})\) operations.

4.2 Solving the regularised McKean–Vlasov equation

With the function \(\widehat{m}_{A}^{\lambda}\) in hand, we now consider the Euler scheme for the particle system (2.7)–(2.9). We fix a time interval \([0,T]\), the number \(M\) of time steps and for simplicity consider a uniform time increment \(\delta :=T/M\). Let \(\Delta W_{i}^{X,n}\) and \(\Delta W_{i}^{Y,n}\) denote independent copies of \(W^{X}_{(i+1) \delta} - W^{X}_{i\delta}\) and \(W^{Y}_{(i+1) \delta} - W^{Y}_{i \delta}\), respectively, for \(n=1, \ldots , N\), \(i=1,\ldots , M\). Note that for stochastic volatility models, the Brownian motions driving the stock price and the variance process are usually correlated. We now define \(\widetilde{X}_{0}^{n}=X_{0}^{n}\), \(\widetilde{Y}_{0}^{n}=Y_{0}^{n}\) and for \(i=0,\ldots , M-1\),

$$\begin{aligned} \widetilde{X}^{n}_{i+1} &= \widetilde{X}^{n}_{i} + H\big(i\delta , \widetilde{X}^{n}_{i}, \widetilde{Y}^{n}_{i}, \widehat{m}^{\lambda}_{A_{1}}( \widetilde{X}^{n}_{i}; \widetilde{\mu}^{N}_{i}) \big) \delta \\ & \hphantom{=:} +F\big(i\delta , \widetilde{X}^{n}_{i}, \widetilde{Y}^{n}_{i}, \widehat{m}^{\lambda}_{A_{2}}(\widetilde{X}^{n}_{i}; \widetilde{\mu}^{N}_{i}) \big) \Delta W_{i}^{X,n}, \end{aligned}$$
(4.5)
$$\begin{aligned} \widetilde{Y}^{n}_{i+1} &= \widetilde{Y}^{n}_{i} + b(i\delta , \widetilde{Y}^{n}_{i}) \delta + \sigma (i\delta ,\widetilde{Y}^{n}_{i}) \Delta W^{Y,n}_{i}, \end{aligned}$$
(4.6)

where \(\widetilde{\mu}_{i}^{N}=\frac{1}{N}\sum _{n=1}^{N}\delta _{( \widetilde{X}_{i}^{N,n},\widetilde{Y}_{i}^{N,n})}\). Thus at each discretisation time step of (4.5), (4.6), we need to compute approximations of the conditional expectations \(\widehat{m}^{\lambda}_{A_{r}}(\widetilde{X}^{n}_{i}; \widetilde{\mu}^{N}_{i})\), \({r=1,2}\). This is done using the algorithm discussed in Sect. 4.1 and takes \(O(NL^{2})\) operations; see Remark 4.2. Thus the total number of operations needed to implement (4.5), (4.6) is \(O(MNL^{2})\).

5 Numerical examples and applications to local stochastic volatility models

As a main application of the regularisation approach presented above, we consider the problem of calibration of stochastic volatility models to market data. Fix a time period \(T>0\). To simplify the calculations, we suppose that the interest rate is \(r=0\). Let \(C(t,K)\), \(t\in [0,T]\), \(K\ge 0\), be the price at time 0 of a European call option with strike \(K\) and maturity \(t\) on a non-dividend paying stock. We assume that the market prices \((C(t,K))_{t\in [0,T], K\ge 0}\) are given and satisfy the following conditions: (i) \(C\) is continuous and increasing in \(t\) and twice continuously differentiable in \(x\), (ii) \(\partial _{xx}C(t,x)>0\), (iii) \(C(t,x)\to 0\) as \(x\to \infty \) for any \(t\ge 0\) and \(C(t,0)={\mathrm{const}}\). It is known by Lowther [19, Theorem 1.3 and Sect. 2.1] and Dupire [8] that under these conditions, there exists a diffusion process \((S_{t})_{t\in [0,T]}\) which is able to perfectly replicate the given call option prices, that is, \(\mathbb{E}[ (S_{t}-K)^{+}]=C(t,K)\). Furthermore, \(S\) solves the stochastic differential equation

$$ dS_{t}=\sigma _{\text{Dup}}(t,S_{t})S_{t}\,dW_{t},\qquad t\in [0,T], $$
(5.1)

where \(W\) is a Brownian motion and \(\sigma _{\text{Dup}}\) is the Dupire local volatility given by

$$ \sigma _{\text{Dup}}^{2}(t,x):= \frac{2\partial _{t}C(t,x)}{x^{2}\partial _{xx}C(t,x)},\qquad x>0,\,t \in [0,T]. $$
(5.2)

We study local stochastic volatility (LSV) models. That is, we assume that the stock price \(X\) follows the dynamics

$$ dX_{t}=\sqrt{Y_{t}} \,\sigma _{\text{LV}}(t,X_{t})X_{t}\, dW_{t}^{X}, \qquad t\in [0,T], $$
(5.3)

where \(W^{X}\) is a Brownian motion and \((Y_{t})_{t \in [0,T]}\) is a strictly positive variance process, both adapted to some filtration \(( \mathcal{F}_{t} )_{t\geq 0}\). If the function \(\sigma _{\text{LV}}\) is given by

$$ \sigma _{\text{LV}}^{2}(t,x) := \frac{\sigma _{\text{Dup}}^{2}(t,x)}{\mathbb{E}[ Y_{t} \vert X_{t}=x ] },\qquad x>0, t\in [0,T], $$

and \(\int _{0}^{T} \mathbb{E}[ Y_{t}\sigma _{\text{LV}}(t,X_{t})^{2} X_{t}^{2} ] \,dt<\infty \), then the one-dimensional marginal distributions of \(X\) coincide with those of \(S\) (see Gyöngy [14, Theorem 4.6], Brunick and Shreve [4, Corollary 3.7]). Thus

$$ C(T,K)=\mathbb{E}[(X_{T}-K)^{+}],\qquad T,K>0. $$
(5.4)

In particular, the choice \(Y\equiv 1\) recovers the local volatility model. If \(Y\) is a diffusion process,

$$ dY_{t}= b(t,Y_{t})dt+\sigma (t,Y_{t}) d W_{t}^{Y}, $$
(5.5)

where \(W^{Y}\) is a Brownian motion possibly correlated with \(W^{X}\), we see that the model (5.3)–(5.5) is a special case of the general McKean–Vlasov equation (2.2), (2.3). To solve (5.3)–(5.5), we implement the algorithm described in Sect. 4; see (4.5), (4.6) together with (4.4). We validate our results by doing two different checks. First, we verify that the one-dimensional distribution of \(\widetilde{X}^{1}_{M}\) is close to the correct marginal distribution \(\operatorname{Law}(X_{T})=\operatorname{Law}(S_{T})\). To do this, we compare the call option prices obtained by the algorithm (that is, \(N^{-1} \sum _{n=1}^{N}(\widetilde{X}_{M}^{n}-K)^{+}\)) with the given prices \(C(T,K)\) for various \(T>0\) and \(K>0\). If the algorithm is correct and if \(\widetilde{\mu}^{N}_{M}\approx \operatorname{Law}(X_{T},Y_{T})\), then according to (5.4), one must have

$$ C(T,K)\approx N^{-1} \sum _{n=1}^{N}(\widetilde{X}_{M}^{n}-K)^{+}=: \widetilde{C}(T,K). $$
(5.6)

On the other hand, if the algorithm is not correct and \(\operatorname{Law}(X_{T},Y_{T})\) is very different from \(\widetilde{\mu}^{N}_{M}\), then (5.6) will not hold.

Second, we also control the multivariate distribution of \((\widetilde{X}_{i})_{i=0,\dots ,M}\). Recall that for any \(t\in [0,T]\), we have \(\operatorname{Law}(X_{t})=\operatorname{Law}(S_{t})\). We want to make sure that the dynamics of the process \(\widetilde{X}\) is different from the dynamics of the local volatility process \(S\). As a test case, we compare option values on the quadratic variation of the logarithm of the price. More precisely, for each \(K>0\), we compare European options on quadratic variation,

$$\begin{aligned} &QV_{S}(K):= \frac{1}{N} \sum _{n=1}^{N} \bigg(\sum _{i=0}^{M} (\log S_{(i+1)T/M}^{n} -\log S_{iT/M}^{n} )^{2} - K\bigg)^{+}, \\ &QV_{\widetilde{X}}(K):= \frac{1}{N} \sum _{n=1}^{N} \bigg( \sum _{i=0}^{M} (\log \widetilde{X}^{n}_{i+1} -\log \widetilde{X}^{n}_{i} )^{2}-K\bigg)^{+}, \end{aligned}$$

and verify that these two curves are different. Here, \((S^{n}_{i})_{i=1,\dots ,N}\) is an Euler approximation of (5.1). We also check that the prices of European options on quadratic variation converge as \(N \to \infty \).

We consider two different ways to generate market prices \(C(T,K)\). First, we assume that the stock follows the Black–Scholes (BS) model, that is, we assume that \(\sigma _{\text{Dup}}\equiv {\mathrm{const}} =0.3\) and \(S_{0}=1\). Second, we consider a stochastic volatility model for the market, that is, we set \(C(T,K):=\mathbb{E}[ (\overline{S}_{T}-K)^{+} ]\), where \((\overline{S}_{t})_{t\geq 0}\) follows the Heston model

$$\begin{aligned} d\overline{S}_{t}&={\sqrt {v _{t}}}\,\overline{S}_{t}\,dW_{t}, \end{aligned}$$
(5.7)
$$\begin{aligned} dv_{t}&=\kappa (\theta -v _{t})\,dt+ \xi {\sqrt {v _{t}}}\,dB_{t}, \end{aligned}$$
(5.8)

with the following parameters: \(\kappa = 2.19\), \(\theta = 0.17023\), \(\xi = 1.04\) and correlation \(\rho = -0.83\) between the driving Brownian motions \(W\) and \(B\), with initial values \(\overline{S}_{0}=1\), \(v_{0}=0.0045\); cf. similar parameter choices in Lemaire et al. [18, Table 1]. We compute option prices based on (5.7), (5.8) with the COS method; see Fang and Oosterlee [9]. We then calculate \(\sigma _{\text{Dup}}\) from \(C(T,K)\) using (5.2).

As our baseline stochastic volatility model for \(Y\), we choose a capped-from-below Heston-type model, but with different parameters than the data-generating Heston model. Specifically, we set \(b(t,x)=\lambda (\mu -x)\) and \(\sigma (t,x)=\eta \sqrt{x}\) in (5.5), where \(Y_{0}=0.0144\), \(\lambda =1\), \(\mu =0.0144\) and \(\eta =0.5751\). We cap the solution of (5.5) from below at the level \(\varepsilon _{{\mathrm{CIR}}}=10^{-3}\) to avoid singularity at 0. Numerical experiments have shown that such capping is necessary. We assume that the correlation between \(W^{X}\) and \(W^{Y}\) is very strong and equals −0.9, which makes calibration more difficult. Since the variance process has different parameters compared to the price-generating stochastic volatility model, a non-trivial local volatility function is required to match the implied volatility. Hence even though the generating model is of the same class, the calibration problem is still non-trivial and involves a singular MKV SDE.

We took ℋ to be the RKHS associated with the Gaussian kernel \(k\) with variance 0.1. We fix the number of time steps as \(M=500\) and take \(\lambda =10^{-9}\), \(L=100\). At each time step of the Euler scheme, we choose \((Z^{j}_{m})_{j=1,\dots , L}\) by the rule that

$$ \text{$Z^{j}_{m}$ is the $j\frac{100}{L+1}$-percentile of the sequence $( \widetilde{ X}^{n}_{m})_{n=1, \dots , N}$,}$$
(5.9)

an approach comparable to the choice of the evaluation grid \(G_{f,t}\) suggested in Guyon and Henry-Labordère [13].

Figure 1 compares the theoretical and the calculated prices (in terms of implied volatilities) in the (a) Black–Scholes and (b)–(d) Heston settings for various strikes and maturities. That is, we first calculate \(C(T,K)\) using the Black–Scholes model (“Black–Scholes setting”) or (5.7), (5.8) (“Heston setting”); then we calculate \(\sigma ^{2}_{\text{Dup}}\) by (5.2); then we calculate \(\widetilde{X}^{n}_{M}\), \(n=1,\ldots ,N\), using the algorithm (4.5), (4.6) with \(H\equiv 0\), \(A_{2}(x)=x\) and

$$ F(t,x,y,z):=x \sigma _{\text{Dup}}(t,x) \frac{\sqrt {y}}{\sqrt {z\vee \varepsilon } }, $$

where \(\varepsilon =10^{-3}\) (see also Reisinger and Tsianni [22]); then we calculate \(\widetilde{C}(T,K)\) using (5.6); finally, we transform the prices \(C(T,K)\) and \(\widetilde{C}(T,K)\) to the implied volatilities. We should like to note that this additional capping of the function \(F\) is less critical than the capping of the baseline process \(Y\).

Fig. 1
figure 1

Fit of the smile for different number of particles: (a) Black–Scholes setting, \(T=1\) year; (b) Heston setting, \(T=1\) year; (c) Heston setting, \(T=4\) years; and (d) Heston setting, \(T=10\) years

We plot in Fig. 1 implied volatilities for a wide range of strikes and maturities. More precisely, we consider all strikes \(K\) such that \({\mathbb{P}[S_{T}< K]}\in [0.05,0.95]\) – this corresponds to all but very far in-the-money and out-of-the-money options. One can see from Fig. 1 that already for \(N=10^{3}\) trajectories, the identity (5.6) holds up to a small error for all the considered strikes and maturities. This error further diminishes as the number of trajectories increases. At \(N=10^{5}\), the true implied volatility curve and the one calculated from our approximation model become almost indistinguishable.

We plot the prices of the options on the logarithms of quadratic variation in Fig. 2. It is immediate to see that in the Black–Scholes model ((5.1) with \(\sigma _{{\mathrm{Dup}}}\equiv \sigma \)), we have \(\langle \log S\rangle _{T}=\sigma ^{2} T\) and thus \(\mathbb{E}[(\langle \log S\rangle _{T} -K)^{+}]=(\sigma ^{2} T-K)^{+}\). As shown in Fig. 2(a), the prices of the options on the quadratic variation of \(X\) are vastly different. This implies that despite the marginal distributions of \(X\) and \(S\) being identical, their dynamics are markedly dissimilar. We also see that these curves converge as the number of particles increases to infinity. This shows that the dynamics of \((\widetilde{X}^{n})\) is stable with respect to \(n\). Options on the logarithm of quadratic variation for the Heston setting are presented in Fig. 2(b). We see that in this case, the dynamics of \(X\) and \(S\) are different as expected, and the dynamics of \((\widetilde{X}^{n})\) is also stable.

Fig. 2
figure 2

Prices of options on log of quadratic variation for different number of particles: (a) Black–Scholes setting, \(T=1\) year; (b) Heston setting, \(T=1\) year

It is interesting to compare our approach with the algorithms of Guyon and Henry-Labordère [13, 12]. We consider a numerical setup similar to [12, p. 10], taking \(N=10^{6}\) particles to calculate implied volatilities. However, we calibrate our model and calculate the approximation of conditional expectation using only \(N_{1}=1000\) of these particles. We compare our results in the Black–Scholes (a) and Heston (b) settings against implied volatilities calculated via the Euler method for the local volatility model \(S\). Figure 3 shows great agreement between the results of the two methods.

Fig. 3
figure 3

Comparison with Guyon and Henry-Labordère [13]: (a) Black–Scholes setting, \(T=1\) year; (b) Heston setting, \(T=1\) year

Remark 5.1

The computational time needed for running our algorithm is comparable with the algorithm of [12], but highly dependent on implementation details in both cases.

Figure 4 shows that not only do the marginal distributions of \(X\) calculated with our method and [13] agree with each other, but so do the distributions of the processes. We also observe that in both settings, the dynamics of \(X\) are different from the dynamics of \(S\).

Fig. 4
figure 4

Comparison with Guyon and Henry-Labordère [13]. Options on quadratic variation: (a) Black–Scholes setting, \(T=1\) year; (b) Heston setting, \(T=1\) year

Now let us discuss the stability of our model as the regularisation parameter \(\lambda \to 0\). We studied the absolute error in the implied volatility of the 1-year ATM call option for various \(\lambda \in [10^{-9},1]\) in the Black–Scholes and Heston settings described above. We used \(N=10^{6}\) trajectories and \(L=100\) at each step according to (5.9), and performed 100 repetitions at each considered value of \(\lambda \). The results are presented in Fig. 5. The vertical lines in Figs. 57 denote the standard deviation in the absolute errors of the implied volatilities. We observe that in both settings, the error initially drops as \(\lambda \) decreases and then stabilises around \(\lambda \approx 10^{-9}\). Therefore, for all of our calculations, we took \(\lambda =10^{-9}\). It is evident that the error does not blow up as \(\lambda \) becomes very small.

Fig. 5
figure 5

Mean absolute implied volatility error for different values of \(\lambda \): (a) Black–Scholes setting; (b) Heston setting

Let us examine how the error in call option prices in (5.6) (and therefore the distance between the laws of the true and approximated solutions) depends on the number \(N\) of trajectories. Recall that it follows from Theorem 2.3 that this error should decrease as \(N^{-1/4}\) (note the square in the left-hand side of (2.10)). Figure 6 shows how the absolute error in the implied volatility of a 1-year ATM call option decreases as the number of trajectories increases in (a) the Black–Scholes setting and (b) the Heston setting. We took \(\lambda =10^{-9}\), \(L=100\), \(N\in [250, 2^{8}\times 250]\) and performed 100 repetitions at each value of \(N\). We see that the error decreases as \(O(N^{-1/2})\) in both settings, which is even better than predicted by theory.

Fig. 6
figure 6

Mean absolute implied volatility error versus number of trajectories. The black line is the approximation: error \(=C N^{-1/2}\); (a) Black–Scholes setting, \(C=0.469\); (b) Heston setting, \(C=0.303\)

We collect average errors in implied volatilities of 1-year European call options for different strikes in Table 1. We considered the Heston setting and as above, we used \(\lambda =10^{-9}\), \(L=100\), \(N=10^{5}\).

Table 1 Average error in implied volatility of 1-year options with given strike. Heston setting

We also investigate the dependence of the error in the implied volatility on the number \(L\) of basis functions in the representation (4.4). Recall that since the number of operations depends on \(L\) quadratically (it equals \(O(MNL^{2})\)), it is extremely expensive to set \(L\) to be large. In Fig. 7, we plotted the dependence on \(L\) of the absolute error in the implied volatility of a 1-year ATM call option. We used \(N=10^{6}\) trajectories, \(\lambda =10^{-9}\), \(L\in [1,\ldots ,100]\) and did 100 repetitions at each value of the number of basis functions. We see that as the number of basis functions increases, the error first drops significantly, but then stabilises at \(L\approx 80\).

Fig. 7
figure 7

Mean absolute implied volatility error versus number of basis functions: (a) Black–Scholes setting; (b) Heston setting

5.1 On the choice of \((\varepsilon ,\varepsilon _{{\mathrm{CIR}}})\)

We recall that there are two different truncations involved in the model. First, we cap the CIR process from below at the level of \(\varepsilon _{{\mathrm{CIR}}}=10^{-3}\). Second, in the Euler scheme (4.5), (4.6), we take as a diffusion coefficient

$$ F(t,x,y,z):=x \sigma _{\text{Dup}}(t,x) \frac{\sqrt {y}}{\sqrt {z\vee \varepsilon } } $$

with \(\varepsilon =10^{-3}\). We claim that both of these truncations are necessary.

Figure 8 shows the fit of the smile for 1-year European call options depending on \(\varepsilon \) and \(\varepsilon _{{\mathrm{CIR}}}\). We use the model of Sect. 5 with \(M=500\) time steps and \(N=10^{6}\) trajectories. We see from these plots that if \(\varepsilon \) or \(\varepsilon _{{\mathrm{CIR}}}\) are either too small or too large, the smile produced by the model may not closely match the true implied volatility curve. Therefore a certain lower capping of the CIR process is indeed necessary.

Fig. 8
figure 8

Fit of the smile of 1-year call options for different truncation levels: (a) \(\varepsilon =10^{-3}\), \(\varepsilon _{{\mathrm{CIR}}}\) varies; (b) \(\varepsilon _{{\mathrm{CIR}}}=10^{-3}\), \(\varepsilon \) varies; and (c) \(\varepsilon =\varepsilon _{{\mathrm{CIR}}}\) varies

6 Conclusion and outlook

In this paper, we study the problem of calibrating local stochastic volatility models via the particle approach pioneered in Guyon and Henry-Labordère [13]. We suggest a novel RKHS-based regularisation method and prove that this regularisation guarantees well-posedness of the underlying McKean–Vlasov SDE and the propagation of chaos property. Our numerical results suggest that the proposed approach is rather efficient for the calibration of various local stochastic volatility models and can obtain similar efficiency as widely used local regression methods; see [13]. There are still some questions left open here. First, it remains unclear whether the regularised McKean–Vlasov SDE remains well posed when the regularisation parameter \(\lambda \) tends to zero. This limiting case needs a separate study. Another important issue is the choice of RKHS and the number of basis functions which ideally should be adapted to the problem at hand. This problem of adaptation is left for future research.

7 Proofs

In this section, we present the proofs of the results from Sects. 2 and 3.

Proof of Proposition 3.3

Since ℋ is separable, let \(I\subseteq \mathbb{N}\) and let \(e:=(e_{i})_{i\in I}\) be a total orthonormal system in ℋ (note that \(I\) is finite if ℋ is finite-dimensional). Define the vector \(\gamma ^{\nu}\in \ell _{2}(I)\) by

$$\begin{aligned} \gamma ^{\nu}_{i}:=\langle e_{i},c^{\nu}_{A}\rangle _{\mathcal{H}} & = \int _{\mathcal{X}\times \mathcal{X}} \langle e_{i},k(\,\cdot \,,x) \rangle _{\mathcal{H}}\,A(y)\nu (dx,dy) \\ & =\int _{\mathcal{X}\times \mathcal{X}} e_{i}(x)A(y)\nu (dx,dy), \qquad i\in I. \end{aligned}$$
(7.1)

Since the operator \(\mathcal{C}^{\nu}\) is bounded, it may be described by the (possibly infinite) symmetric matrix

$$ B^{\nu}:= ( \langle e_{i},\mathcal{C}^{\nu}e_{j}\rangle _{\mathcal{H}} ) _{(i,j)\in I\times I}=\bigg( \int _{\mathcal{X}}e_{i}(x)e_{j}(x)\, \nu (dx,\mathcal{X})\bigg) _{(i,j)\in I\times I}, $$
(7.2)

which acts as a bounded positive semidefinite operator on \(\ell _{2}(I)\). Denote

$$ \beta ^{\nu}=(B^{\nu}+\lambda I)^{-1}\gamma ^{\nu}. $$
(7.3)

For \(f\in \mathcal{H}\), write \(f=\sum _{i\in I} \beta _{i} e_{i}\). Then, recalling (7.1) and (7.2), we derive

$$\begin{aligned} &\operatorname*{arg\, min}_{f\in \mathcal{H}}\bigg( \int _{\mathcal{X}\times \mathcal{X}}|A(y)-f(x)|^{2}\,\nu (dx,dy)+\lambda \| f\|_{\mathcal{H}}^{2} \bigg) \\ & =\operatorname*{arg\, min}_{\beta \in \ell _{2}(I)}\bigg( \int _{\mathcal{X}\times \mathcal{X}}|A(y)-\sum _{i\in I} \beta _{i} e_{i}|^{2}\,\nu (dx,dy)+ \lambda \|\beta \|_{\ell _{2}(I)}^{2}\bigg) \\ & =\operatorname*{arg\, min}_{\beta \in \ell _{2}(I)} \big( -2 \langle \beta ,\gamma ^{ \nu}\rangle _{\ell _{2}(I)}+ \langle \beta ,(B^{\nu}+\lambda I) \beta \rangle _{\ell _{2}(I)} \big) \\ & =\operatorname*{arg\, min}_{\beta \in \ell _{2}(I)}\big( \langle \beta -\beta ^{\nu},(B^{ \nu}+\lambda I) (\beta -\beta ^{\nu})\rangle _{\ell _{2}(I)} \big) \\ & = \beta ^{\nu}, \end{aligned}$$

where we inserted the definition (7.3) and used the fact that \(B^{\nu}+\lambda I\) is strictly positive definite for \(\lambda >0\). To complete the proof, it remains to note that

$$ \sum _{i=1}^{\infty }\beta ^{\nu}_{i} e_{i}=(\mathcal{C}^{\nu}+ \lambda I_{\mathcal{H}})^{-1}c^{\nu}_{A}, $$

which shows (3.4). □

Proof of Theorem 2.4

Let us write

$$\begin{aligned} |m_{A}^{\lambda}(x;\mu )-m_{A}^{\lambda}(y;\nu )| & \le | m_{A}^{ \lambda}(x;\mu )-m_{A}^{\lambda}(x;\nu )| +| m_{A}^{\lambda}(x;\nu )-m_{A}^{ \lambda}(y;\nu )| \\ & =:I_{1}+I_{2}. \end{aligned}$$
(7.4)

Working with respect to the orthonormal basis introduced in the proof of Proposition 3.3, see (7.3), we derive for the first term in (7.4) that

$$\begin{aligned} I_{1} & =|\langle k(x,\,\cdot \,),m^{\lambda}_{A}(\,\cdot \,;\mu )-m^{ \lambda}_{A}(\,\cdot \,;\nu )\rangle _{\mathcal{H}}| \\ &\le \|k(x,\,\cdot \,)\|_{\mathcal{H}}\|m^{\lambda}_{A}(\,\cdot \,; \mu )-m^{\lambda}_{A}(\,\cdot \,;\nu )\|_{\mathcal{H}} \\ & \leq \sqrt{k(x,x)}\| \beta ^{\mu}-\beta ^{\nu}\|_{\ell _{2}(I)} \\ & \le D_{k} \| \beta ^{\mu}-\beta ^{\nu}\|_{\ell _{2}(I)}, \end{aligned}$$
(7.5)

where we used (3.1) and Assumption 2.1.

Denote \(Q^{\nu}:=B^{\nu}+\lambda I\) and \(Q^{\mu}:= B^{\mu}+\lambda I\). Recalling that these are bounded operators from \(\ell _{2}(I)\) to \(\ell _{2}(I)\) with bounded inverses, it is easy to see that

$$ \| (Q^{\mu})^{-1}-(Q^{\nu})^{-1}\| _{\ell _{2}(I)}\leq \| (Q^{\mu})^{-1} \|_{\ell _{2}(I)}\| (Q^{\nu})^{-1}\|_{\ell _{2}(I)}\|Q^{\mu}-Q^{\nu} \|_{\ell _{2}(I)}. $$

Therefore we get

$$\begin{aligned} \| \beta ^{\mu}-\beta ^{\nu}\|_{\ell _{2}(I)} & =\|(Q^{\mu})^{-1} \gamma ^{\mu}-(Q^{\nu})^{-1}\gamma ^{\nu}\|_{\ell _{2}(I)} \\ & \le \big\| \big( (Q^{\mu})^{-1}-(Q^{\nu})^{-1}\big) \gamma ^{\mu} \big\| _{\ell _{2}(I)}+ \| (Q^{\nu})^{-1}( \gamma ^{\mu}-\gamma ^{\nu}) \|_{\ell _{2}(I)} \\ & \leq \| (Q^{\mu})^{-1} \| _{\ell _{2}(I)} \| (Q^{\nu})^{-1} \|_{ \ell _{2}(I)} \| Q^{\mu}-Q^{\nu} \| _{\ell _{2}(I)} \Vert \gamma ^{ \mu} \| _{\ell _{2}(I)} \\ & \hphantom{=:} + \| (Q^{\nu})^{-1} \Vert _{\ell _{2}(I)} \Vert \gamma ^{\mu}-\gamma ^{ \nu} \Vert _{\ell _{2}(I)} \\ & \leq \frac{1}{\lambda ^{2}} \| B^{\mu}-B^{\nu} \|_{\ell _{2}(I)} \Vert \gamma ^{\mu} \Vert _{\ell _{2}(I)}+\frac{1}{\lambda } \| \gamma ^{\mu}-\gamma ^{\nu} \|_{\ell _{2}(I)}. \end{aligned}$$
(7.6)

Now observe that for any \(i,j\in I\), we have

$$\begin{aligned} ( B_{ij}^{\mu}-B_{ij}^{\nu}) ^{2} & =\bigg( \int _{\mathcal{X}} e_{i}(x)e_{j}(x) \big( \mu (dx,\mathcal{X})-\nu (dx,\mathcal{X})\big) \bigg) ^{2} \\ & =\int _{\mathcal{X}}\int _{\mathcal{X}} e_{i}(x)e_{j}(x)e_{i}(y)e_{j}(y) \\ & \hphantom{=:\int _{\mathcal{X}}\int _{\mathcal{X}}} \times \big( \mu (dx,\mathcal{X})-\nu (dx,\mathcal{X})\big) \big( \mu (dy,\mathcal{X})-\nu (dy,\mathcal{X})\big) . \end{aligned}$$

Hence by using the identity

$$\begin{aligned} \sum _{i\in I}e_{i}(x)e_{i}(y)&=\sum _{i\in I} \langle k(x,\,\cdot \,),e_{i} \rangle _{\mathcal{H}} \, \langle k(y,\,\cdot \,),e_{i} \rangle _{ \mathcal{H}} \\ & = \langle k(x,\,\cdot \,) , k(y,\,\cdot \,) \rangle _{\mathcal{H}}=k(x,y), \end{aligned}$$
(7.7)

we get

$$\begin{aligned} & \|B^{\mu}-B^{\nu} \|_{\ell _{2}(I)}^{2} \\ &\le \| B^{\mu}-B^{\nu} \|_{{\mathrm{HS}}}^{2} \\ &=\int _{\mathcal{X}}\big( \mu (dx,\mathcal{X})-\nu (dx,\mathcal{X}) \big) \int _{\mathcal{X}} k^{2} (x,y)\big( \mu (dy,\mathcal{X})-\nu (dy, \mathcal{X})\big). \end{aligned}$$
(7.8)

By the Kantorovich–Rubinstein duality formula (see Villani [27, Chap. 1]), for every \(h:\mathcal{X} \rightarrow \mathbb{R}\) with \(h\in C^{1}(\mathcal{X})\), one has

$$\begin{aligned} \bigg| \int _{\mathcal{X}} h(x)\big( \mu (dx,\mathcal{X})-\nu (dx, \mathcal{X})\big) \bigg| & =\bigg| \int _{\mathcal{X}\times \mathcal{X}} h(x) \big( \mu (dx,dy)-\nu (dx,dy)\big) \bigg| \\ & \leq \mathbb{W}_{1}(\mu ,\nu ) \sup _{x\in \mathcal{X}} \vert \partial _{x} h(x) \vert , \end{aligned}$$

where \(\partial _{x}\) denotes the gradient with respect to \(x\). So we continue (7.8) with

$$ \|B^{\mu}-B^{\nu}\|_{\ell _{2}(I)}^{2}\leq \mathbb{W}_{1}(\mu ,\nu ) \sup _{x\in \mathcal{X}}\bigg| \int _{\mathcal{X}}\partial _{x}k^{2}(x,y)\big( \mu (dy,\mathcal{X})-\nu (dy,\mathcal{X})\big) \bigg|, $$
(7.9)

and for each particular \(x\in \mathcal{X}\), we have similarly

$$\begin{aligned} &\bigg|\int _{\mathcal{X}}\partial _{x}k^{2}(x,y)\big(\mu (dy, \mathcal{X})-\nu (dy,\mathcal{X})\big)\bigg| \\ & \leq \sum _{i=1}^{d}\bigg|\int _{\mathcal{X}}\partial _{x_{i}}k^{2}(x,y) \big(\mu (dy,\mathcal{X})-\nu (dy,\mathcal{X})\big)\bigg| \\ & \leq \mathbb{W}_{1}(\mu ,\nu ) \sum _{i=1}^{d}\sup _{y\in \mathcal{X}}|\partial _{y}\partial _{x_{i}}k^{2}(x,y)| \\ & \leq d^{2}D_{k}^{2}\mathbb{W}_{1}(\mu ,\nu ), \end{aligned}$$

where the last inequality follows from Assumption 2.1. Combining this with (7.9), we deduce

$$ \|B^{\mu}-B^{\nu}\|_{\ell _{2}(I)} \leq D_{k}\mathbb{W}_{1}(\mu ,\nu )d. $$
(7.10)

By a similar argument, using (7.7), we derive

$$\begin{aligned} & \| \gamma ^{\mu}-\gamma ^{\nu} \|_{\ell _{2}(I)}^{2} \\ & \le \sum _{i\in I} \int _{\mathcal{X}\times \mathcal{X}}\int _{ \mathcal{X}\times \mathcal{X}} e_{i}(x)e_{i}(x')A(y)A(y')(\mu -\nu )(dx,dy)( \mu -\nu )(dx',dy') \\ & \le \int _{\mathcal{X}\times \mathcal{X}}\int _{\mathcal{X}\times \mathcal{X}} k(x,x')A(y)A(y')(\mu -\nu )(dx,dy)(\mu -\nu )(dx',dy') \\ & \leq d^{2}\mathbb{W}_{1}^{2}(\mu , \nu )\| A\|_{\mathcal{C}^{1}}^{2}D_{k}^{2} , \end{aligned}$$
(7.11)

where again Assumption 2.1 was used. Next note that

$$\begin{aligned} \| \gamma ^{\mu}\|_{\ell _{2}(I)}^{2} &=\int _{\mathcal{X}\times \mathcal{X}}\int _{\mathcal{X}\times \mathcal{X}} k(x,x')A(y)A(y') \mu (dx,dy)\mu (dx',dy') \\ & \leq \int _{\mathcal{X}\times \mathcal{X}}\int _{\mathcal{X}\times \mathcal{X}} | A(y) | \sqrt{k(x,x)} \, | A(y') | \sqrt{k(x',x')} \, \mu (dx,dy)\mu (dx',dy') \\ & =\Bigl( \int _{\mathcal{X}\times \mathcal{X}} | A(y) | \sqrt{k(x,x)} \, \mu (dx,dy)\Bigr)^{2} \\ & \leq \int _{\mathcal{X}\times \mathcal{X}} | A(y) | ^{2}\mu (dx,dy) \int _{\mathcal{X}\times \mathcal{X}} k(x,x)\mu (dx,dy) \\ &\leq D_{k}^{2}\| A\|_{\mathcal{C}^{1}}^{2} \end{aligned}$$
(7.12)

due to Assumption 2.1. Substituting now (7.10)–(7.12) into (7.6) and then into (7.5), we finally get

$$ I_{1} \le (\lambda ^{-1}D_{k}+1)\lambda ^{-1}D_{k}^{2}\mathbb{W}_{1}( \mu ,\nu )d\|A\|_{\mathcal{C}^{1}}. $$
(7.13)

Now let us bound \(I_{2}\) in (7.4). We clearly have

$$ I_{2} =| \langle k(x,\,\cdot \,)-k(y,\,\cdot \,),m_{A}^{\lambda}(\, \cdot \,;\nu )\rangle | \le \| k(x,\,\cdot \,)-k(y,\,\cdot \,)\|_{ \mathcal{H}}\|m_{A}^{\lambda}(\,\cdot \,;\nu )\|_{\mathcal{H}}. $$
(7.14)

Note that

$$\begin{aligned} &\| k(x,\,\cdot \,)-k(y,\,\cdot \,)\|_{\mathcal{H}}^{2} \\ & =\langle k(x,\,\cdot \,)-k(y,\,\cdot \,),k(x,\,\cdot \,)-k(y,\, \cdot \,)\rangle _{\mathcal{H}} \\ & =k(x,x)-k(x,y)-\big(k(y,x)-k(y,y)\big) \\ & =\bigg( \int _{0}^{1} \partial _{2} k\big(x,x+\xi (y-x)\big)\,d\xi \bigg)^{\top}(x-y) \\ & \hphantom{=:} -\bigg( \int _{0}^{1} \partial _{2} k\big(y,x+\xi (y-x)\big)\,d\xi \bigg)^{\top}(x-y) \\ & =(x-y)^{\top}\bigg( \int _{0}^{1}\int _{0}^{1} \partial _{1} \partial _{2} k\big(x+\eta (y-x),x+\xi (y-x)\big)\,d\xi d\eta \bigg)^{ \top}(x-y), \end{aligned}$$

with \(\partial _{1}\), \(\partial _{2}\) denoting the vector of derivatives of \(k\) with respect to the first and second argument, respectively. Recalling Assumption 2.1, we derive

$$ \| k(x,\,\cdot \,)-k(y,\,\cdot \,)\|_{\mathcal{H}}^{2} \leq dD_{k}^{2} \left \vert x-y\right \vert ^{2}. $$
(7.15)

Further, using (7.12), we see that

$$\begin{aligned} \|m_{A}^{\lambda}(\,\cdot \,;\nu )\|_{\mathcal{H}}&=\|\beta ^{\nu}\|_{ \ell _{2}(I)}\le \|(B^{\nu}+\lambda I)^{-1}\|_{\ell _{2}(I)}\|\gamma ^{ \nu}\|_{\ell _{2}(I)}\le \lambda ^{-1}D_{k}\|A\|_{\mathcal{C}^{1}}. \end{aligned}$$

Combining this with (7.15) and substituting into (7.14), we get

$$ I_{2} \le \sqrt {d} \, \lambda ^{-1}D_{k}^{2} \|A\|_{\mathcal{C}^{1}}|x-y|. $$

This together with (7.13) and (7.4) finally yields

$$ |m_{A}^{\lambda}(x;\mu )-m_{A}^{\lambda}(y;\nu )| \le C_{1} \mathbb{W}_{1}(\mu ,\nu ) +C_{2} |x-y|, $$

where \(C_{1}=(\lambda ^{-1}D_{k}+1)\lambda ^{-1}D_{k}^{2}d\|A\|_{ \mathcal{C}^{1}}\) and \(C_{2}=\sqrt {d} \,\lambda ^{-1}D_{k}^{2} \|A\|_{\mathcal{C}^{1}}\). This completes the proof. □

Now we are ready to prove the main results of Sect. 2. They follow from Theorem 2.4 obtained above.

Proof of Theorem 2.2

It follows from Theorem 2.4, the assumptions of Theorem 2.2 and the fact that the \(\mathbb{W}_{1}\)-metric can be bounded from above by the \(\mathbb{W}_{2}\)-metric that the drift and diffusion of the system (2.4)–(2.6) are Lipschitz and satisfy the conditions of Carmona and Delarue [6, Theorem 4.21]. Hence it has a unique strong solution. □

Proof of Theorem 2.3

We see that Theorem 2.4 and the conditions of Theorem 2.3 imply that all the assumptions of Carmona and Delarue [7, Theorem 2.12] hold (note that the total state dimension is \(2d\) in our case). This implies (2.10). □

Proof of Theorem 3.5

Consider the operator \(\mathcal{C}^{\nu}\) in the orthonormal basis \((\widetilde{a}_{n})_{n\in J}\) of ℋ. Put

$$\begin{aligned} D^{\nu}:=( \langle \widetilde{a}_{i},\mathcal{C}^{\nu}\widetilde{a}_{j} \rangle _{\mathcal{H}} ) _{(i,j)\in J\times J}=( \langle \widetilde{a}_{i},T^{ \nu}\widetilde{a}_{j}\rangle _{\mathcal{H}} ) _{(i,j)\in J\times J}=( \sigma _{j} \delta _{ij})_{(i,j)\in J\times J}, \end{aligned}$$

since \(\widetilde{a}_{j}\) is an eigenvector of \(T^{\nu}\) with eigenvalue \(\sigma _{j}\). Since \(\mathcal{C}^{\nu}\) is diagonal in this basis, we see that for \(\lambda >0\), one has for \(i \in J\) that

$$ (\mathcal{C}^{\nu}+\lambda I_{\mathcal{H}})^{-1} \widetilde{a}_{i}=( \sigma _{i}+\lambda )^{-1}\widetilde{a}_{i}. $$
(7.16)

Consider also the function \(c^{\nu}_{A}\) in this basis. We write for \(i\in J\) similarly to (7.1)

$$ \eta ^{\nu}_{i}:=\langle c^{\nu}_{A},\widetilde{a}_{i}\rangle _{ \mathcal{H}}=\int _{\mathcal{X}\times \mathcal{X}} \widetilde{a}_{i}(x)A(y) \nu (dx,dy),\qquad i\in I, $$

and we clearly have \(c^{\nu}_{A}=\sum _{i\in J}\eta _{i}^{\nu }\widetilde{a}_{i}\). Then, using Proposition 3.3 and (7.16), we derive for \(\lambda >0\) that

$$\begin{aligned} m^{\lambda}_{A}(\,\cdot \,;\nu )&=(\mathcal{C}^{\nu}+\lambda I_{ \mathcal{H}})^{-1}c^{\nu}_{A} =\sum _{i\in J}\eta _{i}^{\nu }( \mathcal{C}^{\nu}+\lambda I_{\mathcal{H}})^{-1}\widetilde{a}_{i} \\ &=\sum _{i\in J}\eta _{i}^{\nu }(\sigma _{i}+\lambda )^{-1} \widetilde{a}_{i}. \end{aligned}$$
(7.17)

Next, since \(m_{A}\in \mathcal{L}_{2}^{\nu}\), we have

$$ P_{\overline{\mathcal{H}}} m_{A} =\sum _{i\in J} \langle \mathbb{E}_{(X,Y) \sim \nu} [ A(Y)|X=\,\cdot \, ] ,a_{i} \rangle _{\mathcal{L}_{2}^{\nu}}a_{i}. $$
(7.18)

Further, for \(i\in J\), we deduce that

$$\begin{aligned} \langle \mathbb{E}_{(X,Y)\sim \nu}[A(Y)|X=\,\cdot \,] ,a_{i} \rangle _{ \mathcal{L}_{2}^{\nu}}&= \int _{\mathcal{X}}\mathbb{E}_{(X,Y)\sim \nu} \left [ A(Y)|X=x\right ] a_{i}(x)\nu (dx,\mathcal{X}) \\ &=\mathbb{E}_{(X,Y)\sim \nu} \big[a_{i}(X)\mathbb{E}[A(Y)|X]\big] \\ &=\mathbb{E}_{(X,Y)\sim \nu} [a_{i}(X) A(Y)] \\ &=\sigma _{i}^{-1/2}\eta _{i}^{\nu}, \end{aligned}$$

where we used that \(\widetilde{a}_{n}=\sqrt{\sigma _{n}}\,a_{n}\). Substituting this into (7.18) and combining with (7.17), we get

$$ P_{\overline{\mathcal{H}}} m_{A} -m_{A}^{\lambda}=\sum _{i\in J} \big(\eta _{i}^{\nu}\sigma _{i}^{-1}- \eta _{i}^{\nu}(\sigma _{i}+ \lambda )^{-1}\big)\widetilde{a}_{i}= \sum _{i\in J} \eta _{i}^{\nu} \frac{\lambda}{\sigma _{i}(\sigma _{i}+\lambda )}\widetilde{a}_{i}. $$

Thus the orthonormality of the \(\widetilde{a}_{i}\) gives

$$ \|P_{\overline{\mathcal{H}}} m_{A} -m_{A}^{\lambda} \|_{\mathcal{L}_{2}^{ \nu}}^{2}= \sum _{i\in J} (\eta _{i}^{\nu})^{2} \frac{\lambda ^{2}}{\sigma _{i}(\sigma _{i}+\lambda )^{2}}= \sum _{i \in J} \langle m_{A} ,a_{i}\rangle _{\mathcal{L}_{2}^{\nu}}^{2} \frac{\lambda ^{2}}{(\sigma _{i}+\lambda )^{2}}, $$

which is (3.10). Similarly, recalling (3.8), we get

$$ \|P_{\overline{\mathcal{H}}} m_{A} -m_{A}^{\lambda} \|_{\mathcal{H}}^{2}= \sum _{i\in J} (\eta _{i}^{\nu})^{2} \frac{\lambda ^{2}}{\sigma _{i}^{2}(\sigma _{i}+\lambda )^{2}}= \sum _{i \in J} \langle m_{A} ,a_{i}\rangle _{\mathcal{L}_{2}^{\nu}}^{2} \frac{\lambda ^{2}}{\sigma _{i}(\sigma _{i}+\lambda )^{2}}, $$

which is finite whenever \(P_{\overline{\mathcal{H}}} m_{A} \in \mathcal{H}\), that is, \(\sum _{i\in J} \langle m_{A} ,a_{i}\rangle _{\mathcal{L}_{2}^{\nu}}^{2} \sigma _{i}^{-1}<\infty \). This shows (3.11). It is easily seen by dominated convergence that the left-hand side of (3.10) goes to zero, and if \(P_{\overline{\mathcal{H}}} m_{A} \in \mathcal{H}\), the left-hand side of (3.11) goes to zero as well. □