1 Introduction

Practically, local volatility models (LVMs) and stochastic volatility models (SVMs), or hybrid models (LSVMs) that combine them, are widely used to express asset prices. However, exact analytical solutions for European option prices written on such underlying assets often do not exist. Conversely, in practice, the financial installments that have sufficient liquidity to be used for model calibration are mostly limited to European options. Therefore, the feasibility of an efficient calculation of European option prices is a bottleneck in the modeling of underlying assets. As some important studies, we refer to Rubinstein (1983), Dupire (1994), and Marris (1999) for LSVs and Hull and White (1987), Heston (1993), and Schöbel and Zhu (1999) for SVMs.

In LSVMs, a stock price S and its volatility v are assumed to follow the stochastic differential equation (SDE):

$$\begin{aligned} \left\{ \begin{array}{rcl} \displaystyle \frac{\textrm{d}S_t}{S_t} &{}=&{} (r(t) - d(t)) \textrm{d}t + \sigma (S_t, v_t) \textrm{d}W^S_t, \\ \textrm{d}v_t &{}=&{} \left( \theta (t) - \kappa (t) v_t \right) dt + \gamma (v_t) \textrm{d}W^v_t \end{array} \right. \end{aligned}$$
(1.1)

where the short rate r(t), the dividend rate d(t),Footnote 1\(\theta (t)\), and \(\kappa (t)\) are deterministic functions of time t, \(\sigma (s,v)\) is a deterministic function of asset price and volatility, \(\gamma (v)\) is a deterministic function of volatility, and \(W^{S}\) and \(W^{v}\) are standard Brownian motions under the risk neutral probability measure with \( dW^{S}_t dW^{v}_t = \rho dt\). It is assumed throughout this study that \(\sigma (s,v)\) and \(\gamma (v)\) are infinitely differentiable regarding (sv) and v, respectively. See, for example, Funahashi (2014) for LSVMs.

Many previous studies have challenged the pricing of financial derivatives that lack analytic solutions. For example, methods using finite difference methods (FDM) or Monte Carlo (MC) methods to numerically solve stochastic differential equations and obtain solutions have been widely proposed. See, for example, Duffy (2006) for FDM and Brigo and Mercurio (2006); Valentina (2023) for MC. However, calibrating the model using these methods becomes computationally intensive, as it involves solving an optimization problem through techniques such as Newton–Raphson, simplex, and Levenberg–Marquardt. These techniques iteratively determine the model parameters until a stopping criterion is met. Practically, frequent re-calibration is required for trading desks and risk management purposes. Hence, the efficient calculation of European option prices remains a significant bottleneck in the modeling of underlying assets.

Many approximation methods have been proposed to fill this gap. In this study, we refer to some approximation approaches that have high generality. Fouque et al. (2003) used the singular perturbation method for stochastic volatility models and asymptotically expanded the partial differential equation around the invariant distribution of a stochastic volatility model and calculated the approximate solution of the option price. Hagan et al. (2002) derived the pricing formula for European call and put options when the underlying asset follows the SABR model using the singular perturbation method. Methods for approximating the transition probability and likelihood function in diffusion processes have also been proposed in Aït-Sahalia (2002, 2008). The author approximated the transition probability by expanding it in Hermite polynomials and calculated the approximate solution of the log-likelihood function when the underlying asset follows a multidimensional process. The asymptotic expansion method is also an active approach of discussion. This approach is based on the idea of expanding the solution of a stochastic differential equation (SDE) into a power series of a small parameter, such as volatility. By truncating the series at a certain order, one can obtain an approximate solution of derivative prices, or its Greeks. Takahashi (1999) applied this method to compute the European and average call options under general Markovian processes of underlying asset prices. Funahashi (2014) applied the Wiener-Itô Chaos expansion to derive an approximate closed formula of the vanilla option prices under the LVMs and SVMs, respectively.

The use of artificial intelligence in the financial industry has made remarkable progress in the past 10 years. Particularly, in the field of financial engineering and quantitative finance, deep learning (DL) of artificial neural networks (ANNs) has been used to solve the problems of hedgingFootnote 2 and derivative pricing. Efforts to use DL and ANN as alternatives to numerical computation and approximate solutions are actively being researched. Neural networks have the potential to handle derivatives with complex payoffs that cannot be analytically solved by traditional mathematical models because of their high ability to approximate nonlinear functions. Moreover, it can efficiently compute several derivatives because of its rapid computation and parallel processing. An excellent, comprehensive review of the literature can be found in Ruf and Wang (2020).

Neural networks comprise two processes: Training, which is the process of teaching the training data to the neural network model, and prediction, which is the process of estimating values using the learned results. The former requires generating several training data and correcting the hidden layer parameters of the neural network through supervised learning, which is a time-consuming process; however, the latter can rapidly estimate values. One advantage of using neural networks for pricing derivatives is that they can perform the time-consuming learning process offline and use online estimation in daily training to swiftly calculate prices.

Hernandez (2017) applied the neural network to the calibration procedure to fix the pricing model parameters for the highly liquid financial instrument. The author trained the feed forward neural network to directly return calibrated parameters of a derivative pricing model. More specifically, after all of the training sets were generated using an option pricing formula, he switched the option price and model parameters in order to directly return the calibrated parameters of the pricing model. He showed that this inverse mapping method directly performs the calibration task using neural networks. An important advantage of this method is that it eliminates the need for iterative calculations by pre-learning the relationship between model parameters and model prices with a neural network.

Itkin (2015) suggested some limitations of the inverse mapping method proposed by Hernandez (2017), such as the lack of control over the inversion function. Therefore, the two-step process has recently become the dominant method: First, a feedforward neural network is trained offline with simulated data to estimate the value function for a given asset pricing model. Subsequently, model parameters are calibrated online with a traditional optimization method. In this process, the price calculation formula depends on the rapid predictions made by the neural network. Similar methods have been adopted in various studies, such as Liu et al. (2019a, b) and Horvath et al. (2021), and the references they cite. These studies examined different pricing models, such as the Black-Scholes, Heston (1993), and Bates (1996) models, and the rough Bergomi model. The results showed that ANNs can greatly reduce computation time, thus reducing the importance of calibration speed in model selection.

However, these methods require a high computational cost due to the offline training. This is because, training the ANN model for option pricing directly requires a large number of numerical simulations to produce training and testing data. To achieve a high level of accuracy in derivative pricing, neural networks usually need between 100,000 and 1,000,000 training data points. The actual number of data points depends on factors such as the contract’s expiry date, the volatility of the underlying asset, and the volatility of the volatility. However, financial firms handle thousands of products, each associated with several pricing models. To get these training data points offline, they must run numerical simulations such as MC and PDE for each product and model combination. This task is remarkably computationally intensive, even if it is performed only once or a few times a year.

Efficiently training neural networks is an important area of research in the field of derivative pricing. McGhee (2018) proposed an accurate integration scheme for the SABR model (instead of the two-factor finite difference scheme, which is more accurate but time-consuming) and computed it 300,000 times to generate data sets for training and testing the ANN models. The author showed that an ANN could construct highly efficient representations of both integration and the two-factor finite difference scheme. Funahashi (2021a) combined the advantages of asymptotic expansion (AE) and neural networks (ANNs) by training an ANN to learn the residual term between the option price C and its asymptotic approximation \(\bar{C}\). This improved the stability and approximation accuracy of previous research methods because (i) the option price, C, can start from an approximated price \({\bar{C}}\) that is adjacent to the original value, and moreover, the variance of the training data is reduced significantly, (ii) the residual terms can be approximated in a smooth and infinitely differential function and (iii) the derivative of the residual term with respect to volatility is no longer bell-shaped and the exploding gradients are less likely to occur. See also Buccioni (2023) for a detailed discussion of this approach. Funahashi (2021a) showed empirically that their method can safely reduce the training set size by approximately a hundredth to thousands of standard ANN trainings with fewer layers and nodes, making the ANN training and prediction more robust. This method lowers the computational cost of offline procedures, which are computationally expensive, and simultaneously increases the stability and accuracy of the online prediction of derivative prices. Funahashi (2023) applied the same method to price options under the SABR model. The author trained an ANN to learn the difference between the implied volatility values obtained by numerical computation and those obtained by Hagan’s approximation formula. This enables one to calculate the implied volatility of deep-in-the-money and deep-out-of-the-money options more accurately and efficiently than conventional approximation methods. However, this method is not flawless either. Approximation methods are common in option pricing and many useful ones exist in the literature. Nonetheless, as the model and product become more realistic and complex, approximation formulas often become either unavailable or cumbersome and challenging to compute.

A similar approach is proposed by Kienitz et al. (2020). The authors used the difference between a target option price C on an original underlying asset process S, and the option price \({\bar{C}}\) on a completely different model \({\bar{S}}\), which has a tractable solution of \({\bar{C}}\). The authors regard this method as the control variates (CV)Footnote 3 for neural networks; similar to applications for MC simulations, they used a completely different model for the approximate price to improve the quality of deep learning applied to option pricing problems. The ANN with quasi-process correction with a different model has lower accuracy and slower convergence speed than the direct approximation formula with the same model as Funahashi (2021a) did. However, it is more widely applicable because it does not require a complex approximation formula. This makes it easier to implement and use, hence it can be applied to a wider range of options. Notably, as the approximation order increases, the asymptotic approximation becomes tedious and messy, and expansion terms increase exponentially. Kienitz et al. (2020) do not reveal how to decide on a suitable model and the parameters that correspond to the model; hence, their method is not suitable for practical applications. Accordingly, one of the aims of this study is to establish a unified approach to determining suitable parameters and appropriate models for the ANN with quasi-process correction. As will be demonstrated shortly, if one chooses a model with a different distribution than the original one, the convergence will be slow, especially in deep-in-the-money and deep-out-of-the-money cases, resulting in worse predictions compared to the direct ANN learning.

This paper is organized as follows: The next section introduces three previous studies that form the basis of this study. First, we provide an overview of the asymptotic approximation for the price of derivatives. Second, we introduce Funahashi’s (2021a) method for improving the stability of neural networks by learning the difference between the price of derivatives and the asymptotic approximation. Third, we explain the quasi-process correction for neural networks. In Sect. 3, we propose and establish a new unified approach for determining suitable parameters and appropriate models for the ANN with quasi-process correction. Section 4 is devoted to numerical examples. By comparing the methods of Funahashi (2021a) and Kienitz et al. (2020) using the European and Barrier options, we show that the former method has significantly higher accuracy for learning and prediction than the latter method. However, as will be observed, the latter method does not require a complex approximation, and if one can appropriately select the base approximation model and set the correct parameters of the selected models, the latter method requires one of ten training data sets compared to directly learning the price of derivatives even if a relatively simple model is used. The latter method proves particularly advantageous in cases where efficient approximation prices for derivatives are unavailable. In Sect. 5, we examine the circumstances in which our proposed neural network effectively learns derivative prices in the context of a complex local stochastic volatility model. We utilized the SABR model as a foundational quasi-process. Finally, Sect. 6 concludes this paper.

2 Backgrounds

Before proposing our deep learning method for derivative pricing, this section summarizes the results obtained in previous studies to provide a foundation for the analysis given in the next section. To enhance intuitive understanding, in this section, we begin with a relatively simple model and progressively extend it to LSVM (1.1).

The price of the underlying asset \(S_t(\omega ) = S_t\) for \({0 \le t \le T}\) is assumed to follow the SDE

$$\begin{aligned} \frac{\textrm{d}S_t}{S_t} = \left( r(t) - d(t) \right) \textrm{d}t + \sigma (S_t, t) \textrm{d}W^S_t \end{aligned}$$
(2.1)

where \(\{W_t\}_{t \ge 0}\) is a standard Brownian motion under the risk-neutral measure. This model is called a local volatility model, which is a simple version of (1.1).

Suppose that the SDE (2.1) has the solution, and we denote \(\Vert g \Vert _{t}^2 = \int _{0}^{t} g^2(u)\textrm{d}u\), \(J_t(g)=\int _{0}^{t} g(u)\textrm{d}W^S_u\), we then apply Itô’s formula to obtain

$$\begin{aligned} S_t = F(0,t) \exp \left[ J_t(\sigma ) - \frac{1}{2} \Vert \sigma \Vert ^2_t \right] \end{aligned}$$
(2.2)

where \(F(0,t)= S_0 \textrm{e}^{\int _0^t (r(s)-d(s)) \textrm{d}s}\) is the forward price.

2.1 Wiener Itô Chaos expansion

We assume the following condition, which can be regarded as a stochastic version of the Picard iteration:

Assumption 2.1

Let \(S^{(0)}_t = F(0,t)\), where \(F(0,t)= S_0 \textrm{e}^{\int _0^t (r(s)-d(s)) \textrm{d}s}\), and \(S_t^{(m)}\) is defined successively by

$$\begin{aligned} S_t^{(m+1)} = F(0,t) \exp \left[ J_t(\sigma ^{(m)}) - \frac{1}{2} \Vert \sigma ^{(m)} \Vert ^2_t \right] , \end{aligned}$$
(2.3)

where \(\sigma ^{(m)}(t) = \sigma (S_t^{(m)}, t)\). It is assumed throughout the rest of the study that \(S_t^{(m)}(\omega )\) converges to \(S_t(\omega )\) as \(m \rightarrow \infty \) for P-a.s. \(\omega \in \Omega \).

Although Funahashi and Kijima (2015) gave a sufficient condition for the convergence in Assumption 2.1, the condition is often too strong for practical uses. Hence, we only assume the successive substitution (2.3) in the following development.

Under this assumption, the third-order chaos expansion approximation of the process is as follows:

$$\begin{aligned} X_t : =\frac{S_t}{F(0,t)} - 1 = a_1(t) + a_2(t) + a_3(t) + R_4 , \end{aligned}$$
(2.4)

where \(R_n\) represents the contributions of the nth or higher-order multiple stochastic integrals. \(a_1\), \(a_2\), \(a_3\) are first, second, and third-order chaos expansion terms, respectively:

$$\begin{aligned} a_1(t)= & {} \int \limits _{0}^{t} p_{1}(s) \textrm{d}W^S_{s}, \quad a_2(t) = \int \limits _{0}^{t} p_{2}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) \textrm{d}W^S_{u} \right) \textrm{d}W^S_{s}, \\ a_3(t)= & {} \int \limits _{0}^{t} p_{3}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) \left( \int \limits _{0}^{u} \sigma _{0}(r) \textrm{d}W^S_{r} \right) \textrm{d}W^S_{u} \right) \textrm{d}W^S_{s}, \\{} & {} \quad +\, \int \limits _{0}^{t} p_{4}(s) \left( \int \limits _{0}^{s} p_{5}(u) \left( \int \limits _{0}^{u} \sigma _{0}(r) \textrm{d}W^S_{r} \right) \textrm{d}W^S_{u} \right) \textrm{d}W^S_{s}. \end{aligned}$$

Notably, \(a_1(t)\) follows a normal distribution with zero mean and variance \(\Sigma _t = \int _0^t p_1^2(s) \textrm{d}s\). \(p_k(t)\) are all deterministic functions

$$\begin{aligned} p_{1}(s):= & {} \left\{ \sigma _{0}(s) + F(0,s) \sigma '_{0}(s) \left( \int \limits _{0}^{s} \sigma ^{2}_{0}(u) \textrm{d}u \right) + \frac{1}{2} F^2(0,s) \sigma ''_{0}(s) \left( \int \limits _{0}^{s} \sigma ^{2}_{0}(u) \textrm{d}u \right) \right\} ,\\ p_{2}(s):= & {} \sigma _{0}(s) + F(0,s) \sigma '_{0}(s), \\ p_{3}(s):= & {} \sigma _{0}(s) + 3 F(0,s) \sigma '_{0}(s) + F^2(0,s) \sigma ''_{0}(s),\\ p_{4}(s):= & {} \sigma _{0}(s) + F(0, s) \sigma '_{0}(s), \\ p_{5}(s):= & {} F(0, s) \sigma '_{0}(s), \end{aligned}$$

with \(\sigma _0(t) = \sigma (F(0,t), t)\), \(\sigma '_0(t) = \partial _x \sigma (x, t)|_{x=F(0,t)}\), \(\sigma ''_0(t) = \partial _{xx} \sigma (x, t)|_{x=F(0,t)}\).

We can justify this approximation through an analysis based on small volatility expansion, where we assume small volatility. Let us denote \(f_0(t) = \sigma _0(t)\), \(f_i(t) = p_i(t)\) for \(i = 1, \ldots , 5\) and \(\bar{f}(t) = \max _k f_k(t) \in L_2([0,t])\) for all t. We then have

$$\begin{aligned} \mathbb {E}[a_n^2] \le \Vert f\Vert _t^{2n} /n! \end{aligned}$$
(2.5)

Therefore, if \(\Vert \bar{f_k}\Vert _t\) is sufficiently small, the sum of iterated integrals beyond the nth order can be approximated as zero. More intuitively, to emphasize that the volatility is small, we rewrite \(\sigma _0 \rightarrow \epsilon \sigma _0\) and \(p_i(t) \rightarrow \epsilon p_i(t)\) and obtain

$$\begin{aligned} a_1(t) \rightarrow \epsilon a_1(t), \ \ a_2(t) \rightarrow \epsilon ^2 a_2(t), \ \ a_3(t) \rightarrow \epsilon ^3 a_3(t), \ \ R_4 \rightarrow \epsilon ^4 R_4 . \end{aligned}$$
(2.6)

We now insert these results into (2.4) to get

$$\begin{aligned} X_t: =\frac{S_t}{F(0,t)} - 1 \approx \epsilon a_1(t) + \epsilon ^2 a_2(t) + \epsilon ^3 a_3(t) + O(\epsilon ^4). \end{aligned}$$
(2.7)

Moreover, we note that because the right-hand side of (2.5) is divided by the factorial of n, it accelerates the convergence speed of \(\mathbb {E}[a_n^2] \rightarrow 0 \ (n \rightarrow \infty )\) and improves the approximation accuracy of (2.4). In this study, we omit terms involving iterated integrals higher than the third order; \(R_4 \approx 0\) for \(n \ge 4\).

Let \(\Psi ^S(\xi )\) be the characteristic function of \(X_t\). Substituting (2.4), we obtain

$$\begin{aligned} \Psi ^S(\xi )= & {} E[\textrm{e}^{i \xi X_t}] \approx E[\textrm{e}^{i \xi (a_1(t) + a_2(t) + a_3(t))}] \nonumber \\= & {} E \left[ \textrm{e}^{i \xi a_1(t)} \left\{ 1 + i \xi a_2(t) + i \xi a_3(t) - \frac{1}{2} \xi ^2 a^2_2(t) + R_4 \right\} \right] \end{aligned}$$
(2.8)

Based on the same approximation strategy to ignore \(R_4 \approx 0\), we get

$$\begin{aligned} E[\textrm{e}^{i \xi a_1(t)} R_4] \le E[|\textrm{e}^{i \xi a_1(t)}|^2]^{\frac{1}{2}} E[|R_4|^2]^{\frac{1}{2}} = E[|R_4|^2]^{\frac{1}{2}} \approx 0. \end{aligned}$$

Hence, \(\Psi ^S(\xi )\) reduces to

$$\begin{aligned} \Psi (\xi )\approx & {} E \left[ \textrm{e}^{i \xi a_1(t)} \right] + i \xi E \left[ \textrm{e}^{i \xi a_1(t)} E[a_2(t) | a_1(t) ] \right\} \nonumber \\{} & {} + i \xi E \left[ \textrm{e}^{i \xi a_3(t)} E[a_2(t) | a_1(t) ] \right] - \frac{1}{2} \xi ^2 E \left[ \textrm{e}^{i \xi a_1(t)} E[a^2_2(t) | a_1(t)] \right] . \end{aligned}$$
(2.9)

However, using formulas provided in Appendix D of Funahashi and Kijima (2015), which present one-dimensional (1D) versions of Lemma 2.1 in Takahashi (1999), we can explicitly compute the conditional approximations as follows:

$$\begin{aligned} \mathbb {E}[ a_{2}(t) | a_{1}(t) = x ]= & {} q^S_{1}(t) \left( \frac{x^{2}}{(\Sigma ^S_{t})^{2}}- \frac{1}{\Sigma ^S_t} \right) , \end{aligned}$$
(2.10)
$$\begin{aligned} \mathbb {E}[ a_{3}(t) | a_{1}(t) = x ]= & {} q^S_{2}(t) \left( \frac{x^{3}}{(\Sigma ^S_{t})^{3}}- \frac{3x}{(\Sigma ^S_{t})^{2}} \right) , \end{aligned}$$
(2.11)
$$\begin{aligned} \mathbb {E}[ a^2_{2}(t) | a_{1}(t) = x ]= & {} q^S_3(t) \left( \frac{x^{4}}{(\Sigma ^S_{t})^{4}} - \frac{6x^{2}}{(\Sigma ^S_{t})^{3}} + \frac{3}{(\Sigma ^S_{t})^{2}} \right) + q^S_{4}(t) \left( \frac{x^{2}}{(\Sigma ^S_{t})^{2}}- \frac{1}{\Sigma ^S_{t}} \right) \nonumber \\{} & {} + q^S_{5}(t), \end{aligned}$$
(2.12)

where the exact formulas of deterministic functions \(\Sigma _t\) and \(q_i(t)\) are given as

$$\begin{aligned} \Sigma ^S_t= & {} \int \limits _{0}^{t} p_1^2(s) ds, \\ q^S_{1}(t)= & {} \int \limits _{0}^{t} p_{1}(s) p_{2}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) p_{1}(u) \textrm{d}u \right) \textrm{d}s, \\ q^S_{2}(t)= & {} \int \limits _{0}^{t} p_{1}(s) p_{3}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) p_{1}(u) \left( \int \limits _{0}^{u} \sigma _{0}(r) p_{1}(r) \textrm{d}r \right) \textrm{d}u \right) \textrm{d}s \\{} & {} +\int \limits _{0}^{t} p_{1}(s) p_{4}(s) \left( \int \limits _{0}^{s} p_{1}(u) p_{5}(u) \left( \int \limits _{0}^{u} \sigma _{0}(r) p_{1}(r) \textrm{d}r \right) \textrm{d}u \right) \textrm{d}s, \\ q^S_3(t)= & {} q^2_1(t),\\ q^S_{4}(t)= & {} 2 \int \limits _{0}^{t} p_{1}(s) p_{2}(s) \left( \int \limits _{0}^{s} p_{1}(u) p_{2}(u) \left( \int \limits _{0}^{u} \sigma ^2_{0}(r) \textrm{d}r \right) \textrm{d}u \right) \textrm{d}s \\{} & {} + 2 \int \limits _{0}^{t} p_{1}(s) p_{2}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) p_{2}(u) \left( \int \limits _{0}^{u} \sigma _{0}(r) p_{1}(r) \textrm{d}r \right) \textrm{d}u \right) \textrm{d}s \\{} & {} + \int \limits _{0}^{t} p^2_{2}(s) \left( \int \limits _{0}^{s} \sigma _{0}(u) p_{1}(u) \textrm{d}u \right) ^{2} \textrm{d}s, \\ q^S_{5}(t)= & {} \int \limits _{0}^{t} p^2_{2}(s) \left( \int \limits _{0}^{s} \sigma ^2_{0}(u) \textrm{d}u \right) \textrm{d}s. \end{aligned}$$

Remark 2.1

As an example, when S(t) follows a displaced diffusion (DD) model

$$\begin{aligned} \textrm{d}S_t = (r(t) - d(t)) S_t \textrm{d}t + \bar{\epsilon } \left( \bar{\beta } S_t + (1 - \bar{\beta }) F(0,t) \right) \textrm{d}W^S_t, \end{aligned}$$
(2.13)

we have \(\sigma _0(s) = \bar{\epsilon }\), \(p_1(s) = \bar{\epsilon }\), \(p_3(s) = \bar{\beta } \), \(p_4(s) = \bar{\beta } \), and \(p_5(s) = -(1 - \bar{\beta }) \bar{\epsilon } \). Hence, the six deterministic functions, \(\Sigma _t\) and \(q_i \ (i=1, \ldots , 5)\), \(\Sigma ^S_t = \bar{\epsilon }^2 t\), \(q_1^S(t) = \frac{1}{2} \bar{\beta } \bar{\epsilon }^4 t^2\), \(q_3^{S}(t) = \frac{1}{6} \bar{\beta }^2 \bar{\epsilon }^6 t^3\), \(q_3^S(t) = \frac{1}{4} \bar{\beta }^2 \bar{\epsilon }^8 t^4\), \(q_4^S(t) = \bar{\beta }^2 \bar{\epsilon }^6 t^3\), \(q_5^{S}(t) = \frac{1}{2} \bar{\beta }^2 \bar{\epsilon }^4 t^2\).

Because \(a_1(t)\) follows a normal distribution with zero mean and variance \(\Sigma ^S_t\) and the conditional expectations in \(\Psi ^S(\xi )\) are polynomial functions, one can directly apply the Fourier inversion formula to obtain the probability distribution function \(f_{X_t}(x)\). More specifically, for any polynomial functions h(x) and g(x), we have

$$\begin{aligned} \frac{1}{2 \pi } \int \limits _{\mathcal {R}} \textrm{e}^{-iky} g(-ik) \mathbb {E}\big [h(Y) \textrm{e}^{ikY} \big ] \textrm{d}k = g\left( \frac{\partial }{\partial y}\right) h(y) n(y;0,\Sigma ) , \end{aligned}$$
(2.14)

where \(Y \sim N(0, \Sigma )\) and n(xab) is the normal density function with mean a and variance b. Notably, the aforementioned formula is easily obtained by differentiating both sides of

$$\begin{aligned} \frac{1}{2 \pi } \int \limits _{\mathcal {R}} \textrm{e}^{-i k y} \mathbb {E}\big [h(Y) \textrm{e}^{ikY}\big ] \textrm{d}k = h(y) n(y;0,\Sigma ) \end{aligned}$$

regarding y.

By applying (2.14) to each term in (2.9), the probability density function of \(X_t\) is approximated as

$$\begin{aligned} {\tilde{f}}_{X_t}(x)= & {} \frac{1}{2} n\left( x; 0, \Sigma _{t} \right) \bigg [ \frac{q_{3}(t)}{\Sigma _{t}^{3}} h_{6} \left( \frac{x}{\sqrt{\Sigma _{t}}} \right) + \frac{\left( 2 q_{2}(t) + q_{4}(t) \right) }{\Sigma _{t}^{2}} h_{4} \left( \frac{x}{\sqrt{\Sigma _{t}}} \right) \nonumber \\{} & {} + \frac{2 q_{1}(t)}{\left( \sqrt{\Sigma _{t}} \right) ^{3}} h_{3} \left( \frac{x}{\sqrt{\Sigma _{t}}} \right) + \frac{q_{5}(t)}{\Sigma _{t}} h_{2} \left( \frac{x}{\sqrt{\Sigma _{t}}} \right) + 2 \bigg ], \end{aligned}$$
(2.15)

where \(h_n(t)\) is Hermite polynomial of order n:

$$\begin{aligned} h_{n}(x) = (-1)^{n} \textrm{e}^{x^2/2} \frac{\textrm{d}^{n}}{\textrm{d}x^{n}} \textrm{e}^{-x^2/2}, \quad n=1,2, \dots , \end{aligned}$$
(2.16)

with \(h_0(x)=1\).Footnote 4

Note that the value of the European call option with maturity T and strike K is given by

$$\begin{aligned} C^S(t) = \mathbb {E}\left[ \textrm{e}^{-\int \limits _{0}^{t} r(s) \textrm{d}s} \left( S_t - K \right) ^{+} \right] = S(0) \int \limits _{-\widetilde{K}}^{\infty } \left( x + \widetilde{K} \right) f_{X_t} (x) \textrm{d}x, \end{aligned}$$

where \(\widetilde{K}:=1-\frac{K}{F(0,t)}\). Hence, it follows that

$$\begin{aligned} C^S(t) \approx C^S_\textrm{App}(t) \end{aligned}$$

where the approximation holds by following our approximation strategy, and \(C^S_\textrm{App}(t) \) is given by

$$\begin{aligned} C^S_\textrm{App}(t)\approx & {} \frac{S_0 n(\widetilde{K};0, \Sigma _t)}{2} \bigg [ \frac{q_3(t)}{\Sigma _t^{2}} h_4 \left( \frac{\widetilde{K}}{\sqrt{\Sigma _t}} \right) + \frac{\left( q_4(t) + 2 q_2(t) \right) }{\Sigma _t} h_2 \left( \frac{\widetilde{K}}{\sqrt{\Sigma _t}} \right) \nonumber \\{} & {} -2 \frac{q_1(t)}{\sqrt{\Sigma _t}} h_1 \left( \frac{\widetilde{K}}{\sqrt{\Sigma _t}} \right) + q_5(t) + 2 \Sigma _t \bigg ] \nonumber \\{} & {} + S_0 \widetilde{K} \left( 1 - \Phi (-\widetilde{K} /\sqrt{\Sigma _t}) \right) , \end{aligned}$$
(2.17)

where \(\Phi (x)\) is the cumulative distribution function of the standard normal distribution.

Remark 2.2

Thus far, we have derived the approximation formula for European options in the framework of the local volatility model. Notably, as demonstrated in Funahashi (2014), the formula (2.17) retains its identical form even when influenced by stochastic volatility, with \(\Sigma _t\) and \(p_i(t)\) for \(i = 1, \ldots , 5\) undergoing adjustments. The functions in the stochastic local volatility model are detailed in Appendix A.

Moreover, approximate closed-form solutions for a wide range of exotic derivatives, including the Barrier option, Asian options, Basket options, and VWAP options, can be obtained using the deterministic functions \(\Sigma ^S_t\) and \(q_i(t) \ (i=1, \ldots , 5)\). See Funahashi and Higuchi (2018) and Funahashi and Kijima (2014) for detailed discussions. Below, we denote the price of these options by \(C^S\) and their approximate closed-form formula by \(C^S_\textrm{App}\) in the following sections.

2.2 ANN with asymptotic correction

Let us denote \(\pmb {\xi }^\textrm{M} = \{ \xi ^\textrm{M}_1, \xi ^\textrm{M}_2, \ldots \}\) by the input data observed from the market or determined in contract, including interest rate r, strike K, spot asset price \(S_0\), and maturity T: \(\pmb {\xi }^\textrm{M} = \{ r, K, S_0, T, \ldots \}\). \(\pmb {\xi }^\textrm{P} = \{ \xi ^\textrm{P}_1, \xi ^\textrm{P}_2, \ldots \}\) denotes model parameters, for example, the DD model in (2.13) has two model parameters \(\pmb {\xi }^\textrm{P}= \{ {\bar{\beta }}, {\bar{\epsilon }} \}\). Then, the option price with strike K and maturity T written on the asset S is given by

$$\begin{aligned} C^S(\pmb {\xi }) = \mathbb {E}[\textrm{e}^{-rT} g(S_T) ] \end{aligned}$$
(2.18)

where \(g(\cdot )\) is a payoff function and \(\pmb {\xi } = \{ \pmb {\xi }^\textrm{M}, \pmb {\xi }^\textrm{P} \}\).

Various ANN approaches have emerged for derivative pricing and financial asset pricing model calibration. One commonly employed technique involves utilizing deep learning to predict option prices. This process generally follows these steps:

  1. (1)

    generate N sets of input vectors \(\pmb {\xi }_k\) for \(k = 1, 2, \ldots , N\),

  2. (2)

    obtain \(C^S(\pmb {\xi }_k)\) for \(k = 1, 2, \cdots , N\) using numerical methods such as PDF and MC,

  3. (3)

    train the ANN model, that is, minimizing the loss function to determine weights and biases in the ANN model using all pairs of inputs and outputs \(\{ \pmb {\xi }_k, C^S(\pmb {\xi }_k) \}_{k=1, \ldots , N}\), to obtain a map

    $$\begin{aligned} \mathcal {M}_C: \pmb {\xi } \mapsto C^S_\textrm{ANN} \end{aligned}$$

    , where \(C^S_\textrm{ANN}\) is an ANN prediction of the derivative price \(C^S\), and

  4. (4)

    predict option price at an arbitrary parameter set \(\pmb {\xi }\)

    $$\begin{aligned} C^S(\pmb {\xi }) \approx C^S_\textrm{ANN}(\pmb {\xi }) \end{aligned}$$
    (2.19)

One of the benefits of applying ANNs in the financial sector is that they can split the pricing process into two parts: Offline training steps (1)–(3), which use ANN models to obtain an accurate estimate of option prices, and prediction step (4), which forecasts and calculates option prices online with the trained ANN. The offline training is computationally intensive because it needs abundant numerical simulations to generate training and testing data, whereas the online prediction is fast enough for real-world applications. After the ANN models have learned the connection weights, the network can be reused to predict option prices for new input patterns. Therefore, practitioners can benefit from the online prediction with quick computations in their daily pricing work and perform the ANN training, which consumes much time, offline when they have enough time at hand.

The offline training is done only once or a few times a year, but financial firms handle various products with many pricing models. So, for the offline training, they need to run several hundred thousand to a few million numerical simulations for each product, which demands huge computational costs. Funahashi (2021a) combined the advantages of asymptotic expansion (AE) and ANN, which trained the residual term

$$\begin{aligned} D^S(\pmb {\xi }) = C^S(\pmb {\xi }) - C^S_\textrm{App}(\pmb {\xi }) \end{aligned}$$

between the option price, \(C^S(\pmb {\xi })\), and its asymptotic approximation, \(C^S_\textrm{App}(\pmb {\xi })\), to improve the stability and approximation accuracy. More specifically, the author proposed a mapping

$$\begin{aligned} \mathcal {M}_D: \pmb {\xi } \mapsto D_\textrm{ANN} \end{aligned}$$

and predicted the option value of arbitrary input \(\pmb {\xi }\) to be

$$\begin{aligned} C^S(\pmb {\xi }) \approx C^S_\textrm{App}(\pmb {\xi }) + D^S_\textrm{ANN}(\pmb {\xi }) \end{aligned}$$
(2.20)

If the base approximation is chosen appropriately, the variance of the target outputs \(\{ D^S(\pmb {\xi }_k) \}\) for \(k = 1, \ldots , N\) is very small compared to that of \(\{ C^S(\pmb {\xi }_k) \}\); hence, the prediction and convergence of the mapping \(\mathcal {M}_D\) become stable and fast.

Moreover, for instance, using the WIC expansion, the approximate option price is determined by the sum of the product of polynomials by a CDF and PDF of the standard normal distribution,Footnote 5\(D^S(\pmb {\xi })\) is a smooth and infinitely differential function. Hence, it is useful not only to improve the efficiency of the training ANN but also to stabilize the computation of the Greeks guaranteed by Hornik et al. (1990).

Consequently, by employing European and Barrier options in the framework of the LSVM in (1.1), Funahashi (2021a) empirically shows that their ANN training and prediction methods demonstrate robustness. Additionally, their approach enables a significant reduction in training set size by a factor ranging from a hundredth to thousands, all while utilizing fewer layers and nodes in the ANN architecture. This outcome indicates that the extensive numerical computations required for derivative price calculations, which constitute a significant portion of the calculation time, can be effectively reduced by a factor of 100–1000. Thus, the computational cost associated with resource-intensive offline procedures can be substantially alleviated. Moreover, this reduction in computation overhead simultaneously improves the stability and accuracy of online predictions of derivative prices.

2.3 ANN with quasi-process correction

ANNs with an asymptotic correction approach require the derivation of an approximate option formula and the prediction of the impact of the residual terms of the approximation option formula using ANNs. Generally, it can improve the ANN prediction as the approximation order increases. Moreover, it can estimate the degree of error, and hence, one can safely choose an appropriate \(C^S_\textrm{App}\). Note that \(C^S_\textrm{App}\) could be a general approximation formula; Kienitz et al. (2020) and Funahashi (2022) examined SABR models using the approximate implied normal and lognormal volatility derived by Hagan et al. (2002), respectively, and Funahashi (2022) tested the free-boundary SABR model using the approximation formula proposed in Antonov et al. (2015).

However, as the approximation order increases, the calculation becomes tedious and messy, and expansion terms increase exponentially. For exotic derivatives under LSVM, it is generally difficult to obtain an approximation solution. Although Funahashi and Higuchi (2018) derived an approximate pricing formula for the Barrier option under the Heston model using a WIC expansion, it becomes messy and requires many calculations to derive these formulas.

Conversely, Kienitz et al. (2020) consider this method as the control variate (CV) for neural networks. Similar to applications for MC simulations, they used a completely different model, \({\bar{S}}\), for the approximate price to improve the quality of deep learning applied to option pricing problems. They focused on the fact that for each parameter, \(\pmb {\xi }_k = \{ \pmb {\xi }^M, \pmb {\xi }^P \}\), of the original process, if the model parameters of the approximation process, \(\bar{\pmb {\xi }}_k = \{ \pmb {\xi }_k^M, \bar{\pmb {\xi }}_k^P \}\), are suitably chosen, then a large portion of the price is already mimicked by the approximate price

$$\begin{aligned} D(\pmb {\xi }_k, \bar{\pmb {\xi }}_k) = C^S(\pmb {\xi }_k) - C^{{\bar{S}}}_\textrm{App}(\bar{\pmb {\xi }}_k) . \end{aligned}$$
(2.21)

Motivated by this fact, they generated the mapping

$$\begin{aligned} \mathcal {M}_{{\bar{D}}}: \{ \pmb {\xi }_k, \bar{\pmb {\xi }}_k \} \mapsto D_\textrm{ANN}. \end{aligned}$$

The authors examined the use of the Black-Scholes price as an approximate benchmark for pricing European options in the Heston (1993) model, and similarly, they employed a co-terminal European swaption as an approximation for pricing Bermudan swaptions in the Hull-White mode.

Although the accuracy and speed of convergence are inferior when compared with using the direct approximation formula with the same model proposed in Funahashi (2021a), the ANN with quasi-process correction can be more widely applicable; the error related to the model differences is left to the learning process of the neural network without needing to derive a complex asymptotic formula. However, in most cases, the problem of determining appropriate models for \(C^S_\textrm{App}\) and suitable parameters, \(\bar{\pmb {\xi }}_k\), are not clear; hence, their method cannot apply to general problems. As will be subsequently discussed, it is essential to select the appropriate \(C^S_\textrm{App}\) carefully. If one selects a model with a different distribution than the original one, the convergence will be slow, especially in deep-in-the-money and deep-out-of-the-money cases. Moreover, the prediction will be inferior to direct ANN learning.

Before running the ANN process, \(\bar{\pmb {\xi }}_k = \{ \pmb {\xi }^M_k, \bar{\pmb {\xi }}^P_k \}\) is estimated from \(\pmb {\xi }_k = \{ \pmb {\xi }^M_k, \pmb {\xi }^P_k \}\) under the conditions, \(\pmb {\xi }^M_k = \{r, K, S_0, T, \ldots \}\). However, recall that our ANN first generates a mapping \(\mathcal {M}_{{\bar{D}}}\) using training data

$$\begin{aligned} \{ \{ \pmb {\xi }_k, \bar{\pmb {\xi }}_k \}, D(\{ \pmb {\xi }_k, \bar{\pmb {\xi }}_k \}) \}_{k=1, 2, \ldots , N} \end{aligned}$$

where \(D(\{ \pmb {\xi }_k, \bar{\pmb {\xi }}_k \})\) is defined in (2.21), and then predicts the option price at an arbitrary parameter set \(\{ \pmb {\xi }, \bar{\pmb {\xi }} \}\):

$$\begin{aligned} C^S(\pmb {\xi }) \approx C^{{\bar{S}}}(\bar{\pmb {\xi }}) + D_\textrm{ANN}(\{ \pmb {\xi }, \bar{\pmb {\xi }} \}) . \end{aligned}$$
(2.22)

\(\bar{\pmb {\xi }}\) used in training (2.21) should be stably consistent with that used in prediction (2.22); otherwise, because the base approximation oscillates, the predicted value jumps and produces a poor result.

Notably, for off-line training, one can utilize prices of the target derivatives, \(C^S(\pmb {\xi }_k)\), to calibrate \(\bar{\pmb {\xi }}_k\) because one has already derived training data using numerical simulation. However, recall that our original goal is to find \(C^S(\pmb {\xi }_k)\) from \(\pmb {\xi }_k\) and \(\bar{\pmb {\xi }}_k\) using the mapping \(\mathcal {M}_{{\bar{D}}}\); hence, for online prediction, one does not have \(C^S(\pmb {\xi }_k)\) and cannot obtain \(\bar{\pmb {\xi }}_k\) using calibration. Hence, by some means, we must estimate the appropriate parameter \(\bar{\pmb {\xi }}_k\) from parameter \(\pmb {\xi }_k\) without using price. Moreover, recall that only one strike is available in \(\pmb {\xi }_k\) to generate \(\bar{\pmb {\xi }}_k\). Hence, even if one obtains suitable implied volatility at the strike, K, \(\bar{\pmb {\xi }}^P\) changes from strike to strike because the pricing model incorporates the volatility skew and smile across different strikes, which causes the base approximation to be unstable.

In the next section, we propose a replication strategy,

$$\begin{aligned} S \approx {\bar{S}}, \end{aligned}$$

that can minimize the error and unsuitability for the ANN training and prediction. In the following section, we will show that this approach is free from the contract parameters, including strikes; hence, it fits our purpose.

3 Proposed method

One aim of this study is to establish a unified approach to determining the suitable parameters and appropriate models for the ANN with quasi-process correction. To achieve this goal, we use a replication technique proposed by Funahashi (2021b) to replicate a complex model \(S_t\) from a simpler model \({\bar{S}}_t\), for which the closed-form solution of the target contingent claim is available.

From (2.9), if one can set the parameters of the simpler model, \({\bar{S}}_t\), such that

$$\begin{aligned} \Sigma _t^{{\bar{S}}} = \Sigma _t^S \ \ \text{ and } \ \ q_i^{{\bar{S}}}(t) = q_i^{S}(t) \end{aligned}$$
(3.23)

for \(i = 1, \ldots , 5\), the characteristic functions of the target process, \(S_t\), can be approximated by that of \({\bar{S}}_t\), that is,

$$\begin{aligned} \Psi ^S(\xi ) \approx \Psi ^{{\bar{S}}}(\xi ). \end{aligned}$$

From the unique relationship between the distribution function and the characteristic function, the marginal distribution of \(S_t\) can be approximated by that of \({\bar{S}}_t\) and the European call option price of S with maturity T, and strike K is approximated by that of \({\bar{S}}\) because

$$\begin{aligned} C^S_\textrm{App}(t) = C^{{\bar{S}}}_\textrm{App}(t). \end{aligned}$$

Before proceeding, we review the assignment of the functions \(\Sigma ^S_t\) and \(q^S_i(t)\) for \(i = 1, \ldots , 5\), and determine the priority of the equation (3.23) with an analysis based on a small volatility explanation. From (2.6), \(a_1(t)\) is \(O(\epsilon )\), which follows a normal distribution \(N(0, \Sigma ^S_t)\), where \(\Sigma ^S_t\) is \(O(\epsilon ^2)\). Hence, if \(\epsilon \) is sufficiently small, then \(S_t\) is approximated by the normal process

$$\begin{aligned} S_t \approx F(0,t) ( 1 + a_1(t) ) \end{aligned}$$
(3.24)

with mean F(0, t) and variance \(F(0,t)^2 \Sigma _t\). However, \(a_2(t)\) and \(a_3(t)\) are \(O(\epsilon ^2)\) and \(O(\epsilon ^3)\), respectively, which increases the accuracy of \(X_t\) (and thus, \(S_t\)) when \(\epsilon \) is not negligible. \(q^S_1(t)\), \(q^S_2(t)\), and \(q_3^S(t) - q_5^S(t)\) are derived from the conditional expectations \(\mathbb {E}[ a_{2}(t) | a_{1}(t) = x ]\), \(\mathbb {E}[ a_{3}(t) | a_{1}(t) = x ]\), and \(\mathbb {E}[ a^2_{2}(t) | a_{1}(t) = x ]\) in (2.9), whose asymptotic orders are \(O(\epsilon ^2)\), \(O(\epsilon ^3)\), and \(O(\epsilon ^4)\), respectively. These deterministic functions correct the characteristic function, \(\Psi (\xi )\), by including the influence of \(a_2(t)\), \(a_3(t)\), and \(a_2^2(t)\), respectively. Thus, the functions \(q_1(t)\), \(q_2(t)\), and \(q_3(t)\) determine the size of the corrections of the asymptotics of the order \(O(\epsilon ^2)\), \(O(\epsilon ^3)\), and \(O(\epsilon ^4)\), respectively. Therefore, the first priority is matching \(\Sigma _t\) and \(q_1(t)\), followed by \(q_2(t)\), and then \(q_3(t) - q_5(t)\).

As an example of the simple process, the DD model in (2.13) has only two parameters, \({\bar{\beta }}\) and \({\bar{\epsilon }}\), and hence, no solution exists for the six equations in (3.23). Therefore, we match the first two equations: \(\Sigma ^{{\bar{S}}}_t = \Sigma ^S_t\) and \(q^{{\bar{S}}}_1 = q^S_1\). Recall that because

$$\begin{aligned} \Sigma ^{{\bar{S}}}_t = ({\bar{\epsilon }}^{*})^2 t, \quad q_1^{{\bar{S}}}(t) = \frac{1}{2} {\bar{\beta }}^{*} ({\bar{\epsilon }}^{*})^4 t^2 , \end{aligned}$$
(3.25)

optimal \({\bar{\epsilon }}^*\) and \({\bar{\beta }}^*\) can be explicitly determined as

$$\begin{aligned} {\bar{\epsilon }}^* = \sqrt{\frac{\Sigma ^S_t}{t}}, \qquad {\bar{\beta }}^* = \frac{2 q_1^S(t)}{(\Sigma ^S_t)^2} . \end{aligned}$$
(3.26)

The European option price under the DD model can be analytically obtained as

$$\begin{aligned} C_\textrm{DD}^{{\bar{S}}}(T, K; {\bar{\beta }}, {\bar{\sigma }}) = \textrm{e}^{-\int \limits _0^t r(s) \textrm{d}s} \textrm{Bl} \left( \frac{F(0,T)}{{\bar{\beta }}}, K+\frac{1-{\bar{\beta }}}{{\bar{\beta }}} F(0,T), {\bar{\beta }} {\bar{\epsilon }} \sqrt{T} \right) , \end{aligned}$$
(3.27)

where

$$\begin{aligned} \textrm{Bl}(F,K,v) = F(0,T) \Phi (d_1(F,K,v)) - K \Phi (d_2(F,K,v)) , \end{aligned}$$
(3.28)

with

$$\begin{aligned} d_1(F, K, v) = \frac{\log \left( \frac{F}{K} \right) + \frac{v^2}{2}}{v}, \quad d_2(F, K, v) = \frac{\log \left( \frac{F}{K} \right) - \frac{v^2}{2}}{v}. \end{aligned}$$

Hence, the European option price under the SDE (2.1) is approximated by

$$\begin{aligned} C^{S}(S,T) \approx C_\textrm{DD}^{{\bar{S}}}(T, K; {\bar{\beta }}^*, {\bar{\sigma }}^*) \end{aligned}$$

where the error can be explicitly estimated as

$$\begin{aligned}{} & {} C^{S}(S,T) - C_\textrm{DD}^{{\bar{S}}}(T, K; {\bar{\beta }}^*, {\bar{\sigma }}^*) \nonumber \\ {}{} & {} \quad \approx C^{S}_\textrm{App}(\pmb {\xi }) - C^{{\bar{S}}}_\textrm{App}(\bar{\pmb {\xi }}) \nonumber \\{} & {} \quad = \frac{S_0 n(\widetilde{K};0, \Sigma _t)}{2 \Sigma _t^{4}} \bigg [ D_3(t) (\widetilde{K}^{4}-6 \widetilde{K}^{2} \Sigma _t + 3 \Sigma _t^{2}) \nonumber \\{} & {} \qquad +\Sigma _t^{2} \left( D_4(t) + 2 D_2(t) \right) \left( \widetilde{K}^{2} - \Sigma _t \right) +\, D_5(t) \Sigma _t^4 \bigg ] \end{aligned}$$
(3.29)

where \(\Sigma _t = \Sigma ^S_t = \Sigma ^{{\bar{S}}}_t\) and \(D_i(t) = q_i^S(t) - q_i^{{\bar{S}}}(t)\) for \(i =2, \ldots , 5\). Note that the leading terms in (2.17) vanish in (3.29).

The Black-Scholes model is another useful example

$$\begin{aligned} \frac{\textrm{d}S_t}{S_t} = r \textrm{d}t + {\bar{\epsilon }} \textrm{d}W^S_t . \end{aligned}$$
(3.30)

In this case, only one parameter can be controlled; hence, this model is not very accurate at explaining complex models. However, it is very flexible and can easily match the variance, \(\Sigma ^{{\bar{S}}}_t = \Sigma ^S_t\), with an optimal parameter \({\bar{\epsilon }}^*\):

$$\begin{aligned} {\bar{\epsilon }}^* = \sqrt{\frac{\Sigma ^S_t}{t}}. \end{aligned}$$
(3.31)

In Sect. 5, we compare a direct mapping, \(\mathcal {M}_C \), and our method, \(\mathcal {M}_D\) and \(\mathcal {M}_{{\bar{D}}}\), with three base approximations: Approximate close-form solution using Wiener-Itô chaos expansion, and replication method using DD and BS models as base approximations.

4 Artificial neural networks

ANNs are a type of machine learning that are developed based on biological neural networks. The perceptron, which is the prototype of ANNs, was proposed by Rosenblatt (1958), and later the backpropagation algorithm was developed and widely known by Rumelhart et al. (1986), which made it possible to efficiently perform the calculations necessary for updating the parameters in the learning of multilayer neural networks. Today, ANNs are successful in many fields. The background of this success includes factors such as the development of the internet and infrastructure, the availability of large-scale data that can be easily obtained to train neural networks without overfitting for complex real-world problems, and the dramatic improvement of computer capabilities such as GPUs and multicore CPUs. For the history of multilayer neural networks, we cite Okatani (2015) and the references therein.

4.1 Feedforward neural network

Figure 1 shows the outline of a feedforward neural network. A perceptron (Fig. 1a) is a simple unit of a neural network that receives the input \(\pmb {x}=\{x_1, \ldots , x_n \}\), multiplies it by the weights \( W=\{w_1, \ldots , w_n \}\) that represent the strength of the connections between the layers, and adds a term called bias b to perform a linear transformation

$$\begin{aligned} u = \sum _{i=1}^n w_i x_i + b. \end{aligned}$$

Subsequently, it applies a nonlinear activation function f to the value and computes the output

$$\begin{aligned} z=f(u). \end{aligned}$$

The activation function determines whether a neuron should be activated or not. Some examples of activation functions are sigmoid, tanh, ReLU, and softmax. Refer to Nwankpa et al. (2018) for an example.

Fig. 1
figure 1

Left-hand side panel indicates the perceptron, and the right-hand side panel shows a feedforward neural network

A feedforward neural network is a neural network that has a structure in which units (perceptrons) arranged in layers are connected between adjacent layers, and information propagates from the input side to the output side in one direction without any feedback loops. Hence, it is sometimes called a multi-layer perceptron. The feedforward neural network in Fig. 1b consists of L layers, and we denote each layer from left to right as \(l=2, \ldots , L-1\). In this case, \(l=1\) is the input layer, \(l=L\) is the output layer, and \(l=2, \ldots , L-1\) are called hidden layers. Moreover, we assume that each layer l has \(n_l\) nodes arranged in it. From now on, we put the layer number on the right shoulder of each variable to distinguish the input and output of each layer. When the input \(z^{(1)} = \pmb {x}\) is given to this network, the \(l-\)th \((l=2, \ldots , L - 1)\) hidden layer receives the output \(\pmb {z}^{(l)}\) from the previous layer and calculates

$$\begin{aligned} \pmb {u}^{(l+1)} = W^{(l+1)} \pmb {z}^{(l)} + \pmb {b}^{(l+1)} \end{aligned}$$
(4.1)

and then applies the activation function f to obtain the output

$$\begin{aligned} \pmb {z}^{(l+1)} = f(\pmb {u}^{(l+1)}) \end{aligned}$$
(4.2)

This way, information is propagated from the input layer through the hidden layer to the output layer, resulting in the final output being obtained as

$$\begin{aligned} \pmb {y} = \pmb {z}^{(L)}. \end{aligned}$$

Therefore, a feedforward neural network can be regarded as a deep nested function that gives the output

$$\begin{aligned} \pmb {y}(\pmb {x}; \pmb {w}) = {\bar{f}}(W^{(L)}f(W^{(L-1)}f(W^{(L-2)} \cdots f(W^{(2)} \pmb {x}+\pmb {b}^{(2)})+ \pmb {b}^{(L-1)}) + \pmb {b}^{(L)}) \end{aligned}$$
(4.3)

depending on the values of the weights \(W^{(l+1)}\) and biases \(\pmb {b}\) between the layers, given the input \(\pmb {x}\). Here, \(\pmb {w}=\{w_1, \ldots , w_P \}\) is a vector consisting of all elements of \(W^{(l+1)}\) and \(\pmb {b}\) and \({\bar{f}}\) is the activation function for output layer.

In neural networks, when a training data consisting of pairs of input \(\pmb {x}_i\) and output \(\pmb {y}_i\) is given

$$\begin{aligned} \{ (\pmb {x}_1, \pmb {y}_1), (\pmb {x}_2, \pmb {y}_2), \ldots , (\pmb {x}_N, \pmb {y}_N), \} \end{aligned}$$
(4.4)

the goal is to adjust \(\pmb {w}\) to reproduce these input–output pairs

$$\begin{aligned} \pmb {y}(\pmb {x}_i; \pmb {w}) \approx \pmb {y}_i , \end{aligned}$$
(4.5)

and to estimate an appropriate output \(\pmb {y}\) for an unknown input \(\pmb {x}\). In this case, an error function \(E(\pmb {w})\) is used as a measure of the closeness of (4.5). Additionally, it is common to perform data normalization as a preprocessing step, as it can hinder learning if a bias in the training data is observed. The most widely used method involves normalizing the input–output data \(\pmb {x}_i=\{ x_{i1}, \ldots , x_{in} \} \ (i=1, \ldots , N)\) by

$$\begin{aligned} x_{ij} = \frac{x'_{ij} - \bar{x}'_j }{\sigma _j}, \quad \bar{x}'_j = \frac{1}{N} \sum _{i=1}^N x'_{ij}, \ \ \sigma _j = \sqrt{ \frac{1}{N} \sum \nolimits _{i=1}^N (x'_{ij} - \bar{x}'_j)^2 }. \end{aligned}$$

for each component \(j \ (j=1, \ldots , n)\) of the original input–output vector \(\pmb {x}'_i=\{ x'_{i1}, \ldots , x'_{in} \} \ (i=1, \ldots , N)\).

In this study, we use the squared error as the error function

$$\begin{aligned} E(\pmb {w}) = \sum _{i=1}^N E_i (\pmb {w}), \quad E_i (\pmb {w})= \frac{1}{2} \Vert \pmb {y}_i - \pmb {y}(\pmb {x}_i; \pmb {w}) \Vert ^2 . \end{aligned}$$
(4.6)

In other words, the purpose of learning is to find

$$\begin{aligned} \pmb {w}^* = \mathop {\arg \max }_{\pmb {w}} E(\pmb {w}) . \end{aligned}$$
(4.7)

4.2 Backpropagation

Generally, an optimization problem (4.8) uses methods such as Newton–Raphson, simplex, and Levenberg–Marquardt that iteratively determine the model parameters until a stopping criterion is met. However, in general, in neural network problems, the scale of optimization becomes large, making it difficult to calculate second- or higher-order derivatives. Therefore, gradient descent methods, which only require first derivatives, are used. In gradient descent methods, starting from a preset initial value \(\pmb {w}^0\), the current weight \(\pmb {w}^{m}\) is updated repeatedly

$$\begin{aligned} \pmb {w}^{(m+1)} = \pmb {w}^{(m)} - \epsilon \Delta E, \quad \Delta E = \left\{ \frac{\partial E}{\partial w_1} , \ldots , \frac{\partial E}{\partial w_P} \right\} ^\textrm{T} \end{aligned}$$
(4.8)

to search for a local minimum point \(\pmb {w}^*\). Here, \(\epsilon \) is a coefficient that determines the size of the update and is called the learning rate.

In (4.6), the error function was calculated using all the training data, but in each step m, a suitable set \(A_m \) (called a mini-batch, \(|A_m|=M_m \le N\)) is selected from the training data, and the weight is updated in that unit.

$$\begin{aligned} \pmb {w}^{(m+1)} = \pmb {w}^{(m)} - \epsilon \Delta E_{A_m}(\pmb {w}), \quad E_{A_m}(\pmb {w}) = \frac{1}{M_m} \sum _{i \in A_m} E_i(\pmb {w}). \end{aligned}$$
(4.9)

Particularly, the method of updating the parameters with the size of the mini-batch set to \((M_m =1\) is called stochastic gradient descent (SDG)). In (4.8), the objective function to be minimized is always the same, so once it falls into a local solution, it cannot escape from there, but in SDG, by randomly selecting samples in each step m, the objective function is different every time; hence, it can greatly reduce the risk of falling into an undesirable local minimum point. Particularly, \(M_m = 8 - 128\) is often used to enjoy the advantages of SDG and the benefits of efficiently implementing parallel computing.

As discussed thus far, the efficient calculation of the gradient of the error function is crucial for performing gradient descent. However, in deep layers (layers close to the input) of a feedforward neural network, the gradient of the error function \((\frac{\partial E(\pmb {w})}{\partial w^{(l)}_{ji}}\) and \(\frac{\partial E(\pmb {w})}{\partial b^{(l)}_{j}})\) can become intricate. The backpropagation method offers an efficient approach to computing these gradients. We consider differentiating \(E_n(\pmb {w})\) regarding the weight of the lth layer, and from the chain rule we get

$$\begin{aligned} \frac{\partial E_n(\pmb {w})}{\partial w^{(l)}_{j,i}} = \frac{\partial E_n(\pmb {w})}{\partial u^{(l)}_{j}} \frac{\partial u^{(l)}_{j}}{\partial w^{(l)}_{j,i}} \end{aligned}$$
(4.10)

However, because \(u^{(l)}_{j} = \sum _{i=1}^{n_{l-1}} w^{(l)}_{j,i} z^{(l-1)}_{i} + b^{(l)}_j\), the second term on the right-hand side is given by

$$\begin{aligned} \frac{\partial u^{(l)}_{j}}{\partial w^{(l)}_{j,i}} = z^{(l-1)}_i . \end{aligned}$$
(4.11)

Concurrently, when considering the first term on the right-hand side, the effect of a change in \(u^{(l)}_{j}\) on \(E_n(\pmb {w})\) is transmitted through \(u^{(l+1)}_{k} = \sum _{j=1}^{n_{(l)}} w^{(l+1)}_{k,j} f(u^{(l)}_{j}) + b^{(l+1)}_k\); hence, using the chain rule again, we obtain

$$\begin{aligned} \frac{\partial E_n(\pmb {w})}{\partial u^{(l)}_{j}} = \sum _{k=1}^{n_{l+1}} \frac{\partial E_n(\pmb {w})}{\partial u^{(l+1)}_{k}} \frac{\partial u^{(l+1)}_{k}}{\partial u^{(l)}_{j}} = \sum _{k=1}^{n_{l+1}} \frac{\partial E_n(\pmb {w})}{\partial u^{(l+1)}_{k}} w^{(l+1)}_{k,j} f'(u^{(l)}_{j}) \end{aligned}$$

If we define \(\delta ^{(l)}_j = \frac{\partial E_n(\pmb {w})}{\partial u^{(l)}_{j}}\), then

$$\begin{aligned} \delta ^{(l)}_j= f'(u^{(l)}_{j}) \sum _{k=1}^{n_{l+1}} \delta ^{(l+1)}_k w^{(l+1)}_{k,j} . \end{aligned}$$
(4.12)

Therefore, substituting (4.11) and (4.12) into (4.10), we obtain

$$\begin{aligned} \frac{\partial E_n(\pmb {w})}{\partial w^{(l)}_{j,i}} = \delta ^{(l)}_j z^{(l-1)}_{i} \end{aligned}$$
(4.13)

Ultimately, we observe that the effect of the variation of \(w^{(l)}_{j,i}\) which represents the strength of the connection between layer l and \(l+1\), on \(\frac{\partial E_n(\pmb {w})}{\partial w^{(l)}_{j,i}}\) is determined only by the delta of unit j, \(\delta ^{(l)}_j\), and the output of unit i in layer \(l-1\), \( z^{(l-1)}_{i}\). Notably, because the delta of layer l, \(\delta ^{(l)}_j\), can be calculated according to (4.12) if the delta of layer \(l+1\) is obtained. This can be repeated sequentially to the output layer, and because the delta of the output layer

$$\begin{aligned} \delta ^{(L)}_j = \frac{\partial E_n(\pmb {w})}{\partial u^{(L)}_{j}} \end{aligned}$$

is given, we can calculate the delta of any layer. Therefore, from (4.12) or (4.13), we can calculate the desired gradient (4.9). In this calculation method, because the delta is propagated from the output layer to the input layer, the error is corrected in the opposite direction of forward propagation; hence, this method is termed backpropagation.

5 Numerical example

In this section, we compare the accuracy and effectiveness of the ANN prediction of our method based on asymptotic correction with those of the quasi-process corrections using numerical examples. Therefore, we examine call option and barrier option prices under an LSVM

$$\begin{aligned} \left\{ \begin{array}{rcl} \displaystyle \textrm{d}S_t &{}=&{} r(t) S_t \textrm{d}t + v_t \left[ \beta S_t + (1 - \beta ) F(0,t) \right] \textrm{d}W^S_t, \\ \textrm{d}v_t &{}=&{} \left( \theta (t) - \kappa (t) v_t \right) dt + \nu v_t \textrm{d}W^v_t \end{array} \right. \end{aligned}$$
(5.14)

The ANN approximations of the implied volatilities (respectively, barrier option prices) using our proposed mapping \(\mathcal {M}_D\) and \(\mathcal {M}_{{\bar{D}}}\) are denoted by \(\sigma ^D_\textrm{ANN}\) (respectively, \(B^D_\textrm{ANN}\)), while those using the direct mapping \(\mathcal {M}\) are denoted by \(\sigma _\textrm{ANN}\) (respectively, \(B_\textrm{ANN}\)).

The implied volatilities \(\sigma _\textrm{MC}\), \(\sigma _\textrm{WIC}\), \(\sigma _\textrm{DD}\), and \(\sigma _\textrm{BS}\) are calculated using the MC simulation, the Wiener-Itô chaos expansion, the replicated DD model, and the mimicked BS model, respectively. Similarly, the barrier option prices \(B_\textrm{MC}\), \(B_\textrm{WIC}\), and \(B_\textrm{BS}\) are calculated using those methods.

In this study, we generate the training and testing data by following the method used in Funahashi (2023). Notably, our method trains the ANN model to predict implied volatilities, \(\mathcal {M}_D: \xi \mapsto \sigma ^D_\textrm{ANN}(\xi )\), using the differences between Monte Carlo and approximate implied volatilities, \(D(\xi _k) = \sigma _\textrm{MC}(\xi _k) - \sigma _\textrm{App}(\xi _k)\), for \(k = 1, \ldots , N\) with respect to the parameters \(\xi _k = \{ (S_0)_k, r_k, \beta _k, (v_0)_k, \nu _k, \rho _k, \kappa _k, \theta _k, K_k \}\). To prepare N sets of vectors \(\{ \xi _k \}\) for \(k = 1\) to N, we first generate M sets of vectors \(\{ v_l \}_{l = 1, \ldots , M}\), where \(M=N/21\). Each \(v_l\) is a vector of 9 elements, namely \((S_0)_l, r_l, \beta _l, (v_0)_l, \nu _l, \rho _l, \kappa _l, \theta _l\). The elements of each \(v_l\) are generated from a uniform random distribution with the given ranges.

To generate appropriate strikes, we do not use fixed values because, depending on the combination of the 9 elements, the volatility of the underlying asset becomes extremely small or large. Instead, we use each \(v_l\) to run an MC simulation with W trials and then set

$$\begin{aligned} K_{1}= & {} \max (\mu - 2 \sqrt{V}, 0.6 F(0,T_l)), \end{aligned}$$
(5.15)
$$\begin{aligned} K_{21}= & {} \min (\mu + 2 \sqrt{V}, 1.5 F(0,T_l)), \end{aligned}$$
(5.16)

and \(K_k = K_1 + (k-1) \Delta K\) for \(k = 2, \ldots , 19\) where \(\Delta K = \frac{K_{21} - K_1}{20}\). For each sample path \(w = 1, \ldots , W\), we generate \(F(\omega _w; 0,T)\) and compute its mean \(\mu = E[F(0,T)]\) and variance \(V = E[(F(0,T) - \mu )^2]\). More specifically, we compute M Monte Carlo simulations to create training and testing data of size \(N = M \times 21\). For each Monte Carlo simulation, W trials are run. The training and testing data \(\xi _{i,k} = \{ v_i, K_k \}\) and target values \(\sigma _\textrm{MC}(\xi _{i,k})\) (respectively, \(B_\textrm{MC}(\xi _{i,k})\)) are obtained simultaneously.

For each test, we run \(M = 1000\)–20, 000 MC simulations to create \(N = 21,000\)–420, 000 datasets and split the datasets into an 80/20 ratio, where 80% is used for training and 20% is used for testing, which is a common practice in data science. All of the results of the following tests are created by the out-of-sample inputs, that is, the latter 20%, which is not used for the training.

In the training stage, the hyper-parameters are set with Adam as the optimizer, ReLU as the activation function,Footnote 6 six hidden layers, and 32 nodes for each layer. The epoch length and batch sizes are set to 100 and 128, respectively.

5.1 Call option

We first examine call option prices under the LSVM (5.14) using the WIC approximation, mimicking the DD model with optimal parameters in (3.26), and mimicked the BS model with parameters (3.31) for the base approximations. Here, we use the Euler-Maruyama scheme with W = 500,000 trialsFootnote 7 and 100 simulation time steps.

We generated uniformly distributed random vectors \(\pmb {\xi }^\textrm{P} = \{ \beta , v_0, \nu , \rho , \kappa , \theta \}\) and \(\pmb {\xi }^\textrm{M} = \{ T, S_0, r \}\) within the range given in Table 1. On the other hand, strikes are computed by following the strategy discussed in the introduction of this section to create \(\pmb {\xi } = \{ \pmb {\xi }^\textrm{P}, \pmb {\xi }^\textrm{M} \}\) where \(\pmb {\xi }^\textrm{M} = \{ T, S_0, r, K \}\). \(\sigma _\textrm{MC}(\pmb {\xi })\) and \(\sigma _\textrm{WIC}(\pmb {\xi })\) were computed using \(\pmb {\xi }\). To ensure the stability of the estimation, we throw out any implied volatilities that are too small or too big

$$\begin{aligned} \sigma _\textrm{A}(\xi _l) < 0.05 \ \textrm{or} \ \sigma _\textrm{A}(\xi _l) > 0.8 . \end{aligned}$$
(5.17)

Therefore, for the actual ANN training and testing, we use \(N' (\le N)\) data sets, \(\{ \xi _n \}_{n = 1, \ldots , N'}\), which exclude these cases.

Table 1 Upper and lower limits of the input model parameters \(\{ T, S_0, r, \beta , v_0, \nu , \rho , \kappa , \theta \}\) generated by uniform random valuables

For the WIC approximation, the deterministic functions of the LSVM, \(p_1\) to \(p_7\) defined in Appendix A, can be explicitly computed as

$$\begin{aligned} p_1= & {} \frac{e^{-s (2 \kappa +r)} \left( p_{1a} +p_{1b}-p_{1c}+p_{1d}+p_{1e}-p_{1f} \right) }{\kappa ^3}, \ p_2 = \frac{\beta e^{-\kappa s} \left[ \theta \left( e^{\kappa s}-1\right) +\kappa v_0 \right] }{\kappa }, \\ p_3= & {} e^{-\kappa s}, \ p_4 = \nu \left[ \frac{\theta \left( e^{\kappa s}-1\right) }{\kappa } + v_0 \right] , p_5 = \frac{\beta e^{-\kappa s} \left[ \theta \left( e^{\kappa s}-1\right) +\kappa v_0 \right] }{\kappa }, \ p_6 = 0, \\ p_7= & {} \left( \beta +e^{r s}-1\right) e^{-s (\kappa +r)}, \ p_8 = (\beta -1) e^{-\kappa s} \left[ \frac{\theta \left( e^{\kappa s}-1\right) }{\kappa } + v_0 \right] \end{aligned}$$

where \(p_{1a} = \theta e^{s (2 \kappa +r)} \left( \theta \nu \rho +\kappa ^2\right) \), \(p_{1b} = \kappa e^{s (\kappa +r)} \left[ -2 \theta ^2 \nu \rho s+\theta \kappa (2 \nu \rho s v_0-1)-2 \theta \nu \rho v_0 + \kappa v_0 (\kappa +\nu \rho v_0) \right] \), \(p_{1c} = \nu \rho e^{r s} (\theta -\kappa v_0)^2\), \(p_{1d} = (\beta -1) \theta ^2 \nu \rho e^{2 \kappa s}\), \(p_{1e} = (\beta -1) \kappa \nu \rho e^{\kappa s} \left( -2 \theta ^2\,s+2 \theta v_0 (\kappa s-1)+\kappa v_0^2\right) \), and \(p_{1f} = (\beta -1) \nu \rho (\theta -\kappa v_0)^2\). The optimal DD parameters, \(\bar{\pmb {\xi }}^\textrm{P}_i = \left\{ {\bar{\epsilon }}^*, {\bar{\beta }}^* \right\} \), and BS parameter, \(\bar{\pmb {\xi }}^\textrm{P}_i = \left\{ {\bar{\epsilon }}^* \right\} \), can be explicitly determined using (3.26) and (3.31), respectively. \(\sigma _\textrm{DD}(\bar{\pmb {\xi }}_i)\)Footnote 8 and \(\sigma _\textrm{BS}(\bar{\pmb {\xi }}_i)\) are computed using the optimized parameters, where \(\bar{\pmb {\xi }}_i = \{ \pmb {\xi }^\textrm{M}_i, \bar{\pmb {\xi }}^\textrm{P}_i \}\).

Here, we generate N = 21000, 210000, and 420,000 (M = 1000, 10,000, and 20,000 (number of MC simulations) \(\times \) 21 (strikes per one trial)) sets of \(\xi \) and remove 1513, 12,327, and 25,724 data sets (i.e., \(N'\) =19,487, 197,673, and 394,276), respectively, following the condition (5.17). We use the MC scheme with 500,000 trials and 100 simulation time steps to compute \(\sigma _\textrm{MC}\). A single Monte Carlo trial requires approximately 2.46 s to complete on a PC with an Intel Core i9-10980XE CPU with 18 cores and 36 threads. The test is performed under a multi-threaded application running on a multi-core processor with 20 cores, and it takes 41 min to compute M = 20,000 MC simulations. Whereas, the computational costs to compute the WIC approximation, replicated DD, and replicated BS models for a call option are listed in Table 3, which are swift enough for practical usage.

Figures 2 and 3 compare the implied volatilities of our proposed methods and MC results for M = 1000 (N =21,000) and M =10,000 (N = 210,000), respectively. The upper left-, upper right-, lower left-, and lower right-hand panels plot the implied volatilities \(\sigma _\textrm{ANN}\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(WIC)\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(DD)\) v.s. \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(BS)\) v.s. \(\sigma _\textrm{MC}\) respectively. The upper left-, upper right-, lower left-, and lower right-hand panels of Figs. 4, 5 and 6 show the frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\), respectively. Here, \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by ANN through direct mapping \(\mathcal {M}\), with WIC correction, using DD model as quasi-process correction, and using BS model as quasi-process correction, respectively. More specifically, \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) is derived using mapping \(\mathcal {M}_D\), while \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) are computed using mapping \(\mathcal {M}_{{\bar{D}}}\). Recall that we used the test data, which is \(20\%\) in the second half of data in N.

Fig. 2
figure 2

Comparison of the implied volatilities derived by artificial neural network (ANN) and Monte Carlo (MC) results. The upper left, upper right, lower left, and lower right panels plot the implied volatilities \(\sigma _\textrm{ANN}\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\textrm{WIC})\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\textrm{DD})\) v.s. \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(\textrm{BS})\) v.s. \(\sigma _\textrm{MC}\), respectively. Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by MC simulation, ANN through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using BS model as quasi-process correction, respectively. The test data, that is, \(20\%\) of \(N=21,000\) (\(M=1000\)) samples is used

Fig. 3
figure 3

Comparison of the implied volatilities derived by artificial neural network (ANN) and Monte Carlo (MC) results. The upper left, upper right, lower left, and lower right panels plot the implied volatilities \(\sigma _\textrm{ANN}\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\textrm{WIC})\) v.s. \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\textrm{DD})\) v.s. \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(\textrm{BS})\) v.s. \(\sigma _\textrm{MC}\), respectively. Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by MC simulation, ANN through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using BS model as quasi-process correction, respectively. The test data, that is, \(20\%\) of \(N=210,000\) (\(M=10,000\)) samples is used

Fig. 4
figure 4

Frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\) (upper left), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\) (upper right), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\) (lower left), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\) (lower right). Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using BS model as quasi-process correction, respectively. The x-axis shows the difference in implied volatilities between MC results and those obtained by an ANN using four methods. The y-axis indicates how often each difference occurs. The test data, that is, 20% of \(N = 21,000\) (\(M=1000\)) samples is used

Fig. 5
figure 5

Frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\) (upper left), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\) (upper right), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\) (lower left), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\) (lower right). Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using BS model as quasi-process correction, respectively. The x-axis shows the difference in implied volatilities between MC results and those obtained by an ANN using four methods. The y-axis indicates how often each difference occurs. The test data, that is, 20% of \(N = 210,000\) (\(M=10,000\)) samples is used

Fig. 6
figure 6

Frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\) (upper left), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\) (upper right), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\) (lower left), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\) (lower right). Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) represent the implied volatilities calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using BS model as quasi-process correction, respectively. The x-axis shows the difference in implied volatilities between MC results and those obtained by an ANN using four methods. The y-axis indicates how often each difference occurs. The test data, that is, 20% of \(N = 420,000\) (\(M=20,000\)) samples are used

To understand more intuitively, we compare the implied volatilities calculated by \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) under various test data sizes, using the parameters given in Table 2. Figures 1314, and 15 in Appendix CFootnote 9 show the cases where the test data, that is, the latter 20% of \(N=21,000, 210,000,\) and 420, 000 (i.e., \(M = 1,000, 10,000,\) and 20, 000), respectively, are used.

Table 2 Parameter sets of the comparative statics used in Figs. 13, 14 and 15. The table shows the values of the parameters T, \(S_0\), r, \(\beta \), \(v_0\), \(\nu \), \(\rho \), \(\kappa \), \(\theta \) for each of the four scenarios considered in the analysis. The scenarios are labeled as A, B, C, and D

As observed from Figs. 26 and 1315, \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) converges most swiftly. From Fig. 13, it is evident \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) mostly converges to the \(\sigma _\textrm{MC}\) even in the case of \(M = 1,000\), while other methods do not. In \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), the range of error is already kept within \(1\%\) even with \(N=21,000\). When N is set to 210, 000, \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) converges next, and when N is set to 420, 000, \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) converges. Conversely, \(\sigma _\textrm{ANN}\) surpasses \(1\%\) error even with \(N = 420,000\). From Fig. 14, \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) has converged but \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) does not in Case A. From Fig. 15, \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) has converged but \(\sigma _\textrm{ANN}\) still has some error in Cases B, C, and D.

As the distribution of the model that we approximate becomes more similar to the original one, we can reduce the amount of training data for ANN learning. The WIC expansion is the closest one because it approximates the original distribution itself. Moreover, using the DD model that approximates the original distribution well can reduce the training data much more than using the BS model with the log-normal distribution. However, it is remarkable that even the BS model with a very different distribution could lower the data amount compared to directly using ANN learning. We provide a detailed discussion in Sect. 6.

These results can also be confirmed from Table 3, which analyzes the mean, variance, and computational times of online prediction and offline learning for the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\) with respect to N. In the table, “times (online) \(D_\textrm{ANN}\)” and “times (online) \(\sigma _\textrm{App}\)” show the computational times of the online prediction and approximation methods, respectively, whereas “time (offline)” indicates the computational times of the offline learning (excluding the numerical simulation). The direct mapping \(\mathcal {M}\) converges to the MC result as N increases, but the convergence speed is slower than the mapping \(\mathcal {M}_D\). This is consistent with the findings of Funahashi (2021a). To conduct a more detailed analysis, in Appendix D, we assess the impact and performance of both the new and previous methods across a range of ANN configurations, including different activation functions, numbers of nodes, and numbers of hidden layers.

Table 3 Mean and variance of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\), and \(\sigma ^D_\textrm{ANN}(\mathrm BS)\) and \(\sigma _\textrm{MC}\). The number of Monte Carlo simulations performed is \(M= 1,000, 10,000,\) and 20, 000 (i.e., \(N = 21,000, 210,000,\) and 420, 000,  respectively). “times (online) \(D_\textrm{ANN}\)” and “times (online) \(\sigma _\textrm{App}\)” show the computational times of the online prediction and approximation methods, respectively, whereas “time (offline)” indicates the computational times of the offline learning

5.2 Barrier option

This subsection examines up-and-in barrier options as an example of an exotic derivative. These options cannot be exercised until the price of the underlying asset reaches or exceeds a predetermined barrier level, B. Once the barrier level is reached, the option becomes exercisable, and the holder can buy or sell the underlying asset at the strike price, depending on whether the option is a call or put.

We use the approximate Barrier option formula, \(B_\textrm{WIC}(\pmb {\xi })\), proposed by Funahashi and Higuchi (2018). The option price of an up-and-in barrier option with barrier level B, maturity T, and strike K is approximated by

$$\begin{aligned} \textrm{UI}(T,K)= & {} \textrm{e}^{-\int \limits _0^T r(s) \textrm{d}s} \bigg [ \frac{\textrm{e}^{\Omega _T} }{2 \sqrt{2 \pi } \Sigma _T} \left( \textrm{e}^{-\frac{(\omega ^1_T(K) - {\dot{\omega }}_T)^2}{2T}} X_1(T) - \textrm{e}^{-\frac{(\omega ^1_T(B) - {\dot{\omega }}_T)^2}{2T}} X_2(T) \right) \nonumber \\{} & {} + \frac{\textrm{e}^{\Omega _T}}{2 \Sigma _T} X_3(T) \left( \Phi \left( \frac{\omega ^1_T(B) - {\dot{\omega }}_T}{\sqrt{T}} \right) - \Phi \left( \frac{\omega ^1_T(K) - {\dot{\omega }}_T}{\sqrt{T}} \right) \right) \nonumber \\{} & {} + \frac{F(0,T)}{\sqrt{2 \pi } \Sigma _T^{\frac{5}{2}}} \textrm{e}^{-\frac{{\bar{B}}^2}{2 \Sigma _T}} \left\{ {\bar{B}}^2 ({\bar{B}} + {\bar{K}}) q(T) - {\bar{K}} q(T) \Sigma _T + \Sigma _T^3 \right\} \nonumber \\{} & {} + F(0,T) {\bar{K}} \left( 1- \Phi \left( \frac{{\bar{B}}}{\sqrt{\Sigma _T}} \right) \right) \bigg ]. \end{aligned}$$
(5.19)

where \({\bar{K}}_T:= 1 - K / F(0,T)\), \({\bar{B}} = B/F(0,T)-1\). \(\Phi (x)\) is the cumulative distribution function of the standard normal distribution. q(t), \(\Sigma _T\), \(\omega _t^{1}(B)\), \(\Omega _T\), \({\dot{\omega }}_T\) and \(X_i(T)\) are defined in Appendix B.

This formula is based on a second order chaos expansion. Hence, compared to the third-order expansion used for call option cases in Sect. 5.1, the accuracy of the approximation gradually worsens as volatility increases and maturity lengthens. This approximation can be expanded to include even high-order terms; see 5.4 of Funahashi and Higuchi (2018). However, in the usual derivative, such an approximation does not exist or, if it does, it requires very complex calculations. Therefore, we will keep the approximation to the second order and observe the effect of the replication method, which is expected to be applicable to more general cases.

We consider two types of datasets. The first type, Case E, has volatilities at the level observed in the normal market. Whereas the second type, Case F, allows for higher volatilities, which are the same level of volatilities used in the previous subsection, with higher barrier level and interest rates. More specifically, we generate uniformly distributed random vectors \(\pmb {\xi }^\textrm{P} = \{ \beta , v_0, \nu , \rho , \kappa , \theta , \epsilon \}\) and \(\pmb {\xi }^\textrm{M} = \{ T, S_0, r, U \}\) within the range given in Table 4. Here, \(U_1\) and \(U_2\) is used to generate the barrier level \(B = S_0 \times U_1\) and volatility of volatility \(\nu = v_0 \times U_2\), respectively.

Table 4 Upper and lower limits of the input model parameters \(\{ T, S_0, r, \beta , v_0, \nu , \rho , \kappa , \theta , U_1, U_2 \}\) generated by uniform random valuables. \(U_1\) and \(U_2\) are used to generate \(B = S_0 \times U_1\) and \(\nu = v_0 \times U_2\), respectively

Notably, for an up-and-in barrier option, if the strike price exceeds the barrier level, that is, \(K > B\), then the option reduces to a standard call option, which we have already considered in the previous section. Therefore, we omit these cases from our analysis. For Case E, to generate appropriate strikes, we consider strikes

$$\begin{aligned} K_{1}= & {} \max (\mu - 1.1 \sqrt{V}, 0.9 F(0,T_l)), \end{aligned}$$
(5.20)
$$\begin{aligned} K_{21}= & {} \min (\min (\mu + 1.1 \sqrt{V}, 1.1 F(0,T_l)), B), \end{aligned}$$
(5.21)

and \(K_k = K_1 + (k-1) \Delta K\) for \(k = 2, \ldots , 19\) where \(\Delta K = \frac{K_{21} -K_1}{20}\). Whereas for Case F, we use (5.15) for \(K_1\) but modify \(K_{21}\) as

$$\begin{aligned} K_{21} = \min (\min (\mu + 2 \sqrt{V}, 1.5 F(0,T_l)), B). \end{aligned}$$

For Case F, we adopt the same condition (5.17) as used in the call option case. For Case E, we lower the upper limit of the parameters that express maturity and volatility. This could make the volatility value too small and cause the MC results to be inaccurate for deep-in-the-money or deep-out-of-the-money cases. Therefore, we further limit the acceptable range of implied volatility.

$$\begin{aligned} \sigma _\textrm{A}(\xi _l) < 0.1 \ \textrm{or} \ \sigma _\textrm{A}(\xi _l) > 0.8 \end{aligned}$$
(5.22)

Regarding the call option cases, for the actual ANN training and testing, we use \(N' (\le N)\) data sets, \(\{ \xi _n \}_{n = 1, \ldots , N'}\), which exclude these cases.

For the Case E, the actual data N = 21,000 and 420, 000 (M = 1000 and 20,000) are prepared and filtered out 651 and 15,414 (i.e., \(N'\) = 20,349 and 404,586) data sets, respectively, due to the condition (5.22). Conversely, for the Case F, we prepare N = 105,000 and 420,000 (M = 5000 and 20,000) sets of \(\xi \) and remove 15,183 and 61,110 (i.e., \(N'\) = 89,817 and 358,890) data sets, respectively, following the condition (5.17). We use the MC scheme with 500,000 trials and 1000 simulation time steps.

A single Monte Carlo trial takes approximately 24.4 s to complete on a PC with an Intel Core i9-10980XE CPU with 18 cores and 36 threads. The test is performed under a multi-threaded application running on a multi-core processor with 20 cores, and it takes seven hours to perform M = 20,000 MC simulations. Because the deterministic functions \(\Sigma _t\) and q(t) (defined in Appendix B) used in the WIC approximation for the Barrier option can be explicitly computed as the following, it takes 0.175 milliseconds to compute WIC approximation for a barrier option, which is fast enough even compared to the offline prediction of the neural network.

$$\begin{aligned} \Sigma _s= & {} \frac{e^{-2 \kappa s}}{2 \kappa ^3} \left( \theta ^2 \left( e^{2 \kappa s} (2 \kappa s - 3 ) + 4 e^{\kappa s}-1 \right) + \kappa ^2 v_0^2 \left( e^{2 \kappa s}-1 \right) + 2 \theta \kappa v_0 \left( e^{\kappa s}-1\right) ^2\right) , \\ q(s)= & {} \frac{e^{-3 \kappa s}}{6 \kappa ^6} \left( A(s) + B(s) \right) . \end{aligned}$$

where

$$\begin{aligned} A(s){} & {} = 3 \beta e^{\kappa s} \left[ \sinh (\kappa s) \left( \theta ^2 (\kappa s-1)+\kappa ^2 v_0^2\right) +\theta \cosh (\kappa s) (\theta (\kappa s-2)+2 \kappa v_0) \right. \\{} & {} \quad \left. + 2 \theta (\theta -\kappa v_0) \right] ^2, \\ B(s){} & {} = \kappa \nu \rho \left[ e^{3 \kappa s} \left( -16 \theta ^3+6 \theta ^2 \kappa (\theta s+v_0)+3 \theta \kappa ^2 v_0^2+\kappa ^3 v_0^3\right) \right. \\{} & {} \quad \left. + 6 \theta e^{2 \kappa s} \left( 3 \theta ^2-\kappa ^2 v_0 (2 \theta s+v_0)+\theta \kappa (2 \theta s-v_0)\right) \right. \\{} & {} \quad \left. - 3 \kappa e^{\kappa s} (\kappa v_0-\theta ) (\kappa v_0 (2 \theta s+v_0)-2 \theta (\theta s+v_0)) - 2 (\theta -\kappa v_0)^3 \right] . \end{aligned}$$

Using the Case E parameters in Table 4, Fig. 7 shows the comparisons between ANN prediction and MC results of the up and in barrier option prices using the test data. The upper, middle, and lower panels show \(B_\textrm{ANN}\) v.s. \(B_\textrm{MC}\), \(B^D_\textrm{ANN}(WIC)\) v.s. \(B_\textrm{MC}\), and \(B^D_\textrm{ANN}(BS)\) v.s. \(B_\textrm{MC}\), respectively. Here, \(B_\textrm{ANN}\), \(B^D_\textrm{ANN}(\mathrm WIC)\), and \(B^D_\textrm{ANN}(\mathrm BS)\) represent the barrier option price calculated by ANN through direct mapping \(\mathcal {M}\), ANN with WIC correction, and ANN using BS model as quasi-process correction, respectively. The left- and right-hand side panels indicate N = 21,000 (M = 1000) and N = 252,000 (M = 12,000), respectively. Concurrently, Fig. 8 shows frequency histograms of the differences between \(B_\textrm{ANN}\) and \(B_\textrm{MC}\) (upper-), \(B^D_\textrm{ANN}(\mathrm WIC(2nd))\) and \(B_\textrm{MC}\) (middle-), and \(B^D_\textrm{ANN}(\mathrm BS)\) and \(B_\textrm{MC}\) (lower-panel).

Fig. 7
figure 7

Comparisons between artificial neural network (ANN) prediction and Monte Carlo (MC) results of the up and in barrier option prices. The upper, middle, and lower panels show \(B_\textrm{ANN}\) v.s. \(B_\textrm{MC}\), \(B^D_\textrm{ANN}(WIC)\) v.s. \(B_\textrm{MC}\), and \(B^D_\textrm{ANN}(BS)\) v.s. \(B_\textrm{MC}\), respectively. Here, \(B_\textrm{MC}\), \(B_\textrm{ANN}\), \(B^D_\textrm{ANN}(\mathrm WIC)\), and \(B^D_\textrm{ANN}(\mathrm BS)\) represent the barrier option price calculated by MC simulation, ANN through direct mapping \(\mathcal {M}\), ANN with WIC correction, and ANN using BS model as quasi-process correction, respectively. The left- and right-hand side panels indicate N = 21,000 (M = 1000) and N = 420,000 (M = 20,000), respectively. The Case E parameters in Table 4 are used

Fig. 8
figure 8

Frequency histograms of the differences between \(B_\textrm{ANN}\) and \(B_\textrm{MC}\) (upper-), \(B^D_\textrm{ANN}(\mathrm WIC(2nd))\) and \(B_\textrm{MC}\) (middle-), and \(B^D_\textrm{ANN}(\mathrm BS)\) and \(B_\textrm{MC}\) (lower-panel). Here, \(B_\textrm{MC}\), \(B_\textrm{ANN}\), \(B^D_\textrm{ANN}(\mathrm WIC(2nd))\), and \(B^D_\textrm{ANN}(\mathrm BS)\) represent the barrier option price calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), ANN with WIC correction, and ANN using BS model as quasi-process correction, respectively. The x-axis shows the difference in barrier option prices between MC results and those obtained by an ANN using three methods. The y-axis indicates how often each difference occurs. The left- and right-hand side panels indicate N = 21,000 (M = 1000) and N = 420,000 (M = 20,000), respectively. The Case E parameters in Table 4 are used

Figures 7 and 8 indicate that the neural network of \(B^D_\textrm{ANN}(\mathrm WIC)\) converges most quickly. \(B^D_\textrm{ANN}(\mathrm WIC)\) has already converged for the most part with the learning data of N = 21,000 (M = 1000), but the estimated error of \(B^D_\textrm{ANN}(\mathrm BS)\) is noticeable at the right tail of the distribution, with errors exceeding \(-\)0.5. Even with a second-order WIC approximation, it indicates that the convergence is faster than the BS model. Conversely, \(B_\textrm{ANN}\) has not converged even with N = 420,000 (M = 20,000), and is insufficient for practical use.

However, notably, the estimated results between \(B^D_\textrm{ANN}(\mathrm WIC)\) and \(B^D_\textrm{ANN}(\mathrm BS)\) do not show a significant difference compared to the call option cases. The reduction to the second-order approximation has a significant impact on the accuracy of the model. However, up-and-in barrier options have additional conditions built in, which severely limits the downside compared to their equivalent vanilla counterpartsFootnote 10. Hence, even if we use an approximation based on the Black Scholes model as an alternative to the Wiener-Itô Chaos approximation, we can expect to converge sufficiently if we keep N = 105,000–420,000.

Using the Case F parameters, Figs. 9 and 10 show the comparisons and frequency histograms of the differences between ANN prediction and MC results of the up and in barrier option prices, respectively. In Fig. 9, the upper-, and lower-panels show \(B_\textrm{ANN}\) v.s. \(B_\textrm{MC}\) and \(B^D_\textrm{ANN}(BS)\) v.s. \(B_\textrm{MC}\), respectively while, the left- and right-hand side panels indicate N = 105,000 (M = 5000) and N = 420,000 (M = 20,000), respectively. In Fig. 10, the upper and lower panels indicate \(B_\textrm{ANN}\) and \(B_\textrm{MC}\) and \(B^D_\textrm{ANN}(\mathrm BS)\) and \(B_\textrm{MC}\), respectively, while the left- and right-hand side panels indicate N = 105,000 (M = 5000) and N = 420,000 (M = 12,000), respectively.

Fig. 9
figure 9

Comparisons between artificial neural network (ANN) prediction and Monte Carlo (MC) results of the up and in barrier option prices. The upper and lower panels show \(B_\textrm{ANN}\) v.s. \(B_\textrm{MC}\) and \(B^D_\textrm{ANN}(BS)\) v.s. \(B_\textrm{MC}\), respectively, while the left- and right-hand side panels indicate N = 105,000 (M = 5000) and N = 420,000 (M = 20,000), respectively. Here, \(B_\textrm{MC}\), \(B_\textrm{ANN}\), and \(B^D_\textrm{ANN}(\mathrm BS)\) represent the barrier option price calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), and ANN using BS model as quasi-process correction, respectively. The x-axis shows the difference in barrier option prices between MC results and those obtained by an ANN using two methods. The y-axis indicates how often each difference occurs. The parameters are set to Case F in Table 4

Fig. 10
figure 10

Frequency histograms of the differences between \(B_\textrm{ANN}\) and \(B_\textrm{MC}\) (upper-) and \(B^D_\textrm{ANN}(\mathrm BS)\) and \(B_\textrm{MC}\) (lower-panels). The x-axis shows the difference in barrier option prices between Monte Carlo (MC) results and those obtained by an artificial neural network (ANN) using two methods. Here, \(B_\textrm{MC}\), \(B_\textrm{ANN}\), and \(B^D_\textrm{ANN}(\mathrm BS)\) represent the barrier option price calculated by MC simulation, ANN through direct mapping \(\mathcal {M}\), and ANN using BS model as quasi-process correction, respectively. The y-axis indicates how often each difference occurs. The left- and right-hand side panels indicate N = 105,000 (M = 5000) and N = 420,000 (M = 20,000), respectively. The parameters are set to Case F in Table 4

As shown in Figs. 9 and 10, the neural network used in \(B^D_\textrm{ANN}(BS)\) has sufficient accuracy for practical use when N = 105,000–420,000 while \(B_\textrm{ANN}\) does not converge sufficiently. This suggests the need for more training data, which subsequently requires more time-consuming numerical simulations, resulting in a significant increase in time requirements. Furthermore, the time required for offline learning cannot be overlooked. Therefore, even when an accurate approximation solution cannot be obtained for the derivative price on the underlying asset that follows a complex probability process, employing ANN with a simple model correction such as the Black-Scholes (BS) model enables efficient learning and prediction for neural networks.

6 Discussion

Here, we discuss the limitations of the ANN with quasi-process correction using call option prices under the LSVM (5.14) in the SABR stochastic volatility model

$$\begin{aligned} \left\{ \begin{array}{rcl} \displaystyle \frac{\textrm{d}\bar{S}_t}{\bar{S}_t} &{}=&{} v_t \bar{S}^{\beta -1}_t \textrm{d}W^{{\bar{S}}}_t, \\ \textrm{d}v_t &{}=&{} \nu v_t \textrm{d}W^{{\bar{v}}}_t, \ \ v_0 = \alpha \end{array} \right. \end{aligned}$$
(6.1)

for the base approximation, where \(W^{{\bar{S}}}_t\) and \(W^{{\bar{v}}}_t\) are two standard Brownian motions with correlation \(\textrm{d}W^{{\bar{S}}}_t \textrm{d}W^{{\bar{v}}}_t = {\bar{\rho }} \textrm{d}t\).

For the WIC approximation, the first three functions of the SABR model are explicitly computed as

$$\begin{aligned} \Sigma ^{{\bar{S}}}_T= & {} \frac{\alpha \left[ \left\{ \alpha \beta T (\alpha (\beta -1)+2 \nu {\bar{\rho }} )+2 \right\} ^3-8 \right] }{12 \beta (\alpha (\beta -1)+2 \nu {\bar{\rho }} )}, \\ q^{{\bar{S}}}_1(T)= & {} \frac{1}{32} \alpha ^3 T^2 (\alpha \beta +\nu {\bar{\rho }} ) (\alpha \beta T (\alpha (\beta -1)+2 \nu {\bar{\rho }} )+4)^2, \\ q^{{\bar{S}}}_2(T)= & {} \frac{1}{384} \alpha ^4 T^3 \left( \alpha ^2 \beta (2 \beta -1)+3 \alpha \beta \nu {\bar{\rho }} +\nu ^2 {\bar{\rho }} ^2\right) (\alpha \beta T (\alpha (\beta -1)+2 \nu {\bar{\rho }} )+4)^3 \end{aligned}$$

To obtain appropriate SABR parameters, \(\bar{\pmb {\xi }} = \{ \alpha , \beta , \nu , \bar{\rho } \}\), we use same correlation used in the LSVM (5.14), \(\bar{\rho } = \rho \). On the other hand, \(\pmb {\theta } = \{ v_0, \beta , \nu \}\) is obtained by solving the problem

$$\begin{aligned} \bar{\pmb {\theta }} = \underset{\pmb {\theta } \in \Theta }{{\text {argmin}}} \left( (\Sigma ^S_t - \Sigma ^{{\bar{S}}}_t)^2 + (q_1^S(t) - q_1^{{\bar{S}}}(t))^2 + (q_2^S(t) - q_2^{{\bar{S}}}(t))^2 \right) \end{aligned}$$
(6.2)

within the range \(\Theta \), \(0.05< v_0 < 2\), \(0.55< \beta < 0.998\), \(0.002< \nu < 0.8\),.Footnote 11

Figure 11 shows frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\) (upper-left panel), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\) (upper-right panel), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\) (lower-left panel), and \(\sigma ^D_\textrm{ANN}(\mathrm SABR)\) and \(\sigma _\textrm{MC}\) (lower-right panel). The parameters are generated by uniformly distributed random vectors within the range Case E given in Table 4. Here, we generate N = 210,000 (i.e., M = 10,000) sets of \(\xi \) and remove 714 data sets (i.e., \(N'\) =209,28) following the condition (5.17). Here, \(\sigma ^D_\textrm{ANN}(\mathrm SABR)\) represents the implied volatilities calculated by ANN using SABR model as quasi-process correction and \(\sigma _\textrm{MC}\) is computed using the Euler–Maruyama scheme with W = 500,000 trials and 100 simulation time steps.

From Fig. 11, we can see that the differences between \(\sigma ^D_\textrm{ANN}(\mathrm SABR)\) and \(\sigma _\textrm{MC}\) are distributed near 0, compared to the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\). Therefore, the former approximation is generally more efficient than the latter. However, it is also evident that the approximation through the SABR model demonstrates a relatively large error. This discrepancy arises because the SABR model lacks the capability to accommodate negative values, unlike the LSVM model (5.14). When attempting to replicate the LSVM model using the SABR model, the latter tends to adopt values near 0 instead of negative values, leading to the inflation of probability density near 0.

Figure 12 shows the cumulative distribution functions of the LSVM model, replicated SABR model, and replicated DD model. Here, we use a low-bias simulation scheme for the SABR model proposed by Chen et al. (2012), which introduced an efficient algorithm to simulate the squared Bessel process with an absorbing boundary at zero, whereas the Euler-Maruyama scheme is used to compute the underlying asset price in the DD model. To compute the cumulative distribution function (CDF), we run 1,000,000 MC trials with 300 simulation steps. The parameters of the original LSVM model, \(\pmb {\xi }^M = \{ \epsilon ^*, \beta ^* \}\), the replicate DD model \(\bar{\pmb {\xi }^M} = \{ \epsilon ^*, \beta ^* \}\) in (3.26), and replicated SABR model \(\bar{\pmb {\theta }}\) in (6.2) are listed in Table 5.

In this case, the original LSVM and SABR processes have relatively large errors, potentially hampering the estimation, rather than directly estimating the option price using a neural network. This observation is consistent with the findings presented in Section 8.5 of Funahashi (2021b). In summary, considering the distribution shape, it is not advisable to employ the SABR model for replicating the LSVM model.

Thus, when using the NN based on the approximation of a quasi-process, it becomes crucial to exercise caution and thoroughly select the probability distribution of the underlying asset, whereas NNs utilizing asymptotic methods are free of these concerns because they directly use the original distribution. This represents a significant advantage of the latter approach.

Table 5 Parameters of the original LSVM model, replicate DD model, and replicated SABR model used to compute CDF in Fig. 12
Fig. 11
figure 11

Frequency histograms of the differences between \(\sigma _\textrm{ANN}\) and \(\sigma _\textrm{MC}\) (upper-left panel), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\) and \(\sigma _\textrm{MC}\) (upper-right panel), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\) and \(\sigma _\textrm{MC}\) (lower-left panel), and \(\sigma ^D_\textrm{ANN}(\mathrm SABR)\) and \(\sigma _\textrm{MC}\) (lower-right panel). Here, \(\sigma _\textrm{MC}\), \(\sigma _\textrm{ANN}\), \(\sigma ^D_\textrm{ANN}(\mathrm WIC)\), \(\sigma ^D_\textrm{ANN}(\mathrm DD)\), and \(\sigma ^D_\textrm{ANN}(\mathrm SABR)\) represent the implied volatilities calculated by Monte Carlo (MC) simulation, artificial neural network (ANN) through direct mapping \(\mathcal {M}\), ANN with WIC correction, ANN using DD model as quasi-process correction, and ANN using SABR model as quasi-process correction, respectively. The x-axis shows the difference in implied volatilities between MC results and those obtained by an ANN using four methods. The y-axis indicates how often each difference occurs. The test data, that is, \(20\%\) of N=210,000 (M = 10,000) samples is used. The range of the parameters are set to Case E in Table 4

Fig. 12
figure 12

Cumulative distribution functions (CDF) of the LSVM model, the replicated SABR model, and the replicated DD models with the model parameters listed in Table 5

7 Conclusion

This study introduces two methods for efficiently learning the price of derivatives in neural networks. The first method is to learn the difference between the price of derivatives and its asymptotic expansion, and the second method is to learn the difference between the price of derivatives written on two different underlying asset prices; one underlying asset price is the target complex stochastic process, and the other is a relatively simple stochastic process that has a closed-form solution for the target derivatives prices. The former method has the advantage that it can be much more efficient than the latter one, especially when an approximate solution is available. Concurrently, the latter method is an alternative valuation method when there is no efficient approximate solution for the derivative value, and if one can arbitrarily determine the model parameters of the quasi-process that approximates the underlying asset process. The latter method requires more training data than the former method, but we demonstrate that it remains significantly more efficient than directly learning the derivative price using neural networks. Even if a relatively simple quasi-process, such as the Black-Scholes model, is employed, the learning and estimation efficiency are overwhelmingly superior to the direct method.

However, the latter method, as shown in Section 6 using the SABR model, may have the same or even lower approximation accuracy than the direct derivative price learning method if the pseudo-process of the underlying asset price is not selected appropriately. Therefore, the key to the latter method is to determine appropriate model parameters that mimic the original underlying asset process. An important contribution of this study is that it proposes a unified replication strategy to determine model parameters of quasi-processes from original underlying asset processes. Moreover, we showed that this approach is free from contract parameters, including strikes; hence, it fits our purpose. In summary, the two methods introduced in this study provide a more efficient way to learn the price of derivatives in neural networks. This is especially useful for stochastic volatility models and other cases where analytic solutions do not exist or are computationally expensive to obtain.

First, we use a general approximation method, such as the singular perturbation method or asymptotic expansion, to calculate the price of the derivative. If an accurate approximation is available, we use the first method (ANN with asymptotic correction). Conversely, if the approximate solution does not exist, or the accuracy is poor, or computation time is required, we calculate the difference between the target derivative price and the corresponding price under the quasi-process using the replication method, using the second method (ANN with quasi-process correction). The proposed methods can not only reduce the amount of training data required for the neural network offline but also significantly improve the accuracy of the online estimation used in daily trading.

Although this study only examines a method for estimating the derivative price by using the Monte Carlo simulation, these methods can be applied to other numerical methods that are computationally expensive, such as partial differential equations (PDEs), finite difference methods (FDMs), numerical integration, or approximation methods. Therefore, this method is generally effective for problems that can be calculated but take a long time to use in daily trading.

Our method works for more general cases, including multi-dimensional diffusions. We can then consider the valuation problem of financial products such as basket options and spread options using the approximation formulas derived in Funahashi and Kijima (2014) for \(C_\textrm{App}\). Another case in which it is possible to extend our method is the valuation of American options. Liang et al. (2021) discusses the application of deep learning methods to the valuation problem of early exercisable derivatives such as American and Bermudan options, which is another topic that is frequently discussed in the literature. Since, the Fourier cosine expansion (COS) method, for example, Fang and Oosterlee (2009, 2011), is known to approximate American option prices well, it is the leading candidate for \(C_\textrm{App}\) to obtain more accurate and stable ANN training. Moreover, using the result for the base approximation \(C_\textrm{App}\), we can utilize the quasi-process correction using a simpler model to train the ANN model to learn the prices of American and Bermudan options under a complex model. A comparison between Liang et al. (2021) method and ours with this approach will be considered in a future study.