Skip to main content

Accuracy of deep learning in calibrating HJM forward curves


We price European-style options written on forward contracts in a commodity market, which we model with an infinite-dimensional Heath–Jarrow–Morton (HJM) approach. For this purpose, we introduce a new class of state-dependent volatility operators that map the square integrable noise into the Filipović space of forward curves. For calibration, we specify a fully parametrized version of our model and train a neural network to approximate the true option price as a function of the model parameters. This neural network can then be used to calibrate the HJM parameters based on observed option prices. We conduct a numerical case study based on artificially generated option prices in a deterministic volatility setting. In this setting, we derive closed pricing formulas, allowing us to benchmark the neural network based calibration approach. We also study calibration in illiquid markets with a large bid-ask spread. The experiments reveal a high degree of accuracy in recovering the prices after calibration, even if the original meaning of the model parameters is partly lost in the approximation step.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10


  1. 1.

    The code is implemented in TensorFlow 2.1.0 and is available at


  1. Andresen, A., Koekebakker, S., & Westgaard, S. (2010). Modeling electricity forward prices using the multivariate normal inverse Gaussian distribution. The Journal of Energy Markets, 3(3), 3.

    Article  Google Scholar 

  2. Barth, A., & Benth, F. E. (2014). The forward dynamics in energy markets–Infinite-dimensional modelling and simulation. Stochastics An International Journal of Probability and Stochastic Processes, 86(6), 932–966.

    Article  Google Scholar 

  3. Barth, A., & Lang, A. (2012). Simulation of stochastic partial differential equations using finite element methods. Stochastics An International Journal of Probability and Stochastic Processes, 84(2–3), 217–231.

    Article  Google Scholar 

  4. Barth, A., & Lang, A. (2012). Multilevel Monte Carlo method with applications to stochastic partial differential equations. International Journal of Computer Mathematics, 89(18), 2479–2498.

    Article  Google Scholar 

  5. Barth, A., Lang, A., & Schwab, C. (2013). Multilevel Monte Carlo method for parabolic stochastic partial differential equations. BIT Numerical Mathematics, 53(1), 3–27.

    Article  Google Scholar 

  6. Bayer, C., & Stemper, B. (2018). Deep calibration of rough stochastic volatility models. arXiv:1810.03399

  7. Bayer, C., Horvath, B., Muguruza, A., Stemper, B., & Tomas, M. (2019). On deep calibration of (rough) stochastic volatility models. arXiv:1908.08806

  8. Benth, F. E. (2015). Kriging smooth energy futures curves. Energy Risk.

  9. Benth, F. E., Benth, J. Š., & Koekebakker, S. (2008) Stochastic modelling of electricity and related markets (Vol. 11). World Scientific.

  10. Benth, F. E., & Koekebakker, S. (2008). Stochastic modeling of financial electricity contracts. Energy Economics, 30(3), 1116–1157.

    Article  Google Scholar 

  11. Benth, F. E., & Krühner, P. (2014). Representation of infinite-dimensional forward price models in commodity markets. Communications in Mathematics and Statistics, 2(1), 47–106.

    Article  Google Scholar 

  12. Benth, F. E., & Krühner, P. (2015). Derivatives pricing in energy markets: An infinite-dimensional approach. SIAM Journal on Financial Mathematics, 6(1), 825–869.

    Article  Google Scholar 

  13. Benth, F. E., & Paraschiv, F. (2018). A space-time random field model for electricity forward prices. Journal of Banking & Finance, 95, 203–216.

    Article  Google Scholar 

  14. Bühler, H., Gonon, L., Teichmann, J., & Wood, B. (2019). Deep hedging. Quantitative Finance, 19(8), 1271–1291.

    Article  Google Scholar 

  15. Carmona, R., & Nadtochiy, S. (2012). Tangent Lévy market models. Finance and Stochastics, 16(1), 63–104.

    Article  Google Scholar 

  16. Chataigner, M., Crépey, S., & Dixon, M. (2020). Deep local volatility. Risks, 8(3), 82.

    Article  Google Scholar 

  17. Clewlow, L., & Strickland, C. (2000). Energy derivatives: Pricing and risk management. London: Lacima Publications.

    Google Scholar 

  18. Cuchiero, C., Khosrawi, W., & Teichmann, J. (2020). A generative adversarial network approach to calibration of local stochastic volatility models. Risks, 8(4), 101.

    Article  Google Scholar 

  19. Da Prato, G., & Zabczyk, J. (2014). Stochastic equations in infinite dimensions. Cambridge: Cambridge University Press.

    Book  Google Scholar 

  20. De Spiegeleer, J., Madan, D. B., Reyners, S., & Schoutens, W. (2018). Machine learning for quantitative finance: fast derivative pricing, hedging and fitting. Quantitative Finance, 18(10), 1635–1643.

  21. Engel, K.-J., & Nagel, R. (2006). A short course on operator semigroups. Berlin: Springer.

    Google Scholar 

  22. Ferguson, R., & Green, A. (2018). Deeply learning derivatives. arXiv preprint arXiv:1809.02233

  23. Filipović, D. (2001). Consistency problems for Heath–Jarrow–Morton interest rate models. Lecture notes in mathematics (Vol. 1760). Springer.

  24. Filipović, D. (2009). Term-structure models. A graduate course. Berlin: Springer.

    Book  Google Scholar 

  25. Frestad, D. (2008). Common and unique factors influencing daily swap returns in the Nordic electricity market, 1997–2005. Energy Economics, 30(3), 1081–1097.

    Article  Google Scholar 

  26. Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. London: MIT Press.

    Google Scholar 

  27. Gottschling, N. M., Antun, V., Adcock, B., & Hansen, A. C. (2020). The troublesome kernel: Why deep learning for inverse problems is typically unstable. arXiv preprint arXiv:2001.01258

  28. Heath, D., Jarrow, R., & Morton, A. (1992). Bond pricing and the term structure of interest rates: A new methodology for contingent claims valuation. Econometrica: Journal of the Econometric Society, 77–105.

  29. Hernandez, A. (2016). Model calibration with neural networks. Available at SSRN 2812140.

  30. Higham, C. F., & Higham, D. J. (2019). Deep learning: An introduction for applied mathematicians. SIAM Review, 61(4), 860–891.

    Article  Google Scholar 

  31. Horvath, B., Muguruza, A., & Tomas, M. (2020). Deep learning volatility: A deep neural network perspective on pricing and calibration in (rough) volatility models. Quantitative Finance 1–17.

  32. Hutchinson, J. M., Lo, A. W., & Poggio, T. (1994). A nonparametric approach to pricing and hedging derivative securities via learning networks. The Journal of Finance, 49(3), 851–889.

    Article  Google Scholar 

  33. Kallsen, J., & Krühner, P. (2015). On a Heath–Jarrow–Morton approach for stock options. Finance and Stochastics, 19(3), 583–615.

    Article  Google Scholar 

  34. Kingma, D. P. & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980

  35. Koekebakker, S., & Ollmar, F. (2005). Forward curve dynamics in the Nordic electricity market. Managerial Finance, 31(6), 73–94.

    Article  Google Scholar 

  36. Kondratyev, Al. (2018). Learning curve dynamics with artificial neural networks. Available at SSRN 3041232.

  37. Kovács, M., Larsson, S., & Lindgren, F. (2010). Strong convergence of the finite element method with truncated noise for semilinear parabolic stochastic equations with additive noise. Numerical Algorithms, 53(2–3), 309–320.

    Article  Google Scholar 

  38. Nelson, C. R. & Siegel, A. F. (1987). Parsimonious modeling of yield curves. Journal of Business 473–489.

  39. Peszat, S., & Zabczyk, J. (2007). Stochastic partial differential equations with Lévy noise: An evolution equation approach (Vol. 113). Cambridge: Cambridge University Press.

    Book  Google Scholar 

  40. Rynne, B., & Youngson, M. A. (2013). Linear functional analysis. Berlin: Springer Science & Business Media.

    Google Scholar 

  41. Tappe, S. (2012). Some refinements of existence results for SPDEs driven by Wiener processes and Poisson random measures. International Journal of Stochastic Analysis.

Download references


The authors would like to thank Vegard Antun for precious coding support and related advice, and Christian Bayer for useful discussions. The authors are also grateful to two anonymous referees for their valuable comments which helped to improve the exposition of the paper with the goal to reach a larger audience.

Author information



Corresponding author

Correspondence to Nils Detering.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Part of the project has been carried out during Silvia Lavagnini’s 3-month visit at UCSB, funded by the Kristine Bonnevie travel stipend 2019 from the Faculty of Mathematics and Natural Sciences (University of Oslo).


Appendix A: Proofs of the main results

We report in this section the proofs to the main results in the order they appear in the paper.

A.1 Proof of Theorem 2.2

For every \(x\in \mathbb {R}_{+}\) and \(f \in \mathcal {H}_{\alpha }\), by the Cauchy–Schwarz inequality we can write that

$$\begin{aligned} \int _{\mathcal {O}} \left| \kappa _t (x,y,f) h(y) \right| \mathrm{d}y \le \left( \int _{\mathcal {O}} \kappa _t (x,y,f)^2 \mathrm{d}y \right) ^{1/2} \left( \int _{\mathcal {O}} h ( y)^2 \mathrm{d}y \right) ^{1/2} < \infty , \end{aligned}$$

which is bounded, since \(\kappa _t (x,\cdot ,f) \in \mathcal {H}\) for every \(x\in \mathbb {R}_+\) and every \(f \in \mathcal {H}_{\alpha }\) by Assumption 1, and because \(h\in \mathcal {H}\). Thus, \(\sigma _t(f) h\) is well defined for all \(h \in \mathcal {H}\).

We need to show that \(\sigma _t(f) h \in \mathcal {H}_{\alpha }\) for every \(f \in \mathcal {H}_{\alpha }\). We start by noticing that for every \(x\in \mathbb {R}_{+}\) the following equality holds:

$$\begin{aligned} \frac{\partial \sigma _t(f) h (x)}{\partial x} = \int _{\mathcal {O}} \frac{\partial \kappa _t (x,y,f) }{\partial x} h(y) \mathrm{d}y, \end{aligned}$$

where the differentiation under the integral sign is justified by Dominated Convergence because of Assumption 2 and \(\int _{\mathcal {O}} \bar{\kappa }_x (y) h(y) dy < \infty\) . Moreover, by Assumption 3 and the Cauchy–Schwarz inequality, we find that

$$\begin{aligned} \int _{\mathbb {R}_+} \left( \int _{\mathcal {O}} \frac{\partial \kappa _t (x,y,f) }{\partial x} h(y) \mathrm{d}y \right) ^2 \alpha (x) \mathrm{d}x \le \left\Vert h\right\Vert ^2 \int _{\mathbb {R}_+} \left\Vert \frac{\partial \kappa _t (x,\cdot ,f) }{\partial x}\right\Vert ^2 \alpha (x) \mathrm{d}x < \infty , \end{aligned}$$

which shows that \(\sigma _t(f) h\in \mathcal {H}_{\alpha }\) and boundedness of the operator \(\sigma _t(f)\) for each \(f \in \mathcal {H}_{\alpha }.\)

A.2 Proof of Theorem 2.3

We first observe that for every \(h\in \mathcal {H}\) and every \(f, f_1\in \mathcal {H}_{\alpha }\), it holds that

$$\begin{aligned} \int _{\mathbb {R}_{+}}\int _{\mathbb {R}_{+}}\left| \frac{\partial \kappa _t (x,y,f) }{\partial x} f_1'(x)\alpha (x)h(y)\right| \mathrm{d}y\mathrm{d}x&= \int _{\mathbb {R}_{+}} | f_1'(x)\alpha (x) | \int _{\mathbb {R}_{+}} \left| \frac{\partial \kappa _t (x,y,f) }{\partial x} \right| \left| h(y)\right| \mathrm{d}y\mathrm{d}x \\&\le \int _{\mathbb {R}_{+}} | f_1'(x) | \alpha ^{1/2}(x) \left\Vert \frac{\partial \kappa _t (x,\cdot , f) }{\partial x} \right\Vert \alpha ^{1/2}(x) \left\Vert h\right\Vert dx \\&\le \left\Vert h\right\Vert \left\Vert f_1\right\Vert _{\alpha } \left( \int _{\mathbb {R}_{+}} \left\Vert \frac{\partial \kappa _t (x,\cdot ,f) }{\partial x} \right\Vert ^2\alpha (x) \mathrm{d}x\right) ^{1/2}, \end{aligned}$$

where we used the Cauchy–Schwarz inequality twice. By Assumption 3 this is bounded and allows us to apply the Fubini Theorem and calculate as follows:

$$\begin{aligned} \langle \sigma _t(f) h, f_1\rangle _{\alpha }&= f_1(0) \int _{\mathcal {O}}\kappa _t (0,y,f) h(y)dy +\int _{\mathbb {R}_{+}}\frac{\partial \sigma _t(f) h (x)}{\partial x} f_1'(x)\alpha (x)\mathrm{d}x\\&=f_1(0) \int _{\mathcal {O}}\kappa _t (0,y,f) h(y)\mathrm{d}y +\int _{\mathbb {R}_{+}} \int _{\mathcal {O}} \frac{\partial \kappa _t (x,y,f) }{\partial x} h(y) \mathrm{d}y f_1'(x)\alpha (x)\mathrm{d}x \\&= \int _{\mathcal {O}} \left( f_1(0)\kappa _t (0,y,f) +\int _{\mathbb {R}_+} \frac{\partial \kappa _t (x,y,f) }{\partial x} f_1'(x)\alpha (x) \mathrm{d}x \right) h(y) \mathrm{d}y\\&= \int _{\mathcal {O}}\sigma _t(f)^{*}f_1(y) h(y)\mathrm{d}y = \langle h, \sigma _t(f)^{*}f_1\rangle , \end{aligned}$$

for \(\sigma _t(f)^{*}f_1\) defined by

$$\begin{aligned} \sigma _t(f)^{*}f_1(y):= f_1(0)\kappa _t (0,y,f) +\int _{\mathbb {R}_{+}}\frac{\partial \kappa _t (x,y,f) }{\partial x} f_1'(x)\alpha (x)\mathrm{d}x = \langle \kappa _t (\cdot ,y,f), f_1\rangle _{\alpha }. \end{aligned}$$

From Rynne and Youngson (2013, Theorem 6.1), \(\sigma _t(f)^{*}\) is the unique adjoint operator of \(\sigma _t(f)\), for \(f \in \mathcal {H}_{\alpha }\).

A.3 Proof of Theorem 2.4

We start with the growth condition. For \(h\in \mathcal {H}\) and \(f_1\in \mathcal {H}_{\alpha }\), we can write that

$$\begin{aligned} \left\Vert \sigma _{t}(f_1)h\right\Vert ^2_{\alpha }&=\left( \sigma _{t}(f_1) h (0)\right) ^2 + \int _{\mathbb {R}_+} \left( \frac{\partial \sigma _t (f_1) h (x)}{\partial x} \right) ^2 \alpha (x) \mathrm{d}x \\&= \left( \int _{\mathcal {O}} \kappa _{t} (0,y,f_1) h(y) \mathrm{d}y \right) ^2 + \int _{\mathbb {R}_+} \left( \int _{\mathcal {O}} \frac{\partial \kappa _{t} (x,y, f_1) }{\partial x} h(y) dy \right) ^2 \alpha (x) \mathrm{d}x \\&\le \left\Vert \kappa _{t} (0,\cdot , f_1) \right\Vert ^2 \left\Vert h\right\Vert ^2 + \int _{\mathbb {R}_+} \left\Vert \frac{\partial \kappa _{t} (x,\cdot , f_1) }{\partial x}\right\Vert ^2 \left\Vert h\right\Vert ^2 \alpha (x) \mathrm{d}x \nonumber \\&\le C(t)^2 (1+ |f_1(0)|)^2 \left\Vert h\right\Vert ^2 + \int _{\mathbb {R}_+} C(t)^2 f_1'(x)^2 \left\Vert h\right\Vert ^2 \alpha (x) \mathrm{d}x\\&\le 2C(t)^2 (1+ \left\Vert f_1\right\Vert _{\alpha })^2 \left\Vert h\right\Vert ^2, \end{aligned}$$

where we have used the Cauchy–Schwarz inequality, together with the inequality \(\left|f_1(0)\right|\le \left\Vert f_1\right\Vert _{\alpha }\) and Assumption 2. With some abuse of notation, it follows that \(\left\Vert \sigma _{t}(f_1)\right\Vert _{\mathcal {L}(\mathcal {H}, \mathcal {H}_{\alpha })} \le C(t)(1+\left\Vert f_1\right\Vert _{\alpha })\) for a suitably chosen constant C(t). Similarly, from Assumption 1, it follows that

$$\begin{aligned} \left\Vert (\sigma _{t}(f_1)-\sigma _{t}(f_2))h\right\Vert ^2_{\alpha }&= \left( \int _{\mathcal {O}} \left( \kappa _{t} (0,y, f_1) - \kappa _{t} (0,y, f_2)\right) h(y) \mathrm{d}y \right) ^2 + \\&\quad + \int _{\mathbb {R}_+} \left( \int _{\mathcal {O}} \left( \frac{\partial \kappa _{t} (x,y,f_1) }{\partial x} - \frac{\partial \kappa _{t} (x,y, f_2) }{\partial x} \right) h(y) \mathrm{d}y \right) ^2 \alpha (x) \mathrm{d}x \\&\le \left\Vert \kappa _{t} (0,\cdot , f_1) - \kappa _{t} (0,\cdot , f_2)\right\Vert ^2 \left\Vert h\right\Vert ^2 +\int _{\mathbb {R}_+} \left\Vert \frac{\partial \kappa _{t} (x,\cdot , f_1) }{\partial x}- \frac{\partial \kappa _{t} (x,\cdot , f_2) }{\partial x}\right\Vert ^2 \left\Vert h\right\Vert ^2 \alpha (x) \mathrm{d}x \\&\le C(t)^2 \left|f_1(0) -f_2(0)\right|^2 \left\Vert h\right\Vert ^2 + \int _{\mathbb {R}_+} C(t)^2 (f_1'(x)- f_2'(x) )^2 \left\Vert h\right\Vert ^2 \alpha (x) \mathrm{d}x\\&\le 2C(t)^2 \left\Vert f_1 -f_2\right\Vert _{\alpha }^2 \left\Vert h\right\Vert ^2, \end{aligned}$$

from which \(\left\Vert \sigma _{t}(f_1)-\sigma _{t}(f_2)\right\Vert _{\mathcal {L} (\mathcal {H}, \mathcal {H}_{\alpha })}\le C(t) \left\Vert f_1 -f_2\right\Vert _{\alpha }\) for a suitably chosen C(t), which proves the Lipschitz continuity of the volatility operator, and concludes the proof.

A.4 Proof of Proposition 2.5

For the volatility operator \(\sigma _t\) to be well defined, we need to check that the function \(\kappa _t\) introduced in Eq. (2.6) satisfies the assumptions of Theorem 2.2. We start by observing that \(\kappa _t(x,\cdot ) \in \mathcal {H}\) if and only if \(\omega \in \mathcal {H}\). Then we can calculate the derivative

$$\begin{aligned} \frac{\partial \kappa _t(x,y)}{\partial x} = a(t)e^{-bx}\left( \omega '(x-y)-b\omega (x-y) \right) , \end{aligned}$$

which, in particular, by Assumption 2 is bounded by

$$\begin{aligned} \left|\frac{\partial \kappa _t(x,y)}{\partial x}\right| \le a(t)e^{-bx} \bar{\omega }_x (y). \end{aligned}$$

For the \(\mathcal {H}\)-norm, we then have that

$$\begin{aligned} \left\Vert \frac{\partial \kappa _t(x,\cdot )}{\partial x} \right\Vert ^2 = \int _{\mathcal {O}}\left( \frac{\partial \kappa _t(x,y)}{\partial x}\right) ^2 \mathrm{d}y \le a(t)^2 e^{-2bx} C_1^2 < \infty , \end{aligned}$$

where we have used that \(\left\Vert \bar{\omega }_x\right\Vert \le C_1\), which implies that Assumption 3 in Theorem 2.2 is satisfied for \(\alpha\) such that \(\int _{\mathbb {R}_+} e^{-2bx} \alpha (x) <\infty\). Finally, the Lipschitz condition is trivially satisfied and the growth condition is fulfilled because a(t) is bounded.

A.5 Proof of Lemma 3.1

For w in Eq. (3.2), we get that \(w_{\ell }(v)= \frac{1}{\ell }\) and \(\mathcal {W}_{\ell }(u) = \frac{u}{\ell }\). Then

$$\begin{aligned} q_{\ell }^{w}(x, y) = \frac{1}{\ell }\left( \ell -y + x\right) \mathbb {I}_{[0,\ell ]}(y-x), \end{aligned}$$

and from Eqs. (3.4) and (3.6), we can write that

$$\begin{aligned} \mathcal {D}_{\ell }^w(g_t)(x) = g_t+ \frac{1}{\ell }\int _{0}^{\infty }\left( \ell -y + x\right) \mathbb {I}_{[0,\ell ]}(y-x)g_t'(y)\mathrm{d}y = g_t+ \frac{1}{\ell }\int _{x}^{x+\ell }\left( \ell -y + x\right) g_t'(y)\mathrm{d}y. \end{aligned}$$

Integration by parts gives the result.

A.6 Proof of Proposition 3.2

Let \(f := \mathcal {D}_{\ell }^{w*}\delta _{T_1-s}^*(1)\). We start by applying the covariance operator to \(h := \sigma _s(g_s)^*f:\)

$$\begin{aligned} \left( \mathcal {Q}\sigma _s(g_s)^*f\right) (x)&= \int _{\mathcal {O}}q(x,y)\sigma _s(g_s)^*f(y)\mathrm{d}y \\&= \int _{\mathcal {O}}q(x,y)\left\langle \kappa _s (\cdot ,y, g_s), f\right\rangle _{\alpha }\mathrm{d}y = \left\langle \int _{\mathcal {O}}q(x,y) \kappa _s (\cdot ,y, g_s)\mathrm{d}y, f\right\rangle _{\alpha }, \end{aligned}$$

where we used Theorem 2.3 and the linearity of the scalar product. Further, we apply \(\sigma _s(g_s)\):

$$\begin{aligned} \left( \sigma _s(g_s)\mathcal {Q}\sigma _s(g_s)^*f\right) (x)&= \int _{\mathcal {O}} \kappa _s(x,z, g_s) \left( \mathcal {Q}\sigma _s(g_s)^*f\right) (z) \mathrm{d}z \\&= \left\langle \int _{\mathcal {O}} \int _{\mathcal {O}}\kappa _s(x,z,g_s)q(z,y) \kappa _s (\cdot ,y, g_s)\mathrm{d}y \mathrm{d}z , f\right\rangle _{\alpha }= \left\langle \varPsi _s(x, \cdot ) , f\right\rangle _{\alpha }, \end{aligned}$$

for \(\varPsi _s(x, \cdot ) := \int _{\mathcal {O}} \int _{\mathcal {O}}\kappa _s(x,z, g_s)q(z,y) \kappa _s (\cdot ,y, g_s)\mathrm{d}y \mathrm{d}z\). We go now back to the definition of f:

$$\begin{aligned} \left( \sigma _s(g_s)\mathcal {Q}\sigma _s(g_s)^*\right) \left( \mathcal {D}_{\ell }^{w*}\delta _{T_1-s}^*(1)\right) (x)&= \left\langle \varPsi _s(x, \cdot ) , \mathcal {D}_{\ell }^{w*}\delta _{T_1-s}^*(1)\right\rangle _{\alpha } \\&= \left\langle \mathcal {D}_{\ell }^{w}\varPsi _s(x, \cdot ) ,\delta _{T_1-s}^*(1)\right\rangle _{\alpha } = \delta _{T_1-s}\left( \mathcal {D}_{\ell }^{w}\varPsi _s(x, \cdot )\right) =\left( \mathcal {D}_{\ell }^{w}\varPsi _s\right) (x, T_1-s). \end{aligned}$$

By Lemma 3.1, we can write that

$$\begin{aligned} \left( \mathcal {D}_{\ell }^{w}\varPsi _s\right) (x, T_1-s)&= \int _{\mathbb {R}_{+}}d_{\ell }(T_1-s, u)\varPsi _s(x,u)\mathrm{d}u \\&= \int _{\mathbb {R}_{+}}\int _{\mathcal {O}} \int _{\mathcal {O}}d_{\ell }(T_1-s, u)\kappa _s(x,z, g_s)q(z,y) \kappa _s (u,y, g_s)\mathrm{d}y \mathrm{d}z\mathrm{d}u, \end{aligned}$$

to which, finally, we apply the operator \(\delta _{T_1-s}\mathcal {D}_{\ell }^{w}\):

$$\begin{aligned}&\delta _{T_1-s}\mathcal {D}_{\ell }^{w}\left( \sigma _s(g_s)\mathcal {Q}\sigma _s(g_s)^*\right) \left( \mathcal {D}_{\ell }^{w*}\delta _{T_1-s}^*(1)\right) \\&\quad = \int _{\mathbb {R}_{+}}d_{\ell }(T_1-s, v)\left( \sigma _s(g_s)\mathcal {Q}\sigma _s(g_s)^*\right) \left( \mathcal {D}_{\ell }^{w*}\delta _{T_1-s}^*(1)\right) (v)\mathrm{d}v\\&\quad = \int _{\mathbb {R}_{+}}\int _{\mathbb {R}_{+}}\int _{\mathcal {O}} \int _{\mathcal {O}}d_{\ell }(T_1-s, v)d_{\ell }(T_1-s, u)\kappa _s(v,z, g_s)q(z,y) \kappa _s (u,y, g_s)\mathrm{d}y \mathrm{d}z\mathrm{d}u\mathrm{d}v, \end{aligned}$$

finalizing the proof.

A.7 Proof of Proposition 5.2

We consider the representation

$$\begin{aligned} \varSigma ^2_s = a^2\int _{\mathbb {R}_{+}}\int _{\mathbb {R}_{+}}e^{-bu}e^{-bv}d_{\ell }(T_1-s, u)d_{\ell }(T_1-s, v) \mathcal {A}(u,v)\mathrm{d}u \mathrm{d}v, \end{aligned}$$

where we have introduced

$$\begin{aligned} \mathcal {A}(u,v) := \int _{\mathbb {R}}\int _{\mathbb {R}}\omega (v-z)q(z,y)\omega (u-y)\mathrm{d}y \mathrm{d}z, \quad u,v\in \mathbb {R}_{+}. \end{aligned}$$

By applying (repeatedly) the integration by parts, and since \(\omega ''\) is null, we obtain

$$\begin{aligned} \mathcal {A}(u,v)&= \int _{\mathbb {R}}\omega (v-z)\left( \int _{\mathbb {R}}e^{-k|z-y|}\omega (u-y)\mathrm{d}y\right) \mathrm{d}z \nonumber \\&=\int _{\mathbb {R}}\omega (v-z)\left( \int _{-\infty }^ze^{-k(z-y)}\omega (u-y)\mathrm{d}y+\int _{z}^{\infty }e^{-k(y-z)}\omega (u-y)\mathrm{d}y\right) \mathrm{d}z\nonumber \\&= \frac{2}{k}\int _{\mathbb {R}}\omega (v-z)\omega (u-z)\mathrm{d}z. \end{aligned}$$

By substituting Eq. (A.2) into (A.1), we get that

$$\begin{aligned} \varSigma ^2_s&= \frac{2a^2}{k}\int _{\mathbb {R}}\int _{\mathbb {R}_{+}}\int _{\mathbb {R}_{+}}e^{-bu}e^{-bv}d_{\ell }(T_1-s, u)d_{\ell }(T_1-s, v) \omega (v-z)\omega (u-z)\mathrm{d}z\mathrm{d}u \mathrm{d}v\\&= \frac{2a^2}{k}\int _{\mathbb {R}}\left( \int _{\mathbb {R}_{+}}e^{-bu}d_{\ell }(T_1-s, u) \omega (u-z)du\right) ^2 \mathrm{d}z\\&= \frac{2a^2}{k\ell ^2}\int _{\mathbb {R}}\left( \int _{T_1-s}^{T_1-s+\ell }e^{-bu} \omega (u-z)du\right) ^2 \mathrm{d}z, \end{aligned}$$

where we used the definition of \(d_{\ell }\) in Lemma 3.1. By integration by parts, we get

$$\begin{aligned} \varSigma ^2_s&= \frac{2a^2}{k\ell ^2b^4}\int _{\mathbb {R}}\left( e^{-b(T_1-s)}\left( b\omega (T_1-s-z)+\omega '(T_1-s-z) \right) +\right. \\ {}&\quad \left. -e^{-b(T_1-s+\ell )}\left( b\omega (T_1-s+\ell -z)+\omega '(T_1-s+\ell -z) \right) \right) ^2 dz\\&= \frac{2a^2}{k\ell ^2b^4}\left( e^{-2b(T_1-s)}\mathcal {B}_1(s) -2e^{-2b(T_1-s)}e^{-b(T_1-s+\ell )}\mathcal {B}_2(s)+e^{-2b(T_1-s+\ell )}\mathcal {B}_3(s)\right) , \end{aligned}$$

where we introduced

$$\begin{aligned} \mathcal {B}_1(s)&:= \int _{\mathbb {R}} \left( b\omega (T_1-s-z)+\omega '(T_1-s-z) \right) ^2\mathrm{d}z,\\ \mathcal {B}_2(s)&:= \int _{\mathbb {R}}\left( b\omega (T_1-s-z)+\omega '(T_1-s-z) \right) \left( b\omega (T_1-s+\ell -z)+\omega '(T_1-s+\ell -z) \right) \mathrm{d}z,\\ \mathcal {B}_3(s)&:=\int _{\mathbb {R}}\left( b\omega (T_1-s+\ell -z)+\omega '(T_1-s+\ell -z) \right) ^2 \mathrm{d}z. \end{aligned}$$

Using the definition of \(\omega\) in Eq. (5.3), we get that

$$\begin{aligned} \mathcal {B}_1(s)&= \int _{T_1-s-1}^{T_1-s+1} \left( b(1-|T_1-s-z|)-\mathrm {sgn}(T_1-s-z) \right) ^2\mathrm{d}z\\&= \int _{T_1-s-1}^{T_1-s} \left( b(1-T_1+s+z)-1 \right) ^2\mathrm{d}z+ \int _{T_1-s}^{T_1-s+1} \left( b(1+T_1-s-z)+1 \right) ^2\mathrm{d}z \\&= \frac{2}{3}\left( b^2+3\right) , \end{aligned}$$

where \(\mathrm {sgn}\) denotes the sign function. Similarly,

$$\begin{aligned} \mathcal {B}_2(s) = \frac{b^2}{6}\left( 3(\ell -2)\ell ^2+4\right) -3\ell +2, \quad \mathcal {B}_3(s) = \frac{2}{3}\left( b^2+3\right) . \end{aligned}$$

By substituting these findings and rearranging the terms, we get that

$$\begin{aligned} \varSigma ^2_s= & {} \frac{2a^2}{kb^4\ell ^2}\,e^{-2b(T_1-s)}\\&\cdot \left\{ \frac{2}{3}\left( b^2+3\right) \left( 1+e^{-2b\ell }\right) -2e^{-b\ell }\left( \frac{b^2}{6}\left( 3(\ell -2)\ell ^2+4 \right) -3\ell +2 \right) \right\} , \end{aligned}$$

which concludes the proof.

Appendix B: The non-injectivity issue

From the numerical experiments, we observe that the accuracy achieved in calibration is not particularly convincing, especially for the parameters regarding the volatility and the covariance operator, a, b and k. Slightly better results were obtained for the Nelson–Siegel curve parameters, \(\alpha _0\), \(\alpha _1\) and \(\alpha _3\), with the exception of \(\alpha _2\) (see Figs. 3, 6 and 8). On the other hand, the relative error for the price approximation after calibration shows high degree of accuracy (Figs. 4, 7 and 9). We may conclude that the original meaning of the model parameters is lost in the approximation step. Indeed, as pointed out in Bayer and Stemper (2018), it is somehow to be expected that the neural network is non-injective in the input parameters on large part of the inputs domain. We shall briefly analyse this.

The pricing formula (5.1), once fixed the strike K and the time to maturity \(\tau\), crucially depends on \(\xi\) and \(\mu (g_{t})\) as derived, respectively, in Proposition 5.3 and Eq. (5.6):

$$\begin{aligned} \varPi (t) = e^{-r(\tau -t)}\left\{ \xi \phi \left( \frac{\mu (g_t)-K}{\xi }\right) + \left( \mu (g_t)-K\right) \varPhi \left( \frac{\mu (g_t)-K}{\xi }\right) \right\} . \end{aligned}$$

However, \(\xi\) is only a scale, while \(\mu (g_{t})\) is more influential on the final price level since it defines the distance from the strike price K. Let us first focus on \(\xi\):

$$\begin{aligned} \xi ^2 = \frac{a^2}{kb^5\ell ^2}\left( e^{-2b(T_1-\tau )}-e^{-2b(T_1-t)}\right) B(b^2, e^{-b\ell }), \end{aligned}$$

where \(B(b^2, e^{-b\ell })\) simply indicates a term proportional to \(b^2\) and \(e^{-b\ell }\). In the front coefficient, a decrease in a might be, for example, compensated by an increase in b or k, and vice versa, meaning that several combinations of values for a, b, and k lead to the same overall \(\xi\). Thus, we may suspect that it can be hard for the neural network to identify the right vector of parameters despite reaching good level of accuracy for the price.

In Fig. 11, we report an example of non-injectivity with respect to the parameters a, b and k that we have observed in the grid-based learning approach. Here the neural network is not injective when all the parameters, except one, are fixed, and is only little sensitive to the change in the parameters. This also explains the struggle in calibration.

Fig. 11

Examples of non-injectivity in the grid-based learning approach. In each image, only one of the parameters is varying, while the rest is fixed

Similar observations can be done for the drift:

$$\begin{aligned} \mu (g_t)= & {} \alpha _0+\frac{e^{-\alpha _3(T_1-t)}}{\alpha _3\ell } \left( \alpha _1+\alpha _2+\alpha _2\alpha _3(T_1-t) \right) +\\&-\frac{e^{-\alpha _3(T_1+\ell -t)}}{\alpha _3\ell }\left( \alpha _1+\alpha _2+\alpha _2\alpha _3(T_1+\ell -t) \right) . \end{aligned}$$

Here the role of \(\alpha _0\) is specific since it defines the starting level of the curve, and indeed \(\alpha _0\) is the parameter that gets the best accuracy in estimation. However, \(\alpha _2\) appears first added to \(\alpha _1\) and then multiplied by \(\alpha _3\), making it hard for the neural network to outline its role. In the Nelson–Siegel curve in Eq. (5.5), \(\alpha _2\) defines the position of the “bump”, but the drift \(\mu (g_t)\) is obtained by integrating the curve within the delivery period of the contract. This integration smoothens the curve and makes it hard to locate the “bump”. This might explain why the accuracy in estimating \(\alpha _2\) is worse than for the other Nelson–Siegel parameters.

We conclude the article with the following theorem showing that it is possible to construct ReLU neural networks which act as simple linear maps.

Theorem B.1

Let \(A \in \mathbb {R}^{p \times d}\). Then for any \(L \ge 2\) and any \(\mathbf {n}= (d, n_1, \ldots , n_{L-1}, p)\) with \(n_i \ge 2d\), \(i=1,\ldots , (L-1)\), there exists an L-layer ReLU neural network \(\mathcal {N} :\mathbb {R}^d \rightarrow \mathbb {R}^p\) with dimension \(\mathbf {n}\), which satisfies

$$\begin{aligned} \mathcal {N}(x) = Ax, \quad \text {for all } x \in \mathbb {R}^d. \end{aligned}$$


We follow a similar approach to Gottschling et al. (2020, Section 8.5). Let \(\nu _i\ge 0\) be such that \(n_i = 2d+\nu _i\) for \(i=1, \dots ,(L-1)\). For \(I_d\) the identity matrix of dimension d, we define the following weights:

$$\begin{aligned}&V_1 := \begin{bmatrix} I_d&-I_d&O_1 \end{bmatrix}^{\top },\\&V_i := \begin{bmatrix} I_d&-I_d&O_i \end{bmatrix} \begin{bmatrix} I_d&-I_d&O_{i-1} \end{bmatrix}^{\top }, \quad i=2, \ldots , (L-1),\\&V_L := A \begin{bmatrix} I_d&-I_d&O_{L-1} \end{bmatrix}, \end{aligned}$$

where \(\top\) denotes the transpose operator. Here \(O_i \in \mathbb {R}^{d\times \nu _i}\) are matrices with all entries equal to 0 to compensate the matrix dimension in such a way that \(V_i \in \mathbb {R}^{n_{i}\times n_{i-1}}\) for \(i=1,\ldots ,(L-1)\). By considering zero-biases vectors \(v_i\), the linear maps \(H_i\) introduced in the neural network definition in Eq. (4.1) coincide then with the matrices \(V_i\).

We observe that for every \(x\in \mathbb {R}^d\), the ReLU activation function satisfies

$$\begin{aligned} x = \rho (x)-\rho (-x) = \begin{bmatrix} I_d&-I_d \end{bmatrix} \rho \left( \begin{bmatrix} I_d&-I_d \end{bmatrix}^{\top } x \right) , \end{aligned}$$

where the activation function is meant to act component wise. By straightforward calculation, one can then see that the neural network defined here satisfies the equality \(\mathcal {N}(x) = Ax\) for every \(x\in \mathbb {R}^d\), which means that it acts on x as a linear map. \(\square\)

Theorem B.1 proves that we can construct a ReLU L-layer neural network which corresponds to a linear map. As there are infinitely many non-injective linear maps (the zero-map being a trivial example), it is then possible to construct infinitely many non-injective ReLU neural networks. Obviously, this does not show that a non-injective network, such as the one constructed in the proof of Theorem B.1, will also minimize the objective function used for training. It might, however, give a glimpse to understand that neural networks are not very likely to be injective in their input parameters.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Benth, F.E., Detering, N. & Lavagnini, S. Accuracy of deep learning in calibrating HJM forward curves. Digit Finance (2021).

Download citation


  • Heath–Jarrow–Morton approach
  • Infinite dimension
  • Energy markets
  • Option pricing
  • Neural networks
  • Model calibration

JEL classification

  • G13
  • C45
  • C63