Skip to main content
Log in

Abstract

American-type financial instruments are often priced with specific Monte Carlo techniques whose efficiency critically depends on the dimensionality of the problem and the available computational power. Our work proposes a novel approach for pricing Bermudan swaptions, well-known interest rate derivatives, using supervised learning algorithms. In particular, we link the price of a Bermudan swaption to its natural hedges, which include the underlying European swaptions, and other relevant financial quantities through supervised learning non-parametric regressions. We explore several algorithms, ranging from linear models to decision tree-based models and neural networks and compare their predictive performances. Our results indicate that all supervised learning algorithms are reliable and fast, with ridge regressor, neural networks, and gradient-boosted regression trees performing the best for the pricing problem. Furthermore, using feature importance techniques, we identify the most important driving factors of a Bermudan swaption price, confirming that the maximum underlying European swaption value is the dominant feature.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Few data in large hyper-volume, i.e. most are zero.

  2. We have considered only binary trees.

  3. Number of the cycle through the full training set in the back-propagation algorithm.

References

  • Barraquand, J., & Martineau, D. (1995). Numerical valuation of high dimensional multivariate American securities. The Journal of Financial and Quantitative Analysis, 30(3), 383–405.

    Article  Google Scholar 

  • Becker, S., Cheridito, P., & Jentzen, A. (2020a). Deep optimal stopping. arXiv:1804.05394

  • Becker, S., Cheridito, P., & Jentzen, A. (2020). Pricing and hedging American-style options with deep learning. Journal of Risk and Financial Management, 13, 158. https://doi.org/10.3390/jrfm13070158

    Article  Google Scholar 

  • Becker, S., Cheridito, P., Jentzen, A., & Welti, T. (2021). Solving high-dimensional optimal stopping problems using deep learning. European Journal of Applied Mathematics, 32(3), 470–514. https://doi.org/10.1017/s0956792521000073

    Article  MathSciNet  Google Scholar 

  • Bloch, D. A. (2019). Option pricing with machine learning.

  • Breiman, L. (2001). Random forest. Machine Learning, 45, 5–32. https://doi.org/10.1023/A:1010933404324

    Article  Google Scholar 

  • Brigo, D., & Mercurio, F. (2006). Interest rate models-theory and practice. Springer.

    Google Scholar 

  • Cao, J., Chen, J., & Hull, J. (2019). A neural network approach to understanding implied volatility movements. Quantitative Finance, 20(9), 1405–1413.

    Article  MathSciNet  Google Scholar 

  • Cao, J., Chen, J., Hull, J., & Poulos, Z. (2021). Deep learning for exotic option valuation. The Journal of Financial Data Science. https://doi.org/10.3905/jfds.2021.1.083

    Article  Google Scholar 

  • Carriere, J. F. (1996). Valuation of the early-exercise price for options using simulations and nonparametric regression. Insurance: Mathematics and Economics, 19(1), 19–30. https://doi.org/10.1016/S0167-6687(96)00004-2

    Article  MathSciNet  Google Scholar 

  • Chen, Y., & Wan, J. W. L. (2019). Deep neural network framework based on backward Stochastic differential equations for pricing and hedging American options in high dimensions. Quantitative Finance, 21(1), 45–67.

    Article  MathSciNet  Google Scholar 

  • Dozat, T. (2016). Incorporating nesterov momentum into adam.

  • Egloff, D., Kohler, M., & Todorovic, N. (2007). A dynamic look-ahead Monte Carlo algorithm for pricing Bermudan options. The Annals of Applied Probability, 17(4), 1138–1171.

    Article  MathSciNet  Google Scholar 

  • Ferguson, R., & Green, A. (2018). Deeply learning derivatives. arXiv:arXiv:1809.02233 [q-fin.CP].

  • Friedman, J. H. (2002). Stochastic gradient boosting. Computational Statistics & Data Analysis, 38(4), 367–378.

    Article  MathSciNet  Google Scholar 

  • Gaspar, R. M., Lopes, S. D., & Sequeira, B. (2020). Neural network pricing of American put options. Risks. https://doi.org/10.3390/risks8030073

    Article  Google Scholar 

  • Glasserman, P. (2003) Monte Carlo methods in financial engineering.

  • Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Journal of Machine Learning Research - Proceedings Track, 9, 249–256.

    Google Scholar 

  • Goldberg, D. A., & Chen, Y. (2018). Beating the curse of dimensionality in options pricing and optimal stopping. arXiv:arXiv:1807.02227 [math.PR].

  • Goudenège, L., Molent, A., & Zanette, A. (2019). Variance reduction applied to machine learning for pricing Bermudan/American options in high dimension. arXiv:1903.11275.

  • Géron, A. (2017). Hands-on machine learning with scikit-learn and TensorFlow: Concepts, tools, and techniques to build intelligent systems. O’Reilly Media Inc.

    Google Scholar 

  • Hagan, P. (2002). Adjusters: Turning good prices into great prices. Wilmott, 56–59.

  • Hastie, T., Tibshirani, R., & Friedman, J. (2001). The elements of statistical learning. Springer.

    Book  Google Scholar 

  • Hernandez, A. (2017). Model calibration: Global optimizer vs. neural network.

  • Hoencamp, J., Jain, S., & Kandhai, D. (2022). A semi-static replication approach to efficient hedging and pricing of callable IR derivatives.

  • Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural Networks, 4(2), 251–257.

    Article  MathSciNet  Google Scholar 

  • Huge, B. N., & Savine, A. (2020). Differential machine learning.

  • Hull, J., & White, A. (1994). Numerical procedures for implementing term structure models I: Single-factor models. Journal of Derivatives, 2, 7–16.

    Article  Google Scholar 

  • Kobylanski, M., Quenez, M., & Rouy-Mironescu, E. (2011). Optimal multiple stopping time problem. The Annals of Applied Probability, 21(4), 1365–1399.

    Article  MathSciNet  Google Scholar 

  • Kohler, M., Krzyżak, A., & Todorovic, N. (2010). Pricing of high-dimensional American options by neural networks. Mathematical Finance, 20(3), 383–410.

    Article  MathSciNet  Google Scholar 

  • Lapeyre, B., & Lelong, L. (2020). Neural network regression for Bermudan option pricing. Monte Carlo Methods and Applications, 27(3), 227–247.

    Article  MathSciNet  Google Scholar 

  • Lokeshwar, Vikranth, & Vikram Bharadwaj, S.J. (2022). Explainable neural network for pricing and universal static hedging of contingent claims.

  • Longstaff, F., & Schwartz, E. (1998) Valuing American options by simulation: A simple least-squares approach. Working paper, The Anderson School, UCLA.

  • Masters, D., & Luschi, C. (2018). Revisiting Small Batch Training for Deep Neural Networks.

  • Muller, A. C., & Guido, S. (2016). Introduction to machine learning with Python: A guide for data scientist. O’Reilly Media Inc.

    Google Scholar 

  • Nesterov, Y. E. (1983). A method for solving the convex programming problem with convergence rate \(o(1/k^{2})\). Dokl. Akad. Nauk SSSR, 269, 543–547.

    MathSciNet  Google Scholar 

Download references

Acknowledgements

The authors gratefully acknowledge fruitful interactions with Prof. D. Galli at Physics Department, Università degli Studi di Milano, and with colleagues at Intesa Sanpaolo, in particular F. Fogliani, who contributed at early stages of this work.

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Riccardo Aiolfi.

Ethics declarations

Conflict of interest

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A: Hull-White One Factor Model (G1++)

The G1++ model assumes that the instantaneous short-rate process evolves under the risk-neutral measure according to

$$\begin{aligned} dr(t)=[\vartheta (t)-a r(t)] d t+\sigma d W(t) \end{aligned}$$
(A1)

where a and \(\sigma \) are positive constants and \(\vartheta \) is chosen so as to exactly fit the term structure of interest rates being currently observed in the market.

For more details see Brigo and Mercurio (2006)

Appendix B: Supervised Learning Algorithms

We present a list of the supervised learning algorithms chosen in this work and their main characteristics and differences.

  • k-Nearest Neighbour (k-NN) This algorithm is arguably the simplest, however being a non-parametric algorithm, i.e. it does not make assumptions regarding the dataset, it is widely used. The principle behind nearest neighbour methods is to find a predefined number k of training samples closest in distance to the new point and predict the label from these. As there is only a data storing phase, and no training phase, it is well suited for a small dataset (both in the number of features and in the number of samples) and it is known not to work well on sparse dataFootnote 1 and features must have the same scale since absolute differences must weigh the same. The label assigned to a query point is computed based on the mean of the labels of its nearest neighbours. The model mainly presents three important hyperparameters: the number of neighbours k, the metric used to evaluate the distance and the weights assigned to the neighbours to define their importance.

  • Linear Models Linear models are a class of models widely used in practice because they are very fast to train and predict; they make a prediction using a linear function of the input features, i.e. the target value is expected to be a linear weighted combination of the features. Notice that the linearity is a strong assumption and it is not always respected, but this gives them an easy interpretation. Training a model like that means setting its parameters so that the model best fits the training set. In general, linear models are very powerful with large datasets, especially if the number of features is huge (high-dimensional problem). There are many different linear models and the difference between them lies in how the parameters are learned and how the model complexity can be controlled. We have considered the Linear Regression and two of its regularised versions: Ridge and Lasso Regression where the regularization term is respectively the \(L^{2}\) and the \(L^{1}\) norm of the weight vector.

  • Support Vector Machine (SVM) Conceptually the SVM, using some significant data points (support vector), try to define a corridor (or a hyper-volume in higher dimensions) within which the greatest number of data points fall. In general, SVMs are effective in the higher dimension, but they do not perform well when we have a large dataset due to the higher training time. They have a hyperparameter that performs the same task as the alpha of the linear models and therefore it limits the importance of each support vector. SVMs are efficient also for non-linear problems thanks to a mathematical technique called kernel trick; depending on the kernel used, additional hyperparameters are needed, but we must take that into account that one of the biggest drawbacks of these algorithms is the high sensitivity to hyperparameters.

  • Tree-based algorithms As the name suggests, these algorithms are based on simple decision trees. Like SVMs, decision trees are versatile and very powerful and like k-NN, they are non-parametric algorithms. The goal of these algorithms is to create a model that predicts the value of a target variable by learning simple decision rules inferred from the data features. To build a tree, the algorithm searches all over the possible tests (a subdivision of the training set) and finds the one that is most informative about the target variable. This recursive process yields a binary tree,Footnote 2 with each node containing a test and it is repeated until each region in the partition only contains a single target value. A prediction on a new data point is made by checking which partition of feature space the point lies in and the output is the mean target of the training point in this leaf. One of the main qualities of decision trees is that they require very little data preparation, moreover, they are very fast to predict and they are defined as white models because they are easily interpretable. Typically, building a tree and continuing until all leaves are pure leads to models that are very complex and highly overfit to the training data and therefore they provide poor generalization performance. The most common way to prevent overfitting is called pre-pruning and it consists of stopping the creation of the tree early. Possible criteria for pre-pruning include limiting the maximum depth of the tree, limiting the maximum number of leaves and others making the decision trees highly dependent on the numerous hyperparameters. Moreover, they have two main problems: the first is the inability to extrapolate or make predictions outside the training range, while the second is that they are unstable due to small variations in the training set. This last problem is solved with the decision tree ensembles: Random Forest (RF) and Gradient Boosted Regression Tree (GBRT).

    Ensembles are methods that combine multiple supervised models to create a more powerful one. They are based on the idea that the aggregation of the predictions of a group of models will often give better results than with the best individual predictor. One way to obtain a group of predictors is to use the same algorithm for every model and train them on different random subsets of the training set; when sampling is performed with replacement, this method is called bagging, otherwise, it is called pasting. Generally, the net result is that the ensemble has a similar bias but a lower variance than a single predictor trained on the original training set. Below we briefly describe the two ensemble methods considered.

    • A RF is an ensemble method of decision trees, generally trained via bagging method (Breiman, 2001). The idea behind random forests is that each decision tree will likely overfit on a specific part of the data, but if we build many trees that overfit in different ways, we can reduce the amount of overfitting by averaging their results. RF get its name from injecting randomness into the tree building in two ways: by the bagging method and by selecting a random subset of the features in each split test. In summary, the bootstrap sampling leads to each decision tree in the RF being built on a slightly different dataset and due to the selection of features in each node, each split in each tree operates on a different subset of features. Together, these two mechanisms ensure that all the trees in the RF are different. Essentially, RF shares all pros of the decision tree, while making up for some of their deficiencies; it also has practically all their hyperparameters with the addition of a new one that regulates the number of trees to consider whose greater values are always better, because averaging more trees will yield a more robust ensemble by reducing overfitting.

    • GBRT (Friedman, 2002) is part of the more general boosting method in which predictors are trained sequentially, each trying to correct its predecessor. By default there is no randomization in gradient-boosted decision trees instead, strong pre-pruning is used; it often uses very shallow trees which makes the model smaller in terms of memory and makes predictions faster. Each tree can only provide good predictions on part of the data, and so more and more trees are added to iteratively improve performance. This method shares the same hyperparameters as RF with the addition of the learning rate but, in contrast to RF, increasing the number of predictors leads to a more complex model. The learning rate and the number of estimators are highly interconnected, as a lower rate means more trees are needed to build a model of similar complexity and therefore there is a trade-off between them. Similar to other tree-based models, the GBRT works well without scaling and often does not works well on high-dimensional sparse data. Their main drawback is that they require careful tuning of hyperparameters and may take a long time to train.

  • Artificial Neural Networks (ANN) or Multi-Layer Perceptron (MLP) They can be understood as a large set of simpler units, called neurons, connected in some way and organized in layers. An ANN is composed of one input layer, one or more hidden layers and one final output layer. In order to understand the entire functioning of the network, it is necessary to consider a single neuron: the inputs and the output are numbers and each input connection is associated with a weight. The artificial neuron computes a weighted sum of its inputs and then applies a non-linear transformation, called the activation function. In some way, ANNs can be viewed as generalizations of linear models that perform multiple stages of processing to come to a decision. The key point of ANN is the algorithm used to train them; it is called the back-propagation algorithm and in simple terms, it is a gradient descent using an efficient technique for computing the gradients automatically. In conclusion, ANNs are typically black box models defined by a set of weights; they take some variables as input and modify the values of the weights so that they return the desired target. Given enough computation time, data and careful tuning of the hyperparameters, ANN are the most powerful and scalable machine learning models. The real difficulty in implementing a suitable model is contained in the enormous amount of hyperparameters that regulate the complexity of the network. Both the number of hidden layers and the number of neurons in each layer can affect the performance of an ANN, but there is a large variety of hyperparameters that need to be optimized for acceptable results. In general, choosing the exact network architecture for an ANN remains an art that requires extensive numerical experimentation and intuition, and is often problem-specific.

Appendix C: Error Metrics

We present a list of the metrics implemented and their main characteristics and differences.

  • MAE It measures the average magnitude of the errors in a set of predictions, without considering their direction. It’s the average over sample (n) of the absolute differences between the target (\(y_{i}\)) and prediction (\({\hat{y}}_{i}\)) where all individual differences have equal weight. In formula:

    $$\begin{aligned} MAE:= \frac{1}{n} \sum _{i=1}^{n}\left|y_{i}-{\hat{y}}_{i}\right|\end{aligned}$$
    (C2)
  • MAPE Instead of using actual value, MAPE uses relative error to present the result. It is defined as

    $$\begin{aligned} MAPE:= \frac{1}{n} \sum _{i=1}^{n}\left|\frac{y_{i}-{\hat{y}}_{i}}{y_{i}}\right|\end{aligned}$$
    (C3)

    MAPE is also sometimes reported as a percentage, which is the above equation multiplied by 100.

  • WAPE It is relative to what it would have been if a simple predictor had been used. More specifically, this simple predictor is just the average of the real values. Thus, it is defined as dividing the sum of absolute differences and normalising it by dividing the total absolute error of the simple predictor. In formula:

    $$\begin{aligned} WAPE:= \frac{\sum _{i=1}^{n}|y_{i}-{\hat{y}}_{i}\vert }{\sum _{i=1}^{n}|y_{i}|} \end{aligned}$$
    (C4)

    WAPE is also sometimes reported as a percentage, which is the above equation multiplied by 100.

  • RMSE It represents the square root of the second sample moment of the differences between predicted values and real values. In formula:

    $$\begin{aligned} RMSE:=\sqrt{\frac{1}{n} \sum _{i=1}^{n} \left( y_{i}-{\hat{y}}_{i}\right) ^{2}} \end{aligned}$$
    (C5)
  • RMRSE It is defined as

    $$\begin{aligned} RMSRE:= \sqrt{\frac{1}{n} \sum _{i=1}^{n} \left( \frac{y_{i}-{\hat{y}}_{i}}{y_{i}}\right) ^{2}} \end{aligned}$$
    (C6)

    RMRSE is also sometimes reported as a percentage, which is the above equation multiplied by 100.

  • RRMSE Similarly to WAPE, it takes the total squared error and normalizes it by dividing it by the total squared error of a simple predictor. By taking the square root of the relative squared error one reduces the error to the same dimensions as the quantity being predicted. In formula:

    $$\begin{aligned} RRMSE:= \sqrt{ \frac{\sum _{i=1}^{n} \left( y_{i}-{\hat{y}}_{i}\right) ^{2}}{\sum _{i=1}^{n} \left( y_{i}\right) ^{2}}} \end{aligned}$$
    (C7)

    RRMSE is also sometimes reported as a percentage, which is the above equation multiplied by 100.

Both MAE and RMSE express average model prediction error in units of the variable of interest, they can range from 0 to \(\infty \) and are indifferent to the direction of errors. They are negatively-oriented scores, which means lower values are better. Since the errors are squared before they are averaged, the RMSE gives a relatively high weight to large errors. This means the RMSE should be more useful when large errors are particularly undesirable and generally, RMSE will be higher than or equal to MAE. It can be noted that RMSRE and RRMSE are completely analogous to MAPE and WAPE where the absolute value is replaced with the square. RMSRE and MAPE are the relative versions of RMSE and MAE respectively and are taken into consideration in this context because, for example, an error of 100 EUR out of 200 EUR is worse than an error of the same amount out of 2000 EUR. However, they have some drawbacks: they are undefined for data points where the target value is 0 and they can grow unexpectedly large if the actual values are exceptionally small themselves. To avoid these problems, an arbitrarily small term is usually added to the denominator. Moreover, they are asymmetric and it puts a heavier penalty on negative errors (when forecasts are higher than target) than on positive errors. To solve these problems RRMSE and WAPE are introduced and they are particularly recommended when the number of samples is low or their values are on different scales.

Appendix D: Hyperparameters Tuning

We report the analysis for the selection of the best hyperparameters for all the algorithms implemented.

1.1 D.1 k-Nearest Neighbour

The model mainly presents three important hyperparameters: the number of neighbours k, the metric used to evaluate the distance and the weights assigned to the neighbours to define their importance. In the construction of this algorithm, we considered the Euclidean distance and we assigned uniform weights to all the neighbours as values other than these, which worsened the cross-validation performance considerably. The only hyperparameter that needs to be fixed is, therefore, the number of neighbours. Its value is strictly linked to the complexity of the model since by increasing the number of neighbours the prediction will be averaged over more data points, thus making the model less linked to the peculiarities of the training set and therefore simpler. Figure 6 shows the trend of the RMSE evaluated on the training and on the validation set as a function of the number of neighbours. Each point shown is the average value, with the respective error, of the RMSE obtained with 5-fold cross-validation. Considering a single neighbour the prediction on the training set is perfect, but when more neighbours are considered the model becomes simpler and the training error increases while the validation error drops. It can be easily seen that the optimal value of neighbours is 4 as the value of the validation error from that value onwards starts to increase and, since its distance with training error is low, we clearly avoid overfitting the training test.

Fig. 6
figure 6

RMSE trend of training (blue) and validation (orange) of k-NN as a function of the number of neighbors (k). For each value of k the mean value of the 5-fold cross-validation is reported with the respective error

1.2 Linear Models

Among the linear models used, we found that the Ridge is the most promising for our problem. Therefore, considering the possibility of adding polynomial features, there are two hyperparameters of the algorithm: alpha, which is the magnitude of the regularization and degree, which is the maximum degree of the polynomial features. We report in Fig. 7 a grid search on the hyperparameters to find the optimal combination. The mean of the 5-fold cross-validation RMSE is reported for each pair of values. As can be deduced from Fig. 7, the hyperparameters that return the lowest value of the evaluation metric on the validation set are \(\texttt{alpha} = 0.01\) and \(\texttt{degree} = 6\) and consequently they were chosen as optimal parameters. It can be seen that increasing the maximum degree gives a great improvement in performance, but using one that is too high the model begins to generalize worse; this is due to the fact that polynomials of high degrees fit in an excellent way to the training set, but thus having a poor predictive power on the validation set.

Fig. 7
figure 7

Heat-map of mean 5-fold cross-validation RMSE of Ridge Regression as a function of \(\alpha \) and degree. Only the RMSE mean value on the validation set is reported

1.3 D.2 Support Vector Machines

The hyperparameter that performs the same task of alpha of the linear models is called C and therefore represents a regularizing parameter that has the task of limiting the importance of each support vector. The strength of the regularization is inversely proportional to C. In general, support vector machines are really effective in the higher dimension, but they do not perform well when we have a large dataset because the required training time is higher. SVMs are efficient also for non-linear problems thanks to a mathematical technique called kernel trick, which allows us to map our data into a higher-dimensional space. In our work, we implemented the Gaussian radial basis function kernel which considers all possible polynomials of all degrees, but the importance of the features decrees for a higher degree. Doing so, there is an additional hyperparameter, gamma, which is a regularizing hyperparameter. To obtain the best hyperparameters, we performed a grid search on C and gamma and we report in Fig. 8 the average values of the RMSE obtained from a 5-fold cross-validation for each pair. As can be deduced from Fig. 8, the hyperparameters that return the lowest value of the evaluation metric on the validation set are \(\texttt{C}=100\) and \(\texttt{gamma}=0.1\) and consequently, they were chosen as optimal parameters. In reality, the same value obtained for \(\texttt{C}=100\) is also obtained with \(\texttt{C}=1000\), but since the greater the value of C the more complex the model and the greater the probability of overfitting, the lower value has been chosen.

Fig. 8
figure 8

Heat-map of mean 5-fold cross-validation RMSE of SVM with a radial basis function kernel as a function of C and gamma. Only the RMSE mean value on the validation set is reported

1.4 D.4 Tree

Since the decision tree works well on data with features that are on completely different scales, we have decided not to apply any transformations to our data, but as we would expect, if we apply this algorithm without pruning it, it tends to overfit; specifically, the algorithm builds a tree with a depth of 22 levels and with 3472 leaves, that is exactly the number of samples in our training set. After a phase of analysis and study of the various hyperparameters, we discovered that the parameters that most influenced the performance of our tree are respectively the maximum reachable depth (max_depth) and the minimum number of samples required to be at a leaf node (min_samples_leaf). We then carried out a grid search on these hyperparameters and in Fig. 9 we report the trend of the RMSE as a function of them. Each point shown is the average value, with the respective error, of the RMSE obtained with 5-fold cross-validation. As can be deduced from Fig. 9, the optimal values of the hyperparameters are respectively \(\mathtt{min\_samples\_leaf} = 3\), as it returns the lowest value of the RMSE on the validation set and \(\mathtt{max\_depth} = 11\) because for higher values the metric remains approximately constant, but having a greater error on the training set, the possibility of overfitting is lower.

Fig. 9
figure 9

RMSE trend on training (blue) and validation (orange) set of the decision tree as a function of max_depth (left) and min_samples_leaf (right). For each of the hyperparameters, the average value of RMSE obtained with 5-fold cross-validation is reported with the respective error

1.5 D.5 Random Forest

Like decision trees, random forest works well with features that are on a completely different scale, therefore we have not applied any transformations to our data. It is known that the most important hyperparameters are max_features, i.e. the number of features to consider when looking for the best split and max_depth, i.e. the maximum depth of the trees. We have therefore decided to perform a grid search on these hyperparameters and we report in Fig. 10 the trend of the RMSE as a function of them. Each point shown is the average value, with the respective error, of the RMSE obtained with 5-fold cross-validation. As can be deduced from Fig. 10, the values of the optimal hyperparameters are respectively \(\mathtt{max\_depth} = 17\), because for higher values the metric remains approximately constant, and \(\mathtt{max\_features} = log2\). The term log2 means that max_features are equal to the logarithm to base 2 of the number of features. Once these values were set, we searched for the optimal value of trees to use. In Fig. 11 we report the trend of the RMSE as a function of the number of trees in the forest. A larger number of n_estimators is always better, but the training time increases considerably. As can be seen from Fig. 11, we have chosen to use 500 decision trees as increasing this number further does not provide any improvement in terms of performance.

Fig. 10
figure 10

RMSE trend on training (blue) and validation (orange) set of the random forest as a function of the number of max_depth (left) and max_features (right). For each of the hyperparameters, the average value of RMSE obtained with 5-fold cross-validation is reported with the respective error. The term sqrt and log2 means that max_features are respectively equal to square root and logarithm to base 2 of the number of features; auto means that all features are used and therefore no randomness in selecting features

Fig. 11
figure 11

RMSE trend on training (blue) and validation (orange) set of random forest with \(\mathtt{max\_depth}= 17\) and \(\mathtt{max\_features} = log2\) as a function of the number of n_estimators. For each of the hyperparameters the average value of RMSE obtained with 5-fold cross-validation is reported with the respective error

1.6 D.6 Gradient Boosted Regression Tree

This method shares the same hyperparameters as a random forest with the addition of the learning rate (learning_rate), which controls how strongly each tree tries to correct the mistakes of the previous trees; a higher learning rate means each tree can make stronger corrections, allowing for more complex models. In contrast to the random forest, where a higher number of predictors (n_estimators) is always better, increasing it in gradient boosting leads to a more complex model. The learning_rate and n_estimators are highly interconnected, as a lower rate means more trees are needed to build a model of similar complexity; generally, there is a trade-off between these two hyperparameters. Their main drawback is that they require careful tuning of hyperparameters and may take a long time to train. Furthermore, the number of hyperparameters to be set is high as each of them has a great influence on the performance. The first two hyperparameters studied are those considered most decisive and specifically, the learning rate and the number of trees. Figure 12 shows the heat map of the grid search on them where the average values of the RMSE obtained from 5-fold cross-validation for each pair are reported. From Fig. 12, it can be clearly seen that the learning rate has the greatest impact on performance and the pair \(\mathtt{learning\_rate}=0.1\) and \(\mathtt{n\_estimators}=1000\) has been selected as optimal hyperparameters. Like the random forest, the other two essential parameters to avoid overfitting our model are the maximum depth of the simple predictors (max_depth) and the number of features used for each split (max_features). In a totally similar way to before, we performed a grid search also on these hyperparameters in order to select the optimal ones. From Fig. 13, it can be noted that the factor that influences the performance the most is max_depth; the pair \(\mathtt{max\_depth}=5\) and \(\mathtt{max\_features}=log2\) are the optimal values. Although in a much lower way than all the previous ones, we have observed that the minimum number of samples required to be at a leaf node (min_samples_leaf) slightly influences the performance of gradient boosted regression tree. In Fig. 14 we report the trend of the RMSE as a function of it. Each point shown is the average value, with the respective error, of the RMSE obtained with 5-fold cross-validation. As can be seen from Fig. 14, as min_samples_leaf varies, the RMSE value on the validation set slightly decreases to a value of 6, while the training value increases. After this value, the evaluation metric on the validation set increases again. For this reason, we have fixed \(\mathtt{min\_samples\_leaf} =6\).

Fig. 12
figure 12

Heat-map of mean 5-fold cross-validation RMSE of GBRT as a function of learning_rate and n_estimators. Only the RMSE mean value on the validation set is reported

Fig. 13
figure 13

Heat-map of mean 5-fold cross validation RMSE of GBRT with \(\mathtt{learning\_rate}=0.1\) and \(\mathtt{n\_estimators}=1000\) as a function of max_depth and max_features. Only the RMSE mean value on the validation set is reported

Fig. 14
figure 14

RMSE trend on training (blue) and validation (orange) set of GBRT with \(\mathtt{learning\_rate}=0.1\), \(\mathtt{n\_estimators}=1000\), \(\mathtt{max\_depth}=5\) and \(\mathtt{max\_features}=log2\) as a function of min_samples_leaf. For each of the hyperparameter, the average value of RMSE obtained with 5-fold cross-validation is reported with the respective error

1.7 D.7 Artificial Neural Networks

In general, the design of the input and output layers in a network is often straightforward because they should adapt to the dataset and to the problem. In general, for multivariate regression, it is necessary one output neuron per output dimension. Consequently, since our aim is to produce a single value we will need only one output neuron. The number of neurons in the input layer is instead established by the number of inputs of our problems, that is, by the number of features; in our case, therefore we have an input layer composed of 7 neurons, i.e. one for each of the features of the dataset. These neurons have the sole task of passing the input values to the hidden layers without applying them to any transformations. Usually, in regression problems the activation function is the ReLU (or one of its variants); it is applied to all hidden neurons but not to the output ones, so they are free to output any range of values. The loss function to use during the training is typically the MSE. In order to make our training faster and more stable, we have scaled all the features and also the target by removing the mean and scaling to unit variance; centring and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. The back-propagation algorithm suffers from a problem called vanishing/exploding gradient, which consists of very unstable gradients. A way to alleviate that is to use the correct initialization of the weights for each activation function (Glorot & Bengio, 2010); for this reason, we select the He normal initialization in combination with ReLU. The random initialization of the weights is fundamental because breaks the symmetry and allows back-propagation to train a diverse team of neurons. Another fundamental element of neural networks is the batch size i.e. the size of the groups of instances used in the back-propagation algorithm; it can have a significant impact on the model performance and training time. Typically small batches are preferable because they led to better models in less training time (Masters & Luschi, 2018); for this reason, the batch size is set to 32. Moreover, in order to avoid overfitting, we have implemented early stopping. It is a regularization technique that consists of interrupt training when there is decreasing in the training loss function, but, at the same time, no progress on the validation set for a predefined number of epochsFootnote 3; we set this limit to 30 epochs.

Once we have set these parameters and have adequately prepared the data to be processed by our neural networks we can concentrate on tuning the other hyperparameters. The most important are 4 mainly: the optimizer, the learning rate, the number of hidden layers and the number of neurons for each hidden layer. Training very large deep neural networks can be very slow. Some of the techniques already implemented and the choice of a good initialization strategy together with a good activation function, allow to speed up the training, but another huge speed boost comes from using a faster optimizer than the regular gradient descent. Our choice is to use Nadam algorithm (Dozat, 2016) that is an adaptive optimization method plus the Nesterov trick (Nesterov, 1983); it returns an excellent quality of convergence in a short time. A fundamental element for the convergence of the algorithm is the choice of the learning rate. Since the optimal learning rate depends on the other hyperparameter, we fix it after choosing the optimizer. Using too high learning rate values the training may diverge, but with values that are too low, training will eventually converge to the optimum, but it will take a very long time. One way to find a good learning rate (Muller & Guido, 2016) is to train the model for a few hundred iterations, exponentially increasing the learning rate from a very small value to a very large, and then looking at the learning curve and picking a learning rate slightly lower than the one at which the learning curves starts shooting back up; in this way, we set its value equal to 0.01.

Theoretically, neural networks with just one hidden layer can model even the most complex functions (Hornik, 1991), provided it has enough neurons. But for complex problems, deep networks have a much higher efficiency than shallow ones: they can model complex functions using exponentially fewer neurons than shallow nets allowing them to reach much better performance with the same amount of training data. Regarding the number of neurons in the hidden layers, it is common practice to use the same number of neurons in all hidden layers as in general performs well, plus, there is only one hyperparameter to tune, instead of one per layer. To obtain the optimal number of these two hyperparameters, we performed a grid search as in the previous cases. In Fig. 15 we report the heat-map of mean 5-fold cross-validation RMSE of the neural networks implemented as a function of the number of hidden layers (n_hidden) and the number of neurons per hidden layer (n_neurons). From Fig. 15, it would seem that increasing the number of layers has a greater impact on network performance than the number of neurons per layer. The hyperparameters that return the lowest value of the evaluation metric on the validation set are \(\mathtt{n\_hidden} = 3 \) and \(\mathtt{n\_neurons}= 100\). Compared to the other similar cross-validation values we have chosen this configuration as it is the one with fewer weights to learn and consequently, the probability of overfitting is lower.

Fig. 15
figure 15

Heat-map of mean 5-fold cross-validation RMSE of MLP as a function of n_hidden and n_neurons. Only the RMSE mean value on the validation set is reported

Appendix E: G1++ Parameters

We present the values of the G1 ++ parameters used for the creation of the dataset (Table 6).

Table 6 G1++ parameters chosen for the creation of the dataset

Appendix F: Market Data

See Tables 7, 8, 9, and 10.

Table 7 EONIA and EURIBOR 6 M zero rate yield curves as of 31st October 2019 (First part) in percentage values (continuous compounding, act/365 day-count convention)
Table 8 EONIA and EURIBOR 6 M zero rate yield curves as of 31st October 2019 (Second part) in percentage values (continuous compounding, act/365 day-count convention)
Table 9 EUR ATM swaption forward rates as of 31st October 2019
Table 10 EUR ATM European swaption straddles, forward premium, physical LCH settlement, notional 10,000 EUR

Appendix G: Bermudan Basket

See Table 11.

Table 11 Bermudan swaption selected for the dataset

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Aiolfi, R., Moreni, N., Bianchetti, M. et al. Learning Bermudans. Comput Econ (2024). https://doi.org/10.1007/s10614-023-10517-w

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10614-023-10517-w

Keywords

Navigation