## Abstract

Classical claims reserving methods act on so-called claims reserving triangles which are aggregated insurance portfolios. A crucial assumption in classical claims reserving is that these aggregated portfolios are sufficiently homogeneous so that a coarse reserving algorithm can be applied. We start from such a coarse reserving method, which in our case is Mack’s chain–ladder method, and show how this approach can be refined for heterogeneity and individual claims feature information using neural networks.

This is a preview of subscription content, log in to check access.

## References

- 1.
Antonio K, Plat R (2014) Micro-level stochastic loss reserving for general insurance. Scand Act J 2014(7):649–669

- 2.
Arjas E (1989) The claims reserving problem in non-life insurance: some structural ideas. ASTIN Bull 19(2):139–152

- 3.
Badescu AL, Lin XS, Tang D (2016) A marked Cox model for the number of IBNR claims: theory. Insur Math Econ 69:29–37

- 4.
Badescu AL, Lin XS, Tang D (2016) A marked Cox model for the number of IBNR claims: estimation and application. Version March 14, 2016. SSRN Manuscript 2747223

- 5.
Baudry M, Robert CY (2017) Non parametric individual claim reserving in insurance. Preprint

- 6.
Cybenko G (1989) Approximation by superpositions of a sigmoidal function. MCSS 2/4:303–314

- 7.
Gabrielli A, Wüthrich MV (2018) An individual claims history simulation machine. Risks 6(2):29

- 8.
Harej B, Gächter R, Jamal S (2017) Individual claim development with machine learning. ASTIN Report

- 9.
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

- 10.
Isenbeck M, Rüschendorf L (1992) Completeness in location families. Prob Math Stat 13:321–343

- 11.
Jessen AH, Mikosch T, Samorodnitsky G (2011) Prediction of outstanding payments in a Poisson cluster model. Scand Act J 2011(3):214–237

- 12.
LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

- 13.
Lopez O (2018) A censored copula model for micro-level claim reserving. HAL Id: hal-01706935

- 14.
Mack T (1993) Distribution-free calculation of the standard error of chain ladder reserve estimates. ASTIN Bull 23(2):213–225

- 15.
Montúfar G, Pascanu R, Cho K, Bengio Y (2014) On the number of linear regions of deep neural networks. Neural Inf Process Syst Proc \({}^{\beta }\) 27:2924–2932

- 16.
Nielsen M (2017) Neural networks and deep learning. Online book available on http://neuralnetworksanddeeplearning.com

- 17.
Norberg R (1993) Prediction of outstanding liabilities in non-life insurance. ASTIN Bull 23(1):95–115

- 18.
Norberg R (1999) Prediction of outstanding liabilities II. Model variations and extensions. ASTIN Bulletin 29(1):5–25

- 19.
Pigeon M, Antonio K, Denuit M (2013) Individual loss reserving with the multivariate skew normal framework. ASTIN Bull 43(3):399–428

- 20.
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

- 21.
Schnieper R (1991) Separating true IBNR and IBNER claims. ASTIN Bull 21(1):111–127

- 22.
Verrall RJ, Wüthrich MV (2016) Understanding reporting delay in general insurance. Risks 4(3):25

- 23.
Werbos P (1982) Applications of advances in nonlinear sensitivity analysis. Syst Model Optim 1982:762–770

- 24.
Wüthrich MV (2018) Machine learning in individual claims reserving. Scand Act J 2018(6):465–480

- 25.
Zarkadoulas A (2017) Neural network algorithms for the development of individual losses. MSc thesis, University of Lausanne

## Acknowledgements

We would like to kindly thank Philipp Reinmann (AXA) and Ronald Richman (AIG) who have provided very useful remarks on previous versions of this manuscript.

## Author information

### Affiliations

### Corresponding author

## Appendices

### Appendix 1: Data

We generate synthetic individual claims data using the scenario generator provided in [7]. The use and parametrization of this scenario generator is illustrated in Listing 1. On lines 2–6 we generate a claims portfolio consisting of feature components LoB, cc, AY, AQ, age and inj_part, see lines 8–15 of Listing 1. We receive 5,003,204 individual claims. The marginal distributions of these feature components are illustrated in Fig. 5 and the two-dimensional contour plots are given in Fig. 6. The latter contour plots do not show co-linearity in feature components.

This portfolio of 5,003,204 individual claims features is then used to generate fully developed individual claims histories. This is done on lines 17–21 of Listing 1 and provides an individual reporting delay RepDel as well as claims payments over 12 development years for each claim. In claims reserving problems, we typically do not have information about the full claims developments. If the latest observed calendar year is 2005, we only have information about the reported claims with \({\mathtt{AY}} + {\mathtt{RepDel}} \le 2005\) (upper triangles). In our synthetic data set, these are 4,970,856 reported claims, and the remaining 32,348 claims are only reported after calendar year 2005. Therefore, these latter claims are not available at the end of 2005. The 4,970,856 reported claims provide claims payments in upper claims reserving triangles. These payments aggregated per line of business \({\mathtt{LoB}} \in \{1,\ldots , 4\}\) are shown in Table 4. In our special situation of simulated claims histories we also know the total ultimate claim amounts \(C_{i,J}\). These are provided in the last columns of Table 4. The general aim is to predict these last columns based on the given upper claims development triangles illustrated in Table 4.

### Appendix 2: Neural network architecture and calibration

We come back to the choice of the network architecture and to the network calibration mentioned in Sect. 3.4. In our case, the network architecture requires the choice of the number of hidden neurons *q* (we also remind of the universality theorems mentioned in Sect. 3.1). The first step in the neuron activation (3.2) is the scalar product

that projects the *d*-dimensional feature \(\varvec{x}\) to a scalar for each neuron \(1\le k \le q\). All features \(\varvec{x}\) that lie in hyperplanes orthogonal to \(\varvec{w}_k\) give the same neuron activation \(z_k(\varvec{x})\) in the *k*-th neuron. This implies that each neuron provides a substantial reduction in dimension at the price of a loss of information. Therefore, we need an appropriate minimal number *q* of hidden neurons (with different \(\varvec{w}_k\)’s) so that we are still able to capture in the hidden layer the main differences between the claim types. On the other hand, *q* should not be chosen too large because this acts negatively on computational time, leads to proneness to over-fitting, and provides more model redundancies (if early stopping in calibration is applied). For these reasons, people often perform a grid search to find a good hyperparameter *q*. Our experience is that a hyperparameter *q* that is roughly 1 to 3 times larger than the dimension *d* of the feature space is often a good choice, if the wanted regression function should not look too wildly.

With this in mind we fit the model. We use the R interface to Keras which is a user-friendly application programming interface (API) to TensorFlow. The corresponding code is provided in Listing 2. The first remark is that among the available loss functions in Keras there is the square loss function (called ‘mse’, see line 18 of Listing 2), but not a weighted square loss function. For this reason, we modify (3.1) to

We remark that the explicit choice of \(\sigma _{j-1}^2>0\) does not influence the calibration. Therefore, it can be dropped. Secondly, the loss function \(\mathcal{L}_j^0\) in (6.1) is now an un-weighted square loss function, and we can use ‘mse’ on line 18 of Listing 2. The responses are given by \(Y_{i,j}(\varvec{x})=C_{i,j}(\varvec{x})/\sqrt{C_{i,j-1}(\varvec{x})}\), and the corresponding regression function reads as

see (3.3), and where \(\frac{1}{2} \log C_{i,j-1}(\varvec{x})\) plays the role of an offset. The responses \(Y_{i,j}(\varvec{x})\) are defined on line 3 of Listing 2 and the offsets (on the exponential scale) are given on line 4. Lines 7–17 then define the regression function: on lines 7–10 we build network (3.3) with *q* hidden neurons and excluding the offset. On lines 11–14 we build the offset part. On lines 15–17 we merge the two parts to regression function (6.2). On line 18 we compile the model using the square loss function ‘mse’ and the optimizer ’rmsprop’. Note that there are different available optimizers in Keras which are different versions of the gradient descent method. ’rmsprop’ stands for ’root mean square propagation’ and it is momentum-based improved version of the gradient descent method with integrated optimal learning rates \(\varrho \) and momentum coefficients \(\nu \).

Finally, the model is fitted on lines 20–21 of Listing 2. We choose mini batches of size 10,000, i.e. every gradient descent step is based on 10,000 observations. We run these gradient descent steps for 100 epochs which means that every claim is considered 100 times during the optimization. Finally, we choose validation_split=0.1 which means that 10% of the data is used for out-of-sample validation.

In Fig. 7 we show the decrease in the loss functions \(\mathcal{L}_j^0\) during the 100 epochs of the gradient descent algorithm. The left-hand side shows the model with \(q=5\) hidden neurons and the right-hand side the model with \(q=20\) hidden neurons; the red color shows the in-sample loss and the green color the validation loss on the 10% of observations chosen for out-of-sample validation. We observe in all cases a decrease in loss function which indicates that we do not over-fit by running the gradient descent algorithm in Listing 2 for 100 epochs on mini batches of size 10’000. We could now fine-tune these calibrations for the number of epochs, the mini batch sizes, the optimizer, etc. We refrain from doing so, but run the algorithm in Listing 2 for 100 epochs on mini batches of size 10,000. We do this for neural networks with \(q=5,10,20\) hidden neurons in the single hidden layer.

The results are presented in Table 5. The first block of Table 5 shows the results in the homogeneous case which corresponds to Mack’s CL model. This model has one parameter (CL factor) for each development period \(j=1,\ldots , 11\) and results in the in-sample losses given on the last line of the first block. The remaining three blocks give the neural network results for \(q=5,10,20\) hidden neurons. The first lines in these blocks provide the number of network parameters involved, i.e. the dimension \(q+1+q(d+1)\) of the network parameters \(\varvec{\alpha }\). The second lines in these blocks provide the run times for 100 epochs. These run times were obtained on a personal laptop with CPU @ 2.50GHz (4 CPUs) with 16GB RAM. We observe that the run times increase in the number of hidden neurons *q* and they decrease in the development periods *j* because we have less accident years and observations in later development years. The maximal run time of 150 seconds has been observed for \(q=20\) and \(j=2\), thus, if one parallelizes optimizations for the development periods it takes roughly 3 minutes to fit the model.

Finally, the last lines of the blocks provide the resulting in-sample losses \(\mathcal{L}_j^0\) of the three network models. Firstly, the neural network regression models provide clearly lower losses than the homogeneous model. From this we conclude that we should refine Mack’s CL model for heterogeneity. Secondly, for periods \(j=1,2\) the model with \(q=20\) hidden neurons outperforms the other two network models. For higher development periods \(j\ge 3\) the situation is less clear and we could also opt for a regression model with less hidden neurons.

### Appendix 3: Sensitivities of neural network calibrations

## Rights and permissions

## About this article

### Cite this article

Wüthrich, M.V. Neural networks applied to chain–ladder reserving.
*Eur. Actuar. J. * **8, **407–436 (2018). https://doi.org/10.1007/s13385-018-0184-4

Received:

Revised:

Accepted:

Published:

Issue Date:

### Keywords

- Claims reserving
- Mack’s CL model
- Individual claims reserving
- Micro-level reserving
- Neural networks
- Individual claims features
- Claims covariates