Neural networks applied to chain–ladder reserving


Classical claims reserving methods act on so-called claims reserving triangles which are aggregated insurance portfolios. A crucial assumption in classical claims reserving is that these aggregated portfolios are sufficiently homogeneous so that a coarse reserving algorithm can be applied. We start from such a coarse reserving method, which in our case is Mack’s chain–ladder method, and show how this approach can be refined for heterogeneity and individual claims feature information using neural networks.

This is a preview of subscription content, log in to check access.

Fig. 1
Fig. 2
Fig. 3
Fig. 4


  1. 1.

    Antonio K, Plat R (2014) Micro-level stochastic loss reserving for general insurance. Scand Act J 2014(7):649–669

    MathSciNet  Article  Google Scholar 

  2. 2.

    Arjas E (1989) The claims reserving problem in non-life insurance: some structural ideas. ASTIN Bull 19(2):139–152

    MathSciNet  Article  Google Scholar 

  3. 3.

    Badescu AL, Lin XS, Tang D (2016) A marked Cox model for the number of IBNR claims: theory. Insur Math Econ 69:29–37

    MathSciNet  Article  Google Scholar 

  4. 4.

    Badescu AL, Lin XS, Tang D (2016) A marked Cox model for the number of IBNR claims: estimation and application. Version March 14, 2016. SSRN Manuscript 2747223

  5. 5.

    Baudry M, Robert CY (2017) Non parametric individual claim reserving in insurance. Preprint

  6. 6.

    Cybenko G (1989) Approximation by superpositions of a sigmoidal function. MCSS 2/4:303–314

    MathSciNet  MATH  Google Scholar 

  7. 7.

    Gabrielli A, Wüthrich MV (2018) An individual claims history simulation machine. Risks 6(2):29

    Article  Google Scholar 

  8. 8.

    Harej B, Gächter R, Jamal S (2017) Individual claim development with machine learning. ASTIN Report

  9. 9.

    Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366

    Article  Google Scholar 

  10. 10.

    Isenbeck M, Rüschendorf L (1992) Completeness in location families. Prob Math Stat 13:321–343

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Jessen AH, Mikosch T, Samorodnitsky G (2011) Prediction of outstanding payments in a Poisson cluster model. Scand Act J 2011(3):214–237

    MathSciNet  Article  Google Scholar 

  12. 12.

    LeCun Y, Bengio Y, Hinton G (2015) Deep learning. Nature 521(7553):436–444

    Article  Google Scholar 

  13. 13.

    Lopez O (2018) A censored copula model for micro-level claim reserving. HAL Id: hal-01706935

  14. 14.

    Mack T (1993) Distribution-free calculation of the standard error of chain ladder reserve estimates. ASTIN Bull 23(2):213–225

    MathSciNet  Article  Google Scholar 

  15. 15.

    Montúfar G, Pascanu R, Cho K, Bengio Y (2014) On the number of linear regions of deep neural networks. Neural Inf Process Syst Proc \({}^{\beta }\) 27:2924–2932

  16. 16.

    Nielsen M (2017) Neural networks and deep learning. Online book available on

  17. 17.

    Norberg R (1993) Prediction of outstanding liabilities in non-life insurance. ASTIN Bull 23(1):95–115

    Article  Google Scholar 

  18. 18.

    Norberg R (1999) Prediction of outstanding liabilities II. Model variations and extensions. ASTIN Bulletin 29(1):5–25

    Article  Google Scholar 

  19. 19.

    Pigeon M, Antonio K, Denuit M (2013) Individual loss reserving with the multivariate skew normal framework. ASTIN Bull 43(3):399–428

    MathSciNet  Article  Google Scholar 

  20. 20.

    Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  Google Scholar 

  21. 21.

    Schnieper R (1991) Separating true IBNR and IBNER claims. ASTIN Bull 21(1):111–127

    MathSciNet  Article  Google Scholar 

  22. 22.

    Verrall RJ, Wüthrich MV (2016) Understanding reporting delay in general insurance. Risks 4(3):25

    Article  Google Scholar 

  23. 23.

    Werbos P (1982) Applications of advances in nonlinear sensitivity analysis. Syst Model Optim 1982:762–770

    Article  Google Scholar 

  24. 24.

    Wüthrich MV (2018) Machine learning in individual claims reserving. Scand Act J 2018(6):465–480

    MathSciNet  Article  Google Scholar 

  25. 25.

    Zarkadoulas A (2017) Neural network algorithms for the development of individual losses. MSc thesis, University of Lausanne

Download references


We would like to kindly thank Philipp Reinmann (AXA) and Ronald Richman (AIG) who have provided very useful remarks on previous versions of this manuscript.

Author information



Corresponding author

Correspondence to Mario V. Wüthrich.


Appendix 1: Data

We generate synthetic individual claims data using the scenario generator provided in [7]. The use and parametrization of this scenario generator is illustrated in Listing 1. On lines 2–6 we generate a claims portfolio consisting of feature components LoB, cc, AY, AQ, age and inj_part, see lines 8–15 of Listing 1. We receive 5,003,204 individual claims. The marginal distributions of these feature components are illustrated in Fig. 5 and the two-dimensional contour plots are given in Fig. 6. The latter contour plots do not show co-linearity in feature components.


This portfolio of 5,003,204 individual claims features is then used to generate fully developed individual claims histories. This is done on lines 17–21 of Listing 1 and provides an individual reporting delay RepDel as well as claims payments over 12 development years for each claim. In claims reserving problems, we typically do not have information about the full claims developments. If the latest observed calendar year is 2005, we only have information about the reported claims with \({\mathtt{AY}} + {\mathtt{RepDel}} \le 2005\) (upper triangles). In our synthetic data set, these are 4,970,856 reported claims, and the remaining 32,348 claims are only reported after calendar year 2005. Therefore, these latter claims are not available at the end of 2005. The 4,970,856 reported claims provide claims payments in upper claims reserving triangles. These payments aggregated per line of business \({\mathtt{LoB}} \in \{1,\ldots , 4\}\) are shown in Table 4. In our special situation of simulated claims histories we also know the total ultimate claim amounts \(C_{i,J}\). These are provided in the last columns of Table 4. The general aim is to predict these last columns based on the given upper claims development triangles illustrated in Table 4.

Fig. 5

Marginal distributions of the feature components LoB, cc, AY, AQ, age, inj_part

Fig. 6

Two-dimensional contour plots of the portfolio distributions of the feature components LoB, cc, age, inj_part

Table 4 Cumulative payments (in 1000) of LoBs 1–4 for accident years \(1994\le i \le 2005\) and development years \(0\le j \le J=11\); the last column gives the true ultimate claims \(C_{i,J}(\cdot )\) for each accident year \({\mathtt{AY}}=i \in \{1994, \ldots , 2005\}\) and line of business \({\mathtt{LoB}}\in \{1, \ldots , 4\}\)

Appendix 2: Neural network architecture and calibration

We come back to the choice of the network architecture and to the network calibration mentioned in Sect. 3.4. In our case, the network architecture requires the choice of the number of hidden neurons q (we also remind of the universality theorems mentioned in Sect. 3.1). The first step in the neuron activation (3.2) is the scalar product

$$\begin{aligned} \langle \varvec{w}_k , \varvec{x}\rangle = \sum _{l=1}^{d} w_{k,l} x_l, \end{aligned}$$

that projects the d-dimensional feature \(\varvec{x}\) to a scalar for each neuron \(1\le k \le q\). All features \(\varvec{x}\) that lie in hyperplanes orthogonal to \(\varvec{w}_k\) give the same neuron activation \(z_k(\varvec{x})\) in the k-th neuron. This implies that each neuron provides a substantial reduction in dimension at the price of a loss of information. Therefore, we need an appropriate minimal number q of hidden neurons (with different \(\varvec{w}_k\)’s) so that we are still able to capture in the hidden layer the main differences between the claim types. On the other hand, q should not be chosen too large because this acts negatively on computational time, leads to proneness to over-fitting, and provides more model redundancies (if early stopping in calibration is applied). For these reasons, people often perform a grid search to find a good hyperparameter q. Our experience is that a hyperparameter q that is roughly 1 to 3 times larger than the dimension d of the feature space is often a good choice, if the wanted regression function should not look too wildly.


With this in mind we fit the model. We use the R interface to Keras which is a user-friendly application programming interface (API) to TensorFlow. The corresponding code is provided in Listing 2. The first remark is that among the available loss functions in Keras there is the square loss function (called ‘mse’, see line 18 of Listing 2), but not a weighted square loss function. For this reason, we modify (3.1) to

$$\begin{aligned} \mathcal{L}_j^0~=~ \sigma ^2_{j-1}\mathcal{L}_j ~=~\sum _{i=1}^{I-j}~ \sum _{\varvec{x}:~C_{i,j-1}(\varvec{x})>0} \left( \frac{C_{i,j}(\varvec{x})}{\sqrt{C_{i,j-1}(\varvec{x})}}- f_{j-1}(\varvec{x})\sqrt{C_{i,j-1}(\varvec{x})}\right) ^2. \end{aligned}$$

We remark that the explicit choice of \(\sigma _{j-1}^2>0\) does not influence the calibration. Therefore, it can be dropped. Secondly, the loss function \(\mathcal{L}_j^0\) in (6.1) is now an un-weighted square loss function, and we can use ‘mse’ on line 18 of Listing 2. The responses are given by \(Y_{i,j}(\varvec{x})=C_{i,j}(\varvec{x})/\sqrt{C_{i,j-1}(\varvec{x})}\), and the corresponding regression function reads as

$$\begin{aligned} \varvec{x}~ \mapsto ~ f^0_{j-1}(\varvec{x})~=~f_{j-1}(\varvec{x})\sqrt{C_{i,j-1}(\varvec{x})} ~=~ \exp \left\{ \beta _0+ \sum _{k=1}^{q} \beta _k z_k(\varvec{x}) + \frac{1}{2} \log C_{i,j-1}(\varvec{x}) \right\} , \end{aligned}$$

see (3.3), and where \(\frac{1}{2} \log C_{i,j-1}(\varvec{x})\) plays the role of an offset. The responses \(Y_{i,j}(\varvec{x})\) are defined on line 3 of Listing 2 and the offsets (on the exponential scale) are given on line 4. Lines 7–17 then define the regression function: on lines 7–10 we build network (3.3) with q hidden neurons and excluding the offset. On lines 11–14 we build the offset part. On lines 15–17 we merge the two parts to regression function (6.2). On line 18 we compile the model using the square loss function ‘mse’ and the optimizer ’rmsprop’. Note that there are different available optimizers in Keras which are different versions of the gradient descent method. ’rmsprop’ stands for ’root mean square propagation’ and it is momentum-based improved version of the gradient descent method with integrated optimal learning rates \(\varrho \) and momentum coefficients \(\nu \).

Finally, the model is fitted on lines 20–21 of Listing 2. We choose mini batches of size 10,000, i.e. every gradient descent step is based on 10,000 observations. We run these gradient descent steps for 100 epochs which means that every claim is considered 100 times during the optimization. Finally, we choose validation_split=0.1 which means that 10% of the data is used for out-of-sample validation.

Fig. 7

Decrease in loss function during the gradient descent algorithm for \(j=1\): in-sample loss in red color and validation loss in green color for the 100 epochs; the left-hand side shows the model with \(q=5\) hidden neurons and the right-hand side the model with \(q=20\) hidden neurons

In Fig. 7 we show the decrease in the loss functions \(\mathcal{L}_j^0\) during the 100 epochs of the gradient descent algorithm. The left-hand side shows the model with \(q=5\) hidden neurons and the right-hand side the model with \(q=20\) hidden neurons; the red color shows the in-sample loss and the green color the validation loss on the 10% of observations chosen for out-of-sample validation. We observe in all cases a decrease in loss function which indicates that we do not over-fit by running the gradient descent algorithm in Listing 2 for 100 epochs on mini batches of size 10’000. We could now fine-tune these calibrations for the number of epochs, the mini batch sizes, the optimizer, etc. We refrain from doing so, but run the algorithm in Listing 2 for 100 epochs on mini batches of size 10,000. We do this for neural networks with \(q=5,10,20\) hidden neurons in the single hidden layer.

The results are presented in Table 5. The first block of Table 5 shows the results in the homogeneous case which corresponds to Mack’s CL model. This model has one parameter (CL factor) for each development period \(j=1,\ldots , 11\) and results in the in-sample losses given on the last line of the first block. The remaining three blocks give the neural network results for \(q=5,10,20\) hidden neurons. The first lines in these blocks provide the number of network parameters involved, i.e. the dimension \(q+1+q(d+1)\) of the network parameters \(\varvec{\alpha }\). The second lines in these blocks provide the run times for 100 epochs. These run times were obtained on a personal laptop with CPU @ 2.50GHz (4 CPUs) with 16GB RAM. We observe that the run times increase in the number of hidden neurons q and they decrease in the development periods j because we have less accident years and observations in later development years. The maximal run time of 150 seconds has been observed for \(q=20\) and \(j=2\), thus, if one parallelizes optimizations for the development periods it takes roughly 3 minutes to fit the model.

Table 5 Results of the gradient descent algorithm in Listing 2 for \(q=5,10,20\) hidden neurons after 100 epochs with mini batch of 10’000 observations

Finally, the last lines of the blocks provide the resulting in-sample losses \(\mathcal{L}_j^0\) of the three network models. Firstly, the neural network regression models provide clearly lower losses than the homogeneous model. From this we conclude that we should refine Mack’s CL model for heterogeneity. Secondly, for periods \(j=1,2\) the model with \(q=20\) hidden neurons outperforms the other two network models. For higher development periods \(j\ge 3\) the situation is less clear and we could also opt for a regression model with less hidden neurons.

Appendix 3: Sensitivities of neural network calibrations

See Figs. 8 and 9.

Fig. 8

Sensitivities of the estimated CL factors \(\widehat{f}_{j-1}(\varvec{x})\) in individual feature components for \(j=1,\ldots , 5\) (in blue color); the gray dotted lines show the average neural network factor \(\bar{f}^\mathrm{NN}_{j-1}\), see (3.9); the y-scales are same on each row

Fig. 9

Sensitivities of the estimated CL factors \(\widehat{f}_{j-1}(\varvec{x})\) in individual feature components for \(j=6,\ldots , 11\) (in blue color); the gray dotted lines show the average neural network factor \(\bar{f}^\mathrm{NN}_{j-1}\), see (3.9); the y-scales are the same on each row

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wüthrich, M.V. Neural networks applied to chain–ladder reserving. Eur. Actuar. J. 8, 407–436 (2018).

Download citation


  • Claims reserving
  • Mack’s CL model
  • Individual claims reserving
  • Micro-level reserving
  • Neural networks
  • Individual claims features
  • Claims covariates