1 Introduction

Since 1980, innovative approaches and developments in mortality modeling have been constantly proposed. Mortality analysis has received a considerable contribution from statistical science, building solid foundations for the evolution of mortality methods. Estimating longevity is not a trivial task; accuracy depends on the particular situation or trends, and figuring out if or when a certain method will be effective is not straightforward. Indeed, new mortality models will appear in the literature but may take years before they can be fully evaluated. As stated by Booth and Tickle (2008), the accuracy of mortality estimates should be regularly tested to set the improvement evidence. Researchers appear to be more focused on technical progress of a method rather than on the accuracy of the estimation provided, focusing on minimizing the bias.

Several approaches have been used to model the mortality surface, determining how death rates change over time. Until the 1980 s, relative simplicity and personal judgment were common features of mortality models (see Pollard (1987) for a detailed review on this aspect). Since then, mortality data became more easily available, which, jointly with the development of a wider array of statistical-mathematical tools, led to the creation of increasingly complex mortality models. According to Booth and Tickle (2008), three main paradigms for demographic modeling can be identified. The first one (explanation) is based on structural or epidemiological models from certain causes of death. A widely known example is the connection between lung cancer and tobacco smoking. Expectation, the second paradigm, relies on expert opinion and involves varying degrees of formality. Finally, making use of the regular structures typically found in age patterns and trends over time leads to the widely adopted extrapolative paradigm. This last approach includes the more complex stochastic mortality models such as the Lee–Carter (Lee and Carter 1992) and, more generally, the Generalized Age Period Cohort (GAPC) model. Despite the Lee–Carter model being recognized as one of the most influential proposals for mortality modeling and forecasting, in recent years scholars suggested additional approaches that sparked interest in actuarial and demographic literature (Brouhns et al. 2002; Renshaw and Haberman 2006; Cairns et al. 2006, 2009).

Despite models like the Lee–Carter and its variants having been widely used, becoming a benchmark for many newly proposed methodologies, they present several shortfalls. In this line, Cairns et al. (2008) tried to address the issue of what would be the best way to estimate mortality, exhibiting interesting criteria that a good mortality model should hold. They referred to good-practice guidelines such as the consistency with historical data and the long-term dynamics, biologically reasonable. Following this line of research, recent longevity literature stimulated the use of machine learning techniques in demographic research allowing the integration of stochastic models into a data-driven approach.

Machine learning techniques can support and integrate traditional mortality models in order to significantly reduce forecasting errors, both for application and research purposes. The main contributions in this area, from Deprez et al. (2017); Levantesi and Pizzorusso (2019); Levantesi and Nigri (2020), take advantage of machine learning algorithms to improve the fitting accuracy of canonical models. That is to say, the mortality surface produced by standard stochastic mortality models is corrected by adjusting mortality rates estimated by the original model. Such adjustments are obtained by calibrating a machine learning estimator. As shown by these authors, machine learning better captures complex patterns that traditional models fail to identify. Other notable contributions in this area are from Bjerre (2022), which uses random forests and gradient boosting to produce more accurate mortality forecasts compared with forecasts from traditional stochastic mortality models, and from Perla and Scognamiglio (2023), which develop a neural network model requiring the coherence of the mortality forecasts within sub-groups of similar populations, stressing the similarities among countries.

The need for new tools for comparing models’ performances is evident to understand mortality evolution more accurately.

Despite the growing literature on tree-based machine learning and mortality [(beyond the above-cited papers, see also the recent contribution from Wen et al. (2023)], tree-based methodologies are not commonly used to compare estimations from different mortality models. Compared to the existing literature related to mortality modeling, this paper explores an innovative approach belonging to the tree-based machine learning techniques that demographers have not yet investigated. This approach, namely Contrast trees, recently proposed by Friedman (2020), and here applied to mortality data, helps evaluate the accuracy of the mortality estimates (fitted mortality rates) given by models that are not treatable with model selection criteria based on the likelihood function. Given the wide range of applications, we believe the proposed methodology may provide a unified framework to assess and compare the goodness-of-fit to historical data of conventional mortality models. Using this methodology in evaluating mortality estimates is definitively new, and it is the main contribution of this paper.

Instead of being concerned with mortality forecasting, the proposed methodology is a diagnostic tool that improves the literature on the mortality estimates evaluation given by the mortality models. The main ability of Contrast trees is to identify the regions in the predictor variables space that show very high values of the error rate quantified by a discrepancy measure. The regions’ width and shape change from model to model.

Moreover, in addition to evaluating the accuracy of the models, the Contrast trees enable improving the performance of the models through a boosting procedure that reduces the inaccuracies. We use this methodology, namely Contrast boosting, to enhance the fitting of historical mortality data. According to the demographic literature, the reliable estimation of mortality data may refer not only to the extrapolation but also to an accurate fitting of the historical mortality surface. For instance, in longevity analysis is common to deal with subpopulations i.e. regions or provinces, characterized by a high level of stochasticity often due to a small number of count data at single ages. This is the case in which specific ages or years are not covered with data information, making the mortality estimation challenging. Our approach is crucial to evaluate the mortality matrix estimation provided by a mortality model and to ensure estimation effectiveness by comparing different methods.

To summarize, through this new technique based on Contrast trees, we aim to find the best model that fits observed mortality rates by grasping and detecting the inaccuracies of any model and boosting its predictive power.

The remainder of this paper is organized as follows: Sect. 2 introduces the model framework, both Contrast trees and Contrast boosting. In Sect. 3, we describe the numerical implementation, also providing an overview of the mortality models, expressed in a regression framework, which we assess by the Contrast trees approach. We devote a specific sub-section to explanation and discussion of the numerical results. Section 4 concludes the paper, providing other possible practical implementations of the method in mortality assessment and the limitations of our research.

2 Materials and methods

2.1 Data source

We consider the Italian mortality data available in the Human Mortality Database (HMD) over the period 1950–2018. We refer to the male population aged 0–90, analyzing the age groups 0–29, 30–60, and 61–90 separately to provide further evidence of the differences in mortality that characterizes the younger ages, the adult ages, and the older ages. Data are split into a training set and a test set, according to the common splitting rule 70–30%. We use the training set to obtain the parameters’ estimate of each model. We apply the parameters’ estimate in the test set to evaluate the out-of-sample performance. Finally, we will calculate the out-of-sample errors using data from the test set. The dataset is partitioned using the dissimilarity-based compound selection proposed in Willett (1999).

2.2 Mortality rate

We calculate the central death rates \(m_{x,t}\) for each age x and year t according to the following formula:

$$\begin{aligned} m_{x,t}=\frac{D_{x,t}}{E_{x,t}} \end{aligned}$$
(1)

where \(D_{x,t}\) is the number of deaths aged x in year t, and \(E_{x,t}\) are the exposures-to-risk aged x in year t.

2.3 Mortality models

In the following, we briefly describe the four models to which the Contrast trees methodology is applied. The scope is to evaluate the models’ quality of fit. The first two models belong to the family of generalized age-period-cohort (GAPC) that are expressed in a regression framework to be suitable for applying Contrast trees, which requires data organized in columns. The last two are well-known machine learning techniques also used for regression tasks.

2.3.1 Lee–Carter (LC) model

We consider the extension of the LC model (Lee and Carter 1992) proposed by Brouhns et al. (2002), which assume that the number of deaths is a Poisson random variable. The Lee–Carter model under the specification of Brouhns et al. (2002) describes the logarithm of the central death rate at age x and time t as:

$$\begin{aligned} \log {\left( m_{x,t}\right) }=\alpha _{x}+\beta _{x} \kappa _{t} \end{aligned}$$
(2)

The age-specific parameter \(\alpha _x\) provides the average age profile of mortality, the age-period term \(\beta _{x}\cdot \kappa _t\) describes the mortality trends, with \(\kappa _t\) the time index and \(\beta _{x}\) modifying the effect of \(\kappa _t\) across ages. The model is subject to the following constraints on \(\kappa _t\) and \(\beta _x\): \(\sum _{t}{\kappa _t}=0\) and \(\sum _{x}{\beta _x}=1\). This model can be reformulated into a Generalized Non-linear Model (GNM) framework, as in Villegas et al. (2018). The authors use a GNM and apply the maximum likelihood method to fit the model to historical data. Under this specification, the LC model can be seen as a non-linear regression model where mortality rates are the target variable, predicted using features (age and time) (Richman and Wüthrich 2021).

2.3.2 Age-period-cohort (APC)

We use the model’s version reformulated into a Generalized Linear Models (GLM) framework (Alai and Sherris 2014):

$$\begin{aligned} \log {\left( m_{x,t}\right) }=\beta _{0} +\beta _{1,x}+\beta _{2,t} +\beta _{3,t-x} \end{aligned}$$
(3)

Where the regression coefficients \(\beta _{1,x}\), \(\beta _{2,t}\), \(\beta _{3,t-x}\) are the age trend, the period trend and the cohort trend (\(t-x\) represents the year of birth).

2.3.3 Gradient boosting machine (GBM)

GBM is a tree-based algorithm proposed by Friedman (2001) that uses fixed-size decision trees as weak learners. The prediction is obtained by a sequential approach, where each decision tree uses the information from the previous one to improve the current fit. Given a current model fit, \(F_{m}({\textbf{x}})\), the algorithm provides a new estimate, \(F_{m+1}({\textbf{x}})=F_{m}({\textbf{x}})+ h_m({\textbf{x}})\), where \(h_m({\textbf{x}})\) is the weak learner fitted on the model residuals \(y-F_{m}({\textbf{x}})\) with y target variable.

2.3.4 eXtreme gradient boosting machine (XGBM)

XGBM is an efficient implementation of gradient boosting decision trees proposed by Chen et al. (2015), and designed to be fast to execute and highly effective. To verify if a simple data preprocessing has some meaningful effect on the quality of models, we apply XGBM to both raw and preprocessed data: the latter is obtained by centering and scaling the raw data using mean and standard deviation.

2.4 Traditional diagnostic tools

In the following, we briefly mention some traditional diagnostic tools that are often used in the literature to assess the goodness-of-fit of a mortality model.

  • Analysis of mortality residuals (or standardized mortality residuals) calculated as the difference between the crude estimate of mortality rate by age and year based on observed data and the corresponding estimated mortality rate using a specified mortality model. For example, Cairns et al. (2010) verified that they are consistent with the hypothesis of i.i.d. N(0, 1) and have zero correlation both across adjacent ages and across adjacent years.

  • Proportion of variance explained (\(R^2\)) by the model or the parameters of the model (see, e.g., Bongaarts (2005))

  • Model selection criteria that penalize the log-likelihood with the increase in number of parameters: Akaike Information Criterion (AIC), Schwarz-Bayes Criterion (SBC) [or Bayes Information Criterion (BIC)] and Likelihood-ratio test (LRT) (Li et al. 2009). Note that in this case the evaluation of the goodness-of-fit is given on the basis of the log-likelihood.

  • Qualitative model selection criteria: Cairns et al. (2008) provide a list of criteria that might be considered desirable in a mortality model, such as, e.g., ease of implementation, parsimony, and transparency. Relating to the fitting ability to the observed data, the model should be consistent with historical data, and parameter estimates should be robust relative to the range of data used. For example, Djeundje et al. (2022) consider consistency, stability, and parsimony in addition to standard goodness-of-fit indices (deviance residual, BIC, and residual patterns).

  • Checking for the absence of autocorrelation in the residuals of the model by the Portmanteau test (see, e.g., Torri (2011)).

2.5 Contrast trees

Contrast trees is an innovative approach that, leveraging tree-based machine learning techniques, allows for deeply assessing the goodness-of-fit of a model by identifying where the model performs worse. Specifically, the goal of the Contrast trees method is to uncover regions in the predictor variables space presenting very high values of the error rate quantified by a discrepancy measure (Friedman 2020). In the context of mortality modeling, the main feature that distinguishes this method from the traditional diagnostic methods mentioned above is the ability to automatically identify the regions in which a given model provides a high error for certain combinations of ages and calendar years. Furthermore, Contrast trees have the advantage of being easy to interpret and can be used as a diagnostic tool to detect the inaccuracies of every kind of model, for example, both those whose parameters estimate is based on a likelihood function and those based on machine learning algorithms. Our analysis shows how Contrast trees can be used for assessing the goodness-of-fit of different mortality models to observed data.

Suppose to have a set of predictor variables \(x = (x_1, x_2,...,x_p)\) and two outcome variables y and z for each x. We aim to find those values of x for which the respective distributions of y|x and z|x, or some statistics such as mean or quantiles, are most different. In summary, Contrast trees provide a lack-of-fit measure for the conditional distribution \(p_y(y | x)\), or some statistics.

Consider the \(M^{th}\) iteration, where the tree splits the space of the predictor variables into M disjoint regions \(\lbrace R_m\rbrace ^M_{m=1}\), each one containing a subset of the data. We denote \(f_m^{(l)}\) and \(f_m^{(r)}\) the fraction of observations in the left and right region with respect to \(R_m\), respectively. While, the quantities \(d_m^{(l)}\), \(d_m^{(r)}\) respectively represent the discrepancy measures associated to the fractions \(f_m^{(l)}\) and \(f_m^{(r)}\). Given a specified subset of the data \(\lbrace x_i,y_i,z_i \rbrace _{x_i\in R_m}\), a discrepancy measure between y and z values can be generally defined as:

$$\begin{aligned} d_m=D\left( \lbrace y_i\rbrace _{x_i\in R_m},\lbrace z_i\rbrace _{x_i\in R_m}\right) \end{aligned}$$
(4)

The quality of a split is quantified by the following measure:

$$\begin{aligned} Q_m(l,r)=\left( f_m^{(l)} \cdot f_m^{(r)}\right) \cdot max\left( d_m^{(l)}, d_m^{(r)} \right) ^{\beta } \end{aligned}$$
(5)

The factor \(\left( f_m^{(l)} \cdot f_m^{(r)}\right) \) discourages highly asymmetric splits in anticipation of further splitting, while the other factor \(max\left( d_m^{(l)}, d_m^{(r)} \right) ^{\beta }\) attempts to isolate the \(R_m^{(l)}\) and \(R_m^{(m)}\) regions with high discrepancy. The parameter \(\beta \) regulates the relative influence of the two factors but, as stated by Friedman (2020), results are insensitive to its value. We will use \(\beta =2\) in our analysis.

The choice of the discrepancy measure depends on the problem to be solved, allowing Contrast trees to be applied to a variety of problems (Friedman 2020). They are similar to loss criteria in prediction problems. The discrepancy measures that could be appropriate to represent the problem under investigation are the following:

$$\begin{aligned} d_m^{[1]}= & {} \frac{1}{N_m} \sum _{x_i \in R_m} |y_i-z_i |\end{aligned}$$
(6)
$$\begin{aligned} d_m^{[2]}= & {} \frac{1}{2N_m-1} \sum _{i =1}^{2N_m-1} \frac{\left| {\hat{F}}_y(t_{(i)})-{\hat{F}}_z(t_{(i)})\right| }{\sqrt{i\cdot (2N_m-i)}} \end{aligned}$$
(7)

where \(N_m\) is the number of observations in the region \(R_m\), \(t_{(i)}\) is the \(i^{th}\) value of t in sorted order, and \({\hat{F}}_y\) and \({\hat{F}}_z\) are the respective empirical cumulative distributions of y and z. See Friedman (2020) for further details about the tree split procedure.

In numerical applications, for sake of simplicity, we use the discrepancy measure \(d_m^{[1]}\).

2.5.1 Lack-of-fit contrast curves

The results obtained by applying the Contrast trees to different models can be summarized in the lack-of-fit contrast curves, which have point coordinates

$$\begin{aligned}{}[f_m,{\bar{d}}_m] \end{aligned}$$

where \(f_m=\frac{1}{N}\sum _{d_j\ge d_m}N_j\) is the fraction of observations in the region \(R_m\) containing \(N_m\) observations, and \({\bar{d}}_m=\frac{\sum _{d_j\ge d_m}d_j N_j}{\sum _{d_j\ge d_m}N_j}\) is the average discrepancy.

From the above expressions, we can deduce that the lack-of-fit curves by construction are decreasing. By way of example, we show a typical pattern of this curve in Fig. 1, where the leftmost point on the abscissa-axis provides the fractions of observations that fall into the regions with the higher discrepancy, while the rightmost point corresponds to all the observations (\(f_m=1\)). Looking at the ordinate-axis, the leftmost point on each curve represents the \({\bar{d}}_m\) value of the largest discrepancy region of its corresponding tree; the rightmost point provides the \({\bar{d}}_m\) value across all regions. Points in between give a \({\bar{d}}_m\) value over the regions with the highest discrepancy that contain the corresponding fraction of observations (Friedman 2020).

Fig. 1
figure 1

Example of a lack-of-fit contrast curve

2.6 Contrast boosting

To improve the models’ accuracy, Friedman (2020) proposes a contrast-boosting technique that enables models to provide more accurate predictions, by means of an iterative procedure that reduces uncovered errors. Boosting gradually modifies a starting value of z using an additive term, reducing its discrepancy with y. The resulting prediction for z is then adjusted accordingly. We now shortly describe the Estimation Contrast Boosting technique, which takes z as an estimate of a parameter of the full conditional distribution of a target variable given a set of predictor variables, \(p_y(y | x)\). The procedure consists starts with the estimation of a contrast tree. The z values within a certain region \(R_m^{(1)}\) of the CT are then updated, so that their discrepancy with y is zero, i.e. \(d_m=0\) in Eq. 4. The resulting update \(\delta ^{(1)}_m\) is then used to adjust the prediction from z to \(z^{(1)}=z+\delta ^{(1)}_m\) for \(x\in R^{(1)}_m\). The updated predictions \(z^{(1)}\) are the used as input for a second CT, and then updated from \(z^{(1)}\) to \(z^{(2)}=z+\delta ^{(2)}_m\) for \(x\in R^{(2)}_m\), and so on. The final z values boosted estimate is then \({\tilde{z}}(x)=z(x)+\sum _{k=1}^{K}\delta _{m}^{(K)}\), where K is the maximum number of iterations. In practice, each updated value of z is contrasted with y producing new regions \(R^{(k)}_m\) (\(1\le k\le K\)) with corresponding updates \(\delta _{m}^{(k)}\).

3 Results

We implement the analyses using the conTree R package developed by Friedman and Narasimhan (2020). We set to 100 the maximum tree size corresponding to the number of regions. It is worth noting that the choice of this parameter is not straightforward because it involves a trade-off between discrepancy and interpretability. The smaller the trees, the larger the regions (defined by simple rules and easy to be interpreted). The larger the trees, the higher the potential to uncover small high discrepancy regions (defined by complex rules).

The models’ performance results on the test set are summarized in the lack-of-fit contrast curves, deduced by contrasting the observed mortality data to the estimates provided by each model. These curves are shown in Figs. 2,  3, and  4 for the age group 0–29, 30–60, and 61–90, respectively. The panels (a) of these figures refer to the lack-of-fit curves obtained without applying the Contrast boosting (Base models), while panels (b) refer to the lack-of-fit curves obtained after applying Contrast boosting to the output of the models (Boosted models). For the 0–29 age group (see Fig. 2), both APC and XGBM model have the lowest discrepancy values for each fraction of observations, providing the best fitting. The average discrepancy for this age group is higher than for the 30–60 age group. The 0–29 age group is known to be characterized by high accidental mortality, the so-called “accident hump” around age 20–25, due to accidental deaths or suicides caused by increased risk-taking behavior. Mortality at age 0–29 is therefore hard to predict, and Contrast boosting is not able to actually reduce the average discrepancy. For the 30–60 age group (see Fig. 3), the APC model seems to best perform across all regions since the discrepancy values are consistently lower than those of the other models. For the XGBM models, we can observe that the model applied to preprocessed data (XGBM_preproc) performs better in the regions with the highest average discrepancy with respect to the model applied to raw data. From the scale of the plots, we can see that Contrast boosting reduces discrepancy across almost all regions for the GBM and LC models, where the relative effect of boosting is particularly evident. For the 61–90 age group (see Fig. 4), the GBM model seems by far the worst performing model. Albeit the application of Contrast boosting significantly reduces the discrepancy, the GBM continues to be less accurate than the other models. It should also be noted that the effect of Contrast boosting in high-discrepancy regions for the other models is negligible, except for the APC.

Fig. 2
figure 2

Lack-of-fit contrast curves in the log scale for APC, LC, GBM, XGBM and XGBM_preproc for the 0–29 age group. Left panels: base model; Right panels: boosted model

Fig. 3
figure 3

Lack-of-fit contrast curves in the log scale for APC, LC, GBM, XGBM and XGBM_preproc for the 30–60 age group. Left panels: base model; Right panels: boosted model

Fig. 4
figure 4

Lack-of-fit contrast curves in the log scale for APC, LC, GBM, XGBM and XGBM_preproc for the 61–90 age group. Left panels: base model; Right panels: boosted model

Table 1 reports the values of the average discrepancy measure for both the base and the boosted models considered in the analysis. The APC and the XGBM base models provide the lowest average discrepancy values (0.000410 and 0.000417, respectively), which remain substantially unchanged after the Contrast boosting procedure. The APC model shows the lowest value of \({\bar{d}}_m\) also for the age group 30–60, in line with the dynamics of the lack-of-fit curves depicted in panel (b) of Fig. 2 and panel (a) of Fig. 4. However, the lack-of-fit curves provide more structured information than the average discrepancy, in particular, regarding how and how much \({\bar{d}}_m\) varies across the input space. For example, for the age group 61–90 in the base model (Fig. 2, panel a), we can appreciate that the main difference among models (except for GBM, which is out of range) measured by the average discrepancy is caused by the high discrepancy regions (where the fraction of observation is less than about 0.20). For ages 61–90, the GBM base model shows the worst fitting to the observed mortality data. Although Contrast boosting produces a strong improvement in the discrepancy measure, GBM remains the worst model in terms of discrepancy. Contrast boosting is very effective also for the GBM model in the age group 30–60, as it heavily lowers (−30%) the average discrepancy between observed and estimated values.

Table 1 Values of the average discrepancy \({\bar{d}}_m\) calculated on \(m_{x,t}\) in the test set

For a comparison with the average discrepancy, we also calculate the Root Mean Square Error (RMSE) and Mean Absolute Percentage Error (MAPE) on the base model and the boosted one. Intuitively, the three measures \({\bar{d}}_m\), RMSE, and MAPE quantify the "distance" between the estimates and the actual observations. However, the average discrepancy is an innovative measure summarizing the discrepancy over all the regions identified by the Contrast trees, while RMSE and MAPE are commonly used error measures calculated on the overall input space without distinguishing by region.

By comparing Table 2 showing the values of RMSE and MAPE with Table 1 reporting the values of the average discrepancy, we note a greater convergence of the error measures in the boosted models rather than in the base models. This result is intuitively straightforward since the boosted models are obtained by just reducing the discrepancy measure.

Table 2 Values of the RMSE and MAPE calculated on \(m_{x,t}\) in the test set

We also calculate average discrepancy, RMSE, and MAPE on the logarithm of the central death rates (Tables 3, 4). These measures assign a relatively large weight to errors at young ages, while error measures calculated on the central death rates assign a large weight to errors at older ages. Indeed, for the age group 0–29, all the errors reported in Tables 3, 4 are significantly higher than those in Tables 1, 2. The errors calculated on the logarithm of the central death rates highlight the ability of Contrast boosting to reduce the inaccuracy of GBM and XGBM_preproc in fitting observed mortality at ages 0–29.

Table 3 Values of the average discrepancy \({\bar{d}}_m\) calculated on \(\log {\left( m_{x,t}\right) }\) in the test set
Table 4 Values of the RMSE and MAPE calculated on \(\log {\left( m_{x,t}\right) }\) in the test set

The most interesting feature of the application of Contrast trees to the field of mortality estimate is the automatic identification of the regions of the predictors’ space where a given model provides high discrepancy values for certain combinations of ages-years obtained by comparing the model estimates with the observed mortality rates. These regions can be easily detected and possibly interpreted, providing a further explanation of the model performances as well as helping to assess whether a model can be reliable or not. Figures 5 and 6 show the heatmap of all the error regions for the base model and the boosted one, respectively. Low discrepancy regions are painted in green, while high discrepancy regions are painted in red. For the sake of plot readability, we colored in purple the regions presenting a discrepancy value exceeding 3e-04, 6e-04, and 0.008 for the age groups 0–29, 30–60, and 61–90, respectively.

We can generally observe that the regions’ width and shape change from model to model. Some regions show remarkable mortality estimation errors in specific age groups, others in specific intervals of years, others in a specific range of cohorts. All the models considered show high discrepancy values in the first year of age (Fig. 5, age group 0–29, left panels), confirming the difficulty of adequately estimating the mortality of newborns. This situation remains unchanged after the application of Contrast boosting, which, in this case, seems to be not effective (Fig. 6, age group 0–29, left panels). For the age group 30–60 in the base model (Fig. 5, central panels), the two XGBM models show high discrepancy values after age 45–46, while GBM in the years 2000–2018. The LC model instead evidences high errors in estimating the mortality of cohorts born between 1920 and 1932. Considering the 61–90 age group (Fig. 5, right panels), we notice that the GBM model continues to fail in estimating mortality rates in the years 2000–2018, while the LC model (and also APC) mortality rates in the cohorts born between 1920 and 1932. By comparing the results for the base models (Fig. 5) with those for the boosted ones (Fig. 6), we observe a clear effect of boosting on the GBM model for the 30–60 and 61–90 age groups and the XGBM for the 30–60 age group.

Fig. 5
figure 5

Contrast trees regions, Base model. Years 1950–2018. Regions presenting a discrepancy value exceeding 3e-04 (age 0–29), 6e-04 (age 30–60), and 0.008 (61–90) are colored in purple

Fig. 6
figure 6

Contrast trees regions, Boosted model. Years 1950–2018. Regions presenting a discrepancy value exceeding 3e-04 (age 0–29), 6e-04 (age 30–60), and 0.008 (61–90) are colored in purple

4 Discussion

Evaluating, and thus eventually improving, the fit of mortality models is crucial for both demographers and actuaries. Indeed, in particular situations, common in actuarial practice, data quality can turn the mortality estimate difficult. A prime example is the case of small subpopulations where a common method such as the Lee–Carter may not guarantee reliable estimation. In mortality modeling, the objective of diagnostic checking is to ascertain whether the model fits the historical data by obeying an underlying probabilistic hypothesis. This procedure is carried out using residuals diagnosis checking with a Gaussian or more often a Poisson assumption (see, e.g., Renshaw and Haberman (2006)). Leveraging (Friedman 2020), who introduces contrast trees to estimate the full conditional probability distribution without any parametric assumptions, we propose a prominent alternative, with particular regard to the intersection of Machine Learning and Mortality modeling fields. In this sense, our proposal fills the gap between mortality modeling and model diagnostics, particularly for nontraditional modeling as a machine learning framework.

Contrast trees consist of a general method based on machine learning that can be applied to any model, expressed as a regression model, to evaluate the goodness of fit and identify the worst-performing regions in the input space. The main characteristic that discriminates this method from traditional diagnostic tools is automatically identifying the regions in which a given model produces a high error for certain combinations of ages and calendar years. Well-known diagnostic tools often used in the literature to assess the goodness-of-fit of a mortality model, such as BIC and AIC, require the likelihood function, which is not available for machine learning models. Therefore, Contrast trees provide a unified approach for assessing and comparing the accuracy of traditional mortality models with machine learning algorithms.

In Contrast trees, the detection of the regions in which a model worst performs can be considered an evolution of the standard analysis on residuals, in which the detection of the highest residuals is typically assigned to graphical analyzes using heatmaps and scatter plots (Cairns et al. 2009; Villegas et al. 2018), and to summary measures like RMSE and MAPE calculated on the overall input space and not by region. Conversely, the decision tree structure of Contrast trees enables quantifying the discrepancy between the estimates provided by a model and the actual observations in each region identified by Contrast trees.