1 Introduction

During the last century, populations around the world have experienced a continuous longevity growth, despite country-specific dynamics and disparities in age-at-death distribution ( Aburto and van Raalte 2018; Nigri et al. 2022). Broadly speaking, mortality has declined at all ages, with different intensities, due to the action of heterogeneous factors affecting the human lifespan. For instance, the diffusion of preventative health measures and improved medical care have positively impacted on population’s life expectancy (see, e.g. Vaupel et al. 2021 and Zarulli et al. 2021). Moreover, socioeconomic and environmental conditions have been determinants of how mortality changed over time and among populations (see, e.g. Cairns et al. 2019 and Khomenko et al. 2021). Although human lifespan increments are enjoyable evidence, as longevity increases, economic challenges emerge to ensure an adequate income and health treatments for all individuals throughout their old age. The social costs to cater for the needs of increasingly elderly populations involve a burdensome financial position for governments, annuity providers and pension schemes. These entities face the risk of paying out benefits for a much longer period, suffering financial instability conditions. In addition, such a circumstance may be exacerbated by future social advances. For instance, it is plausible that scientific and technological progresses will be able to boost medicine and lifestyles behaviours, making future improvements in life expectancy highly unpredictable (see, e.g. Keilman 2019). Indeed, population forecasts are heavily based on the current knowledge in vital processes, from which to extrapolate future trends. As a consequence, unexpected changes may occur due to random fluctuations or structural changes in the mortality trends. Therefore, uncertainty in predicting future longevity scenarios is a fact and achieving accuracy in mortality forecasts is crucial to support a long-lived society financially. To this end, both the demographic and the actuarial literature have been enriched over time by excellent works focused on stochastic mortality modelling. Models widely used to forecast mortality rates are extrapolative and draw inspiration from the pioneering Lee-Carter model (Lee and Carter 1992, hereinafter LC). The latter is a log-bilinear model embedding age-period effects and an additive Gaussian error structure. Various LC model extensions have been proposed in the literature to overcome model weaknesses and describe other relevant mortality patterns. For instance, Brouhns et al. (2002) formulated the canonical LC model in terms of Poisson regression, later extended in a Bayesian framework by Czado et al. (2005). A multi-factor age–period extension of LC was proposed by Booth et al. (2002), while Renshaw and Haberman (2003) exploited the LC model to provide mortality forecasting with age-specific enhancement. Prominent developments in stochastic mortality modelling were furnished by Renshaw and Haberman (2006), introducing cohort effects, and by Cairns et al. (2006). The latter proposed a two-factor stochastic representation for the logit of death probabilities, opening the way for further generalizations embodying multiple period and cohort effects (Cairns et al. 2009). Afterwards, Plat (2009) gathered the LC, the Renshaw and Haberman and the Cairns-Blake-Dowd models to establish a unified approach. Since a large number of mortality models have been proposed in the literature, Hunt and Blake (2014) designed a general procedure for constructing parametric mortality models able to catch different age-period-cohort effects present in the data.

Designating the best mortality model is not trivial, and the choice stems from the satisfaction of suitable criteria. However, a model may not necessarily dominate all others based on the selected criteria. For instance, a model could be better than others looking at its goodness of fit to historical data; at the same time, such a model could be less robust or provide less accurate projections than others. Furthermore, the model comparison depends on the mortality data investigated and the purposes of the analysis. Subsequently, many articles in the literature have compared different mortality models exploiting specific mortality experiences and befitting metrics. Cairns et al. (2009) offered an extensive analysis comparing empirical fits of eight stochastic mortality models on US and England & Wales males mortality data. The authors examined model performances by means of both qualitative and quantitative model selection criteria. The former refers to desirable mortality model’s features, that is, parsimony, transparency, ability to generate sample paths, incorporation of cohort effects, aptitude to provide non-trivial correlation structures between ages. Quantitative criteria allow for assessing consistency and robustness of parameter estimates with respect to the period of data observed. Dowd et al. (2010b) tailored hypothesis tests to strengthen models evaluation in terms of goodness-of-fit and to consider England & Wales males mortality data. A different perspective to benchmark mortality models was developed in Dowd et al. (2010a). The latter designed a backtesting framework to gauge (ex-post) the forecasting performances of six stochastic mortality models fitted on England & Wales male mortality data. Cairns et al. (2011) focused on the plausibility of stochastic mortality model forecasts by means of innovative qualitative criteria, namely: biological reasonableness, plausibility of predicted levels of uncertainty in forecasts at different ages, projections robustness with respect to the sample period used to fit the model. A considerable model comparison is also promoted in Haberman and Renshaw (2011), whose numerical experiment include both US and England & Wales female’s mortality experiences. Additional quantitative model comparisons have been performed in Lovász (2011), concerning Finnish and Swedish populations, and in Biffi and Clemente (2014) and Carfora et al. (2017), concerning Italian population.

In the spirit of the aforementioned literature, the present study furnishes a comparative analysis of the forecasting ability of stochastic mortality models. As suggested in Dowd et al. (2010a), a suitable mortality model should offer both proper in-sample results and plausible ex-ante forecasts, as well as generate adequate ex-post out-of-sample performances. In particular, forecasted uncertainty levels have a central role since it reveals the potential, unexpected longevity rise and its biological plausibility. Therefore, we implement an ex-post assessment of the projected prediction intervals, comparing and measuring their plausibility through uncertainty metrics not yet adopted: the Prediction Interval Coverage Probability (hereinafter PICP) and the Mean Prediction Interval Width (hereinafter MPIW). The former generally represents the probability of observing mortality outcomes over the backtesting horizon falling within the prediction intervals. At the same time, the latter expresses the average amplitude of such intervals. Both the PICP and the MPIW are usually employed for rating neural networks prediction intervals (see, e.g. Khosravi et al. 2011, but they can be efficiently utilized for any predictive model. It is straightforward to note that prediction intervals whose PICP is the highest possible value are a matter of interest; we aim for all mortality realizations within the variability range. Such a scenario may be simply achieved through a wide prediction interval, but the latter could indicate poor predictive believability. For instance, too large prediction intervals are not informative about the likely uncertainty of future mortality outcomes; furthermore, too wide intervals may be associated with low or null PICP values. The latter is the worst-case scenario as it indicates the model’s failure to capture the mortality trend. Hence, it is important to consider jointly the PICP and the MPIW, defining an associated criterion for the model comparison and selection. From a backtesting perspective, we advance a preference criterion that gives credibility to the mortality model with the highest PICP and the lower MPIW. Our proposal identifies an additional backtesting strategy whose joint application with commonly used quantitative metrics improves the model selection process. As further contribution, we illustrate how the suggested criterion performs ranking seven stochastic mortality models in an empirical analysis. To this end, we deal with the mortality experiences of five populations, namely Australia, England & Wales, Italy, Japan and the USA, and for both genders by exploiting the Human Mortality Database data (Human Mortality Database 2018). Our analysis shows that, as expected, no mortality model outperforms all the others. Despite both age- and country-specific characteristics, our proposal grants to identify models with mortality density forecasts more balanced across populations. The remainder of the paper is the following. Section 2 contains the methodological framework for investigating the stochastic mortality models. Section 3 explains the PICP and the MPIW metrics and the proposed preference criterion. Section 4 refers to the numerical experiments and related results obtained. Finally, Sect. 5 sets out concluding remarks.

2 Methodological framework

Let \({\mathcal {X}}= \{x_1,x_2, \ldots ,x_{\omega }\}\) be the age set, and \({\mathcal {T}}=\{t_1,t_2, \ldots , t_n\}\) the calendar year set. For each last birthday age \(x\in {\mathcal {X}}\) during calendar year \(t \in {\mathcal {T}}\), we introduce the following variables of interest:

  • \(D_{x,t}\), i.e. the random number of deaths;

  • \(d_{x,t}\), i.e. the observed number of deaths;

  • \(E_{x,t}^{0}\), i.e. the initial exposed to risk;

  • \(q_{x,t}\), i.e. the one-year probability of death.

In the present work we resort the Generalized Age-Period-Cohort (hereinafter GAPC) as unified mortality modeling framework, allowing for a fair comparison among stochastic mortality models (see e.g. Currie 2016; Villegas et al. 2018). In particular, we assume that:

$$\begin{aligned} D_{x,t} \sim \text {Binomial}(E^0_{x,t}, q_{x,t}), \quad \forall x \in {\mathcal {X}}, \forall t \in {\mathcal {T}}, \end{aligned}$$
(1)

coherently representing the systematic mortality component by the following predictor:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} = a_x + \sum _{l = 1}^L b_x^{(l)} \kappa _t^{(l)} + c_x\gamma _{t-x}. \end{aligned}$$
(2)

By the means of Eq. (2), one-year probabilities of death are characterized through a sigmoidal transformation of age-dependent parameters \(a_x, b_x^{(l)}\) and \(c_x\), time-dependent parameters \(\kappa _t^{(l)}\), and the cohort parameter \(\gamma _p\). Most of the stochastic mortality models in the literature are special cases extracted from the GAPC setup. For instance, consider whether the age-dependent parameter \(a_x\) or the cohort parameter \(\gamma _{t-x}\), implies different mortality representations; again, varying the order of L, mortality models are shaped according to specific period effects portrayals.

2.1 Stochastic mortality models

We choose to deal with seven stochastic mortality models, summarized in Table 1, depicting the hinges of the two main stochastic mortality models families: the Lee-Carter family and the Cairns-Blake-Dowd family. We briefly illustrated their analytical structure in the following sections.

Table 1 Stochastic mortality models considered in the present work

2.1.1 LC model

The LC model provides two age-specific effects, a period effect and it does not acknowledge for cohort influences. Thus, by the means of Eq. (2) we have \(L = 1\) and \(c_x, \gamma _{t-x}= 0\), obtaining:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} = a_x + b_x^{(1)} \kappa _t^{(1)}. \end{aligned}$$
(3)

We need to make the LC model invariant with respect to parameters’ transformations, and the following constraints are then applied:

$$\begin{aligned} \sum _{x \in {\mathcal {X}}} b^{(1)}_x = 1, \quad \sum _{t \in {\mathcal {T}}} \kappa ^{(1)}_t = 0. \end{aligned}$$

2.1.2 RH model

The RH model develops the LC structure introducing an age-cohort bi-linear term, that is:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} = a_x + b_x^{(1)} \kappa _t^{(1)} + c_x \gamma _{t-x}. \end{aligned}$$
(4)

Similarly to the LC model, identifiability problems are avoided imposing the following constraints:

$$\begin{aligned} \sum _{x \in {\mathcal {X}}} b^{(1)}_x = 1, \quad \sum _{x \in {\mathcal {X}}} c_x = 1, \quad \sum _{t \in {\mathcal {T}}} \kappa ^{(1)}_t = 0, \quad \sum _{p = t_0-x_{\omega }}^{t_n-x_0} \gamma _p = 0. \end{aligned}$$

As suggested in Haberman and Renshaw (2011), we consider the specification \(c_x = 1\) aiming to bypass robustness issues suffered by the predictor in Eq. (4).

2.1.3 APC model

The APC model was originally proposed in the fields of medicine and demography (see e.g. Clayton and Schifflers 1987) and later introduced in actuarial literature by Currie (2006). Referring to the latter, the APC model involves an age effect, a period effect and a cohort effect, that is:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} = a_x + \kappa _t^{(1)} + \gamma _{t-x}. \end{aligned}$$
(5)

Such a model requires parameters constraints defined as below:

$$\begin{aligned} \sum _{t \in {\mathcal {T}}} \kappa ^{(1)}_t = 0, \quad \sum _{p = t_0-x_{\omega }}^{t_n-x_0}\gamma _p = 0. \end{aligned}$$

2.1.4 CBD model

The CBD model assumes \(L = 2\) time-dependent parameters, while \(a_x, \ c_x,\ \gamma _{t-x}= 0\). In addition, Cairns et al. (2006) accounts for \(b_x^{(1)} = 1\) and \(\ b_x^{(2)} = (x- {\bar{x}})\), where \({\bar{x}}\) denotes the average age for the age set considered. Hence, the predictor has the following expression:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} =\kappa _t^{(1)} + (x- {\bar{x}}) \kappa _t^{(2)}, \end{aligned}$$
(6)

and it is not affected by identifiability issues.

2.1.5 M6 model

The M6 model is a CBD’s extension due to the cohort effect inclusion. The latter is weighted by \(c_{x} = 1\), and the predictor becomes:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} =\kappa _t^{(1)} + (x- {\bar{x}}) \kappa _t^{(2)}+\gamma _{t-x}. \end{aligned}$$
(7)

Contrary to the CBD model, constraints have to be introduced:

$$\begin{aligned} \sum _{p = t_0-x_{\omega }}^{t_n-x_0} \gamma _p = 0,\quad \sum _{p = t_0-x_{\omega }}^{t_n-x_0} p\gamma _p = 0. \end{aligned}$$

2.1.6 M7 model

The M7 model boosts the M6 one, favoring the introduction of a third period effect weighted by a quadratic coefficient. In particular, the M7’s predictor is defined setting \(L = 3\), \(a_x = 0, \ b_x^{(1)} = 1\), \(b_x^{(2)} = (x- {\bar{x}})\), \(b_x^{(2)} = ((x- {\bar{x}})^2-{\hat{\sigma }}_x^2)\), \(c^{(1)}_{x} = 1\), where \({\hat{\sigma }}_x\) denotes the standard deviation of ages. In analytical terms, it holds that:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} =\kappa _t^{(1)} + (x- {\bar{x}}) \kappa _t^{(2)}+((x- {\bar{x}})^2-{\hat{\sigma }}^2_x) \kappa _t^{(3)} +\gamma _{t-x}. \end{aligned}$$
(8)

Regarding identifiability problems, the following constraints are imposed:

$$\begin{aligned} \sum _{p = t_0-x_{\omega }}^{t_n-x_0} \gamma _p = 0,\quad \sum _{p = t_0-x_{\omega }}^{t_n-x_0} p \ \gamma _p = 0 ,\quad \sum _{p = t_0-x_{\omega }}^{t_n-x_0} p^2 \ \gamma _p = 0 . \end{aligned}$$

2.1.7 M8 model

The M8 model is M6 model’s variant contemplating a non-constant coefficient related to the cohort effect, that is \(c^{(1)}_{x} = (x_c- {x})\) for some constant parameter \(x_c\) to be estimated. Thus, the predictor takes the form:

$$\begin{aligned} \log \frac{q_{x,t}}{1-q_{x,t}} =\kappa _t^{(1)} + (x- {\bar{x}}) \kappa _t^{(2)}+(x_c- {x})\gamma _{t-x}, \end{aligned}$$
(9)

and the associated parameter constraint is:

$$\begin{aligned} \sum _{p = t_0-x_{\omega }}^{t_n-x_0} \gamma _p = 0. \end{aligned}$$

2.2 Model fitting

Stochastic mortality models falling within the GAPC framework are examples of generalized, or non-generalized, linear models and they can be fitted by maximum likelihood estimation (see e.g. Currie 2016). Recalling assumption (1) and assuming i.i.d. death counts for each age and calendar year, we consider the log-likelihood below:

$$\begin{aligned} \log {\mathcal {L}}(d_{x,t}, {\hat{d}}_{x,t}) = \sum _{x,t} w_{x,t} \Bigg [ d_{x,t} \log \Bigg (\frac{{\hat{d}}_{x,t}}{E_{x,t}^0}\Bigg ) + (E_{x,t}^0 - d_{x,t}) \log \Bigg ( \frac{E_{x,t}^0 - {\hat{d}}_{x,t}}{E_{x,t}^0} \Bigg ) + \left( {\begin{array}{c}E_{x,t}^0\\ d_{x,t}\end{array}}\right) \Bigg ], \end{aligned}$$
(10)

where \(w_{x,t}\) is a 0/1 weight taking value 0 if a data cell, (xt), is ignored or 1 if the cell is incorporated, and \({\hat{d}}_{x,t}\) is the estimated number of deaths, that is:

$$\begin{aligned} {\hat{d}}_{x,t}=E_{x,t}^0{\hat{q}}_{x,t} = E_{x,t}^0 \frac{\exp \left( {\hat{a}}_x + \sum _{l = 1}^L {\hat{b}}_x^{(l)} {\hat{\kappa }}_t^{(l)} + {\hat{c}}_x{\hat{\gamma }}_{t-x}\right) }{1+\exp \left( {\hat{a}}_x + \sum _{l = 1}^L {\hat{b}}_x^{(l)} {\hat{\kappa }}_t^{(l)} + {\hat{c}}_x{\hat{\gamma }}_{t-x}\right) }. \end{aligned}$$
(11)

It is straightforward to note that the parameters are estimated by solving the problem:

$$\begin{aligned} \hat{\varvec{\theta }} = \arg \max _{\varvec{\theta }} \log {\mathcal {L}}(d_{x,t}, {\hat{d}}_{x,t})\end{aligned}$$

where \(\hat{\varvec{\theta }} = \left( {\hat{a}}_x ,{\hat{b}}_x^{(l)}, {\hat{\kappa }}_t^{(l)}, {\hat{c}}_x,{\hat{\gamma }}_{t-x}\right)\).

2.3 Model forecasting

Stochastic mortality models forecasts arise from time-series techniques describing period and cohort effects, while age-specific parameters are time invariant. Let \(t_n \in {\mathcal {T}}\) be the forecasting year and let \({\mathcal {T}}^{'}=\{t_n+h, \ h=1,\ldots ,s\}\) be the forecasting horizon. Then, the h-step ahead forecast for the predictor is defined as:

$$\begin{aligned} \log \frac{q_{x,t_n+h}}{1-q_{x,t_n+h}} = {\hat{a}}_x + \sum _{l = 1}^L {\hat{b}}_x^{(l)} \kappa _{t_h+h}^{(l)} + {\hat{c}}_x\gamma _{t_n+h-x}, \quad \forall x \in {\mathcal {X}} \end{aligned}$$
(12)

Periods effects at time \(t \in {\mathcal {T}}^{'}\), namely \(\varvec{\kappa }_t = ( \kappa _t^{(1)}, \kappa _t^{(2)}, \dots \kappa _t^{(L)}) \in {\mathbb {R}}^L\), are shaped through a multivariate random walk:

$$\begin{aligned} \varvec{\kappa }_t = \varvec{\theta } + \varvec{\kappa }_{t-1} + \varvec{\epsilon }_t, \quad \quad \varvec{\epsilon }_t \sim N({\varvec{0}}, \Sigma ), \end{aligned}$$
(13)

where \(\varvec{\theta } \in {\mathbb {R}}^L\) is the drift term and \(\Sigma \in {\mathbb {R}}^{L \times L}\) is the variance-covariance matrix of \(\varvec{\epsilon }_t\). Obviously, for stochastic mortality models with \(L=1\), the random walk collapses to the univariate case. Concerning the cohort’s parameter, we follow previous studies in literature (see e.g. Haberman and Renshaw 2011; Lovász (2011)) in order to generate projections embedding cohort effects. Therefore, the latter are modeled by an univariate ARIMA process, assuming stochastic independence with respect to \(\varvec{\kappa }_t\). In Table 2 we summarize the specific ARIMA(p,d,q) considered in the present work.

Table 2 ARIMA models for the cohort effect in the different mortality models

Point mortality forecasts are obtained determining the expectation of both time-dependent and cohort-dependent parameters, that is:

$$\begin{aligned} \log \frac{{\hat{q}}_{x,t_n+h}}{1-{\hat{q}}_{x,t_n+h}} = {\hat{a}}_x + \sum _{l = 1}^L {\hat{b}}_x^{(l)} {\mathbb {E}}\left( \kappa _{t_h+h}^{(l)}\right) + {\hat{c}}_x{\mathbb {E}}\left( \gamma _{t_n+h-x}\right) , \quad \forall x \in {\mathcal {X}}, \end{aligned}$$
(14)

so that future mortality trend is outlined without considering prediction errors. However, the latter generates uncertainty in mortality forecasts, as well as the estimation error related to \(\hat{\varvec{\theta }}\). Therefore, prediction intervals are substantial in supporting mortality projection’s reliability and to perform consistent mortality/longevity risk assessments. Any stochastic mortality model have to provide prediction intervals with a certain coverage probability, namely \((1 -\alpha )\), aiming that future mortality outcomes belong to the interval defined by the lower and upper bounds. More in detail, indicating with \({\hat{q}}^{LB}\) the projected prediction interval lower bound and with \({\hat{q}}^{UB}\) the projected upper bound, a desirable model satisfies the following:

$$\begin{aligned} {\mathbb {P}}\left( {\hat{q}}_{x,t_n+h}^{LB}\le q_{x,t_n+h}\le {\hat{q}}_{x,t_n+h}^{UB}\right) \ge (1-\alpha ), \quad \alpha \in (0,1). \end{aligned}$$
(15)

In addition, prediction interval’s quality depends on the distance between \({\hat{q}}_{x,t_n+h}^{LB}\) and \({\hat{q}}_{x,t_n+h}^{UB}\) for each time point in \({\mathcal {T}}^{'}\). For our purposes, we appraise models’ out-of-sample judging the quality of the projected prediction intervals in terms of their probability coverage and their width. In doing so, we take into account both model and parameter uncertainty. The method considered to generate such uncertainty is explained in Appendix A.

3 Prediction interval based metrics and forecasts comparison

To appreciate the prediction interval’s goodness, we propose to jointly examine specific prediction interval-based metrics, namely PICP and MPIW. The former inspects the prediction interval coverage by counting how many mortality realisations are wrapped in the probabilistic range, given a confidence level \((1-\alpha )\). It is defined as

$$\begin{aligned} PICP= \frac{1}{|{\mathcal {T}}^{'}|}\sum _{t \in {\mathcal {T}}^{'}} \mathbbm {1}_{\left\{ q_{x,t} \ \in \ [{\hat{q}}_{x,t}^{LB}, \ {\hat{q}}_{x,t}^{UB}]\right\} }, \end{aligned}$$
(16)

where \(\mathbbm {1}_{\left\{ \cdot \right\} }= 1\) if \(q_{x,t} \in [{\hat{q}}_{x,t}^{LB},\ {\hat{q}}_{x,t}^{UB}]\), and \(\varvec{1}_{\left\{ \cdot \right\} }=0\) otherwise. Broadly speaking, the PICP furnishes the estimate for the probability in Eq. (15) so that, theoretically, PICP should be greater or equal to the nominal value \((1-\alpha )\). The higher the PICP value, the more likely the coverage the prediction interval provides for future mortality realizations. However, PICP values lower than the confidence level could occur due to different reasons, such as in the presence of noisy mortality data or when the model under-fitting or over-fitting crops up. Larger PICP values could be achieved simply by considering wider prediction intervals, but the latter suggests poor predictive believability, and they are not of practical usefulness. Therefore, rating prediction interval quality by PICP without considering the prediction interval width is a little choice. We must evaluate prediction interval accuracy, referring simultaneously to both PICP and MPIW. The latter represents the prediction interval mean width over the forecasting horizon, that is:

$$\begin{aligned} MPIW = \frac{1}{|{\mathcal {T}}^{'}|} \sum _{t \in {\mathcal {T}}^{'}} {\hat{q}}_{x,t}^{UB}-{\hat{q}}_{x,t}^{LB}. \end{aligned}$$
(17)

The joint use of PICP and MPIW requires formulating a criterion to compare mortality models given their projected prediction intervals. Our approach traces a backtesting exercise. We provide a plain ex-post evaluation of prediction intervals inspecting the associated PICP and MPIW for each mortality model considered. In particular, we advance a preference criterion relying on the mortality model with the highest PICP and the lower MPIW. For instance, given two stochastic mortality models, namely \(M_k\) and \(M_j\), we propose to promote the model satisfying the following criterion:

$$\begin{aligned} M_k \succ M_j \quad \text {if} \quad PICP_k> PICP_j \ \wedge \ MPIW_k \le MPIW_j. \end{aligned}$$
(18)

The criterion in (18) may be drafted also in terms of weak preference, that is:

$$\begin{aligned} M_k\succeq M_j \quad \text {if} \quad PICP_k \ge PICP_j \ \wedge \ MPIW_k\le MPIW_j, \end{aligned}$$
(19)

allowing for the indifference situation \(M_k \sim M_j\) iff \(PICP_k=PICP_j\) and \(MPIW_k=MPIW_j\). The means of criterion (18), or (19), is streamlined: we aim to select a model whose prediction intervals enclose future mortality outcomes and, at the same time, the model (and associated parameters) must not yield an excessive uncertainty. For instance, Fig. 1 shows a graphical visualization of how the proposed preference criterion works. Let us consider five stochastic mortality models and let \(M_k\), \(k=1,\ldots ,5\), the coordinates representing such models in the plane (MPIWPICP). It is straightforward to note that \(M_3\succ M_1 \succ M_2\), but criterion (18) does not allow preference judgment in comparing \(M_1\) and \(M_4\) or \(M_1\) and \(M_5\). However, for financial and actuarial purposes, the latter case should be unambiguous: we are inclined to prefer the model \(M_1\) as it provides a greater probability coverage and is more informative regarding uncertainty. Thus, we interpret the preference criterion (18) as a gold standard, facilitating model selection if there is a prominent mortality model in terms of density forecasts; otherwise, intermediate situations occur, and we need to introduce an additional rule in defining the preference criterion. In particular, we say that \(M_k \succ M_j\) if the following criterion is satisfied:

$$\begin{aligned} PICP_k \ge PICP_j &\wedge&\ MPIW_k \le MPIW_j\, \vee \nonumber \\ PICP_k \ge PICP_j &\wedge&\ MPIW_k \le MPIW_j\, \end{aligned}$$
(20)

Recalling the example in Fig. 1, the previous criterion implies the preference chain \(M_3\succ M_4 \succ M_1 \succ M_5 \succ M_2\). The model \(M_2\) represents the worst case, since it has simultaneously the lowest PICP and the higest MPIW.

Fig. 1
figure 1

An illustrative representation of mortality models in the plane (MPIWPICP)

4 Empirical experiments

To test our proposal, we implement an empirical experiment involving mortality experiences worldwide. In particular, we exploit mortality data for both genders in five countries, namely Australia (AUS), England & Wales (GBRTENW), Japan (JPN), Italy (ITA) and the USA. We indicate the population set by \({\mathcal {I}}=\{\texttt {Male},\texttt {Female}\} \times \{\texttt {AUS},\texttt {GBRTENW},\texttt {JPN},\texttt {ITA},\texttt {USA}\}\). We consider such countries representative in terms of mortality dynamics. For instance, Australia, England & Wales and Italy populations experienced a deep, non-linear shaped, reduction of mortality rates after WWII, despite they differ in terms of life expectancy acceleration. On the other hand, Japan population exhibits a linear decline in mortality, while the USA population is characterized by the well-known life expectancy stagnation (Mehta et al. 2020).

Data were downloaded from the Human Mortality Database, which concerns calendar years from 1960 to 2019 and the age set \({\mathcal {X}} = \{60,\ldots ,89\}\). The latter is specially chosen to investigate ages for which the longevity risk may arise. Such a data are split in two sub-dataset with respect to the forecasting year \(t=2000\): a training dataset composed by mortality data indexed by the calendar years set \({\mathcal {T}} = \{ 1960,\ldots ,2000\}\) and a testing dataset concerning mortality data for the calendar years set \({\mathcal {T}}^{'} = \{ 2001,\ldots ,2019\}\). The former is used to fit stochastic mortality models, while the latter act as a test for the backtesting analysis. We refer to the Bayesian Information Criterion (hereinafter BIC) to assess the mortality model’s in-sample performance. At the same time, the out-of-sample accuracy of point forecasts is evaluated by looking at the Mean Squared Error (hereinafter MSE). We inspect the latter metric as an overall performance measure for each mortality model, and it is defined as:

$$\begin{aligned} MSE_i = \frac{1}{|{\mathcal {T}}^{'}|\cdot |{\mathcal {X}}|}\sum _{t \in {\mathcal {T}}} \sum _{x \in {\mathcal {X}}} ( q_{x,t, i} - {\hat{q}}_{x,t, i} )^2, \quad \forall i \in {\mathcal {I}}. \end{aligned}$$
(21)

We also investigated the MSE distribution by age, that is:

$$\begin{aligned} MSE_{x,i} = \frac{1}{|{\mathcal {T}}^{'}|} \sum _{t \in {\mathcal {T}}} ( q_{x,t, i} - {\hat{q}}_{x,t, i} )^2, \quad \forall x \in {\mathcal {X}}, i \in {\mathcal {I}}. \end{aligned}$$
(22)

In addition, we assess the mortality density forecasts using uncertainty metrics explained in Sect. 3. In particular, we firstly scrutinize PICP and MPIW distribution by age and separately. As the final step, we apply the proposed preference criterion displaying the results for ages \(x=65,75,85\). The experiments are performed using the R software (R Core Team 2021, version 3.6.3) and the package StMoMo (Villegas et al. 2018, version 0.4.1).

4.1 Results

The present section exposes and discusses the findings of our numerical experiments. Foremost, in Table 3 we list in-sample BIC values for each mortality model and different populations. The best performance is reported in bold for each population.  We outline that RH and M7 fitting overperform the other models: the former presents the lowest BIC in 5/10 cases, while the latter has the best BIC in 4/10 cases. The M8 provides the lowest BIC only for the US female population. Thus, from a proper perspective, both periods and cohort effects seem to be prevalent in explaining the mortality surfaces considered. For completeness, we report the heatmaps of the binomial residuals for all the considered models and populations in Appendix B.

Table 3 In-sample BIC for the different populations and different stochastic models considered

Concerning the out-of-sample analysis, Table 4 shows MSE values applying Eq. (21). We observe that models selected in terms of BIC are not optimal from a forecasting perspective. For instance, the RH forecasts underperform for all populations investigated, and the M7 results are profitable only for the GBRTENW male mortality data. From a forecasting accuracy’s point of view, the models appearing fruitful are the LC and the M8. Such evidence remarks on the backtesting process’s usefulness. Indeed, a mortality model should offer both an in-sample representativeness and an accurate out-of-sample performance. Our analysis points out potential over-fitting generated by RH and M7 models so that their forecasting accuracy is poor overall. However, we also examine the MSE distribution by age for an in-depth vision.

Table 4 Out-of-sample \(MSE_i\), \(i \in {\mathcal {I}}\) for the different stochastic models considered

Referring to Fig. 2, MSE increases with age, and some models show peculiar trends for the elderly. For all female populations, excluding the US, the M7’s MSE tends to be greater for initial ages, then decreases until about age 85 and increases for the final ages. Such behaviour also occurs in both Italian and Japanese male populations. For the latter mortality data, the M6’s MSE shrinks after age 85, indicating the M6 goodness in predicting mortality for older ages. However, the M6 shows lower accuracy for other mortality experiences. Within the CBD family, the MSE related to the M8 model exhibits greater regularity with increasing age. The RH model seems to be befitting, respecting the age range 60–73, but then loses predictive accuracy at older ages. The LC forecasts produce quite increasing errors, except for US and Australian female populations. On the other hand, the APC’s MSE boasts significantly after age 80 for all populations.

Fig. 2
figure 2

Out-of-sample \(MSE_{x,i}\) distribution by age, for \(i \in {\mathcal {I}}\). \(MSE_{x,i}\) values are expressed on log-scale for simplicity of graphical display

Selecting a mortality model considering only the point predictions accuracy could be difficult and, sometimes, misleading. Therefore, we evaluate projected prediction intervals exploiting both PICP and MPIW. We begin inspecting the former as depicted in Fig. 3, separately from the latter represented in Fig. 4. As expected, mortality models within the CBD family demonstrate larger prediction intervals at old ages. These models imply more uncertainty due to the presence of multiple periods effects with respect to LC-based models. Prediction interval width influences the prediction interval coverage, making CBD-based models more appealing ex-ante to anticipate unexpected mortality outcomes. Nonetheless, the ex-post perspective about PICP by age, as in Fig. 3, delineates particular evidence. For instance, the best M6 and M7 density forecasts occur for both English & Welsh and Australian males. For all other populations, such models regain coverage probability only at higher ages, thanks to their MPIW’s exponential growth. In addition, we observe that the APC’s PICP departs from the complete coverage from age 80 onwards, despite APC’s MPIW increasing with age. The LC model presents the lowest MPIW in all populations and allows a reasonable coverage probability for Australian females, Italian females and Japanese males. Regarding PICP, the LC model with cohort effect is profitable for both Australian and English & Welsh female mortality experiences.

Moreover, RH’s MPIW increases exponentially at older ages for English & Welsh males, failing to provide a corresponding probability coverage. In our opinion, the M8 model seems to be the best compromise to achieve biological reasonableness and plausibility of predicted levels of uncertainty in forecasts at different ages. To verify our belief, we compare mortality model predictions using a statistical test about the global PICP-based performances.

Fig. 3
figure 3

\(PICP_{{i}}\) distribution by age, for \(i \in {\mathcal {I}}\). Prediction intervals are calculated at level \(95\%\), accounting for parameter uncertainty

To this end, we compute the global PICP for each mortality model and each population, that is:

$$\begin{aligned} PICP_{i}= \frac{1}{|{\mathcal {T}}^{'}|\cdot |{\mathcal {X}}|}\sum _{x \in {\mathcal {X}}}\sum _{t \in {\mathcal {T}}^{'}} \mathbbm {1}_{\left\{ q_{x,t, i} \ \in \ [{\hat{q}}_{x,t, i} ^{LB}, \ {\hat{q}}_{x,t, i} ^{UB}]\right\} }, \end{aligned}$$
(23)

and we submit the values to the Wilcoxon signed-rank test (Wilcoxon 1992): the null hypothesis assumes that differences between two distributions come from a zero-median distribution, while the alternative hypothesis states that they come from a distribution with a median greater than zero. We perform the test by analysing the PICP values \((PICP_i)_{i \in {\mathcal {I}}}\) obtained by two models in the (ten) different populations across ages and calendar years.

Fig. 4
figure 4

\(MPIW_{{i}}\) distribution by age, for \(i \in {\mathcal {I}}\). Prediction intervals are calculated at level \(95\%\), accounting for parameter uncertainty

Table 5 contains the Wilcoxon’s test results. Each table cell reports the p-value obtained by testing the PICP performances of the model in the row against those of the model in the column. We refer to the value 0.05 as the significance threshold: if the p-value is lower, we reject the null hypothesis concluding that the model in the row is superior to the model in the column. Green cells indicate, by row, which models are better, while red cells identify opposite cases. Interestingly, we observe that for both M8 and APC models alternative hypothesis is rejected in almost all cases. Therefore, these models are superior to others in terms of global PICP. In addition, the empirical evidence does not allow for rejection of the null hypothesis when the APC model is tested against M8 and vice versa, highlighting that these two models have comparable PICP performances.

Table 5 P-value of the left-side Wilcoxon rank test. The test is applied to the \((PICP_i)_{i \in {\mathcal {I}}}\) values of two models each time

Finally, we compare mortality forecasts employing jointly PICP and MPIW. In particular, we apply the preference criterion exposed in Eq. (20). Figures 5, 6 and 7 depict the (MPIWPICP) plane for each population under investigation and for ages 65, 75 and 85 respectively.

Fig. 5
figure 5

Stochastic mortality models in the \((MPIW-PICP)\) plane for age \(x = 65\)

Fig. 6
figure 6

Stochastic mortality models in the \((MPIW-PICP)\) plane for age \(x = 75\)

For \(x=65\), the M7 model produces modest performances, in particular for both Japanese genders and for the female gender of Italy, England & Wales and Australia. Similar observations hold for the M6 model. In general terms, the M7 is the best model only for English & Wales aged 65. The M8 model is more advisable, but both the RH and the APC models can also guarantee a complete prediction interval coverage without large MPIW values. Moreover, we observe dense clusters of models for many populations, such as US males, Japanese males, Italian males and females, and Australian Males and females. In these cases, mortality models provide similar density forecasts for populations aged 65, and the preference criterion’s strength emerges when clusters dissolve with increasing age. Indeed, looking at populations aged 75, models M6 and M7 are far from the cluster’s centre in many cases. On the other hand, prediction intervals stemming from models M8, RH and APC ensure coherent mortality boundaries with respect to mortality outcomes. Finally, from Fig. 7, some clusters of models disappear, e.g. for US males, and others change their composition. We can appreciate how the LC model becomes more profitable for Japanese populations, as well as for Australian female mortality. However, the M8 model plays the role of “best practice” in most cases.

Fig. 7
figure 7

Stochastic mortality models in the \((MPIW-PICP)\) plane for age \(x = 85\)

In Table 6 we summarize models’ ranking by the means of criterion (20). Generally speaking, our experiment highlights the M8 model performances are the more balanced. Secondly, also the RH model manifests appealing prediction interval-based results for many populations. From an age-based perspective, both M8 and RH models seem more proper for populations aged \(x=65\) and \(x=75\), while for higher ages, only the M8 maintains good performances. Adding country-specific considerations, both the LC and the CBD models’ performances for age \(x=85\) may be explanatory. For instance, we notice that mortality profiles for Japanese populations aged 85 are strongly linear, and the LCs’ density forecasts goodness confirms how such a model is more parsimonious than others. Thus, the LC model generates less uncertainty providing narrower prediction intervals and, at the same time, embedding mortality realizations. Similar suggestions hold about Australian populations. However, for populations aged 65, or 75, mortality experiences may be characterized by more complex influences so that mortality models also involving cohort effects are more accurate. Overall, the M6 model looks more unsatisfactory, while the M7 model owns peculiarities. The M7 is tailored for the English & Welsh male population, but it is also competitive for several elderly populations (see, e.g. Fig. 7).

Table 6 Best (B) and worst (W) mortality forecasts for ages \(x=65,75,85\)

5 Conclusions

The improvements in mortality observed after WWII have raised the need to create stochastic mortality models to measure longevity risk. The literature offers many models’ specifications to provide accurate mortality projections. However, what is the best model remains an open question. Indeed, comparative studies available in the literature show that mortality model selection depends on both the mortality experiences and the criteria considered. In addition, few studies have thoroughly tested the uncertainty concerning mortality projections. Forecasting mortality uncertainty levels have a central role since they reveal the potential, unexpected longevity rise and the related economic impact. The present work proposes a methodological framework to backtest uncertainty in mortality projections exploiting uncertainty metrics not yet adopted in mortality literature. To this end, we employ the Prediction Interval Coverage Probability Coverage and the Mean Prediction Interval Width. In particular, such prediction interval-based measures allow quantifying both the plausibility and effectiveness concerning the predicted levels of uncertainty in future mortality outcomes. In addition, we define a new model selection criterion that combines the two metrics, allowing for a plain ex-post assessment of density forecasts at different ages. Numerical experiments are performed in five countries worldwide and both genders. As expected, there does not exist a mortality model overperforming all the others. Despite both age- and country-specific characteristics, our proposal grants to identify models with mortality density forecasts more balanced across populations. The empirical application of our proposal highlights that the RH model seems the best candidate within The LC family, while the M8 model overperforms the other CBD family members. Furthermore, the M7 model may be suited for elderly populations. The latter feature is particularly fruitful for governments, as well as for pension and health benefits providers. These entities bear the cost of increasingly elderly populations, facing the risk of paying out benefits for much longer. Therefore, selecting an appropriate mortality model to forecast mortality is necessary, and our proposal aims to support such a goal. Indeed, our analysis show the possibility to select a country-tailored mortality model in terms of uncertainty.

Future research will proceed in several directions. First, we intend to exploit our criterion to test and compare multi-population models that consider the dependence structure between the mortality dynamics of different countries, see Li and Lee (2005), Kleinow (2015). Second, we plan to develop a data-driven procedure based on machine-learning-based tools to select the most suited mortality model automatically following the approach suggested in Hunt and Blake (2014), Cairns et al. (2019).