Backtesting stochastic mortality models by prediction interval-based metrics

Human lifespan increments represent one of the main current risks for governments and pension and health benefits providers. Longevity societies imply financial sustainability challenges to guarantee adequate socioeconomic conditions for all individuals for a longer period. Consequently, modelling population dynamics and projecting future longevity scenarios are vital tasks for policymakers. As an answer, the demographic and the actuarial literature have been introduced and compared to several stochastic mortality models, although few studies have thoroughly tested the uncertainty concerning mortality projections. Forecasting mortality uncertainty levels have a central role since they reveal the potential, unexpected longevity rise and the related economic impact. Therefore, the present study poses a methodological framework to backtest uncertainty in mortality projections by exploiting uncertainty metrics not yet adopted in mortality literature. Using the data from the Human Mortality Database of the male and female populations of five countries, we present some numerical applications to illustrate how the proposed criterion works. The results show that there is no mortality model overperforming the others in all cases, and the best model choice depends on the data considered.


Introduction
During the last century, populations around the world have experienced a continuous longevity growth, despite country-specific dynamics and disparities in age-at-death distribution ( Aburto and van Raalte 2018; Nigri et al. 2022). Broadly speaking, mortality has 1 3 declined at all ages, with different intensities, due to the action of heterogeneous factors affecting the human lifespan. For instance, the diffusion of preventative health measures and improved medical care have positively impacted on population's life expectancy (see, e.g. Vaupel et al. 2021 andZarulli et al. 2021). Moreover, socioeconomic and environmental conditions have been determinants of how mortality changed over time and among populations (see, e.g. Cairns et al. 2019 andKhomenko et al. 2021). Although human lifespan increments are enjoyable evidence, as longevity increases, economic challenges emerge to ensure an adequate income and health treatments for all individuals throughout their old age. The social costs to cater for the needs of increasingly elderly populations involve a burdensome financial position for governments, annuity providers and pension schemes. These entities face the risk of paying out benefits for a much longer period, suffering financial instability conditions. In addition, such a circumstance may be exacerbated by future social advances. For instance, it is plausible that scientific and technological progresses will be able to boost medicine and lifestyles behaviours, making future improvements in life expectancy highly unpredictable (see, e.g. Keilman 2019). Indeed, population forecasts are heavily based on the current knowledge in vital processes, from which to extrapolate future trends. As a consequence, unexpected changes may occur due to random fluctuations or structural changes in the mortality trends. Therefore, uncertainty in predicting future longevity scenarios is a fact and achieving accuracy in mortality forecasts is crucial to support a long-lived society financially. To this end, both the demographic and the actuarial literature have been enriched over time by excellent works focused on stochastic mortality modelling. Models widely used to forecast mortality rates are extrapolative and draw inspiration from the pioneering Lee-Carter model (Lee and Carter 1992, hereinafter LC). The latter is a log-bilinear model embedding age-period effects and an additive Gaussian error structure. Various LC model extensions have been proposed in the literature to overcome model weaknesses and describe other relevant mortality patterns. For instance, Brouhns et al. (2002) formulated the canonical LC model in terms of Poisson regression, later extended in a Bayesian framework by Czado et al. (2005). A multi-factor age-period extension of LC was proposed by Booth et al. (2002), while Renshaw and Haberman (2003) exploited the LC model to provide mortality forecasting with age-specific enhancement. Prominent developments in stochastic mortality modelling were furnished by Renshaw and Haberman (2006), introducing cohort effects, and by Cairns et al. (2006). The latter proposed a two-factor stochastic representation for the logit of death probabilities, opening the way for further generalizations embodying multiple period and cohort effects (Cairns et al. 2009). Afterwards, Plat (2009) gathered the LC, the Renshaw and Haberman and the Cairns-Blake-Dowd models to establish a unified approach. Since a large number of mortality models have been proposed in the literature, Hunt and Blake (2014) designed a general procedure for constructing parametric mortality models able to catch different ageperiod-cohort effects present in the data.
Designating the best mortality model is not trivial, and the choice stems from the satisfaction of suitable criteria. However, a model may not necessarily dominate all others based on the selected criteria. For instance, a model could be better than others looking at its goodness of fit to historical data; at the same time, such a model could be less robust or provide less accurate projections than others. Furthermore, the model comparison depends on the mortality data investigated and the purposes of the analysis. Subsequently, many articles in the literature have compared different mortality models exploiting specific mortality experiences and befitting metrics. Cairns et al. (2009) offered an extensive analysis comparing empirical fits of eight stochastic mortality models on US and England & Wales males mortality data. The authors examined model performances by means of both qualitative and quantitative model selection criteria. The former refers to desirable mortality model's features, that is, parsimony, transparency, ability to generate sample paths, incorporation of cohort effects, aptitude to provide non-trivial correlation structures between ages. Quantitative criteria allow for assessing consistency and robustness of parameter estimates with respect to the period of data observed. Dowd et al. (2010b) tailored hypothesis tests to strengthen models evaluation in terms of goodness-of-fit and to consider England & Wales males mortality data. A different perspective to benchmark mortality models was developed in Dowd et al. (2010a). The latter designed a backtesting framework to gauge (ex-post) the forecasting performances of six stochastic mortality models fitted on England & Wales male mortality data. Cairns et al. (2011) focused on the plausibility of stochastic mortality model forecasts by means of innovative qualitative criteria, namely: biological reasonableness, plausibility of predicted levels of uncertainty in forecasts at different ages, projections robustness with respect to the sample period used to fit the model. A considerable model comparison is also promoted in Haberman and Renshaw (2011), whose numerical experiment include both US and England & Wales female's mortality experiences. Additional quantitative model comparisons have been performed in Lovász (2011), concerning Finnish and Swedish populations, and in Biffi and Clemente (2014) and Carfora et al. (2017), concerning Italian population.
In the spirit of the aforementioned literature, the present study furnishes a comparative analysis of the forecasting ability of stochastic mortality models. As suggested in Dowd et al. (2010a), a suitable mortality model should offer both proper in-sample results and plausible ex-ante forecasts, as well as generate adequate ex-post out-of-sample performances. In particular, forecasted uncertainty levels have a central role since it reveals the potential, unexpected longevity rise and its biological plausibility. Therefore, we implement an ex-post assessment of the projected prediction intervals, comparing and measuring their plausibility through uncertainty metrics not yet adopted: the Prediction Interval Coverage Probability (hereinafter PICP) and the Mean Prediction Interval Width (hereinafter MPIW). The former generally represents the probability of observing mortality outcomes over the backtesting horizon falling within the prediction intervals. At the same time, the latter expresses the average amplitude of such intervals. Both the PICP and the MPIW are usually employed for rating neural networks prediction intervals (see, e.g. Khosravi et al. 2011, but they can be efficiently utilized for any predictive model. It is straightforward to note that prediction intervals whose PICP is the highest possible value are a matter of interest; we aim for all mortality realizations within the variability range. Such a scenario may be simply achieved through a wide prediction interval, but the latter could indicate poor predictive believability. For instance, too large prediction intervals are not informative about the likely uncertainty of future mortality outcomes; furthermore, too wide intervals may be associated with low or null PICP values. The latter is the worst-case scenario as it indicates the model's failure to capture the mortality trend. Hence, it is important to consider jointly the PICP and the MPIW, defining an associated criterion for the model comparison and selection. From a backtesting perspective, we advance a preference criterion that gives credibility to the mortality model with the highest PICP and the lower MPIW. Our proposal identifies an additional backtesting strategy whose joint application with commonly used quantitative metrics improves the model selection process. As further contribution, we illustrate how the suggested criterion performs ranking seven stochastic mortality models in an empirical analysis. To this end, we deal with the mortality experiences of five populations, namely Australia, England & Wales, Italy, Japan and the USA, and for both genders by exploiting the Human Mortality Database data (Human Mortality Database 2018). Our analysis shows that, as expected, no mortality model outperforms all the others. Despite both age-and country-specific characteristics, our proposal grants to identify models with mortality density forecasts more balanced across populations. The remainder of the paper is the following. Section 2 contains the methodological framework for investigating the stochastic mortality models. Section 3 explains the PICP and the MPIW metrics and the proposed preference criterion. Section 4 refers to the numerical experiments and related results obtained. Finally, Sect. 5 sets out concluding remarks.

Methodological framework
Let X = {x 1 , x 2 , … , x } be the age set, and T = {t 1 , t 2 , … , t n } the calendar year set. For each last birthday age x ∈ X during calendar year t ∈ T , we introduce the following variables of interest: • D x,t , i.e. the random number of deaths; • d x,t , i.e. the observed number of deaths; • E 0 x,t , i.e. the initial exposed to risk; • q x,t , i.e. the one-year probability of death.
In the present work we resort the Generalized Age-Period-Cohort (hereinafter GAPC) as unified mortality modeling framework, allowing for a fair comparison among stochastic mortality models (see e.g. Currie 2016; Villegas et al. 2018). In particular, we assume that: coherently representing the systematic mortality component by the following predictor: By the means of Eq. (2), one-year probabilities of death are characterized through a sigmoidal transformation of age-dependent parameters a x , b (l) x and c x , time-dependent parameters (l) t , and the cohort parameter p . Most of the stochastic mortality models in the literature are special cases extracted from the GAPC setup. For instance, consider whether the age-dependent parameter a x or the cohort parameter t−x , implies different mortality representations; again, varying the order of L, mortality models are shaped according to specific period effects portrayals.

Stochastic mortality models
We choose to deal with seven stochastic mortality models, summarized in Table 1, depicting the hinges of the two main stochastic mortality models families: the Lee-Carter family and the Cairns-Blake-Dowd family. We briefly illustrated their analytical structure in the following sections. (1)

LC model
The LC model provides two age-specific effects, a period effect and it does not acknowledge for cohort influences. Thus, by the means of Eq. (2) we have L = 1 and c x , t−x = 0 , obtaining: We need to make the LC model invariant with respect to parameters' transformations, and the following constraints are then applied:

RH model
The RH model develops the LC structure introducing an age-cohort bi-linear term, that is: Similarly to the LC model, identifiability problems are avoided imposing the following constraints: As suggested in Haberman and Renshaw (2011), we consider the specification c x = 1 aiming to bypass robustness issues suffered by the predictor in Eq. (4).

APC model
The APC model was originally proposed in the fields of medicine and demography (see e.g. Clayton and Schifflers 1987) and later introduced in actuarial literature by Currie (2006). Referring to the latter, the APC model involves an age effect, a period effect and a cohort effect, that is: Such a model requires parameters constraints defined as below:

CBD model
The CBD model assumes L = 2 time-dependent parameters, while a x , c x , t−x = 0 . In addition, Cairns et al. (2006) , where x denotes the average age for the age set considered. Hence, the predictor has the following expression: and it is not affected by identifiability issues.

M6 model
The M6 model is a CBD's extension due to the cohort effect inclusion. The latter is weighted by c x = 1 , and the predictor becomes: Contrary to the CBD model, constraints have to be introduced:

M7 model
The M7 model boosts the M6 one, favoring the introduction of a third period effect weighted by a quadratic coefficient. In particular, the M7's predictor is defined setting where ̂x denotes the standard deviation of ages. In analytical terms, it holds that: Regarding identifiability problems, the following constraints are imposed:

M8 model
The M8 model is M6 model's variant contemplating a non-constant coefficient related to the cohort effect, that is c (1) x = (x c − x) for some constant parameter x c to be estimated. Thus, the predictor takes the form: and the associated parameter constraint is:

Model fitting
Stochastic mortality models falling within the GAPC framework are examples of generalized, or non-generalized, linear models and they can be fitted by maximum likelihood estimation (see e.g. Currie 2016). Recalling assumption (1) and assuming i.i.d. death counts for each age and calendar year, we consider the log-likelihood below: where w x,t is a 0/1 weight taking value 0 if a data cell, (x, t), is ignored or 1 if the cell is incorporated, and d x,t is the estimated number of deaths, that is: It is straightforward to note that the parameters are estimated by solving the problem:

Model forecasting
Stochastic mortality models forecasts arise from time-series techniques describing period and cohort effects, while age-specific parameters are time invariant. Let t n ∈ T so that future mortality trend is outlined without considering prediction errors. However, the latter generates uncertainty in mortality forecasts, as well as the estimation error related to ̂ . Therefore, prediction intervals are substantial in supporting mortality projection's reliability and to perform consistent mortality/longevity risk assessments. Any stochastic mortality model have to provide prediction intervals with a certain coverage probability, namely (1 − ) , aiming that future mortality outcomes belong to the interval defined by the lower and upper bounds. More in detail, indicating with q LB the projected prediction interval lower bound and with q UB the projected upper bound, a desirable model satisfies the following: In addition, prediction interval's quality depends on the distance between q LB x,t n +h and q UB x,t n +h for each time point in T ′ . For our purposes, we appraise models' out-of-sample judging the quality of the projected prediction intervals in terms of their probability coverage and their width. In doing so, we take into account both model and parameter uncertainty. The method considered to generate such uncertainty is explained in Appendix A.

Prediction interval based metrics and forecasts comparison
To appreciate the prediction interval's goodness, we propose to jointly examine specific prediction interval-based metrics, namely PICP and MPIW. The former inspects the prediction interval coverage by counting how many mortality realisations are wrapped in the probabilistic range, given a confidence level (1 − ) . It is defined as , and 1 {⋅} = 0 otherwise. Broadly speaking, the PICP furnishes the estimate for the probability in Eq. (15) so that, theoretically, PICP should be greater or equal to the nominal value (1 − ) . The higher the PICP value, the more likely the coverage the prediction interval provides for future mortality realizations. However, PICP values lower than the confidence level could occur due to different reasons, such as in the presence of noisy mortality data or when the model under-fitting or over-fitting crops up. Larger PICP values could be achieved simply by considering wider prediction intervals, but the latter suggests poor predictive believability, and they are not of practical usefulness. Therefore, rating prediction interval quality by PICP without considering the prediction interval width is a little choice. We must evaluate prediction interval accuracy, referring simultaneously to both PICP and MPIW. The latter represents the prediction interval mean width over the forecasting horizon, that is: The joint use of PICP and MPIW requires formulating a criterion to compare mortality models given their projected prediction intervals. Our approach traces a backtesting exercise. We provide a plain ex-post evaluation of prediction intervals inspecting the associated PICP and MPIW for each mortality model considered. In particular, we advance a preference criterion relying on the mortality model with the highest PICP and the lower MPIW. For instance, given two stochastic mortality models, namely M k and M j , we propose to promote the model satisfying the following criterion: The criterion in (18) may be drafted also in terms of weak preference, that is: The means of criterion (18), or (19), is streamlined: we aim to select a model whose prediction intervals enclose future mortality outcomes and, at the same time, the model (and associated parameters) must not yield an excessive uncertainty. For instance, Fig. 1 shows a graphical visualization of how the proposed preference criterion works. Let us consider five stochastic mortality models and let M k , k = 1, … , 5 , the coordinates representing such models in the plane (MPIW, PICP). It is straightforward to note that M 3 ≻ M 1 ≻ M 2 , but criterion (18) does not allow preference judgment in comparing M 1 and M 4 or M 1 and M 5 . However, for financial and actuarial purposes, the latter case should be unambiguous: we are inclined to prefer the model M 1 as it provides a greater probability coverage and is more informative regarding uncertainty. Thus, we interpret the preference criterion (18) as a gold standard, facilitating model selection if there is a prominent mortality model in terms of density forecasts; otherwise, intermediate situations occur, and we need to introduce an additional rule in defining the preference criterion. In particular, we say that M k ≻ M j if the following criterion is satisfied: Recalling the example in Fig. 1, the previous criterion implies the preference chain The model M 2 represents the worst case, since it has simultaneously the lowest PICP and the higest MPIW.

Empirical experiments
To test our proposal, we implement an empirical experiment involving mortality experiences worldwide. In particular, we exploit mortality data for both genders in five countries, namely Australia (AUS), England & Wales (GBRTENW), Japan (JPN), Italy (ITA) and the USA. We indicate the population set by We consider such countries representative in terms of mortality dynamics. For instance, Australia, England & Wales and Italy populations experienced a deep, non-linear shaped, reduction of mortality rates after WWII, despite they differ in terms of life expectancy acceleration. On the other hand, Japan population exhibits a linear decline in mortality, while the USA population is characterized by the well-known life expectancy stagnation (Mehta et al. 2020).
Data were downloaded from the Human Mortality Database, which concerns calendar years from 1960 to 2019 and the age set X = {60, … , 89} . The latter is specially chosen to investigate ages for which the longevity risk may arise. Such a data are split in two sub-dataset with respect to the forecasting year t = 2000 : a training dataset composed by mortality data indexed by the calendar years set T = {1960, … , 2000} and a testing dataset concerning mortality data for the calendar years set T � = {2001, … , 2019} . The former is used to fit stochastic mortality models, while the latter act as a test for the backtesting (20) analysis. We refer to the Bayesian Information Criterion (hereinafter BIC) to assess the mortality model's in-sample performance. At the same time, the out-of-sample accuracy of point forecasts is evaluated by looking at the Mean Squared Error (hereinafter MSE). We inspect the latter metric as an overall performance measure for each mortality model, and it is defined as: We also investigated the MSE distribution by age, that is: In addition, we assess the mortality density forecasts using uncertainty metrics explained in Sect. 3. In particular, we firstly scrutinize PICP and MPIW distribution by age and separately. As the final step, we apply the proposed preference criterion displaying the results for ages

Results
The present section exposes and discusses the findings of our numerical experiments. Foremost, in Table 3 we list in-sample BIC values for each mortality model and different populations. The best performance is reported in bold for each population. We outline that RH and M7 fitting overperform the other models: the former presents the lowest BIC in 5/10 cases, while the latter has the best BIC in 4/10 cases. The M8 provides the lowest BIC only for the US female population. Thus, from a proper perspective, both periods and cohort effects seem to be prevalent in explaining the mortality surfaces considered. For completeness, we report the heatmaps of the binomial residuals for all the considered models and populations in Appendix B.
Concerning the out-of-sample analysis, Table 4 shows MSE values applying Eq. (21). We observe that models selected in terms of BIC are not optimal from a forecasting (21) Such evidence remarks on the backtesting process's usefulness. Indeed, a mortality model should offer both an in-sample representativeness and an accurate out-of-sample performance. Our analysis points out potential over-fitting generated by RH and M7 models so that their forecasting accuracy is poor overall. However, we also examine the MSE distribution by age for an in-depth vision. Referring to Fig. 2, MSE increases with age, and some models show peculiar trends for the elderly. For all female populations, excluding the US, the M7's MSE tends to be greater for initial ages, then decreases until about age 85 and increases for the final ages. Such behaviour also occurs in both Italian and Japanese male populations. For the latter mortality data, the M6's MSE shrinks after age 85, indicating the M6 goodness in predicting mortality for older ages. However, the M6 shows lower accuracy for other mortality experiences. Within the CBD family, the MSE related to the M8 model exhibits greater regularity with increasing age. The RH model seems to be befitting, respecting the age range 60-73, but then loses predictive accuracy at older ages. The LC forecasts produce quite increasing errors, except for US and Australian female populations. On the other hand, the APC's MSE boasts significantly after age 80 for all populations. Selecting a mortality model considering only the point predictions accuracy could be difficult and, sometimes, misleading. Therefore, we evaluate projected prediction intervals exploiting both PICP and MPIW. We begin inspecting the former as depicted in Fig. 3, separately from the latter represented in Fig. 4. As expected, mortality models within the CBD family demonstrate larger prediction intervals at old ages. These models imply more uncertainty due to the presence of multiple periods effects with respect to LC-based models. Prediction interval width influences the prediction interval coverage, making CBD-based models more appealing ex-ante to anticipate unexpected mortality outcomes. Nonetheless, the ex-post perspective about PICP by age, as in Fig. 3, delineates particular evidence. For instance, the best M6 and M7 density forecasts occur for both English & Welsh and Australian males. For all other populations, such models regain coverage probability only at higher ages, thanks to their MPIW's exponential   growth. In addition, we observe that the APC's PICP departs from the complete coverage from age 80 onwards, despite APC's MPIW increasing with age. The LC model presents the lowest MPIW in all populations and allows a reasonable coverage probability for Australian females, Italian females and Japanese males. Regarding PICP, the LC model with cohort effect is profitable for both Australian and English & Welsh female mortality experiences. Moreover, RH's MPIW increases exponentially at older ages for English & Welsh males, failing to provide a corresponding probability coverage. In our opinion, the M8 model seems to be the best compromise to achieve biological reasonableness and plausibility of predicted levels of uncertainty in forecasts at different ages. To verify our belief, we compare mortality model predictions using a statistical test about the global PICP-based performances.
To this end, we compute the global PICP for each mortality model and each population, that is: and we submit the values to the Wilcoxon signed-rank test (Wilcoxon 1992): the null hypothesis assumes that differences between two distributions come from a zero-median distribution, while the alternative hypothesis states that they come from a distribution with a median greater than zero. We perform the test by analysing the PICP values (PICP i ) i∈I obtained by two models in the (ten) different populations across ages and calendar years.  Table 5 contains the Wilcoxon's test results. Each table cell reports the p-value obtained by testing the PICP performances of the model in the row against those of the model in the column. We refer to the value 0.05 as the significance threshold: if the p-value is lower, we reject the null hypothesis concluding that the model in the row is superior to the model in the column. Green cells indicate, by row, which models are better, while red cells identify opposite cases. Interestingly, we observe that for both M8 and APC models alternative hypothesis is rejected in almost all cases. Therefore, these models are superior to others in terms of global PICP. In addition, the empirical evidence does not allow for rejection of the null hypothesis when the APC model is tested against M8 and vice versa, highlighting that these two models have comparable PICP performances.
Finally, we compare mortality forecasts employing jointly PICP and MPIW. In particular, we apply the preference criterion exposed in Eq. (20). Figures 5, 6 and 7 depict the (MPIW, PICP) plane for each population under investigation and for ages 65, 75 and 85 respectively.
For x = 65 , the M7 model produces modest performances, in particular for both Japanese genders and for the female gender of Italy, England & Wales and Australia. Similar  females. In these cases, mortality models provide similar density forecasts for populations aged 65, and the preference criterion's strength emerges when clusters dissolve with increasing age. Indeed, looking at populations aged 75, models M6 and M7 are far from the cluster's centre in many cases. On the other hand, prediction intervals stemming from models M8, RH and APC ensure coherent mortality boundaries with respect to mortality outcomes. Finally, from Fig. 7, some clusters of models disappear, e.g. for US males, and others change their composition. We can appreciate how the LC model becomes more profitable for Japanese populations, as well as for Australian female mortality. However, the M8 model plays the role of "best practice" in most cases. In Table 6 we summarize models' ranking by the means of criterion (20). Generally speaking, our experiment highlights the M8 model performances are the more balanced. Secondly, also the RH model manifests appealing prediction interval-based results for many populations. From an age-based perspective, both M8 and RH models seem more proper for populations aged x = 65 and x = 75 , while for higher ages, only the M8 maintains good performances. Adding country-specific considerations, both the LC and the CBD models' performances for age x = 85 may be explanatory. For instance, we notice that mortality profiles for Japanese populations aged 85 are strongly linear, and the LCs' density forecasts goodness confirms how such a model is more parsimonious than others. Thus, the LC model generates less uncertainty providing narrower prediction intervals and, at the same time, embedding mortality realizations. Similar suggestions hold about Australian populations. However, for populations aged 65, or 75, mortality experiences may be characterized by more complex influences so that mortality models also involving cohort effects are more accurate. Overall, the M6 model looks more unsatisfactory, while the M7 model owns peculiarities. The M7 is tailored for the English & Welsh male population, but it is also competitive for several elderly populations (see, e.g. Fig. 7).

Conclusions
The improvements in mortality observed after WWII have raised the need to create stochastic mortality models to measure longevity risk. The literature offers many models' specifications to provide accurate mortality projections. However, what is the best model remains an open question. Indeed, comparative studies available in the literature show that mortality model selection depends on both the mortality experiences and the criteria considered. In addition, few studies have thoroughly tested the uncertainty concerning mortality projections. Forecasting mortality uncertainty levels have a central role since they reveal the potential, unexpected longevity rise and the related economic impact. The present work proposes a methodological framework to backtest uncertainty in mortality projections exploiting uncertainty metrics not yet adopted in mortality literature. To this end, we employ the Prediction Interval Coverage Probability Coverage and the Mean Prediction Interval Width. In particular, such prediction interval-based measures allow quantifying both the plausibility and effectiveness concerning the predicted levels of uncertainty in future mortality outcomes. In addition, we define a new model selection criterion that combines the two metrics, allowing for a plain ex-post assessment of density forecasts at different ages. Numerical experiments are performed in five countries worldwide and both genders. As expected, there does not exist a mortality model overperforming all the others. Despite both age-and country-specific characteristics, our proposal grants to identify models with mortality density forecasts more balanced across populations. The empirical application of our proposal highlights that the RH model seems the best candidate within The LC family, while the M8 model overperforms the other CBD family members. Furthermore, the M7 model may be suited for elderly populations. The latter feature is particularly fruitful for governments, as well as for pension and health benefits providers. These entities bear the cost of increasingly elderly populations, facing the risk of paying out benefits for much longer. Therefore, selecting an appropriate mortality model to forecast mortality is necessary, and our proposal aims to support such a goal. Indeed, our analysis show the possibility to select a countrytailored mortality model in terms of uncertainty. Future research will proceed in several directions. First, we intend to exploit our criterion to test and compare multi-population models that consider the dependence structure between the mortality dynamics of different countries, see Li and Lee (2005), Kleinow (2015). Second, we plan to develop a data-driven procedure based on machine-learningbased tools to select the most suited mortality model automatically following the approach suggested in Hunt and Blake (2014), Cairns et al. (2019).

Appendix A: Model and parameter uncertainty
In the present work, we construct prediction intervals considering both model and parameter uncertainty. Among others, we refer the reader to Dowd et al. (2010a) for more details concerning the role of both uncertainty causes in backtesting stochastic mortality models. For our purposes, we briefly recall that model uncertainty stems from forecasting errors about the period and cohort indexes, while parameter uncertainty arises from the estimation of the parameters of the GAPC model. Consequently, accounting for both uncertainty sources allows for portraying the randomness generated by a stochastic mortality model in projecting mortality. In particular, model uncertainty is easily assessed as it derives directly from the innovation component of the time series model considered for the time and the cohort indexes, see for instance, Eq. (13) and Table 2. On the other hand, due to the

Declarations
Competing interests The authors have not disclosed any competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.