1 Introduction

1.1 Mortality modelling: motivation, background, and literature

Since the seminal work of Lee and Carter [17], several stochastic models for the estimation and projection of mortality rates were developed during the last decades, see, e.g., the contributions of Brouhns et al. [2], Renshaw and Haberman [31], Cairns et al. [3], and Plat [28]. While these pioneering approaches analyzed single populations, models for multiple populations gained considerable importance in subsequent years; after it was found that the mortality profiles of multiple populations tended to converge (cf. Wilson [40]). Indeed, the multi-population paradigm often has advantages over modelling mortality rates for each population separately. Most notably, multi-population mortality models can capture common features of the mortality profile of similar populations such as neighbouring countries, populations showing similar socio-economic-, environmental-, or biological characteristics, while simultaneously reflecting population-specific features. This motivates their use for producing coherent projections of mortality rates.

Recently, Neural Network (NN)-based approaches for mortality modelling were proposed as an appealing alternative to classical stochastic models. Among the first examples is the work of Richman and Wüthrich [33], who analyses the Swiss population, and Hainaut [11] who uses French, UK, and US mortality rates to compare a NN approach to the Lee–Carter model w/wo cohort effects. Another contribution is Nigri et al. [23], who use a deep learning algorithm based on a two-step Recurrent Neural Network (RNN) to enhance the forecasts obtainable under the Lee–Carter model. Indeed, there have been many recent developments in the use of NNs in the context of multi-population mortality modelling, such as Perla et al. [27], which considers the use of one-dimensional (period effect only) RNN with Long Short-Term Memory (LSTM) and of Convolutional Neural Networks (CNN) to provide direct forecasts of the mortality rates compared to the two-step approach of Nigri et al. [23]. Lindholm and Palmborg [19] consider similar models with a focus on the optimal use of data for projection. Schnürch and Korn [35] extend the RNN and CNN by proposing a two-dimensional approach involving age and period. In a slightly different fashion, Scognamiglio [36] proposes a NN architecture for the joint calibration of individual Lee–Carter models based on its classical log-normal representation as well as the Poisson Lee–Carter version of Brouhns et al. [2]. An approach that allows for coherent predictions within sub-groups of similar population is provided in Perla and Scognamiglio [26]. Finally, Wang et al. [38] develop a framework which ‘augments’ the mortality dataset to construct an image of neighborhood mortality data around the central death rate and use two CNN approaches for projecting mortality rates.

1.2 Applications of (multi-population) mortality models

A key application of multi-population mortality modelling approaches is the analysis of the mortality levels based on socio-economic characteristics. Among others, understanding mortality is relevant to policymakers in order to propose and plan sustainable state pension reforms and budgets, or to address disparities between socio-economic groups. Understanding mortality from a statistical point-of-view is also relevant for private sector players such as insurance companies and pension funds, offering mortality-linked products like annuities and pensions and designing effective solutions for longevity risk management and transfer.

Considering specific populations that have already been investigated in the literature, let us mention Wen et al. [39], who compare several stochastic mortality models to fit mortality rates of small geographic areas in the UK (Lower Layer Super Output Areas), grouped in deciles of their Index of Multiple Deprivation. Cairns et al. [6] develop a multi-population mortality modelling approach for the analysis of the Danish population on the basis of the deciles of a newly created affluence index, which accounts for individual information on income and wealth.

1.3 Contributions: methodology and investigated population

This paper investigates the use of NNs to jointly model the mortality rates of multiple populations and compares the empirical results to classical stochastic mortality models. The model for the dynamics of mortality rates draws on the work of Richman and Wüthrich [32], where they propose the use of Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). Although computationally expensive, we focus on RNNs to exploit their suitability to deal with the time-series structure of mortality rates. Indeed, due to their recurrent connections, RNNs allow to maintain memory of past observations, since these let the information pass from past time steps to the current one. In this way, it is possible to model the information throughout the entire time series. Furthermore, RNNs can handle time series of variable length, due to their sequential processing of the data, cf. Hsu [12]. The NN approaches are then compared to well-established stochastic mortality models for multiple populations such as the Lee–Carter extension of Li and Lee [18] and the Common Age Effect model of Kleinow [16], as well as to the single-population approach represented by the Plat [28] model.

In our case study, we investigate Italian mortality data that is grouped according to socio-economic characteristics. More precisely, we first propose a new deprivation index on the basis of five variables. This index allows separating the 106 Italian counties into nine groups of different socio-economic level, which are then used as populations for the empirical analysis. To the best of our knowledge, this study is the first to explicitly address the use of NNs in the context of the analysis of multiple populations on the basis of socio-economic characteristics.

The structure of the remaining paper is as follows: Sect. 2 introduces and explains the Italian mortality data used for our analysis and the creation of the new Index of Multiple Deprivation. Section 3 describes how NNs are used in our study. Section 4 briefly introduces the models used for comparison. Then, Sect. 5 shows the results of the empirical analysis and, finally, Sect. 6 concludes.

2 Data

The collected mortality data comprises the ‘number of deaths’ and ‘exposure-at-risk years’ for males and females aged 50 to 95 living in Italy. It spans over the calendar years 1982–2018 and is granular to the level of the 106 Italian counties (called provinces). Our main source of data is the local office of national statistics (ISTAT).Footnote 1 The data were further processed to account for splits in some counties over the last thirty years (see Euthum [10] for details). Further elaborations were needed to calculate the central exposure to risk from the population data given for January 1st of each year.Footnote 2 We assume that the net migration effect does not bias our findings about mortality rates.Footnote 3 Furthermore, we collected a set of twelve indicators of socio-economic characteristics for each county, which we used to create an index of multiple deprivation. For further details regarding peculiarities and adjustment of data, we refer to the supplementary materialFootnote 4.

2.1 Index of multiple deprivation

For each county, we originally collected a set of twelve indicators representing different aspects of the quality of life. Out of these indicators, a subset of five was ultimately selected, reflecting a proper mix of different types of socio-economic factors. These have been aggregated to a so-called ‘Index of Multiple Deprivation’ (IMD). In this way, the 106 provincesFootnote 5 from twenty regions could be classified to different groups based on their socio-economic indicators. The chosen variables are:

  1. 1.

    Relative poverty (in percent): The percentage of households with a consumption expenditure lower than the average per-capita, as estimated by ISTATFootnote 6;

  2. 2.

    Primary care and residential and semi-residential facilities (measured in beds per 10,000 inhabitants). This information is indicative of the expenditure in health care by the region where the county is located, which in its turn is affected by its wealthFootnote 7;

  3. 3.

    Social services and benefits of municipalities, measured as costs in Euros per capita (2017 data). Services include, e.g., day nursery, socio-educational services for early childhood, and so on. It is assumed that higher costs correspond to more investments in social services by municipalities and hence more benefits from a sociological point of view for the population;

  4. 4.

    Unemployment rate (in percent for the population aged over 15);

  5. 5.

    Number of felonies committed by persons convicted by final judgement (per 1,000 inhabitants). From a statistical point of view, this variable is not significantly correlated with the other indicators used for building the index. It is assumed that a high relative number of felonies committed (by persons convicted by final judgement) suggests worse living conditions from a sociological point of view, but also comes along with a rather deficient economic situation.

A detailed description of these variables can be found on the ‘Statbase portal’ of Istat, which provides access to a large amount of data on the Italian population,Footnote 8 which is also the source of the underlying data (downloaded on \(1^{st}\) of March 2021 for year 2018).

The variables have been chosen based on a correlation analysis between them. The selection was then validated by a ranking measure, Kendall’s tau, to check whether the rank of provinces changes significantly when omitting a variable from the selection. We found that Kendall’s tau does not significantly change when omitting or adding a further variable to the five we selected.

Our index was constructed using z-scores to scale the five different variables to a comparable level and unit; the same approach is used in Osservatorio della salute [24]. This allowed to aggregate the single z-scores per province to a total value and finally permitted a ranking of the provinces. For each province, the z-score was calculated as

$$\begin{aligned} z^j = \sum \limits _{i = 1}^{5} z_i^j = \sum \limits _{i = 1}^{5} \frac{x_i^j-m_i}{s_i}, \end{aligned}$$

where i indicates the respective socio-economic classifier, \(x_i^j\) its value, j the county, \(m_i\) the mean of this classifier over all counties, and \(s_i\) the (unbiased) sample deviation over all counties.

The interpretation is intuitive: the higher the value \(z^j\) of a county, the worse is its socio-economic situation, or in other words, the deprivation in that area is higher. Implicitly, we assume that the standardized value of each variable has the same impact on the z-score (which could be easily generalized by introducing weights).

Then, the aim is to rank the counties on the basis of the values of their z-score. Hence, we created nine groups of counties, each with homogeneous populations, ranging between 6 and 7 million people. This split does not account for sex and age of the Italian population. This implicitly assumes that the counties have similar distributed population by age and sex. If, by using this criterion, we found that two counties had the same z-score value,Footnote 9 we allocate these in the same group. In this way, we aim at creating groups of comparable size, where the counties therein have similar socio-economic characteristics. The use of socio-economic indicators for the calendar year 2018 only, implies that the ranking and the corresponding groups do not change over time. This assumption may not reflect the socio-economic developments in Italy across the last 36 years. Nevertheless, we find this assumption reasonable in consideration of the results obtained in the preliminary analysis of the mortality and socio-economic data. Indeed, more deprived socio-economic counties are located in the south of Italy, compared to the historically wealthier areas in the north, as also demonstrated by the time series of the unemployment rate, and this evidence is further confirmed when analysing the evolution of the raw mortality rates.

We purposely opted for a simple and interpretable approach to create the index, that might be refined in further studies. Our main focus is to elaborate on mortality models as described later, and this simple and intuitive approach provides very plausible results. In any case, we remark how the creation of the groups was carried out on the basis of their order, rather than their index value.

To obtain a geographical impression of the index-based subdivision, the following map is reported, see Fig. 1. The color-scheme is as follows: Brighter colors indicate a lower index value, reflecting better socio-economic condition based on the IMD defined above. The original map was taken from the De Agostini websiteFootnote 10 and colored with standard graphic tools.

Fig. 1
figure 1

Subdivision of the Italian provinces based on the new index. We observe the tendency of northern provinces having smaller index values, i.e. better living conditions

Fig. 2
figure 2

Crude death rates in log-scale for ages 68 resp. 83, female and male population. Group 1 (g1) is the worst group socio-economically speaking

To obtain a first impression of mortality rates, the crude period death rates \({\hat{m}}(x,t,i) = \frac{d(x,t,i)}{E^c(x,t,i)}\) between 1982 and 2018 are plotted in Fig. 2 for the male and female population aged 68 and 83 years old, respectively. We observe:

  • In general, mortality rates decrease over time and increase with age; which is consistent with the literature and the biological ageing process;

  • Mortality rates for females are lower than for males, as widely observed in many other national population tables;

  • The most deprived subpopulations (g1 in dark blue) appear to have higher mortality rates over the period analysed. The difference and ordering in subgroups is more pronounced for the female subpopulation, while for males, the difference in mortality rates is less evident for wealthier subgroups. This could be due to the chosen index or due to the underlying population. This effect may also be the consequence of a significant north–south division in terms of socio-economic well-being, while people still living longer in southern parts of the country, as it is the case of regions like Sardinia and Calabria, which are famous for their exceptional longevity (see Poulain et al. [29]);

  • In the earlier years of the study, deprivation trends and differences are harder to detect. This may be caused by the fact that the socio-economic analysis was based on indicators of the year 2018, while different provinces evolved differently over decades;

  • The spike in the mortality rates for males and females aged 83 in 2003 is likely to be due to the massive heatwave of the summer in 2003 (Johnson et al. [15]). Further discussion is included in Sect. 5.

Throughout the analysis, we use the mortality rates of the first 33 calendar years 1982–2014 when training the models (training set), while those of 2015–2018 are used for predictions and forecasts (test set). We opted for this split in about 90%–10% due to the short length of the time-series of available data. We believe this split is a reasonable compromise between the need of sufficient data to fit the models (especially on a restricted age-range of a 50–95 years old population), whilst using the latest years to assess their predictive ability.

3 Multi-population NNs

In this section we introduce the NN approach adopted in this work, which draws on the work of Richman and Wüthrich [32], to analyse the Italian population. The key idea is to use age, sex, calendar year, and deprivation group (described in Sect. 2) as input, in order to obtain as output an estimate for mortality rates. The time series nature of the data requires the use of RNNs with Long Short-Term Memory structure or Gated Recurrent Unit structure, as pioneered in Richman and Wüthrich [33]. The general structure of a NN for regression includes:

  • An input layer formed by several features or covariates;

  • One or more hidden layers where inputs are processed, that is weighted and mapped inputs are passed on among different layers;

  • An output layer, which returns a fitted value of the dependent variable.

Suppose there are \(K \ge 1\) hidden layers in the network. Each layer includes \(q_k \in {\mathbb {N}}\) neurons (by convenience \(q_0\) is the dimension of the feature space which provides the input layer). The layers \(z^{(k)}\) represent a mapping:

$$\begin{aligned} z^{(k)} : {\mathbb {R}}^{q_{k-1}} \rightarrow {\mathbb {R}}^{q_k}, \quad {{\textbf {z}}} \mapsto z^{(k)}({{\textbf {z}}}) = \Big ( z_1^{(k)}({{\textbf {z}}}), \dots , z_{q_k}^{(k)}({{\textbf {z}}}) \Big )', \end{aligned}$$
(3.1)

\(k=1,\ldots ,K\). \(z_l^{(k-1)}\) is the l-th neuron from layer \(k-1\):

$$\begin{aligned} z_j^{(k)}({{\textbf {z}}}) = \phi _j^{(k)} \bigg ( w_{j,0}^{(k)} + \sum _{l=1}^{q_{k-1}} w_{j,l}^{(k)}z_l^{(k-1)} \bigg ) =: \phi _j^{(k)} \Big \langle {{\textbf {w}}}_j^{(k)}, {{\textbf {z}}}^{(k-1)} \Big \rangle , \quad \text {for } j = 1, \dots , q_k, \end{aligned}$$
(3.2)

where \({{\textbf {w}}}_j^{(k)} = (w_{j,0}^{(k)}, \dots , w_{j,q_{k-1}}^{(k)})' \in {\mathbb {R}}^{1+q_{k-1}}\) are the weights and parameters to be trained in the model. \(\phi _j^{(k)}\) denotes the activation function. This function determines the output of a layer, or neuron, in other words it decides whether a neuron is activated or not. The aim of the activation function is to introduce non-linearity into the output of a neuron.

3.1 Recurrent neural networks (RNN)

For RNNs, one has given input variables \((\tilde{x}_1, \dots , \tilde{x}_T)\) in time series structure with components \(\tilde{{{\textbf {x}}}_t}\in {\mathbb {R}}^{q_0}\) for \(t = 1, \dots , T\). For the k-th layer, we define the mappings as follows:

$$\begin{aligned} z^{(k)} : {\mathbb {R}}^{q_{k-1}+q_k} \rightarrow {\mathbb {R}}^{q_k}, \quad ({{\textbf {z}}}_{t}^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}) \mapsto z_t^{(k)} = z^{(k)}({{\textbf {z}}}_{t}^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}), \end{aligned}$$
(3.3)

where

$$\begin{aligned} z_t^{(k)}&= z^{(k)}({{\textbf {z}}}_{t}^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)})\\&= \Big ( \phi \Big ( \langle {{\textbf {w}}}_1^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle {{\textbf {u}}}_1^{(k)}, {{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ), \dots , \phi \Big ( \langle {{\textbf {w}}}_{q_k}^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle {{\textbf {u}}}_{q_k}^{(k)}, {{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \Big ) ^T\\&=: \phi \Big ( \Big \langle W^{(k)}, {{\textbf {z}}}_t^{(k-1)} \Big \rangle + \Big \langle U^{(k)}, {{\textbf {z}}}_{t-1}^{(k)} \Big \rangle \Big ). \end{aligned}$$

The individual neurons \(1 \le j \le q_k\) are modeled as

$$\begin{aligned} z_{j,t}^{(k)} = \phi \Big ( \langle {{\textbf {w}}}_j^{(k)}, {{\textbf {z}}}_t^{(k-1)} \rangle + \langle {{\textbf {u}}}_j^{(k)}, {{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) = \phi \bigg ( w_{j,0}^{(k)} + \sum _{l=1}^{q_{k-1}} w_{j,l}^{(k)} z_{l,t}^{(k-1)} + \sum _{l=1}^{q_k} u_{j,l}^{(k)} z_{l,t-1}^{(k)} \bigg ), \end{aligned}$$
(3.4)

where \(\phi\) is the (non-linear) activation function, which will be the same for all neurons. The weights are \(W^{(k)} = ({{\textbf {w}}}_1^{(k)}, \dots , {{\textbf {w}}}_{q_k}^{(k)})^T \in {\mathbb {R}}^{q_k \times (1+q_{k-1})}\) (including an intercept, see above) and \(U^{(k)} = ({{\textbf {u}}}_1^{(k)}, \dots , {{\textbf {u}}}_{q_k}^{(k)})^T \in {\mathbb {R}}^{q_k \times q_k}\) (excluding an intercept) which are identical for all time points t, as these weight matrices are homogeneous over time.

Note, \(\tilde{{{\textbf {x}}}}_t \in {\mathbb {R}}^{q_0}\) is an input in layer \(k=1\).

3.1.1 Long short-term memory structure (LSTM)

The LSTM type RNN has cycles in information transmission which are provided by a so-called cell state process. This process stores the available memory and allows for information from previous points in time to be included in further time steps when modelling the network. Shortly, this leads to the main equations for this specific NN:

$$\begin{aligned} {{\textbf {z}}}_t^{(k)} := z^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}, {{\textbf {c}}}_{t-1}^{(k)}\Big ) = o_t^{(k)} \circ \phi \Big ({{\textbf {c}}}_t^{(k)} \Big ) \in {\mathbb {R}}^{q_k}, \end{aligned}$$
(3.5)

where the cell state \({{\textbf {c}}}_t^{(k)}\) is given by

$$\begin{aligned} {{\textbf {c}}}_t^{(k)}:= & {} c^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}, {{\textbf {c}}}_{t-1}^{(k)}\Big ) \nonumber \\= & {} f_t^{(k)} \Big ( {{\textbf {c}}}_{t-1}^{(k)}\Big ) + i_t^{(k)} \circ \phi _{\text {tanh}} \Big ( \langle W_c^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_c^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in {\mathbb {R}}^{q_k}. \end{aligned}$$
(3.6)

Here, \(f_t^{(k)}\), \(i_t^{(k)}\), and \(o_t^{(k)}\) are the Forget gate, the Input gate, and the Output gate, respectively. These are trained to decide which information is handed in and which one is excluded at time t. \(\phi _{\text {tanh}}\) denotes the selected activation function.Footnote 11 The mentioned gates have their own weight matrices and are defined as follows:

  • Forget gate (loss of memory gate):

    $$\begin{aligned} f_t^{(k)} := f^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) = \phi _{\sigma } \Big ( \langle W_f^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_f^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in (0,1)^{q_k}, \end{aligned}$$
    (3.7)
  • Input gate (memory update gate):

    $$\begin{aligned} i_t^{(k)} := i^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) = \phi _{\sigma } \Big ( \langle W_i^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_i^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in (0,1)^{q_k}, \end{aligned}$$
    (3.8)
  • Output gate (release of memory information rate):

    $$\begin{aligned} o_t^{(k)} := o^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) = \phi _{\sigma } \Big ( \langle W_o^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_o^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in (0,1)^{q_k}. \end{aligned}$$
    (3.9)

3.1.2 Gated recurrent unit (GRU)

The second RNN architecture used in this work is the GRU, first introduced by Cho et al. [7]. A problem when using LSTMs is that they are rather complex and computationally expensive. GRU networks are slightly simpler and can still provide good results as we observe in Sect. 5. The output structure is the same as for the LSTM, while the corresponding gates show some differences. In GRU architectures, only two different gates are used, the so-called Reset gate and the Update gate, denoted respectively by \(r_t^{(k)}\) and \(u_t^{(k)}\).

The gate activations are given by

$$\begin{aligned} {{\textbf {z}}}_t^{(k)}:= & {} z^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) \nonumber \\= & {} r_t^{(k)} \Big ( {{\textbf {z}}}_{t-1}^{(k)}\Big ) + (1-r_t^{(k)}) \circ \phi \Big ( \langle W^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + u_t^{(k)} \circ \langle U^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in {\mathbb {R}}^{q_k}, \end{aligned}$$
(3.10)

for general weight matrices of dimensions as those above. Here, no cell process \(c_t\) comes into play.

Note, if \(r_t\) approaches 1 in a component, then there is no reset for this component (neuron) in the sense that only the old activation from the previous time step is taken also in time step t. If \(r_t\) approaches 0, the old value is reset and replaced by a new value. This new value undergoes some update for information from the previous layer or not, depending on the values of \(u_t\). Consequently, the number of parameters decreases. The gates are defined as follows:

  • Reset gate:

    $$\begin{aligned} r_t^{(k)} := r^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) = \phi _{\sigma }\Big ( \langle W_r^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_r^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in (0,1)^{q_k}, \end{aligned}$$
    (3.11)
  • Update gate:

    $$\begin{aligned} u_t^{(k)} := u^{(k)}\Big ({{\textbf {z}}}_t^{(k-1)}, {{\textbf {z}}}_{t-1}^{(k)}\Big ) = \phi _{\sigma }\Big ( \langle W_u^{(k)},{{\textbf {z}}}_t^{(k-1)} \rangle + \langle U_u^{(k)},{{\textbf {z}}}_{t-1}^{(k)} \rangle \Big ) \in (0,1)^{q_k}. \end{aligned}$$
    (3.12)

3.2 Implementation of the NN approach

RNNs were implemented in R (R Core Team [30]). Training and forecasting was performed using the package keras Chollet et al. [8]. The choice of the parameters is motivated by the work of Richman and Wüthrich [33], as number of layers, number of neurons, type of activation functions, and so on. When experimenting with different hyper-parameters we did not notice any substantial differences or improvements in the results. The R code for data pre-processing, similar in spirit to Richman and Wüthrich [33], can be found in the Github repository https://github.com/maxeuthum/Multipopulation-Mortality-ModelsFootnote 12. This repository also includes the code for fitting the models, with a detailed description of the performed operations line-by-line.

Input values were smoothed over 5 neighbouring ages to predict mortality at central age x. Therefore, for group i we obtain:

$$\begin{aligned} {{\textbf {y}}}_{t,x} = \Big ( \text {log}(m)_{(x-2) \vee 50,t}, \text {log}(m)_{(x-1) \vee 50,t}, \text {log}(m)_{x \vee 50,t}, \text {log}(m)_{(x+1) \wedge 95,t}, \text {log}(m)_{(x+2) \wedge 95,t} \Big )^T \in {\mathbb {R}}^5, \end{aligned}$$

where, for general variables \(x, y \in {\mathbb {R}}\), \(x \vee y = \text {max}\{x,y\}\), and \(x \wedge y = \text {min}\{x,y\}\).

Furthermore, a look-back period of \(T=10\) years was determined. This means that the previous 10 years were taken as an input to predict mortality rates for year t. This is the same look-back period as in Richman and Wüthrich [33].

The training data set \({\mathcal {T}}\) was defined as

$$\begin{aligned} {\mathcal {T}} = \{\left( {{\textbf {y}}}_{t-T,x}, \dots , {{\textbf {y}}}_{t-1,x},{{\textbf {y}}}_{t,x}\right) ; \quad 50 \le x \le 95, \quad 1982+T \le t \le 2014\}, \end{aligned}$$

where \({{\textbf {y}}}_{t,x}\) denote the log-mortality rates. Hence, we have log-mortality rates for 46 1-year age groups (50 to 95), denoted by variable x, over 23 calendar years (\(2014-(1982+10)+1=23\)), and nine deprivation groups. This gives \(46 \times 23 \times 9 = 9,522\) training samples of dimension \(10 \times 5\) as input x (10 for the look-back period, and 5 for the neighbouring ages). In R, this was stored in 5 arrays of dimension \(9,522 \times 10\). Last, \({{\textbf {y}}}_{t,x}\) was stored in an array of dimension 9, 522 (for this array of course no look-back period or smoothing takes place).

As a final step for training data pre-processing, scaling was applied on the prediction data using the MinMax-scaler. In the end, the group indicator has been mapped to the data to obtain the input data consisting of these two parts returning the total input of the NN. We remark that several choices for the input dimension are reasonable, allowing for further features to be included in the model.

In this work, different types of RNN models have been modelled, namely LSTM and GRU networks for all groups simultaneously, for all three sex groups (males, females, and combined) separately. Respectively one, two, and three layers have been specified (Figs. 3, 4).

Fig. 3
figure 3

Parameters and concatenations, example of three layered LSTM network model, female population, R-output. This table is a standard R-output which shows the underlying construction of the model

We opted for running 500 epochsFootnote 13 of the data with a batch size of 100 (again based on the work of Richman and Wüthrich [33]). To prevent overfitting, the training data set has been split into a training and validation set with a proportion of 80% and 20%, respectively. Furthermore, we implemented callbacks in order to select the calibration with the minimum loss on validation data among the 500 epochs.

As in other cases of NN-modelling, gradient descent algorithms were used to solve the optimization problem of the NN. Usually, these algorithms run until some stopping criterion is met such as the case where the error which we seek to minimize lies within a certain range. As stated in Richman and Wüthrich [33], an issue using early stopped solutions of gradient descent methods is that the resulting calibrations depend on the chosen seed of the algorithm. As a consequence, the results from different runs of the NN may be substantially deviating. For example, a certain run may lead to really good predictions, the next run to quite bad predictions. Hence, in our case study each model has been run 40 times, each time with different starting values. Then, we average the outcomes to assure not to have an outlying model by sheer coincidence. Hence, for each model the output (the log-mortality rates) has been transformed by the exponential function to obtain the mortality rates, which are then finally averaged.

Fig. 4
figure 4

In-sample loss for different models and populations, errors on \(10^{-4}\)-scale. F, M, and T abbreviate females, males, and total (for combined), respectively

3.2.1 Forecasting

The prediction of the log-mortality rates for the years 2015–2018 is carried out by means of an iterative approach, which makes no explicit use of time-series techniques. For example, the 10-year window 2005–2014 is needed to predict mortality for the year 2015. These new predictions are combined with the observations from the years 2006–2014, which yields a further 10-year window range to predict mortality for 2016, and so on. This recursive method allows for predictions of the 4 out-of-sample years. However, for longer prediction intervals this method may cause some implausible predictions.

4 Competing stochastic mortality models

We compare the results obtained from the NN approach of this work with three well established stochastic mortality (multi-)population models (thereafter referred as SMM), namely the Li and Lee [18] model, the Common Age Effect (CAE) model of Kleinow [16], and the Plat [28] model. All three models assume that the number of deaths D(xti) at age x, calendar year t, for group i is Poisson distributed with rate \(E^c(x,t,i)\cdot m(x,t,i)\), where m(xti) denotes the underlying death rate and \(E^c(x,t,i)\) is the central exposed at risk:

$$\begin{aligned} D(x,t,i)&\sim \text {Poi}\big (E^c(x,t,i)\cdot m(x,t,i)\big ). \end{aligned}$$

The parameters underlying the specification are estimated by using maximum likelihood, as proposed by Enchev et al. [9]. We briefly introduce these models below.Footnote 14

4.1 Li and Lee (LL) model

This model extends the single-population Lee and Carter [17] model to the analysis of multiple populations. The Li and Lee [18] model introduces a set of parameters which are common to the set of analysed groups, as well as group-specific parameters capturing the unexplained variance.

Model 4.1

(Li and Lee model) For age (in years) x, time period t, and group i, the Li & Lee model describes the logarithm of the ‘central mortality rate’ m(xti) as

$$\begin{aligned} \log \big (m(x,t,i)\big ) = \alpha (x,i) + B(x)\,K(t) + \beta (x,i)\,\kappa (t,i), \quad t_{min} \le t \le t_{min} + T - 1. \end{aligned}$$
(4.1)

Here, m(xti) can be approximated as the ratio between the death counts, denoted as D(xti), and the central exposure at risk \(E^c(x,t,i)\).

\(\alpha (x,i)\), \(\beta (x,i)\), and \(\kappa (t,i)\) are group-specific parameters. \(\alpha (x,i)\) indicates the average over time of the log mortality rate. The common function K(t) explains the evolution of the mortality over time for all groups, and B(x) is a global age modulating parameter, indicating how rates change by age for changes in the time factor K(t). \(\beta (x,i)\) and \(\kappa (t,i)\) have the same role as B(x) and K(t), but act on a group-specific level.

4.2 Common age effect (CAE) model

The CAE model of Kleinow [16] assumes that age effects are common to all populations, following the assumption that age effects may be very similar in countries sharing a similar socio-economic structure.

Model 4.2

(Kleinow model) For age x, time period t, and group i, the Kleinow model of order p assumes the following model for the logarithm of the central death rate

$$\begin{aligned} \log \big (m(x,t,i)\big ) = \alpha (x,i) + \beta ^{(1)}(x)\,\kappa ^{(1)}(t,i) + \ldots + \beta ^{(p)}(x)\,\kappa ^{(p)}(t,i). \end{aligned}$$
(4.2)

The order p follows from the allowance of further age–time interaction parameters. In our analysis we set \(p=2\).

4.3 Plat model

The third model we use for comparison was initially proposed by Plat [28]. However, in this work we use a simplified version, which includes two period-specific factors, without accounting for the cohort effect.

Model 4.3

(Plat model) For age x, time period t, and group i, the Plat model (without cohort effects) models the logarithm of the central death rates as

$$\begin{aligned} \log \big (m(x,t,i)\big ) = \alpha (x,i) + \kappa ^{(1)}(t,i) + \kappa ^{(2)}(t,i)(x-\bar{x}), \end{aligned}$$
(4.3)

where \(\bar{x}\) denotes the average age of the observed age range. The first stochastic component \(\kappa ^{(1)}\) represents changes in the level of mortality for all ages, while \(\kappa ^{(2)}\) allows for changes in mortality to vary between ages.

4.3.1 Forecasting

For the models of Li & Lee, Kleinow, and Plat, we performed forecasts via a classical time series approach. In fact, time dependent \(\kappa\)-processes are modelled as a stochastic time series to predict mortality rates through ARIMA (Auto Regressive Integrated Moving Average) models. These can be readily implemented in R, e.g. by using the function auto.arima from the package forecast (Hyndman and Khandakar [14]).

5 Empirical results

All models are fitted based on data spanning from 1982 to 2014. However, fitted values are just compared for the years 1992 to 2014, since the NN-based approach does not deliver values for the first ten years, see Richman and Wüthrich [33]. In what follows, then \(i \in \{1, \dots , 9\}, \, x \in \{ 50, \dots , 95\}\), and \(t \in \{1992, \dots , 2014\}\). Furthermore, we graphically inspect the models by using the standardized residuals, as defined in Wen et al. [39], see also Table 4.

5.1 In-sample fit

We first compare the three competing stochastic mortality models, that we briefly introduced in Sect. 4, in terms of their explanation ratio. Then we compare these to the rates obtained by using the NN models of this paper.

Table 1 shows the explanation ratios for the models of Li & Lee, Kleinow, and Plat, defined for group i and model M as

$$\begin{aligned} R_i^M = 1 - \frac{\sum \limits _{x,t} \left( \log \frac{d(x,t,i)}{E^c(x,t,i)} - \log \left( {\hat{m}}(x,t,i)\right) \right) ^2}{\sum \limits _{x,t} \left( \log \frac{d(x,t,i)}{E^c(x,t,i)} - \alpha ^c(x,i) \right) ^2}, \end{aligned}$$
(5.1)

where \(\alpha ^c(x,i):= \frac{1}{T} \sum \limits _{t} \log \frac{d(x,t,i)}{E^c(x,t,i)}\) is the average log crude death rate over time. The explanation ratio is useful for analysing how much information about the crude death rates \(\frac{d(x,t,i)}{E^c(x,t,i)}\) is explained by the respective model.

Table 1 Explanation ratios for different models

We observe how the Li & Lee model performs best in terms of explanation ratios for eight out of nine females subgroups, and the same is noted for the Kleinow model when looking at the male population. The Plat model, analysed at the level of a single population, still yield comparable explanation ratios, despite being lower compared to the Li & Lee and the CAE model of Kleinow. Furthermore, the Plat model has a lower number of parameters, which may in part explain these results. Our results are further confirmed by the analysis of their Akaike Information Criterion (AIC) (Akaike [1]), shown in Table 2, which indicates the relative quality of a statistical model with a penalty term for the number of parameters. The AIC is calculated as

$$\begin{aligned} AIC = -2 \cdot \ell ({\hat{\theta }}) + 2\cdot p, \end{aligned}$$
(5.2)

where \(\ell ({\hat{\theta }})\) indicates the log-likelihood value at (optimal) parameter \({\hat{\theta }}\) and p denotes the number of parameters of the respective model (see Table 9).

Table 2 AIC values for different models

Tables 3 shows the mean squared error for model M and subpopulation i, calculated as follows:

$$\begin{aligned} \text {MSE}_i^M = \frac{1}{n\cdot T} \sum \limits _{x,t} \left( m(x,t,i) - {\hat{m}}(x,t,i)^M \right) ^2, \end{aligned}$$
(5.3)

where \({\hat{m}}(x,t,i)^M\) denotes the fitted mortality rate derived from model M. Table 4 shows the MSE for the RNN models analysed in this paper.

Table 3 Mean squared errors for different models and groups
Table 4 Mean squared errors for different Recurrent network models and groups

From the results in Table 3, we observe that it is not possible to draw a decisive conclusion about the best performing model, since for different groups and sex there is a mixed evidence about the best performing model. When the data for both males and females are combined, then we observe that the Plat model has a better performance in the two extremes of the deprivation group, or in other words, shows a better performance for the least and most deprived county groups.

The two RNN models seem to sensibly improve the in-sample fit compared to the three competing stochastic models. In more detail, the two-layer LSTM outperforms the other two LSTM models over all socio-economic groups for males, females, and combined. On the other hand, the GRU models with three layers turn out to perform even better. For female and both sexes combined mortality rates, the most deprived groups show higher in-sample errors compared to less deprived ones. A similar evidence can be observed for the LL, CAE, and Plat models. This may be indicative of the issues of mortality models in general when fitting more deprived subpopulations via a multi-population approach. This can also be the effect of the created index in Sect. 2 and of the underlying data. Nevertheless, we note that the difference between females and males is smaller for the RNN approach compared to the competing models.

In conclusion, the multi-population RNN models outperform the well established stochastic mortality models when analysing county-based socio-economic subgroups of the Italian population, based on the mean squared error. The mean squared errors for males are higher compared to females and both sexes combined.

For a deeper inspection of these results we restrict the MSE analysis to a shorter age range for males and females, in a way such that they both have the same life expectancy. These ranges were chosen based on remaining life expectancy of Italian males at age 50 in 2017 (32.08 years) and Italian females at age 54 in year 2017 (approx. 32 years). Therefore, based on the Italian life tables from the Human Mortality Database [13], we restrict the analysis to the male population aged 50 to 82 and to the females aged 54 to 86.

Again, we recalculated the deprivation group-specific MSE for males and females for the restricted age-ranges, still based on the model fitted to the original dataset (males and females aged 50 to 95). The results are shown in Table 5 (LL, CAE, and Plat models) and Table 6 (for RNN).

5.2 Mean squared errors for specific ages

Table 5 Mean squared errors for specifically tailored age ranges, females (54–86), and males (50–82), SMM models
Table 6 Mean squared errors for specifically tailored age ranges, females (54–86), and males (50–82), selected RNN models

For the selected age ranges, we observe that for the three competing stochastic mortality models, the mean squared error is less than 0.1 of the mean squared error for all ages, even if these reduced age ranges cover 72% of the modelled years.

Again, we observe that for the female population the MSE is larger for the most deprived socio-economic groups for both the competing models as well as for the RNN-based ones. A similar evidence is obtained for the males at a lesser extent. For the models fitted using female subpopulation data, we still face more difficulties when considering the most deprived groups compared to less deprived ones, especially for SMM models. It is also evident, that all models within one model class (SMM or RNN) have similar mean squared errors. This means, the models perform very well for these ages through all selected models. Let us also mention the fact that mean squared errors in the RNN case are by far smaller than in the SMM case.

Furthermore, it is interesting to observe an alignment of female and male rate fitting (as it was aimed through the analysis of different age ranges): the SMM mean squared error for the male population were on average approximately 3.02 times compared to females. When restricting the age range observation, this factor reduces to about 1.23, which means that for our data both male and female rates are almost fitted equally well on average for an age range with similar remaining life expectation. For the three selected RNN models from above, the factors are 1.52 and 1.01 respectively, that is the alignment of female and male rates fitting ability is almost perfect in the RNN case.

Concluding, it seems that the oldest people (of age 87–95) drive most of the mean squared error. In addition, one should be aware of looking at different age ranges for females and males based on the evidence that life expectancy significantly differs for these two groups. The point in time for older ages, where mortality is harder to fit due to more volatility in death and exposure data, starts earlier for males which could explain a higher mean squared error when looking at all ages.

Another possible reason for larger errors in the male population through all models may come from the difference in population sizes of the underlying deprivation groups. These are higher in the female case, since in the observed age ranges females outlive males. A larger sample size could give the statistical model more stability when estimating parameters. However, the difference is not very largeFootnote 15.

5.2.1 Standardized residuals

We perform a graphical analysis of these results by investigating the standardized residuals. When a model fits the data reasonably well, standardized residuals should not exhibit any pattern based on years or ages. To quote Cairns et al. [4], “if the model fits the data well, then the standardized residuals should be independent of each other, meaning that the heat plot should exhibit a high degree of randomness, with no discernible patterns”.

Figure 5 plots the heatmaps of the standardized residuals for the fitted rates under the Li and Lee model for group 7 and for the group 3 of the CAE model. The plots for the other groups under the three SMMs are similar. These are calculated for model M as

$$\begin{aligned} Z(x,t,i)^M = \frac{d(x,t,i) - E^c(x,t,i) \cdot {\hat{m}}(x,t,i)^M}{\sqrt{E^c(x,t,i) \cdot {\hat{m}}(x,t,i)^M}}, \end{aligned}$$
(5.4)

where \({\hat{m}}(x,t,i)^M\) denotes the fitted mortality rate under model M.

The two plots show a diagonal line, which is indicative of a cohort effect for those individuals born around 1918, which corresponds to the end of the first world war and the Spanish flu pandemic. We discuss these points in more detail in Appendix A.2. For all other ages and years, the standardized residuals appear to lack any specific pattern (Figs. 6, 7).

For RNN models, the standardized residuals shown in Fig. 8 (for the other groups we have similar evidences) seem to be spread homogeneously across the ages and years. There are no visible cohort effects in residuals as in earlier models and residuals seem to be much smaller than in the previous models. These observations suggest that RNN models used in this work are able to capture cohort effects in the data and yield a better in-sample fit for observed years. In any case, we observe that for ages 70+/80+ in 2003, residuals are large and positive, meaning that the observed mortality is higher than expected. Conversely, for 2004 mortality rates seem to be over-estimated by the models. A possible explanation for this under-estimation in 2003 could be the massive European heat wave in summer 2003, see Johnson et al. [15], the hottest summer in Europe for centuries. ISTAT reported over 18,000 deaths in that summer compared to the year before (+ 11.6%). As reported in Mattone [21], 91.8% of these were aged 75 and older. This could explain in part the large number of negative residuals at older ages in 2004. These extraordinary circumstances may explain why the models could not fully detect such a sudden increase in the number of deaths. A similar result might be observed in 2020 or later following the COVID-19 pandemic.

We further compare the residuals as function of the age (Fig. 6), and as a function of the calendar year (Fig. 7), where groups and sex have been chosen randomly and where we exclude the data for the cohorts born in 1916, 1917, and 1920, based on observations in Appendix A.2.

Patterns are not observable if not for yellow points which indicate age 90+. In general, residuals for the Li and Lee model (the same holds for the other models of type SMM) spread on much higher levels than those of RNN models, by a factor larger than ten. However, both model types suggest evenly distributed residuals with exception for the oldest years which seem harder to model. Presumably, this latter observation is driven by the smaller exposure for older ages.

Fig. 5
figure 5

Selected residual heatmap plots per group

Fig. 6
figure 6

Selected residual plots function of age, per group, both sexes

Fig. 7
figure 7

Selected residual plots function of year, per group, both sexes

Fig. 8
figure 8

Randomly selected residual heatmap plots per group

5.2.2 Out-of-sample measures

We analyse the out-of-sample performance of the implemented models by using a forecast period of four years, with \(n=46\) and \(T=4\). All figures have been obtained on the basis of the procedure described in Sect. 3 for RNN and Sect. 4 for the SMMs. Results are shown in Tables 7 and 8.

Table 7 Out-of-sample mean squared errors for different models and groups

First of all, we observe that the RNN-based approaches show a better out-of-sample performance, since their MSE are sensibly lower compared to their SMM counterparts. In more detail, for females and both sexes combined it can be observed that the GRU networks tend to produce a lower error on average than LSTM networks. Overall, it seems that LSTM models produce better forecasts when they are two layered in the female case and one layered in the male case, whereas for GRU models this holds for two layered in the female case and three layered in the male case. Hence, we cannot draw a simple conclusion on which number of layers uniformly provides better forecasts for both sexes. However, it can be seen that for a specific sex and network type, the fitted values closest to those observed are generated by the same layered model through most groups.

When analysing socio-economic groups, we find that the out-of-sample performance is similar, except for few outliers. When looking at SMM, we note that their MSE is considerably larger when forecasting the mortality rates for group 4 under the Plat model, and the Li & Lee model only for males. In these cases, for some reason mortality is overestimated over all ages.

We conclude the section with two remarks: first, as noted through the analysis of their in-sample performance, NN models have the advantage to capture cohort effects, while other models would require the inclusion of specific cohort parameters. This may affect the precision of point forecasts. Second, a downside of NN models is that they produce only point forecasts. The strength of the stochastic models is their ability to provide a prediction interval which gives more reliability when forecasting mortality rates. To overcome this problem, an average over 40 forecasts has been taken in the NN case to prevent outliers in predictions. However, this solves the problem only in part.

Table 8 Out-of-sample mean squared errors for different RNNs and groups

6 Conclusions and outlook

The main contribution of the present case study can be divided in two major themes. On one hand, a NN approach has been implemented for the analysis of multiple populations based on socio-economic characteristics. On the other hand, an index of multiple deprivation was created that is linked to the life expectancy of the Italian population across different counties.

These overarching aims have been achieved by subdividing the underlying data of the Italian population into nine socio-economic groups. Then, various (multi-)population models have been estimated to some decades of Italian mortality data. The original hypothesis, following the work of Wen et al. [39], was to detect differences in mortality in the underlying deprivation groups. This effect was only partially observed in the Italian data using the methods we implemented. Whereas the two most deprived groups clearly exhibited higher mortality rates, the other groups could not be separated as accurately as it was possible for the British population.

Comparing the individual models, it turned out that all implemented approaches showed a very good fit to the data; both in-sample and out-of-sample. In general, the implemented NN models had a slightly more accurate fit compared to the classical statistical models. One reason could be the fact that the latter ones do not include cohort effect parameters, which has been observable in residual plots, whereas NNs are able to capture cohort effects. As a downside, however, NN models produce only point forecasts and face robustness problems when used to extrapolate longer into the future. These robustness issues could be partly overcome by averaging several outcomes. Further steps may include the analysis of prediction intervals for the NN models as proposed in Pearce et al. [25] or in Mancini et al. [20], which was beyond the scope of this work. An advantage of the classical statistical models by Li & Lee, Kleinow, and Plat is the interpretability of their parameters, these followed reasonable patterns over time for each deprivation group.

Determining which model is best suited for a certain group can be done based on the explanation ratio, the AIC criterion, mean squared errors, error plots, or out-of-sample performance. Generally speaking, female mortality data was best described by the Li & Lee model, whereas male data was best fitted by Kleinow’s model (based on explanation ratios and AIC). Looking at in-sample errors, NNs outperform the other models, presumably due to including cohort effects. Within the group of NN-models, three layered GRU models tend to produce the lowest in-sample errors. In general, old age mortality prediction was challenging and the models’ precision went down for higher ages.

In forecasting via ARIMA, there was not one model approach uniformly outperforming the others. When forecasting via NN models, it appeared that the three layered GRU model does not necessarily outperform the others. Indeed, the two layered version performed better when forecasting, and in some cases even the one layered LSTM delivered really good predictions.

Concluding, the models of Li and Lee, Kleinow, Plat, and NNs of type LSTM and GRU all detected time trends in the underlying data, provided a satisfying fit for all deprivation groups and sexes, and allowed for forecasting with different methods; at least for short periods into the future.

Upcoming steps could include additional fine tuning, for example by adding cohort effects, trying different time series structures, adding further input features (the flexibility of RNNs allow for several features to be included, on top or in place of socioeconomic indicators), conducting more tuning of hyper-parameters, or adding further models for comparison, for example the Age-Period-Cohort-Improvement (APCI) models of Richards et al. [34] or CNN from Wüthrich and Meier [41]. As an outlook for applications of the developed tools and models, it would be interesting using these (with socioeconomic based feature) when modelling mortality for pension plans. This might allow a discussion about social (un-)fairness and private insurance companies could address and explain the problem of adverse selection (since mortality rates within insurance portfolios are normally smaller due to the selection effect). During the review process, we became aware of the recent work Scognamiglio [37] that proposes a NN approach for the creation of clusters of countries with similar mortality patterns. This might be an alternative to the clustering via the index we proposed.