Introduction

Aggregated data such as health insurance claims data become more and more available for research purposes. Recently, we proposed a new method to estimate the excess mortality in chronic diseases from aggregated age-specific prevalence and incidence data [1, 2]. So far, estimates of excess mortality have only been plausible for ages 50+ and have shown to be unstable in younger ages. For example, in the simulation study of [2], the bias increases as the age decreases (Table 1 in [2]).

The theoretical background for estimating the excess mortality stems from the illness-death model for chronic diseases [3]. In [4] we have shown that the temporal change, ∂p = (∂t + ∂a) p of the age-specific prevalence p is related to the incidence rate i, the mortality rates m0 and m1 of the people with and without the disease, respectively, the general mortality m and the mortality rate ratio R = m1/m0 via the following equations:

$$ {\partial p} = \, \left( {{ 1 { }{-}p}} \right) \, \{{ i{-}}p \times \left( {{m_{ 1} {-}m_{0}} } \right)\} $$
(1a)
$$ = \left( { 1 { }{-}p} \right)\{ i{-}m \times p\left( {R{-}{ 1}} \right)/\left[ { 1 { } + p\left( {R{-}{ 1}} \right)} \right]\} . $$
(1b)

There are two assumptions such that Eqs. (1a) and (1b) are true: (a) there is no remission from the chronic condition back to the healthy state and (b) age-specific prevalence of the chronic condition in the migrating population is the same as in the resident population.

Given the age-specific prevalence p, the age-specific incidence rate i and the general mortality rate m, Eqs. (1a) and (1b) can be used to estimate the excess mortality rate ∆m = m1m0 and the mortality rate ratio R [1, 2]:

$$ \Delta m = \, \{ i{-}\partial p/\left( { 1 { }{-}p} \right)\} /p, $$
(2a)
$$ R = { 1 } + { 1}/p \times \{ i\left( { 1 { }{-}p} \right) \, {-}\partial p\} /\{ \left( { 1 { }{-}p} \right) \, \left( {m{-}i} \right) \, + \partial p\} $$
(2b)

The aim of this research note is to explore the reasons why estimates of excess mortality for younger ages are biased and what can be done to extend the age range to ages below 50 years. As a testing example, we use claims data about diabetes from the German statutory health insurance based on about 70 million people collected during the period from 2009 to 2015 [5].

Main text

Methods and materials

Goffrier et al. report the age-specific prevalence p of type 2 diabetes in 2009 and 2015 [5]. The age-specific prevalence data p for men in 2009 and 2015 are modeled by a linear regression model after application of a logit transformation. Furthermore, the age-specific incidence rate i for diabetes in men halfway between 2009 and 2015, i.e., in the year 2012, is reported. The age-specific incidence rate i for 2012 is modeled by a linear regression model after a log-transformation. These data are used as input for Eqs. (2a) and (2b). For applying Eq. (2b) we also use the general mortality m in 2012 from the Federal Statistical Office of Germany.

With these input data, Eqs. (2a) and (2b) allow to estimate the age-specific excess mortality ∆m and the mortality rate ratio R. While R has a straightforward interpretation as the ratio of the mortality rate of the diabetic population compared to the non-diabetic population, the excess mortality rate ∆m is more interpretable when it is related to another mortality rate. As it holds m = p m1 + (1 − p) m0, we have ∆m/m ≤ ∆m/m0= R − 1 and thus R ≥ 1 + ∆m/m ≥ ∆m/m. Hence, we decided to report the quotient ∆m/m, which is a lower bound for R.

In order to assess uncertainty in the results, we implemented a multidimensional probabilistic sensitivity analysis [6]. The key idea is to randomly sample from the distributions of input parameters (i.e., prevalence in 2009 and 2015, and incidence in 2012), and calculate the outcomes (i.e., measures of excess mortality). As the input parameters are sampled from random distributions many times, we get a sequence of outcomes, which also follows a random distribution representing the combined uncertainty in the input parameters [6]. We report empirical medians, and 2.5% and 97.5% quantiles for approximate 95% confidence intervals of the outcomes based on 5000 samples from the input distributions.

Results

Figure 1 shows the age-specific ratio ∆m/m. Below 50 years of age the excess mortality rate is more than 10 times higher than the mortality rate of the general population. The ratio peaks at a value of more than 200 at the age of about 30 years. As R ≥ ∆m/m, we see that the estimate of the excess mortality is extraordinarily high.

Fig. 1
figure 1

Age-specific ratio ∆m/m of the excess mortality (∆m) and the general mortality (m). The graph shows the empirical median of ∆m/m with 95% confidence bounds (vertical bars) based on the probabilistic sensitivity analysis with 5000 simulation runs. The ratio ∆m/m is a lower bound for the mortality rate ratio R

Application of Eq. (2b) for obtaining the mortality rate ratio R, yields the results as shown in Table 1. We see that for ages below 55 years of age, the mortality rate ratios are implausibly high or become negative. By definition of the mortality rate ratio, a quotient of two positive rates, negative values are not possible. Thus, we see that the estimates based on Eq. (2b) do not yield sensible results for lower age groups and thus are not reliable.

Table 1 Mortality rate ratios (R) for different age-groups

Discussion

In this manuscript we have applied two methods to estimate indices for the excess mortality of a chronic condition from age-specific prevalence and incidence data. The first index is the difference ∆m between the mortality rate of the diseased people (m1) and the people without the disease (m0), i.e., ∆m = m1m0. Sometimes, the index ∆m is called attributable risk [7]. The second index is the mortality rate ratio R = m1/m0. In an example about diabetes in the German male population, it turns out that both estimates are numerically unstable for ages below 50 years. In case of ∆m, unreasonably high values have been obtained in the diabetes data (more than 200 times the mortality of the general population). The estimated values of R can lead to implausible results such as negative rate ratios.

The question arises if the implausible results might be a consequence of the assumptions for Eq. (1) being violated. The two assumptions are: no remission and prevalence in migrants is the same as in residents. While remission of diabetes has indeed been observed [8], it has not been a relevant therapy option or health policy in Germany during the study period. Note that the input data [5] refers to millions of people. Little is known about the second assumption. The prevalence of diabetes in migrants from and to Germany is currently not investigated on population level. However, in another age-related chronic disease (dementia), we analyzed the most extreme cases (i.e., all immigrants having the chronic condition and all emigrants being free from the chronic condition and vice versa) and the overall epidemiological measures were only negligibly affected [9]. Thus, we think that violations of the two assumptions have only very minor effects on reported results.

Implausible results, at least in theory, may be due to changes in the distributions of relevant covariates in the input data. Examples for relevant covariates might be the change of diagnostic criteria for diabetes, changes in the distribution of disease duration, distribution of body weight, the quality of glucose control or the presence of co-morbidities. In fact, possible effects of changing covariates are not estimable by our method and we do not doubt that these exist. However, we believe that the study period (2009–2015) is relative short to comprise considerable changes. Furthermore, in Germany there has not been a change in diagnostic criteria in diabetes during the study period.

In simulation studies, we found that the diagnostic accuracy of the claims data plays a crucial role for the proposed methods of estimating excess mortality. By diagnostic accuracy we mean sensitivity and specificity of the claims data compared to the gold standard of diagnosing diabetes. In principle, diagnostic accuracy may undergo secular changes, e.g., if reimbursement policy is changing. It could be possible, for instance, that false positive diagnoses in the prevalence of 2015 can be increased compared to 2009, if physicians obtain more reimbursement for the later point in time. We note, however, that such up-coding is fraud and is enforced by penalty. The impact of changes of diagnostic accuracy is subject to an ongoing theoretical analysis (including a comprehensive simulation study) aimed for an upcoming paper.

Based on the results in this example, we see that special attention is required in interpreting the results of the two estimation techniques, when applied to lower age ranges.

Limitations

The aim of this research note was to assess the performance of two recently published estimators for the excess mortality of a chronic disease from prevalence and incidence data. While in previous publications [1, 2] reasonable results have been found for ages over 50 years, here we demonstrated problems of these estimators in younger age groups. The reasons for the problems seem to lie in the estimators itself. For instance, if the partial derivative of the prevalence (∂p) is close to zero and the incidence rate (i) is close to the general mortality (m), i.e., i ≈ m, the denominator in Eq. (2b) is close to zero. Thus, the fraction on the right hand side of Eq. (2b) becomes very large in magnitude. This explains the highly oscillating values in Table 1. Despite Eq. (2a) does not have the (cancellation) problem for i ≈ m, implausibly high values are obtained too. The reason is the factor 1/p on the right hand side of Eq. (2a). For values of the prevalence (p) being close to zero, the reciprocal 1/p becomes very large. For example, in the lowest age group (15–19 years), the fraction 1/p takes values of about 900, which explains the high estimate for ∆m in this age group. Strategies to overcome these problems are currently under development and will be subject of a future article.