Left-censored dementia incidences in estimating cohort effects

We estimate the dementia incidence hazard in Germany for the birth cohorts 1900 until 1954 from a simple sample of Germany’s largest health insurance company. Followed from 2004 to 2012, 36,000 uncensored dementia incidences are observed and further 200,000 right-censored insurants included. From a multiplicative hazard model we find a positive and linear trend in the dementia hazard over the cohorts. The main focus of the study is on 11,000 left-censored persons who have already suffered from the disease in 2004. After including the left-censored observations, the slope of the trend declines markedly due to Simpson’s paradox, left-censored persons are imbalanced between the cohorts. When including left-censoring, the dementia hazard increases differently for different ages, we consider omitted covariates to be the reason. For the standard errors from large sample theory, left-censoring requires an adjustment to the conditional information matrix equality.


Introduction
When studying the incidence of dementia, it is necessary to acknowledge the age of a person, and useful to study the evolution over time (cohort effect) (Doblhammer et al. 2013;Wu et al. 2016). From data of the nine-year period 2004 until 2012, we observe, for the German population born between 1900 and 1954, the ages at which dementia is diagnosed. For insurants of Germany's largest Health insurance, we drew a simple random sample of 250,000 persons in 2004. An insurant with dementia incidence before the study period, i.e. prior to 2004, is left-censored. Together with the 80% right-censored persons without dementia in 2013, double-censoring is the required missing data pattern (see e.g. Ren and Gu 1997;Cai and Cheng 2004;Kim et al. 2013;Dörre and Weißbach 2017;Shen and Chen 2018).
We estimate the effect of cohort, age and sex from the Health Claims Data (HCD), with the cohorts in decades as dummy variables. Given that our data are a random sample, covariates are random as well, and we maximize the likelihood, conditional on the covariates (CMLE). In order to derive consistency and asymptotic normality for double censoring, as Ren and Gu (1997) and Cai and Cheng (2004) do, we apply the results about M-estimation, however for a different model or criterion function. Effort is devoted to obtaining a uniform convergence of the criterion functions with Wald's dominating condition, so that convergence of the criterion function translates into convergence of the maximizing arguments. Also, the Conditional Information Matrix Equality needs to be adjusted for left-censoring, in order to avoid the need for sandwich estimation for an M-estimator in order to calculate standard errors for the confidence intervals.
As can be expected, for the HCD, we find that standard errors dip when including the 11,000 left-censored insurants. The cohort effect is generally negative in the sense that the dementia hazard has increased over the decades. However, with left-censoring, the slope of that increase is smaller. Another finding is that including left-censored persons increases the incidence of dementia at younger ages and attenuates the increase in dementia with age. That dementia is slightly more likely for males than for females becomes almost irrelevant after including leftcensoring.

Population and model for age-at-dementia-incidence
The population in the demographic sense are, basically Germans born between 1900 and 1954. We will not distinguish between different demarcation frontiers of Germany. As the statistical population, we will use insurance of a person by one German health insurer in 2004, and use its Health Claims Data (HCD). Note that health insurance is mandatory in Germany. The first three boxes in Fig. 1 depict the selection of people from the demographic to the statistical population. Our primary variable is age at dementia incidence and, roughly speaking, we wish to perform a lifetime data regression with the two covariates 'cohort', classified according to decades, and 'gender'. As the age at dementia incidence is strictly positive, the theoretical simplicity of an additive model (see Kremer et al. 2014) is not appealing in demography, so that we model the effects multiplicative to the hazard as in Sect. III.1.4 of Andersen et al. (1993). An unspecified 'baseline' hazard a 0 (t), resulting in the semiparametric Cox-type model Demographic population: Germans born 1900-1954 ? § ¦ Selection 1 st stage: Persons surviving 31/12/2003 ? § ¦ Selection 2 nd stage (statistical population): Persons insured by AOK (= 25, 388, 191) ? § Selection 3 rd stage: Random 2.2% of persons over 50 (n = 245, 888) Deselection of (left-censored) dementia incidences before 1/1/2004 (= 10, 986) a(t|z) = a 0 (t)e parameter z , would safeguard against model miss-specification. However, a widely acceptable weight function or smoothing parameter is out of sight in demography, whereas a Gompertz-baseline is standard.

Definition 1
The duration Y in years between the person's 50th birthday (risk onset) and dementia incidence has a hazard rate, conditional on Z = z, a(t|z) = ae β 1 t e β 2z e β 3 z s = e θ (1,t,z ) , where t (also) denotes the age (since the 50th birthday) and z := (z , z s ) . We have θ 1 = log a. It isz 1 = 1 for a person born between 1900 and 1909, and zero if it is born in some other decade. It isz 2 = 1,z 3 = 1 orz 4 = 1 for a person who was born in the 1910s, the 1920s or the 1940s. It isz 5 = 1 for a person who was born between 1950 and 1954, the latest date possible for a person to become 50 years old, prior to the start of the study in 2004. (The thirties are the reference cohort.) The coding of cohorts is displayed in Table 1, and Fig. 2 displays (at the bottom) the coding for one uncensored person, i. e. with dementia incidence during the study period. The z s codes the sex (0=male; 1=female). We denote by e β 1 t , or β 1 , the age effect, and by e β 2z , or β 2 , the cohort effect. In short, the eight parameters, θ k , of the model are θ := (log a, β 1 , β 21 , . . . , β 25 , β 3 ) . We assume that the distribution of Z does not depend on θ .

Health claims data, censoring and criterion function
We use HCD from the Allgemeine Ortskrankenkasse (AOK), the largest public health insurance company in Germany. The claims data contain information about outpatient and inpatient diagnoses and treatments, on a quarterly basis, for each insured person, with at least one day of insurance coverage, regardless of whether they sought medical treatment or not. The data include information about sex, age, year of birth and date of exit (death or switch to another insurance company). All inpatient and outpatient diagnoses are coded in the International Statistical Classification of Diseases and Related Health Problems (ICD), revision 10, issued by the World Health Organization. For this study, the health insurance company drew a random sample of 250,000 persons with a follow-up until the end of 2012 given that the persons were insured in the first quarter of 2004. This corresponds to approximately 2.2% of the statistical population who was born before 1955 and has survived the year 2003 (see Sect. 2). Dementia is defined by the ICD-10 numbers G30, G310, G3182, G231, F00, F01, F02, F03 and F051 for which we exclusively consider outpatient diagnoses with the modifier verified, and discharge secondary diagnoses from the inpatient sector. We do not distinguish according to aetiology, and combine all ICD codes into one group named dementia. The method of diagnosis validation is laid out in Doblhammer et al. (2013) and it results in n = 245, 888 observations after data cleaning. Let us consider the potential obstacles when applying Definition 1 to the HCD. Consider the dummy variable that codes the cohort, e.g.Z 4 for the 1940s (see again Table 1). Its parameter is the probability of selecting a person born in that decade from the insurants. The variable is 'exogenous' and will not disturb our inference to the statistical population (see again Fig. 1), as we will use the conditional likelihood. Inference to the demographic population will be considered in Sect. 5.2.
The typical obstacle to statistical inference for Definition 1 is that the duration Y maybe subject to right-or left-censoring. Occasionally, left-censored observations are deselected in the demographic literature, as depicted in the last box of Fig. 1, and we aim in this study to assess the consequences thereof. Let us derive the censoring notation. For each person, we record its year and month of birth, but the year and month of death is only recorded if it is in the study period between 1/1/2004 and 31/12/2012. We denote the age at the start of the study period on 1/1/2004, given in months since 50th birthday, by L. Birth and death are assumed to occur in the middle of a month. For each person, we observe the age in months at the time of dementia incidence, date of death, loss to follow-up or end of study. We assume an onset of dementia risk at the age of 50 and denote the subsequent time as Y . We attribute the diagnosis to the middle of a quarter, and a loss to follow-up at the end of a quarter. A censoring indicator is 0, i.e. Y is uncensored, if the dementia incidence occurs in the study period. It is = 1, i.e. Y is right-censored, if (i) the incidence is past the study period (i.e. after the forth quarter of 2012 (Q4/2012)), (ii) a patient dies without having had dementia, or (iii) is lost to follow-up. The censoring is = −1, and Y is left-censored, if the incidence has been prior to the study period, i.e. before Q3/2004. It is also assumed that a dementia diagnosis in Q1/2004 and Q2/2004 is a prevalent case, i.e. a left-censored observation (see Table 2).
Y (age at dementia diagnosis -50) for L ≤ Y ≤ R ( := 0) L (age at beginning of study period -50) for Y < L ( := −1) R (age at event (i)-(iii) minus 50) for R < Y ( := 1) Note that although the study period is fixed, entrance is individually different, depending on the birthday. We do not (and do not need to) model the birthday, as we can leave the joint distribution for (L, R) unspecified, apart from R > L. We denote its parameter by θ . As a consequence, the censoring indicator is random. We will see that is endogenous, that is its distribution depends on θ . As specified forZ 4 earlier, the entire covariate Z is random due to sampling and exogenous. The values y, δ and z can obviously be calculated with the definitions from Sect. 2 for each person. Figure 2 displays the coding for a right-and a left-censored person (top and middle), i. e. with dementia incidence outside the study period. We need the distribution of (Y , , Z) when defining the criterion function. To derive the density, we commence with left-censoring and assume Y and L to be independent. The age at the beginning of the study period, L, is also assumed to be independent of Z and we denote its density and CDF as f L (·) and F L (·). We observe Y := max{Y , L} and with density Note that the conditional density f (·|·) and CDF F(·|·) are those of the 'latent' Y . Right-censoring instead of left-censoring is similar, only with F(y|z) replacing 1 − F(y|z), and 1 − F R (y) replacing F L (y). Of course, with dementia the residual lifetime will be reduced so that, for the right-censoring cause (ii), age-at-death and age-at-dementia will not be stochastically independent. However, we think it is noninformative because we are not interested in mortality, but in morbidity and that death has an impact on the-still conceptionally existing-time until dementia is not plausible. The conditional density under double-censoring is easily derived (see Proposition 1 in Dörre and Weißbach (2017) or Formula (3) in Kim et al. 2013).
As the first step towards the criterion function, recall that the distribution of Z does not depend on θ . The parameter of the covariate is nuisance, so 'conditioning' applies (see Kalbfleisch and Sprott 1970;Reid 1995). For the bivariate dependent variable (Y , ), the conditional likelihood method is appropriate for estimating the parameter vector (θ , θ ). Note that due to endogeneity, it is not possible to separately relate θ to Y and θ to . Also note that the categorical scale of the second dependent variable, , is not an obstacle, as more importantly, θ is on a continuous scale. Ultimately, we want to restrict attention to θ . As the distribution of L is unconnected to the parameter θ , the third and forth factors of (3) will not influence the point estimate found by maximization with respect to (wrt) θ , as can be seen from the usual logarithmic transformation. The same is true for R. The impact of θ on the standard errors is studied in the next Sect. 4. Note already that θ can be gained from the conditional model by the smooth function θ := g(θ, θ ). The conditional likelihood, denoted by c (θ , θ ), is the product over (3) (amended by right-censoring), as We now collect all factors in c (θ, θ ) that contain θ and define Note that due to the last summand, we need observations for all persons. We cannot expect a low-dimensional sufficient statistic as is occasionally the case for only rightcensored survival data. We define the exponential of our criterion function as The maximizing argument is denoted byθ and numerically determined. An initial value must avoid negatively infinite log-conditional-likelihood values. Specifically, we start from a model with only the 35, 920 uncensored observations and without Z, i.e. with β 2 = 0 and β 3 = 0. The closed-form estimate for (a, β 1 ) is then (0.22×10 −3 , 0.152) .
Using now the covariates, the logarithmic value for (5)  Including now censored observations, the logarithmic (5) has for all n = 245, 888 observations a numerical maximum of −191, 444. The adequacy of the numerical maximizations were verified ex post for convergence. The resulting point estimates are given in Table 3 (as the two rows 'Definition 1', with and without the left-censored observations) and will be discussed in Sect. 5, together with standard errors derived in the next section.

Statistical inference
Let us here study the implication of left-censoring for estimator consistency and normality, the latter with consequences for the confidence intervals. Along with the nuisance parameter for censoring, θ , the distribution of the random (multivariate) covariate also, has a parameter which we do not denote explicitly. Roughly speaking, studying the asymptotic normality of a Maximum Likelihood Estimator (MLE) for all parameters enables deferring the normality of the estimator for θ , even when only maximizing (5). In more detail, as stated, we aim at disposing of the parameter of the (exogenous) covariate by conditioning, i.e. by factorizing the likelihood. As censoring is endoge-nous, instead of conditioning, we aim at disposing of the θ by the virtue of fact that differentiation of (·), and of the logarithmic unconditional density of [(Y i , i ), Z i ] wrt to θ, are equal. To see this, factorize the latter unconditional density into (3) and the marginal density of Z i . After taking the logarithm and differentiating, the marginal density of Z i and the distribution of (L i , R i ) vanish, as both are assumed not be depend on θ .
Note already that, for the derivation of asymptotic confidence intervals, including standard errors, arguments will be needed that prevent the use of the entire distribution, which would again include the covariate parameter and θ .
In order to establish the asymptotic normality of the estimator, in Sect. 4.2, we will use a Taylor expansion of the score equation. An important requirement on several occasions will be the consistency ofθ (maximizing (5)). There are several sets of assumptions underlying such a proof (see e.g. Property 8.1 in Gouriéroux and Monfort 1995a). The main idea behind Wald's dominating conditions (6) is to ensure that the convergence of the criterion function (as a sequence in n) will be uniform (as function of θ ). This will in turn ensure the convergence of the maximizing argument,θ , to converge to the true parameter θ 0 for Y (conditional on Z = z).

Wald's dominating condition
Even though we want to cover double-censored durations, we commence with uncensored observations. And, for simplicity of the argument, we start without covariates. Hence, the criterion function (5) reduces to the likelihood, and for the MLE, we verify Wald's D conditions. Of the Wald-conditions, especially condition D3 is cumbersome, namely to find an integrable positive function h(y ) that dominates the likelihood ratio: Here f (y ; θ ) is synonymous for f (y |z) in (1). The idea is to set h(y ) as the upper bound to the left in the first inequality of (6)-wrt to θ -and to show integrability, wrt to y . Let us set = [ε; 1 ε ], for some small ε > 0 and even ignore the age-effect up to this point, i.e. set β 1 = 0.
The proof stems from the following graphical arguments. Obviously, in the ydirection, the log-likelihood ratio 46 R. Weißbach et al.

Fig. 3
Left: log likelihood ratio (7) for y = 0.2 and a 0 = 0.5, middle: log likelihood ratio (7) for a 0 = 0.5, right: absolute of log likelihood ratio for a 0 = 0.5 is linear. It increases for a < a 0 and decreases for a 0 < a. In both cases, the slope decreases (in absolute terms) as a approaches the true parameter a 0 . For a = a 0 the function is constant. More important is the direction of argument a, for which log L R(y , a, a 0 ) is concave (see Fig. 3 (left)). As a function of both arguments, the ratio has the shape of a pear leaf ( Fig. 3 (middle)). The log likelihood ratio is concave in a with local minima at the edges of the parameter space, {ε; 1/ε}. If the function were negative, these would be the only potential maxima of the absolute value. Unfortunately, this is not the case, as Lemma 4 in "Appendix A" exhibits. Hence, function has its maximum either in a = ε, in a = 1/ε or in a = 1/y . Figure 3 (right) displays the 'used-handkerchief shape' that log L R(y , a, a 0 ) exhibits as a function of a and y . The function h(y ) is composed as maximum over the only three candidates ε, 1/y and 1/ε. The analytical version of the proof is in "Appendix A". Graphically, one considers the three one-dimensional functions through the three-dimensional room depicted in Fig. 3 (right) as candidates. For the first two candidates, whatever y , the maximum is at the same a, namely on the edge (on the room's left and right wall). These candidate functions are parallel. This is not true for the third function because the maxima are at a = 1/y in the parameter space. (It proceeds in a curve through the room.) Now imagine the two-dimensional vertical plane spanned by the y -axis and the axis of the log-likelihood (i.e. the left wall of a room you enter). And imagine a projection of the three function graphs on that plane (as shadows on the left wall near a light source on the right wall), the upper hull in this picture is the graph of h(·). For the example a 0 = 0.5, Fig. 4 depicts h(·) and suggests that one linear edge extremum quickly dominates the other two candidates. As consequence, the second half of condition (6) is fulfilled as it is proportional to the expectation of an exponential distribution, being 1/a 0 and hence finite, due to the compact support of the parameter space.
The aim is now to include the left-censoring. But then, the criterion function will not be the likelihood, but only a factor thereof. It is the product over so that without right-censoring and covariates, (4) becomes There is an analogous criterion to Wald's D3-condition (6) for M-estimation (see Sect. 24.2.3, condition C2' in Gouriéroux and Monfort (1995b), originally Theorem 2 from Jennrich 1969). For the proof of the following Lemma, see 1.
Lemma 2 Assume Y 1 , . . . , Y n ∼ E x p(a 0 ) and L 1 , . . . , L n ∼ F L (·) to be all independent and 0 < P(Y < L) < 1. For Y and as in (2), there is a function h L (y, δ) such that In contrast to the method of proof in Kremer et al. (2014), the method here easily extends to double-censored observations. The sum (8) then has three summands. And in the third summand, h L (y, 1) = 1 ε y is easily found by the linearity of ay wrt a. The resulting integral is then again proportional to the expectation of Y , given that rightcensoring is neither impossible nor sure. The Lemma also carries over to covariates, as we consider the conditional densities. The aim is also to include the age-effect. The parameter space now has two dimensions, β 1 being between a small and a large positive real value and the log-likelihood ratio becomes (see Definition 1 and (1) , log a(y ) = log a + β 1 y . Now the second half of (6) does not simplify to an expectation, but interchanging the integral with the sum, the summands are either limited because of the density property, because of the finite expectation, or because e β 1 y can be absorbed into the exponential function of the density to form a new Gompertz distribution's density. Taking maxima over the parameter space does not hinder this, as the parameter space is bounded and all functions are continuous in the parameters.

Standard error forÂ
The maximum of (5) also maximizes c (θ, θ ) wrt to the first argument. It even maximizes the likelihood wrt to θ , as we assume the distribution of Z not to depend on θ . Neither θ nor the parameters of the covariate Z are of concern for the point estimate. The asymptotic standard error of an MLE, classically builds upon the unconditional expectation of the squared gradient of the logarithmic density for one observation, namely the Fisher information matrix. Such expectation wrt the joint distribution of ((Y i , i ), Z i ) will add θ and the covariate parameters into the expression. Roughly speaking, we can partition the Fisher information matrix into blocks, where the upper left block is for θ , and then, on the block-diagonal a block for θ follows, and the lower right block is for the covariate parameters. The arguments for point estimation also let conclude that the off-diagonal block matrices will all be zero. Classically, the standard errors for the parameters are deduced from the inverse of the Fisher information. Due to the Schur complement (see e.g. Section A.2.2.d in Gouriéroux and Monfort 1995b), only the inverse of the upper-left block must be inverted to achieve a standard error of θ . Standard errors can be estimated with the observed Fisher information.
Let us now be more specific and denote by E 0 the expectation wrt (Y , ), conditional on Z, by E Z the expectation wrt to the marginal distribution of Z. Define, with finally unconditional expectations Theorem 1 For the maximizing argumentθ of (5), it holds for a compact subspace for θ 0 in (R + ) 2 × R 6 : Proof Denote the column vector and perform a multivariate quadratic Taylor expansion thereof, evaluated at the maximizing argument and expanded around the true parameter value for each of the eight coordinates: Here ∇ 2 θ U k (θ * ) denotes the Hessian matrix and θ * is a point on the line betweenθ and θ 0 . The last summand is asymptotically negligible by Slutzky's Lemma, for three reasons: (i) Its second factor, the Hessian (divided by n), can be shown to be bounded at θ 0 by the usual arguments of continuous functions on compact support and because θ * will converge to θ 0 becauseθ is consistent. (ii) Its last factor converges to zero, aŝ θ is consistent. (iii) Its first factor (including √ n) converges weakly. The entire last summand is dropped in the following analysis. In a more precise version of the proof, one applies Theorem 10.1 of Billingsley (1961).
The first summand U k (θ 0 )/ √ n is now a sum of iid random variables and will be asymptotically normal with mean zero and variance-covariance matrix I, due to the CLT by the usual arguments. Subtract the second summand, and multiply the equation with the inverse of ∇ 2 θ U k (θ * )/n. It becomes a first factor on the right side and can be replaced with its deterministic matrix limit, namely J −1 , by the LLN. (We refrain from verifying a sufficient assumption such existence of moments or differentiability for the characteristic function of ∇ θ [(Y i , i ), Z i ; θ ].) We will also not verify that the model is identified.
For a conditional likelihood c (θ, θ ), the 'Conditional Information Matrix Equality (CIME)' follows, and for our criterion function, i.e. the logarithm of (5), a similar equation holds for θ .

Lemma 3 I = J .
The proof uses elementary analysis and is given in "Appendix B". As a consequence J −1 IJ −1 = J −1 . By Theorem 1, the asymptotic standard errors can be derived now as square roots of the diagonal elements from J −1 and consistently estimated fromĴ There are only minor numerical considerations when estimating the standard errors for the eight-dimensional function of Definition 1 with a Newton algorithm. The resulting standard errors are given in the rows 'Definition 1' of Table 3 (in brackets below the point estimates).

For the statistical population
Let us here study the implications of left-censoring on the estimates from HCD, especially on the confidence intervals. We now study the implication of deselection for the inference from the sample (3rd stage selection) onto the statistical population (2nd stage selection) (see Fig. 1). The rows 'Definition 1' in Table 3 summarize the inference drawn from the sample for θ in the statistical population of Germans born between 1900 and 1954 and insured by the AOK in 2004. The results enable an assessment of excluding or including left-censored observations (see third to fifth boxes in Fig. 1). As a first general finding, by including left-censoring, confidence intervals become narrower for almost all parameters, as to be expected, with the exception of a. Overall, given the interpretation of the standard error as half of a half confidence interval by Theorem 1, no small sample size argument needs to be taken into consideration and almost all effects are statistically significant in the sequel.
Before we compare point estimates and standard errors for the model of Definition 1, we fit two smaller preliminary models for dementia incidence to the data. Both models neglect the gender effect, i.e. generally set β 3 = 0. In one model, we neglect the age effect, i e. set β 1 = 0, and in the other model the cohort effect is neglected, i.e. we set β 2 = 0. Of course we are convinced that both effects exist, but still want to build a model by forward selection of covariates.
The preliminary model with only a cohort-effect has a likelihood (and (5)) with one factor for each cohort. Such stratified analysis simply fits each cohort to a separate (one-dimensional) Exponential distribution. With or without left-censoring, the right-censored data sets do not pose any numerical obstacles. The point estimates for excluded (top) and included (bottom) left-censored persons (Table 3, first rows) generally suggest a decrease in dementia hazard over the cohorts. The effect of including left-censoring is that the hazard rate is increased for all cohorts. Of course this is expected, because excluding right-censored observations is known to overestimate the hazard, and hence excluding left-censored observations should underestimate the hazard. The increase in hazard over the cohorts is not constant and we will observe and soon explain this phenomenon in the model of Definition 1. For the twentieth century's Table 3 Estimates for model of Definition 1 subject to right-censoring (top) and double-censoring (bottom) Age-effect 1900Age-effect -1909Age-effect 1910Age-effect -1919Age-effect 1920Age-effect -1929Age-effect 1940Age-effect -1949Age-effect 1950Age-effect -1959 Sex-effect a (   95% Confidence intervals based normality of Theorem 1 with standard errors from inverse of (9) first decade, the hazard itself is exp(1.138) ≈ 3 (with left-censored observations), i.e. it is three times higher then in the 1930s. In the most recent cohort of the 1950s, the hazard is exp(−0.977) ≈ 0.4, i.e. only 40% of the risk that prevails in the 1930s. We will soon see that this remarkable range can be explained (in part) with the model of Definition 1.
For the preliminary model with only an age-effect, we need to maximize (5) in two parameters, being a slightly larger numerical effort, because a visual inspection of (5) no longer suffices. Estimates are given in the second rows of Table 3 and the effect of β 1 ≈ 0.14 (again with left-censored observations) means that with each additional year (starting at age 50), the dementia hazard (which is approximately the probability acquiring dementia within one year), is multiplied by exp(0.137) ≈ 1.15, i.e. by 15%. Again, excluding left-censoring decreases the parameterâ, here by a remarkable 50%.
The results of the Cox-type model in the third rows are similar to the next rows of Definition 1's results, only with slight increases in standard errors due to accounting for the increased insecurity by the nonparametric baseline. This indicates robustness or admissibility of the Gompertz-assumption in Definition 1. The point is further verified by plotting the Breslow-estimate of the cumulative nonparametric baseline hazard in the Cox-type model and one finds that (for different resolutions) the (double-)exponential increase in age fits.
Let us now come to the model of Definition 1 with estimates given in the forth rows of Table 3. We discuss the cohort effectβ 2 , the age effectβ 1 and the sex effectβ 3 .
Let us start with the cohort effect and with a finding known as Simpson's paradox. The 'slope' of the cohort effect is reverted, in comparison with the preliminary model with only the cohort effect. The sign of a linear approximation through β 21 , . . . , β 25 was preliminarily positive and is now negative. As an example, for a person born between 1900 and 1910, the dementia hazard is estimated as exp(−2.095) ≈ 0.12, i.e. is only 12% of those born in the 1930's (with left-censored observations). By contrast, for a person from the 1950s, the hazard is exp(1.342) ≈ 4 and hence almost four times as high. As an interpretation, apparently, estimating the preliminary model with cohort effects only, in addition to a constant age-independent hazard which identifies the 1930 cohort, attributes the entire age effect of dementia incidence to the cohorts. Obviously, in the more recent cohorts, we exclusively observe lower ages, and we wrongly attribute the age effect to that cohort. By doing so, we greatly overestimate the decline in dementia incidence over cohorts.
A consequence of including left-censored persons, already remarked upon in the preliminary model with only the cohort effect, is that the slope in the cohort effect declines (in absolute terms). The reason why the hazard-increasing inclusion of leftcensoring is not constant across cohorts is as follows. As can be seen from two rows in Table 3, the percentages of left-censored persons differ between the birth cohorts (i.e. withinz). In older, earlier cohorts, the portion is larger than in younger, more recent cohorts. As a result, in the older cohorts, the hazard estimate increases more than in younger cohorts, and the slope of such a cohort effect in β 2 becomes smaller. (The overall hazard increase is absorbed in the parameters a and β 1 .) Let us study the impact of left-censored persons in the model of Definition 1 on the age effect. First note that by including left-censored persons,â, i.e. of the hazard for a 50-year-old male person born in the 1930s, rises, as to be expected. More interesting than the increase ofâ from 0.058 to 0.190 is the dip ofβ 1 from 0.226 to 0.183. That means the increase in hazard is stronger for younger people compared to older ones. The impact becomes evident with the five-year incidence rates calculated from the conditional CDF (1), P
The statistical reason we advocated in the cohort effect, namely that more persons in earlier cohorts are left-censored, so that including them increases the estimate more than for later cohorts, is not applicable here. To the contrary, the fact that more persons are left-censored in earlier cohorts, now means that fewer are left-censored at lower ages. We cannot comment on the effect without a degree of speculation. Assume that (i) an older person, say born before 1930, belonging to the data, i.e. alive in 2004, is an indication of a health privilege. (Younger persons in the data who were been born after 1930 are not indicated.) Hence, the proportion of health-privileged among the elderly in the data must be disproportional. Assume further that (ii) a left-censored person has had dementia incidence early in life, and therefore cannot be regarded as health-privileged. As a consequence of both assumptions, including the left-censored ones, will reduce the disproportion of privileged and the age-effect must dip. Inference regarding the topic would require including further covariates in the model, observed or unobserved.
The consequence of including left-censored persons for the sex effect is an increase. The negative sex-effect, favourable to women, becomes smaller (in absolute terms) being ultimately almost irrelevant.

Inference to demographic population
A next step is to consider the statistical population as random sample of the demographic population (see Fig. 1). We will refrain here to comment on the effect of the 2nd stage selection of persons to the specific health insurance. However, we consider 1st stage selection of left-truncating those not surviving 2003. We will see that, for the morbidity analysis, left-censoring is a competing concept to left-truncation. For studying measurements such as mortality causing absorption, the case is different. Including measurements that are not absorbed, but contain only left-censored information, must be better than ignoring them combined with a general adjustment for left-truncation. In order to precisely discriminate between truncation due to early dementia and truncation due to prior death, we must include another state in the healthy-ill model and advance to a multi-state model, namely a healthy-ill-dead model (see Fig. 6(any panel)). In a Markovian model left-censoring, with respect to the illness, can be combined with left-truncation, caused by death.
An analytic treatment of the multi-state model, however without left-censoring and left-truncation, is found e.g. in Weißbach and Walter (2010); Kim et al. (2012). For an introduction to truncation see Weißbach et al. (2013);Frank et al. (2019); Dörre and Emura (2019). Here we simulate (i) homogeneous Markov processes, i.e. without time-effect (see Fig. 6(top)) and (ii) an inhomogeneous Markov process, i.e. with time-effect which will be a cohort-effect (simplified here to two intervals, z = 0 and z = 1) (see Fig. 6(bottom)). A homogeneous Markov process can be easily simulated by its construction of exponential waiting times and target states with multinomial distribution (Albert 1962, see e.g.). An inhomogeneous process is simulated as easily.
We assume births, i.e. the 50th birthday, to be uniformly distributed on [1950,2004]. The time-homogeneous, i.e. age-and cohort-constant, transition intensities are a = 1/35 (  All simulation results are based on 10,000 repetitions of the same set-up. For all models, the transition intensity from healthy to ill is the hazard rate and estimated with (3). (Fitting all parameters to the data would be based on the partial likelihood of the multiple Markov process (see Andersen et al. 1993, equation 2.7.4'), however including left-censoring is not straight-forward.) The Bias of any estimatorγ is calculated as B(IAS)(γ ) :=γ − γ and the relative Bias as rB(IAS)(γ ) := 100 · B(IAS)(γ )/γ . Recall from Fig. 1 that N denotes the size of the demographic population, before 1st stage selection, i.e. left-truncation, and n the (random) sample size (after truncation and possibly after removing left-censored persons). The number of left-truncated persons is denoted by n LT and of left-censored persons by n LC .
First, ignoring left-truncation does not necessarily bias the illness incidence estimation, if left-censoring is taken into account (see Fig. 6(top, left)). Table 4 shows the negative bias for ignoring both and a negligible bias when accounting for leftcensoring. Second, that the statistical population is not a simple sample, i.e. with equal selection probabilities, becomes visible, when assuming cohort effects. Persons of an earlier cohort have a smaller probability to survive 2003 than later cohorts. Hence it is questionable whether the cohort effect β 2 of Definition 1 is estimated correctly. In order to account for a cohort effect (see Fig. 6 (bottom)), define for cohort 1950-1977 z = 0 with hazard a(t|0) ≡ 1/35, i.e. with a = 1/35, and for cohort 1978-2003 z = 1 with hazard a(t|1) = 1 35 exp(β) and β = 0.7. Again, truncation due to death does introduce a bias into the estimate of the cohort effect (see Table 5(top)). However, the cohort effect is apparently estimated consistently, at least for age-constant intensities, if left-censoring is accounted for (see Table 5(bottom)).
Third, one should not have the impression that left-censoring can replace lefttruncation in any situation. If the death intensity is different for the transition from healthy to dead c as compared to the transition from ill to dead b (see Fig. 6(top, right)), a (small) bias is introduced by ignoring the left-truncation phenomenon even if one accounts for left-censoring (see Table 6).
With respect to standard errors for the inference to the demographic population, note that E Z of Theorem 1 will still have a proper meaning, however, (9) will not necessarily estimate the asymptotic variance. One reason is that the covariate Z is not random but deterministic from the demographic population to 1 st stage selection (see Fig. 1), another theory must be applied (see e.g. Bradley and Gart 1962; Weißbach and Radloff 2020).

Conclusion
The study reveals that even when including left-censored observation in a survival analysis, the asymptotic analysis of the model may use elementary means. However, bypassing lengthy calculations, as in Kremer et al. (2014) for left-censored observation only, is no longer possible. Only with right-censoring, would the model be a member of the exponential family (and also a generalized linear model). Whether doublecensoring can be analysed more easily in a counting process framework, was not investigated. As are the additional conditions to (6) for consistency. Also, together with the major assumption of estimator consistency, there are also further assumptions, such as the existence of J −1 , needing investigation in order to conclude asymptotically normality estimator (see e.g Theorem 5.21 in van der Vaart 1998).
An application to issues of human morbidity in follow-up studies is appealing, because a disease typically does not 'absorb' the statistical unit. And due to the longevity of humans, many follow-up studies typically cannot start before to the first disease incidence. In particular here, left-censoring accounts for the general weakness of the HCD that it one does not follow each cohort from a given age, but rather for a given period.
From a broader perspective, estimating the duration distribution conditional only on survival may be unsatisfactory. However, note that the probability of an incidence exceeding age t, given the lifetime surpassed age s, can be multiplied with the last probability, so as to result in the joint distribution. In order to obtain the marginal distribution of the disease incidence, only the second argument needs to be integrated out. Hence, because the mortality distribution will typically be known quite accurately from other data sources such as a register of deaths, knowing the conditional distribution is already a major achievement.
An unconditional estimate, solely from the HCD, will first have to reconsider the assumption in Definition 1, that the dummy variable coding the cohort, e.g.Z 4 for the 1940s (see again Table 1), is endogenous for the demographic population. Its parameter is the probability of selecting a person born during that decade in the data process of Fig. 1. This probability is dependent on the probability of surviving 2003. If the probability of surviving depends on having dementia (which is widely accepted), the probability of dementia incidence (and hence the model of Definition 1 and its parameter θ ) will be influential. material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
Proof Assume the converse. The function lr a 0 ,y (a) has its maximum with respect to a in 1/y , so negativity is equivalent to Function q a 0 (·) is convex as its second derivative is 1/y 2 > 0. (Note that, being in survival analysis, we may refrain from allowing y = 0.) The first derivative of q a 0 (·) is zero for y = 1/a 0 , being the minimum. At the minimum of q a 0 (·) is Hence it is ≥ 1 for any a 0 , and there exists no a 0 to fulfil (A.1), being a contradiction.
Before we continue the proof of Lemma 1, we note some useful properties: Inserting ε, 1/y or 1/ε into log likelihood ratio (7) as a) bounds it in absolute terms. Of course relevant candidate can vary for different y . We denote the potential bounds by g abs ε (y ), g abs 1 y (y ) and g abs 1 ε (y ) and give the definition in the following table. It contains also simplified versions thereof, useful as we see soon.
Using (A.2), one sees that function g ε (·) is linearly increasing, with negative y -axis off-set, and function g 1 ε (·) is linearly decreasing, with positive y -axis off-set.
Both, g abs ε (·) and g abs 1 ε (·), are positive and linear functions with slopes a 0 − ε and 1/ε − a 0 from some point y on. (A linear function in the integration of (6) will result in an expectation, and hence be finite.) Unfortunately this is not true for g abs 1 y (·) but the following will simplify the integration. Function g 1 y (·) is convex, because d 2 dy 2 g 1 y (y ) = 1 y 2 > 0. As d dy g 1 y (y ) = a 0 − 1 y it is minimal in y = 1/a 0 . Because g 1 y ( 1 a 0 ) = 0 the function is non-negative and hence g 1 This and using max(x, y) ≤ x + y (for positive x and y) we define h(y ) := max{g abs ε (y ), g abs For the two bounding function candidates g abs ε (0) and g abs 1 ε (0) are below positive − log ε a 0 or log 1 εa 0 (see (A.2)). Now Both candidate functions g abs ε (·) and g abs 1 ε (·) can be bounded from above by linear function max{− log ε a 0 , log 1 εa 0 } + max{ε − a 0 , 1 ε − a 0 }y . Hence the first integral on the right in (A.4) is smaller than For the first equality, see Formula 4.1.49 in Abramowitz and Stegun (1970), the last but one equality follows from de l'Hôspital's rule. This ends the proof of Lemma 1. For the proof of Lemma 2 note first that The search for maxima of | L (y, δ; a)| wrt a for a fixed (y, δ) means that we need maxima for | L (y, 0; a)| and | L (y, −1; a)|, both for a fixed y. The maxima of the first are known from the proof of Lemma 1 to be in ε, 1/y and 1/ε. Now look at the necessary condition In the first summand the second factor is proportional to the density of (Y , ) conditional on . It only needs to be divided by one minus the probability of censoring, P(Y > L), which is by assumption smaller than one. The function h L (y, 0) can now be chosen as h(y) in Lemma 1 and hence the summand is proportional to E(Y ) = 1/a 0 < ∞. (1 − e a 0 y ) −1 = 0, as can be seen by two times applying l'Hôspital's rule. Note theh L (y) > 0 and is zero on both edges of its support R + = (0, ∞). It is continuous on [0, ∞], as a product of continuous functions on R + , where the first is continuous as a composition of continuous functions. By the mean value theorem there must be a y withh L (y) = 0, being a maximum. Even if we do not show here uniqueness, as a continuous function on the compact [ε y , 1/ε y ] for sufficiently small ε y , it must attain its maximum. Without poles on [0, ∞] this must be a finite M. The density f L (y) now bounds (A.6) with M and ends the proof of Lemma 2.

Appendix B: Proof for Sect. 4.2
For the log likelihood it is well-known that the expected squared first derivative and the expected negative second derivative of the log likelihood are equal (see e.g. Formula Proof Denote by E θ,θ (·|Z i = z i ) the conditional expectation with respect to density (3) (again with right-censoring), which is similar to E 0 , however, with generic (θ , θ ). It is Due to the chain rule for differentiation, applied to the logarithm in L i (θ, θ ), it is Integrating wrt y i and summation wrt δ i results on the left-hand side in zero, because after interchanging differentiation ∇ θ with integration and summation, what needs to be taken a derivative of, is one, due to the density property. On the right-hand side, recall the definition of s i (θ). Now set θ 0 for θ .
As L i (θ , θ ) is twice differentiable wrt θ , let the Hessian for observation i denote where ∇ θ leads to a row vector (for each coordinate of the column vector to be taken the derivative of). It is obviously J = −E Z E 0 [H i (θ 0 )] and we are now ready to show −E Z E 0 [H i (θ 0 )] = I. Because of the product rule for differentiation-generalized to vector coordinates-it is

(B.3)
Again applying R + δ∈{−1,0,1} to the left-hand side and interchanging with ∇ θ results in a zero to be taken the derivative of (due to the arguments after (B.1)). When integrating the the right-hand side, replace in the second summand ∇ θ f Y , |Z (y i , δ i |z i ) with the transpose of (B.2). so that Inserting θ 0 , we have the conditional information matrix equality because, again, L i and have equal derivatives. Then, we take E Z on both sides and use the iterated expectation theorem. This ends the proof of Lemma 3.