Left-Truncated Health Insurance Claims Data: Theoretical Review and Empirical Application

At the beginning of 2004, we draw a sample of size 0.25 million people from the inventory of the health insurer AOK. We followed their health claims until 2013. Our aim is the effect a stroke on the dementia onset probability, for Germans born in the first half of the 20$^{th}$ century. People deceased before 2004 are randomly left-truncated. Filtrations, modelling the missing data, enable to circumvent the unknown number of truncated persons by using a conditional instead of the full likelihood. Dementia onset after 2013 is a conditionally fixed right-censoring event. For each observed health history, Jacod's formula yields the conditional likelihood contribution. Asymptotic normality of the estimated intensities is derived, relative to a sample size definition that includes the truncated people. Yet, the standard error is observable. The claims data reveal that after a stroke, with time measured in years, the intensity of dementia onset increases from 0.02 to 0.07. Using the independence of the two estimated intensities, a 95\%-confidence interval for their difference is [0.050,0.056]. The effect halves, when we extend the analysis to an age-inhomogeneous model, but does not change further when we additionally adjust for multi-morbidity.


Introduction
For Germany, Doblhammer et al. (2018) forecast an increase up to 2.8 million people with dementia in 2050. One risk factor is a stroke and we model life as a time-continuous multi-state history (see Figure 1). The model is also called 'disability model' (see Hougaard, 2001, Figure 1.6). Note that 'healthy' only stands synonymous for 'neither having dementia nor having had a stroke'. Dementia onset of a person after a stroke (or precisely, after its first) is, ceteris paribus, governed by the onset intensity, named λ S 1 D . We want to compare the intensity to the dementia onset intensity of a healthy person, λ HD . By stroke effect on dementia onset we refer to the difference (or the ratio) of these two intensities.
The population, we will infer to, are Germans born in the years 1900 to 1954, which we for simplicity occasionally call 'the first half of the 20 th century' (see disease state models (see e.g. Putter et al., 2006). Ignoring truncation would lead to the 'immortal time bias' (see e.g. Hernán et al., 2016;Yadav and Lewis, 2021).
Efficient estimation in large samples is usually achieved by the maximum likelihood method. In order to render the knowledge about truncated persons obsolete, here marginalisation and conditioning are necessary. We intend both, a review of the methodological arguments and the application to a case study about the effect of stroke on dementia onset in Germany. Right-censoring will also be accounted for (see Figure 2, bottom box), but has already a large literature. Explaining how the technique reduces information is easier when the state 'being alive' is not subdivided into several disease states, and is explained in a lifetime state model (in the terminology of Hougaard, 2001) in Appendix A. Then, with all states of Figure 1, Section 2 derives a confidence interval for the difference λ S 1 D − λ HD and a Wald-type test for the stroke effect. In the next Section 3, we allow intensities to dependent on age for two reasons. On the one hand, it allows to compare our case study with re-known international studies where age-inhomogeneous behaviour is routinely accounted for. On the other hand, we will see that accounting for age changes the stroke effect, drastically, and is an instructive example for Simpson's § paradox. In Section 3 we additionally adjust for (vascular) multi-morbidity, in order to answer the question whether the elevated dementia risk by stroke could be anticipated for multi-morbid persons. Appendix A, Section 2 and Section 3.1 start with the asymptotic theory, Appendix A and Section 2 continue with a Monte Carlo study to see that the asymptotic approximation with the normal distribution is adequate. All sections end with fitting the case study data to the model of the section.

Data: AOK HCD
Even though our aim is to sample from the German population of 2004, we are restricted to sampling from the 25 million members of Germany's largest public health insurance company 'Allgemeine Ortskrankenkasse' (AOK). This represents one-third of the population, presumably with, on average, slightly elevated disease rates, compared with other statutory health insurance funds or private health insurance funds (see Schnee, 2008). As we are only interested in a difference between two intensities of disease onset -with and without preceding stroke, we refrain from studying selection bias in that respect. The AOK's health claims data (HCD) include information about age, year of birth and date of exit (death or switch to another insurance company). From the insurance inventory on 01/01/2004, a simple sample of 250,000 people is drawn. We follow the health histories of those persons until the end of 2013. We exclude 4121 persons with implausible information over time on sex, birth year or region of living. In that form, the data are sufficient for the lifetime state model in Appendix A. The AOK HCD also contain information about outpatient and inpatient diagnoses for each insured person, with at least one day of insurance coverage, regardless of whether or not they sought medical treatment. Recall, that we study here mainly the design effect that by sampling from the above population in 2004, people who died earlier an not be selected, i.e. are left-truncated. There is a second design effect which we, however, will not discuss in depth. We draw in 2004 only a sample of those persons at ages 50 years or older. That age is typically the earliest at which a stroke or dementia occurs. Of course, age 50 is not the earliest age at which a person may die and, additionally, left-truncated is a person who dies before age 50. We refrain from considering the second truncation reason, i.e. assume death before 50 impossible, for the sake of model simplicity and trade-off two effects. On the one hand, only ≈ 6.5% of people die before age 50 in western civilisations. 1 Hence assuming that rate to be zero, will not distort the results by much. On the other hand, our data donor AOK, or more precisely its scientific research institute (WIdO), allowed sampling 250,000 people and without restricting to those over 50 years old would have resulted in ≈ 50% younger 2 and those will mostly stay healthy over the 10 years of observation. Hence doing so would increase the standard errors by at with dementia diagnoses at that time (technically in that quarter or the next) may not indicate a dementia onset, but can be a prevalent case. We exclude those and observable remain n = 236,039 persons. For those, the mean follow-up time is 7.3 years, resulting in 1.7 million person-years at risk. Some more descriptive crosssectional statistics as of 2004, not necessarily needed in the models, are given in

Literature review
Similar ideas of testing the effect of stroke on dementia are found in Desmond et al. (2002), Ivan et al. (2002), Reitz et al. (2008), Savva and Blossom (2010), Kuźma et al. (2018), Kim and Lee (2018) and Hbid et al. (2020). Let us compare our contribution broadly to the adjacent literature, distinguishing substantial and methodological similarities. Substantially, Vieira et al. (2013) report dementia incidences, as do Leys et al. (2005) after stroke. Death incidences, after stroke (van den Bussche et al., 2010) and with dementia (Garcia-Ptaceka et al., 2014), are of use for us because they constitute elements for one of our models and will enter the calibration of simulations. Dementia prevalence is studied in Doblhammer et al. (2018) and risk factors are presented in Mangialasche et al. (2012).
Community-based studies on the effect of stroke on dementia are Ivan et al. (2002) and Reitz et al. (2008). Cerebrovascular processes are studied in more detail by Hu and Chen (2017). Common statistical risk factors to dementia and stroke are studied in Pendlebury and Rothwell (2009). With respect to the method, our work has considerable similarity with the study 'Mortality of Diabetics in the County of Fyn' in Andersen et al. (1993), and Andersen et al. (1988, Section 4) in particular.
However, our truncation model is slightly easier, our simple random sample of HCD is considerable larger than the data there, and we reduce the arguments to those necessary for our model. We make considerable use of Fleming and Harrington (1991)

Contribution of an observed person to inference
Roughly speaking, we aim at maximum likelihood inference. In case of a simple sample, each randomly drawn person contributes with its density to the likelihood, and the estimation criterion is the maximisation of the joint density as a function of the parameter, i.e. the likelihood. We will see that people not observed do not contribute to our criterion function and we now derive the contribution for each observed history. We first collect all possible state transitions in the index set I := {HS 1 , S 1 D, HD, Hd, S 1 d, Dd} (see Figure 1).
Furthermore, universally for all persons, we do not follow a health history any further than τ years. The continuous-time history X = {X t , t ∈ [0, τ ]}, observed in full, in parts or not at all, defined on the probability spaces (Ω, F, P λ ) represent either the population or one random draw from it. We assume throughout the Markov property, so that the history is determined by the transition intensities λ hj (t) := lim s 0 P λ (X t+s = j | X t = h)/s. In this section, we model the population of Germany (at that time) as age-homogeneous, i.e. assume λ hj (t) ≡ λ hj .
(The realistically age-inhomogeneous intensities follow in Section 3.1.) By parameter we mean the vector λ := (λ hj , hj ∈ I) . We consider a simple random sample of size n all persons drawn from the population (see Figure 2). The generalisation, as compared to Appendix A, is theoretically less severe when we assume that all persons start in the same state, X 0 = H, at age origin. Practically, from those not truncated by death, denoted as n in Appendix A.2, X 0 ∈ {S 1 , D} is known to be rare, still exclusion is impossible because, e.g. for an observed person with X u = D, X 0 is unknown. An option is to condition on the distribution of X 0 , which leaves the criterion as a function of λ unchanged, if the distribution of X 0 does not depend on λ. Sloppily, with L denoting the distribution, this is because we can decompose L λ (X) = L λ (X|X 0 )L(X 0 ). (In the case of L λ (X 0 ), efficiency gets lost.) We pursue the option for the AOK HCD. An additional benefit is notational simplicity, because we again observe those same n < n all persons (as in Appendix A), with histories that occur -completely or in part -during our observation period between 2003 and 2014 (see Figure 2).
For each observed history, the 'age-at-study-begin', U , is the time between the calendar dates of the 50 th birthday and the study begin on 01/01/2004. In Appendix A.2 we initially simplify to a non-random age-at-study-begin, u, and the age-at-death T is reformulated as (jump-diffusion) process N T . Here we generalise and reformulate one history X in several N hj (t) := s≤t 1 {X s− =h,Xs=j} , the processes 'counting' the transition between states, up to age t, and Y h (t) := 1 {X t− =h} , indicating residence in state h, at the age of t. The counting processes are collected in the vector N X (t) := (N hj (t), hj ∈ I) . Now, as usual, statistical statements about parameters are deferred from statements about the location (the 'signal') of the random experiment. In order to define a location for a stochastic process, probabilities may be calculated on a filtration N t := σ{N X (s), 0 ≤ s ≤ t}. We may assume that N X (t) is adapted to it, because we theoretically assume X 0 = H. We assume for the 'true', the population, parameter λ 0 (similar to (A1) in Appendix A): (B1) It is λ hj 0 ∈ Λ hj := [ε hj ; 1/ε hj ] for some small ε hj ∈ (0, 1).
The compensator of N X , the location concept here, has an intensity (intuitively the derivative of the compensator) with respect to N t and P λ of α(t) := (Y h (t)λ hj , hj ∈ I) . (When only h appears, the first position of hj it meant. Especially h = d, because death is absorbing.) Note that the notion of a 'derivative' from real analysis, being with respect to the Lebesgue measure, needs a bit generalisation here. Starting with deterministic u, a person is not left-truncated in case of A := {X u = d}. Different to the lifetime state model is that the history up to u is only known when X u = H. If for instance X u = S 1 , the age-at-stroke is left-censored. This is an important incentive to start observation only at u, i.e.
where t ∧ u := min(t, u). Due to the Markovian property it is adapted to the filtration u G t := σ{ u N(s), u ≤ s ≤ t}.
Lemma 1 With respect to the probability measure P A λ (F ) : The proof is as in Section A.1.1. Note that P A λ depends on the parameter λ and on u.
The observed left-truncated and right-censored counting process is u N c (t) : Fleming and Harrington, 1991, Example 1.4.2). It has intensity with respect to P A λ and observed filtration u F c t := σ{ u N c (s), u ≤ s ≤ t}. For the distinction be observable and unobservable filtrations see Section A.1.2. As u F c t is a required self-exciting filtration, by Jacod's formula (see Andersen et al., 1988, Formula (4.3)), the contribution of a person (truncated or not) to the marginal likelihood and its (natural) logarithm are: The product integral is explained in Appendix A.1.1. Essentially, the discrete approximation of the history X is a collection of random increments. The probability function (pf) of this collection can be a product of the increments' pf's.
Decreasing the grid spacing defines an integral. The double-use of the integration symbol dt in the first line is still different to the line above (17) (in Appendix A.1.1) because Y (t) drops to zero after T , whereas Y h (t) is only one for a different state. The reason for the exponential function in the third line is explained shortly after (17). For the second equality, on the logarithmic scale, the logarithm of a product becomes a sum of logarithms and no new integration arises, the Stieltjes integration for discontinuous g ( f dg = f ∆g) suffices. Note hj (t) jumps. These jumps are of height one. Further note that u+10 u can be replaced by τ 0 , because u Y c h (t) already accounts for the limits, and similarly, in the product, [u, u + 10] is accounted for in u N c (t). Note that, because almost surely u N c hj (u) = 0, A truncated person does not contribute to the marginal likelihood, as argued in detail with Formula (17) in Appendix A. As X is random, so must be the ageat-study-entry, U . Similar to (A2) in Appendix A.1.2, together with independent truncation, we impose as additional assumption, that not everybody is dead, prior to 2004: (B2) U and X are independent, it is A := {X U = d} and β λ 0 :=P λ 0 (A) > 0 The additional information by stopping time U , i.e. for (X, U ), and at the same time the loss in information by truncation, is reflected by including U Y c in the 1.2 to see that, similar to non-random truncation (2), conditional on the last two coordinates, by Jacod's formula the logarithmic density of (X, U, 1 A ) up to τ is where U replaces u in the definitions of u N c hj (t), u Y c h (t) and u α c (t) of (1). The expression uses the Doob-Meyer decomposition of N hj (t), stacked to N X . Occasionally, we will denote the second term in (4), and in corresponding decompositions for more advanced models, the subtrahend, as 'Y -term'. The first term, the minuend, will be denoted as the 'N -term', because Y will vanish after taking derivatives, essentially due to d dx = d dx and ln(ax) = ln(a) + ln(x). The observed left-truncated and right-censored versions thereof are with C(t) being one, as long as the person is not censored, i.e. for t ≤ U + 10 and That the contribution of a truncated person to the conditional likelihood is one, is argued in Appendix A.1.2, Formula (19).

Point estimates and their standard errors
The truncated persons without contribution to the conditional likelihood are sorted to the end of the unobserved sample, a convention already in Heckman (1976). All others contribute with (4) to This requires U i to be random as explained in Appendix A. Figure 3 for n all = 4) the unique root of the derivatives of (5) -and hence the point estimates -are, by One can avoid integration in the denominator in (6). Of the interesting states for h, H and S 1 , rewrite e.g. for h = H: Note that similar to (22), by using the simple sample assumption, among those who survive U i (i.e. 2003), the portion in the study period at age t in state h is asymptotically the same in the observed sample and in the entire population. By the LLN, for fixed t, The latter will typically be positive, but for our parametric model, we only need to assume (compare (A3)): By verifying regularity conditions, we arrive at the (joint) asymptotic distribution of the estimatorsλ HD andλ S 1 D by standard results on martingales. It depends on m A h (t), the conditional prevalence of state h at age t in the population, and β λ 0 , the probablity of a person from the sample to be observed, i.e. not to be truncated.
Roughly speaking, the arguments of the proof, given in Appendix B, are similar to the case of the univariate parameter space in Appendix A.1.3. Luckily, the multivariate parameter space here results in a diagonal matrix of asymptotic variances, and positive definiteness follows from the positivity of the diagonal elements.
It remains to consistently estimate Σ(λ 0 ), in order to construct a confidence interval for the differenceλ S 1 D 0 −λ HD 0 with the standard error. This then allows a Wald-type test for the effect of a stroke S 1 on the intensity of dementia onset for the AOK HCD in Section 2.4. By Theorem 1, V ar(λ) = V ar( √ n allλ )/n all · = Σ −1 (λ 0 )/n all , so that, for estimating the asymptotic variance in Theorem 1, define Andersen et al., 1993, Formula (6.1.11 (6) and the CMT. The standard error of λ hj is the square root thereof. Note that even though n all , and with it the asymptotic variance -as component of Σ(λ 0 ) -, is not observable, the standard errors are indeed observed.

Simulation of finite sample properties
Here we conduct a Monte Carlo simulation, primarily to visualize the asymptotic results on consistency, measured in (root) mean squared error, and on normality for small sample sizes, as indicated by Theorem 1. Especially we will find that the asymptotic approximation is rather precise for our statements on basis of the AOK HCD. Appendix A.1.4 does alike for the lifetime state model. We refrain from indicating the true parameter by the sub/superscript and drop 0 in this section. We arrange λ as generator (left side): Here, the small dot signals summation over the respective index j .

Algorithm for simulating a history
We simulate a disease state history with the description in Albert (1962). As discussed in Section 2.1, we assume with T 2 (and j ∈ {S 1 , D}) having cumulative hazard function A j· (t + T 1 ) −

Section of true parameter, sample size and birth distribution
For a true parameter λ in a realistic region of the parameter space, intensities from the literature are reconciled with results the AOK HCD, anticipating Section 2.4.
The theoretical relation between incidences and intensities is given by P (t) = e tQ (see e.g. Weißbach et al., 2009, Formula (2)), where P (t) denotes the matrix of t-year probabilities P (X t = j|X 0 = h). Similar to Section A.1.4, the approximation e Q ≈ I + Q (for Q 'small') allows to simply replace the one-year incidences for the intensities. For dementia onset after a stroke Leys et al. (2005) find a one- year incidence of 7%. The value λ S 1 D = 0.07 will be confirmed for the AOK HCD in Section 2.4 (Table 3). For dementia onset without a stroke, the AOK HCD result in λ HD = 0.02. Similarly, Vieira et al. (2013) collect, but independent of whether a stroke preceded, one-year incidences of 0.008, 0.001 and 0.002 (dependent on the country and age range) for individuals below age 65. The AOK HCD value of 0.02 is larger, but aims at high ages as well and we stick to 0.02.
There are other parameters, necessary for the simulation, but not of primarily of interest for the main question and will thus not be reported in Section 2.4.
However, we can still estimate them from the AOK HCD using (8) (9) (right-hand side).
As sample sizes we let n all vary from one to five, ten, 20 (and later 100) thousand people. All are below the sample size latent to the AOK HCD. However, we will see convergence to kick in, so that more computational burden is unnecessary.
For the distribution of the age-at-study-begin U , we follow Weißbach and Wied (2022) and assume the distribution of U to be uniform.
The longer the birth period, the more people are left-truncated, our population is born within 54 years (see Figure 2). However, to start with, we only use 30 years, i.e. U ∼ U [0, 30]. Combined with (9), on average, 48.7% in the simulated samples are unobserved due to left-truncation.

Interpretation
The number of simulation replications is 10,000. The simulation results in Table   2 (top) confirm consistency ofλ. Especially the root mean squared error drops, as a function of the sample size. The simulation averages ofλ HD − λ HD (and similarly for transitions S 1 D) reveal a generally small bias. The standard error (8) can also be suspected to be consistent (see Table 2, bottom), without a formal proof in the above section. Simulations show similar behaviour for all otherλ hj (and their standard errors). The actual level of the confidence interval is close to the nominal.  (9)) Estimator (6) of λ (only transitions HD and S 1 D)   (9). We use n all = 100,000, being still below the sample size behind the AOK HCD, and run (only) 2000 simulation loops now. The left and middle panel of Figure 4 confirm the asymptotic normality ofλ HD andλ S 1 D (of (6)) stated in Theorem 1. The theorem also states asymptotic independence of the two estimators, which will be important when subsequently deriving a confidence interval for the difference. The simulated correlation Cor(λ HD ,λ S 1 D ) = -0.02 confirms the independence.

Results for AOK HCD
As population, we consider the 76 million Germans born between 01/01/1900 and 31/12/1954 (see Figure 2). The data, i.e. the truncated sample, was described in Section 1.1. Recall that in the disease history X, dementia at the age of t is coded as X t = D (see beginning of Section 2). We only remind here on the number of observations n = 236,039, and on the maximally observed timespan τ := 54+10 = 64 years (after a person's 50 th birthday). By doing so, the n persons are at most followed until the age of 114 (see Figure 9). The least possibly observed lifetime is just above 50 years, for a person turning 50 shortly before 01/01/2004 and dying shortly after that. Preliminary results for the lifetime state model are in Appendix A.2, where the hazard rate of the univariate lifetime has been estimated, and we expand our perspective now to the history of vascular diseases.
We start from the logarithmic conditional likelihood (5) for the model introduced during Section 2.1. In the disease history model pursued here, in contrast to to  specifically, we compare dementia onset without a preceding stroke, λ HD , to onset after a stroke, λ S 1 D . The data cover the information of 34,012 people with onset of dementia in the monitoring period, split into 6275 after a stroke and 27,737 not following a stroke (see Table 3). Already with a stroke until 2004, 5864 persons (see Table 1) must be combined with 19,201 persons with newly diagnosed strokes between the forth quarter 2004 and the end of 2013. Point estimates (6) and standard errors (roots of (8)) are given in and (see Table 3) estimated to be 0.00093 2 + 0.000102 2 = 8.8 × 10 −7 . Hence,  (Ivan et al., 2002), the adjusted RR of dementia with respect to stroke is estimated to be 2.0. The Rotterdam Study (Reitz et al., 2008) also indicates that a stroke doubles the risk of dementia (hazard ratio: HR=2.1). A systematic review and meta-analysis reveals a pooled HR of between 1.7 and 2.2 (Kuźma et al., 2018). Another result, but without multi-states, is that of Savva and Blossom (2010), who report a hazard ratio of 2. Based on South Korean HCD, and also using multi-state methods, Kim and Lee (2018) find a 2.4-fold risk of subsequent dementia after a stroke. Our intensity ratio of 3.5 exceeds the more recent studies presumably because those adjust for covariates. We now as first covariate, we now adjust for age using age-inhomogeneous, namely piecewise constant, intensities. We will see that the effect of a stroke on dementia onset becomes markedly smaller because of Simpson's paradox: Simultaneously, intensities increase with age and a stroke is more likely at higher ages.

ADJUSTMENTS FOR AGE AND OTHER DISEASES
In Section 2, the probability of suffering either event, stroke or dementia onset, has been equal for all ages and independent of any other factor. Morbidity intensities vary with age and in order to compare our results for Germany later in Section 3.1.2 internationally, we derive a model in Section 3.1 that adjusts for age inhomogeneity. Also, a risk-increasing effect of stroke on the dementia hazard might not be causal in the following sense. Assume that one group has a vascular predisposition and that a stroke (mainly) indicates the membership to that group.
The information about the predisposition could have been achieved earlier and a stroke should not trigger additional medical effort with regard to dementia prevention. We aim in Section 3.2 at stratification according to vascular predisposition.

AGE-INHOMOGENEOUS INTENSITIES
We define (as in Weißbach et al., 2009;Weißbach and Walter, 2010), for a partition 0 = t 0 , . . . , t b = τ , X as a Markov process with piece-wise constant intensities We do give neither the self-contained analysis of the lifetime state model of Appendix A, nor the still complete analysis of the the age-homogeneous disease state model in Section 2. We restrict the display to the statement of the conditional likelihood and derive of the estimator. The asymptotic arguments are developed to the extend that the standard errors can be derived.

Point estimate and standard error
The same two counting processes N hj (t) and Y h (t) of Section 2.1, reformulate a history. When stacking λ hj (t) to λ(t) in the same way as N hj to N X , N X has a compensator -with respect to N t -with intensity α(t) := (Y h (t)λ hj (t), hj ∈ I) . The compensator is with respect to the probability measureP λ , where λ := (λ HS 1 1 , . . . , λ Ddb ) collects the 6b parameters. Hence with little change, compared to (4), the conditional likelihood contribution is Note that there are five possibilities for the intersection of [t l , t l−1 ) with [U, U +10) (see Figure 5), so that for the N -term (with a ∨ b : and for the Y -term Again by (4) with λ, and comparable to (5), it is ln U L c (data; λ) the sum of the contributions (11), so that (∂/∂λ hjl ) ln U L c (data|λ) = A hj,l /λ hjl − B h,l , by interchanging differentiation and summation, with transitions A hj,l and time-atrisk B h,l , per age-group, namely: Similar to (6), for the time interval [t l , t l−1 ] it is: For the multi-state Markov model with right-censoring (but without left-truncation), proof of the asymptotic normality (assuming consistency) for the piecewise constant-intensity model (10) is found in Weißbach and Walter (2010). A simplified proof for consistency is found in Weißbach and Mollenhauer (2011).
It is to be expected that the proofs easily generalise the case of left-truncation, because, similarly, a Doob-Meyer decomposition of the counting process into compensator and martingale is applied and enables the martingale limit theorem.
Hence, in order to derive confidence intervals only, agian, the Hessian of the logarithmic conditional likelihood is needed. It is a diagonal matrix with diagonal hjl ln U L c (data|λ)| λ=λ 0 = −A hj,l /λ 2 hjl 0 , and hence similar to (8),

Result for AOK HCD
Population and data, including the number of observation n, all remain the same as in age-homogeneous model of Section 2. Section 2.4 found an effect of stroke on dementia onset that exceeds by far findings in contemporary epidemiology.
An age-inhomogeneous dementia intensity has already been confirm for the AOK HCD in Weißbach et al. (2021) and we now apply the piece-wise constant intensities (10). Table 4 and Figure 6 exhibit point estimates, standard errors and confidence intervals due to (12), (13) and the generalisation of Theorem 1 with age intervals covering five years, i.e. with b = 12 pieces (see Table 4, column (1)). For instance, in the age-group with the most dementia events, namely from 80 to 85 years, the dementia intensity after stroke ofλ S 1 D7 = 0.117 exceeds that without stroke ofλ HD7 = 0.047 (see framed numbers in Table 4). The ratio of 2.5 is now two thirds of the ratio 0.07/0.02 = 3.5 of Table 3, and more in line with the Table 4: Statistics (columns (2),(3),(5),(6)), point estimates with standard errors (SE) (Formulae: (12) and root of (13)) (columns (4),(7)) and intensity ratio (column (8)) for 5-year age-classes (Oldest class [110][111][112][113][114] has no stroke or dementia events) (1) (3) is anyway more likely. In more detail, a stroke generally occurs at higher ages, so that the denominator in the stroke-specific point estimator (6), starts accumulating 'time at risk' at a high age. The higher dementia intensity at the ages then results in many events in the numerator of point estimator (6), not attributable to the stroke event. This defect is resolved by the age-specific ratios in (12). And the defect does not level out when calculating the relative risk, because the defect does not affect the healthy persons' intensityλ HD . The formulated aim of the study is to answer the question whether stroke has a effect on dementia. Following up on the arguments in Section 2.4, consider the squared test statistics for each of the b = 12 time intervals, and add those. Thus sum for the 12 differences is distributed as χ 2 12 due to the independence between estimation differences, which must hold in analogy to Theorem 1 also for age-inhomogeneity as the proof of Weißbach and Walter (2010, Theorem 1) suggests. The 95% quantile of the χ 2 12 -distribution is 21.026 and the test statistic (using (13) so that the test is significant.
In order to explore the role of age further, we may notice a decreasing strokeeffect in age, measured in ratios. In the age group of the 55 to 60 years old, the intensity ratio is 9.4 (see last column in Table 4). The higher the age is, the smaller is the intensity ratio. This coincides with the Framingham Study (Ivan et al., 2002) where the adjusted RR was higher for those younger than 80 (RR=2.6) compared to those aged 80 or older (RR=1.6). The b = 12 age-specific Wald-type tests for pairwise differences (suppressed here) show that there is no significant difference in the risk of dementia between persons with and without stroke for the highest age groups (90 years and older). Similarly, the systematic review by Savva and Blossom (2010) also does not find an excess risk of dementia after stroke in those at ages 85 years or older.
Stratification according to multi-morbidity at the time origin, i.e. at the age of 50, would impose a random dichotomous classifier Z, but is not observed as some people are older in 2004. Moreover, multi-morbidity is age dependent, as acquisition of the second vascular disease could take place at any age after age 50.
A time-dependent covariate Z(t) = 1 {[age at the second disease onset,τ )} (t) is necessary: For a dichotomous covariate, the additive model λ wo hjl +β hjl Z(t), as in Kremer et al. (2014, Formula 5), or the multiplicative λ wo hjl e β hjl Z(t) (Andersen et al., 1993, Formular 7.6.2) are equal and we may write the model as piece-wise constant.
Theory for an additive model and a fixed z is derived in Kremer et al. (2014), for a lifetime state model with left-censoring. For the multiplicative model and right-censoring, Borgan (1984, Theorem 2) derives the asymptotic distribution of the estimator. The full theory for left-truncation will not be be reported here, only the point estimator and standard error shall be given. Observable data require a random Z, as usual, and we assume that the distribution of Z does ot depend on λ and condition again (after conditioning on U and X 0 ) on Z. For the ease of notation, define the age of multi-morbidity onset as κ := min{t : z(t) = 1}.
We refrain from developing an age-homogeneous model and directly follow up on Section 3.1 model age-inhomogeneously.

Point estimate and standard error
For each person only one additional split on the constant intensities pieces is necessary. For (the random version of) κ before U or after U + 10, no further distinction is necessary (see Figure 5). The idea is that a person now contributes to the estimation of a set of parameters without multi-morbidity, λ wo hjl , , i.e. to the transition counts and the at-risk-times, until that κ. After the split, a set of parameters with multi-morbidity, λ w hjl , is estimated. All parameters are collected in λ. In detail, conditional on Z = z we define Obviously, generalising (4), as in (11), the logarithmic conditional likelihood contribution is Now, in extension to Section 3.1.1, for κ ∈ [U, U + 10), the derivative of the N -term of ∂ ∂λ wo hjl and for ∂ ∂λ w hjl ln U L c (X, U, Z|λ) for the N -term it is: For κ ≥ U + 10, the N -term in ∂ ∂λ wo hjl hj (t l−1 )) and the derivative with respect to λ w hjl is zero. For κ < U , the N -term with respect to λ wo hjl zero and in ∂ ∂λ w hjl hjl ln U L c (X, U, Z|λ) is, in extension to Section 3.1.1: These summarized, as in (5), to ln U L c τ (data; λ) = n i=1 ln U L c τ (X i , U i , Z i |λ), and set to zero, results in point estimatorŝ Their squared standard errors are, similar to (13), A wo hj,l /(B wo h,l ) 2 and A w hj,l /(B w h,l ) 2 .

Result for AOK HCD
Incorporating multi-morbidity, (14), two tables, similar to the unstratified Table   4, for the two groups with and without multi-morbidity are now given jointly in Table 5. Comparing the columns 4 and 7, the dementia onset intensity is again larger when having had a stroke, as in the age-homogeneous model (of Section 2) and in the age-inhomogeneous model (of Section 3.1). Comparing the first and second rows, multi-morbidity does increase the dementia intensity, however much less than a stroke does.
The graphical analysis of the estimates (16) and confidence intervals (as in Figure 6) are displayed in Figure 7. If multi-morbidity were a predominant predictive factor, a stroke would now not increase the dementia incidence. This is not the case as the middle panel shows. The two panels (left and middle) reveal little differences in dementia intensity of the stratification (apart from larger confidence intervals because group sizes are smaller than in Figure 6). The differences λ wo S 1 Dl − λ wo HDl and λ w S 1 Dl − λ w HDl are equal, and also equal to the unstratified difference λ S 1 Dl − λ HDl (right panel) and hereby strongly suggest that stroke is a risk factor irrespective of multi-morbidity. It it tempting to construct a χ 2 -test for the global hypothesis that stroke is a significant risk factor, similar to that at   (14) as in Table 5 without multimorbidity (left) from stroke (S 1 ) to dementia onset (D) λ wo S 1 Dl (top, grey) and healthy (H) to dementia onset (D)λ wo HDl (bottom, black), intensity with multimorbidity λ w S 1 Dl andλ w HDl (middle); and differences thereof λ wo S 1 Dl − λ wo HDl (dashed), λ w S 1 Dl − λ w HDl (dotted) (right), combined with un-stratified from

Conclusion
Note first that left-truncation can be circumvented by matching the starts of population and observation period, i.e. hereby defining the population as those born after 1954, and hence turning 50 years old from 2004 onwards (as e.g. done in Weißbach et al., 2009). However, not only will then the (many) events of stroke and dementia-onset for people born before 1954 be un-analysed, also will the population not be of current interest, because dementia is a disease of old-age. More critical is that the similarity of the stroke effect from Section 3.1 with that of the related study Kim and Lee (2018) for Korea is misleading because the later study takes more covariates into account. Effect sizes typically decrease as a function of the number of covariates due to multicollinearity. However, integrating exogenous continuous covariates in our left-truncated event history analysis, other than the dichotomous covariate we considered, is algorithmically cumbersome (see e.g. Kim et al., 2012). Also critical is that we assume three sorts of independence. First of all, we assume it within pairs (T i , U i ), even though this is likely to be untrue in our case study because U is an affine transformation of the birth date. Typically demographers assume that younger cohorts tend to live longer. Theory about dependent truncation is currently developed in Tanzer et al. (2021); Uña-Álvarez and van Keilegom (2021); Rennert and Xie (2021). For dependent truncation, the truncation time distribution becomes influential and different distributions are studied in Weißbach and Dörre (2022). Second, we can assume independence between pairs (T i , U i ), due to our data being a sample. However, for an observational study of event histories, stochastic independence must be concluded from unforeseeable birth dates. Third, close to longitudinal independence, probably the most critical assumption of our modelling strategy seems to be the Markovian. This is especially the case, as for our application Pendlebury and Rothwell (2009) and Corraini et al. (2017) claim that the time elapsed since stroke is a risk factor for the intensity of dementia onset. Such duration-dependence especially violates the assumption of multiplicative intensities (25) and thus requires a different strategy (see e.g. Weißbach and Schmal, 2019). Another critical point is that, after the first age axis time-since-birth and the second axis time-since-observation-start, a potential third axis is the cohort trend, found by Weißbach et al. (2021) for the same data, or by Kremer et al. (2014), for another dataset. Assuming steady health progress, here for Germany, the given intensity estimates must be interpreted as intensities averaged over cohorts and are too high for today. Differences, and hence the stroke effect, could still be adequate.
A LIFETIME STATE MODEL Consider a preliminary model in order to lay out the stochastic details more easily.
The population is unchanged to that of the univariate model of Section 2 with several disease states, we aim at Germans born in the first half of the 20 th century (see first line in Figure 2). Again the same HCD 2004-2013 are to be used. In the disease state model of Section 2, death is only the event of truncation, but not that of interest. To 'collapsed' the 'non-dead' states (H,S 1 and D) to one 'alive' state simplifies the notation. As population model here, only of interest is the lifetime T after the 50 th birthday, called 'age-at-death' in the following (see Figure 8). It has hazard rate λ E (·) ≡ λ (i.e. is Exponentially distributed) with CDF F E (·). For a person drawn randomly from the population, the age-at-death T is defined on the probability space (Ω, F, P λ ).
(A1) It is for the true parameter λ 0 ∈ Λ := [ε; 1/ε] for some small ε ∈]0; 1[. The population model is further described by a second measurement, the time elapsed for a person since the age of 50 at study begin, U , denoted 'age at study begin' (see Figure 8). It is an affine transformation of the birthdate. The distribution of U will not be important.
There is no value in using a symbol for the number of years over which we observe, 2004 -2013 in our case study, other than 10. That number will not occur in any other meaning.

A.1 Filtration and conditional likelihood contribution
It is well-known that for a simple random sample the maximum likelihood estimator for λ is a ratio, where each person contributes a numerical one to the numerator and the time at risk, T , to the denominator.
Let t, as in Section 2, count the years after a person's 50 th birthday. That birthday is typically the earliest age at which a stroke occurs, and we continue to simply call t 'age'. Of course, the age 50 is not the earliest age at which a person may die, but for simplicity we may think alike here. Methodologically, the definition of the origin of life, the birth, is arbitrary in this model here. For the model in Section 2, studied the effect of left-truncation due to death before 2004, and refrain from studying the effect of truncation due to death before age 50 because death before age 50 is still rare.
The Let X t indicate a person's state at the age of t. For a person with dementia we write X t = D, irrespective of whether a stroke has preceded dementia onset at that age or not. Equivalently to T , the experiment can be expressed in terms of a process, N T (t) := 1 {T ≤t} (for t ∈ R + ). In order to prepare for modelling left-truncation, adapt it to N t := σ{1 {T ≤s} , 0 ≤ s ≤ t} and note that N T has a compensator with intensity α(t) := 1 {N T (t−)=0} λ = Y (t)λ, with respect to N t and P λ and with Y (t) := 1 {T ≥t} . with T = 100 -50, T ≥ u (or U ) + 10, u (or U ) + 10 = 99 -50 (Explanation of graphs and symbols is distributed over larger parts of text.)

A.1.1 Non-random left-truncation and right-censoring
Informally, left-truncation means that, looking at the top path in Figure 9, the person is only observable at risk of death from 01/01/2004 on, i.e. T −u = 34 -29 = 5 years. For simplicity, we assume u to be deterministic here, i.e. the experiment to been planned. This is unrealistic for our case study and will be relaxed in Section A.1.2. Intuitively the person should now contribute to the estimator of λ, still in the numerator with a numerical one, and in the denominator no longer with T but only with T − u. (We will see that this does not strictly maximize the likelihood.) We decide to start observation at the age of u and use the probability measure of N T thereafter, namely use a marginal likelihood. The reduced observation in the stochastic process model corresponds to a coarser filtration. (Increasing the filtration will be necessary for random U .) The observable filtration is now and the lost observation, as compared to N t , can be seen from completing the filtration with u G t := {∅, 0 ≤ s < u}. (If the case T ≥ u is considered as 'retrospectively ascertained', Weißbach and Wied (2022) include the entire path of N T into a maximum likelihood analysis, but in a manner that is less easily extended to right-censoring.) Extending u G t to censoring will end this section, and is needed in our case study.
We will now see how to proceed with the middle person in Figure 9. It is interesting to note that, with whatever u, observing a person is indication of a small λ and not observing a person is indication of a larger λ. Hence any left-truncated person must enter the likelihood, although we do not see it. This includes that we do not know how many there are. Formally, if N T (u) = 1, no stochastic development will occur after u, which is formalised in u N T (t) := N T (t)−N T (min(t, u)), i.e. u N T is N T if u ≤ T and constantly zero otherwise. Note that u G t = σ{ u N T (s), u ≤ s ≤ t}.
Aiming at likelihood-based estimation, a density starting from u, i.e a marginal likelihood (see Gourieroux and Monfort, 1995, Definition 7.2(i)) is needed. A likelihood is, with respect to some dominating measure, the Radon-Nikodym derivative of the measure that describes the experiment. (The derivative is then evaluated at the observed data.) The measure of an experiment will usually contain the location of the experiment. And for a Bernoulli-experiment, only the location is needed. Considering u N T as series of Bernoulli increments over infinitely short intervals, expectations of the increments define the intensity process. After a deterministic u, intuitively, the expected increase of u N T over an interval of length dt, at age t, is λdt if death is not reached, i.e. t ≤ T , and if the person is no lefttruncated, i.e. in case of A := {T > u}. Else it is zero. We assume P λ (A) > 0.
Formally, the intensity process of u N T (t) after u is u α(t) := 1 {u<t≤T } λ, with respect to the dominating probability measure that conditions on A. For a person not truncated (see Figure 9, top) we lose information, but due to the zero-intensity of a truncated person (see Figure 9, middle) the contribution to the criterion function is ineffective and hence not observing the person, still renders the criterion function observable. More formally, that the dominating measure depends on λ can be interpreted as information loss, namely 'remaining' in the dominating measure. For a proof that u N T , compensated by t u u α(s)ds, really is a (P λ (A), u G t )-martingale, with deterministic u as special case of the random U , see Proposition 4.1 in Andersen et al. (1988). We can denote a criterion function, build on the latter intensity, a marginal likelihood. But conditioning will follow. Right-censored is the age-at-death if it occurs after 2013, or after having left the AOK (see Figures 2 and 9, bottom). As in (Andersen et al., 1993 It has intensity u α c T (t) := λ1 {u≤t≤min(T,u+10)} with respect to P A λ and observed filtration The u F c t is self-exciting (see Andersen et al., 1988, p4), as required (see Andersen et al., 1988, p23), so that the density is determined by Jacod's formula (see Andersen et al., 1988, Formula (4.3)), (see also Andersen et al., 1993, Formula (2.7.2") (and in extension 3.2.8)): For an explanation of the second and third line, see page 24 and, respectively, Example 2.2 in Andersen et al. (1988). Note that in the first line d u N c T (t) ≡ 0 and 1 {u≤T ≤min(t,u+10)} ≡ 0. For whatever process Z, one defines dZ(t) := Z(t) − Z(t − dt) for some 'small' dt (see Fleming and Harrington, 1991, Section 1.4).
Independent persons are, due to different u i , not identically distributed. Each density is a Radon-Nikodym derivative with respect to a different measure, all of which are even dependent on the parameter, P A i λ . Even worse, measures are conditional on observation. That the density for a collection of independent persons is the product of the persons' densities (see e.g. Bleymüller et al. (2020, Sect. 8.2) or Feller (1971, Sect. III.1.Example(a))), relies on the equal (and parameterindependent) dominating measure (usually being Lebesgues). To achieve equal dominating measures for all persons, we will and can follow Examples IV.1.7 and VI.1.4 of Andersen et al. (1993) in using a random U . The sample in our case study of size n all implies that indeed T and U are random.

A.1.2 Random left-truncation and conditional right-censoring
The probability space for (T, U ) is (Ω, F,P λ ), where the distribution of U will not be important and we suppress its parameter (and indicate the difference to P λ with the tilde instead).
Left-truncation for age-at-death Ignoring censoring for the moment, T is recorded when larger than U , the age-at-study-begin (see again Figure 9). Consider the unobservable filtration G t := σ{1 {T ≤s} , 1 {U ≤s} , 0 ≤ s ≤ t}. For lefttruncation (see Andersen et al., 1993, Example III.3.2), similar to non-random respect toP λ , again being λ1 {T ≥t} (because U is independent of T ). We observe durations U and T in the case of A, and neither measurement U nor T -nor the person at all -when T < U . Now define U Y (t) := Y (t)1 {t>U } = 1 {T ≥t>U } , and as G t is unobservable, but due to U being a G t -stopping time, with respect to U F t and the probability measurẽ See again Andersen et al. (1988, Proposition 4.1). Intuitively, U N T will not yet jump prior to U , and no longer after T , and in between, at the hazard rate of T . Now, as U F t is self-exciting, we may apply Jacod's formula (see Andersen et al., 1988, Formula (2.1)) in order to determine a conditional version of the marginal likelihood (see Andersen et al., 1988, Formula (4.3)). And as U is independent of T and the hazard rate has the form λ E (·) ≡ λ, the conditional (marginal) likelihood (see Gourieroux and Monfort, 1995, Definition 7.2(ii)) is for T ≤ U one and else (see also Andersen et al., 1993, Formula (3.3.3)) (and Andersen et al. (1988, p24) with ∆ U N T replaced by d U N T : the size of the observed sample, combining all persons surviving 31/12/2003, as n := n all i=1 1 A i where i enumerates the persons (see Figure 2). Without loss of generality, we have sorted those not truncated at the beginning of the sample, and all other factors are one (see again the first line of (17)). Due to the assumption of a simple sample (and hence the same dominating measure), the logarithmic conditional likelihood for the data (we suppress marginalisation from now on) is the sum of logarithms for (19). We denote by N cens := n i=1 1 {T i >U i +10} the number of right-censored and by N uncens := n -N cens = n i=1 1 {U i <T i ≤U i +10} the number of neither truncated nor censored people and can write ln U L c τ (λ) as so that the unique estimator (if in Λ, see (A1)) becomeŝ For our parametric model, we assume: Theorem 2 Under Conditions (A1)-(A3) it isλ of (21) consistent and √ n all (λ − Proof: Due to the uniqueness ofλ, for both consistency and asymptotic normality, we need to verify Conditions (A)-(E) (see Andersen et al., 1993, Theorems VI.1.1+2). In order to map the notations, note that n becomes n all , a n := √ n all , θ becomes λ and h is redundant. The λ(t; θ) becomes, by interchanging conditional expectations with summation, the multiplicative intensity process of U N c T By (A1), (A) is fulfilled for intensity U α c T (t) and therefore for U α c T • (t; λ) and the logarithm of likelihood (19). For (B), because by the LLN, for fixed t ∈ [0, τ ], for the portion in the study period alive at the age t and observed it is due to the simple sample assumption. Furthermore n/n all P −→P λ 0 (A) = β λ 0 , so that Slutzky's Lemma and the CMT yield For (D), by (A3), it is σ(λ 0 ) positive. Condition (E) consists of six conditions: For the first four, note that For the fifth, note (23) (and also (A1)). For the sixth, note (23) and then H/ √ n all P → 0.

A.1.4 Finite sample properites
Comparable to Section 2.3, we visualise consistency and asymptotic normality, stated in Theorem 2. We again find the approximations suitable for the AOK HCD data of n ≈ 250,000 observations in the next Section A.2.
For an appropriate parameter, similar to Section 2.3.2, for the Exponential distribution of T , note first that P (T ≤ 1) = 1 -e −λ 0 ≈ λ 0 for small λ 0 (by the well-known e x ≈ 1 + x, for small x). For a contemporaneous one-year death incidence (and hence intensity) in Germany, ≈ 0.96 mio. deaths in 2020, restricted to the over 50 year-old's (Destatis 2021b), are to be set in relation to ≈ 35.5 mio.
In order to mimic the portion of uncensored in our case study, we simulate, independent of T , U ∼ Exp(0.004), so that (on average) from n = 250,000 observations n uncens = 40,000 are uncensored. For different sample sizes and 2,000 simulated studies each, Table 6 shows that the mean square error decreases with increasing sample size as to be expected and reaches an irrelevant magnitude far below the sample size of our case study. For the assessment of normality, the n =  (21)

A.2 Result for AOK HCD
The population, the sample of size n all , and among those, the n = 236,039 people not truncated, are described in Section 2.4. The conditional likelihood contributions are (20).   Table 7 and add up to 1.8 million years, i.e.λ = 0.035. Note that the person-years at dementia risk in Table 1 are by definition smaller. Recall that neither for the point estimate (21) nor for its standard error (24) is the knowledge of n all necessary. Theorem 2 yields the confidence intervalλ ± z 1−0.025 σ −1 (λ 0 )/ √ n all , with 97.5% quantile of the standard Gaussian distribution z 1−0.025 , being very narrow due to small standard error J −1/2 τ (λ) = [n all σ(λ)] −1/2 by (24) (see again Table 7). The expected lifetime is 1/0.035 ≈ 28 (plus 50) years. That may not be appropriate, due to the assumption of a constant hazard and by the impossibility of death before age 50 in this model. Age-inhomogeneous increase in the hazard is the well-documented and in basic demography usually modelled as a Gompertz distribution. Piecewise constant hazards would be as in Section 3.1.

B PROOF OF THEOREM 1
Due to the uniqueness ofλ, for both consistency (see Andersen et al., 1993, Theorem VI.1.1) and asymptotic normality (see Andersen et al., 1993, Theorem VI.1.2), we need to verify Conditions (A)-(E) (see Andersen et al., 1993, Condi-tion VI.1.1). Specifically, in order to map the notations, note that n becomes n all , a n := √ n all , θ becomes λ, h becomes hj and λ h (t; θ) becomes bacause by interchanging conditional expectations, the multiplicative intensity process of U N c hj• (t) is the sum of the compensators for each person's counting process.