Semiparametric likelihood inference for heterogeneous survival data under double truncation based on a Poisson birth process

We study a selective sampling scheme in which survival data are observed during a data collection period if and only if a specific failure event is experienced. Individual units belong to one of a finite number of subpopulations, which may exhibit different survival behaviour, and thus cause heterogeneity. Based on a Poisson process model for individual emergence of population units, we derive a semiparametric likelihood model, in which the birth distribution is modeled nonparametrically and the lifetime distributions parametrically, and define maximum likelihood estimators. We propose a Newton–Raphson-type optimization method to address numerical challenges caused by the high-dimensional parameter space. The finite-sample properties and computational performance of the proposed algorithms are assessed in a simulation study. Personal insolvencies are studied as a special case of double truncation and we fit the semiparametric model to a medium-sized dataset to estimate the mean age at insolvency and the birth distribution of the underlying population.


Introduction
In survival analysis, missing data are a central notion, as full observability of all units is rarely achievable. Consequently, statistical methods and approaches are designed to consider the various types of missing information to obtain valid inference. The Only units that fail within the data collection period are sampled (solid circles). Units that fail before or after the data collection period (empty circles) are not sampled situation we address in the following is at the intersection of survival analysis and retrospective studies, where truncated data result from sampling only failed units within a certain data collection period.
Example 1 Consider personal insolvencies in Germany (see also Sect. 5). When a person becomes insolvent in Germany, individual information on the person is made available due to legal processes. Since the information is accessible for a limited amount of time through a public website, only units with their insolvency event occurring within the data collecting period can be observed. We have collected data from this source from March 2018 to December 2018 and sampled data on insolvent persons born between 1920 and 2000 in all 16 federal states of Germany. Figure 1 illustrates this truncation mechanism.
This sampling design is a special case of random double truncation, as the lifetimes are truncated if they are either too short or too long. Random double truncation may occur as an unavoidable characteristic of data collection procedures or may be chosen deliberately as a cost-effective alternative to other sampling designs. In the example above, the stochastic process of births over time poses a specific component of the truncation setup, which is addressed here.
Random double truncation is relevant to many areas including medicine (see Lagakos et al. 1988;Scheike and Keiding 2006;Moreira et al. 2014), astronomy (see Efron and Petrosian 1999), economics and engineering (see Lawless and Kalbfleisch 1992;Ye and Tang 2016). In recent years, attention to this area of research has been growing (see Emura and Konno 2012;Zhang 2015;Emura et al. 2017;Dörre and Emura 2019;de Uña-Álvarez 2020). This is partly due to the fact that administrative and economically collected data are often observational and frequently subject to selective criteria or constraints which can be posed in a truncation framework.
Among the various existing models and methods for doubly truncated data that have been investigated, the nonparametric maximum likelihood estimator under random double truncation (NPMLE, see Shen 2010a; Moreira and Uña-Álvarez 2010; Emura et al. 2015) has perhaps been studied most extensively. The NPMLE is generally applicable under double truncation, but can be computationally demanding and requires caution regarding identifiability aspects (see Shen 2010a) and the existence of the estimator (see Xiao and Hudgens 2019).
Parametric models have been developed partly to address the drawbacks of the nonparametric approach (see Emura et al. 2017;Dörre and Emura 2019). In Dörre (2020), fully parametric models have been studied in a setting similar to the one studied here. Finally, semi-parametric models (see Wang 1989;Shen 2010b) have been proposed in the context of double truncation.
Most existing models for doubly truncated data are primarily designed for homogeneous populations and can not readily be extended to heterogenous populations in an efficient manner. However, heterogeneity may be an immanent feature of survival data. For instance, in the example described above, it can not be assumed that the underlying lifetime distributions of people in different federal states are identical. While existing approaches such as the NPMLE could be applied to each subpopulation separately, this procedure would not be efficient if the subpopulations have common structural properties.
By incorporating available information on the structure of the study population, we derive a semiparametric method for survival data under double truncation. The survival data are modeled as heterogeneous in that the population may consist of separate subpopulations with potentially different lifetime distributions.
In Sect. 2, we specify the sampling setup and parameter space, and derive a corresponding likelihood model in order to define proper maximum likelihood estimates. By reconsidering the model structure, we address asymptotic properties and discuss notions of point processes and inverse probability weighting in Sect. 3. In Sect. 4, finite-sample properties of the proposed maximum likelihood estimates and numerical optimization methods are examined by means of simulation studies. We apply the methods to study personal insolvencies in Germany in Sect. 5.

General framework
We suppose that the units under investigation experience two subsequent events, generically called 'birth' and 'failure', throughout time. The meaning of these events may differ with respect to the particular context, and constitute a survival analysis setup, where the 'lifetime' is defined as the duration between birth and failure. Furthermore, we suppose that the population of interest is composed of a finite number m of distinct subpopulations, which may exhibit different lifetime distributions.
Units emerge throughout time according to stochastic processes, yielding random subpopulation sizes N j , j = 1, . . . , m. Each unit is born at a random time b and has a random lifetime y. The corresponding failure event therefore occurs at b + y. The stochastic process describing the birth of units implicitly corresponds to a distribution G(·) of births in the population.
Due to the sampling setup, only units that fail within the data collection interval [τ L , τ R ] are observed. Therefore, any unit fulfilling the selection criterion is being sampled, and for sampled units we assume that full information is available, i.e. birth b, lifetime y and the variable x ∈ {1, . . . , m} indicating the subpopulation to which the unit belongs (see Fig. 2). Units not fulfilling this selection criterion are truncated in the sense that no information about those units is available. The sample consists of n ≤ N units, and is subdivided into random subsamples of sizes n j , j = 1, . . . , m, according to the considered subpopulations. Note that the subpopulation sizes N j are unobservable, whereas the subsample sizes n j are observed. We denote the data as Throughout, we make the following assumptions: (A) For each j = 1, . . . , m, the units in the jth subpopulation emerge independently of all other subpopulations and stochastically within a common fixed time interval [s, t] according to a Poisson process with intensity function The lifetime distributions are continuous and parametric, i.e. for each j = 1, . . . , m, there is a true parameter θ j such that f j (·|θ j ) is the density function of y conditional on x = j. (D) The truncation mechanism is such that a unit (b, x, y) from the population is sampled if and only if it fails within The lifetimes y and births b are independent.
Assumption (E) is necessary in some derivations below. In particular, this assumption requires that the lifetime distribution in any subpopulation does not change over the course of time. In specific applications, this assumption may be questionable. For testing the (quasi-)independence assumption, several approaches have been studied; see Tsai (1990) and Martin and Betensky (2005). Recent papers which address the aspect of the independence assumption within the truncation context are Chiou et al. (2018Chiou et al. ( , 2019 and Moreira et al. (2021).

Proportional birth intensity model
The distribution function of a randomly chosen birth b generated by a Poisson process with intensity function λ(·) is equal to (2) (2), it follows that there is a common birth distribution for all considered subpopulations, such that the birth distribution functions of the subpopulations fulfill G j ≡ G for all j = 1, . . . , m.
By Assumptions (A) and (D), the observed units of any subpopulation j form a random point process with regard to the observed births b i where x i = j. It can be shown that the observed point process of each subpopulation j is also a Poisson process. Furthermore, the sum of all observed birth processes is a Poisson process, too, by the independence stated in Assumption (A). For any subpopulation j = 1, . . . , m, we denote the cumulative intensity of the latent birth process as Λ j := t s λ j (b) db and the cumulative intensity of the observed birth process as The probability of observing one unit, conditional on it belonging to subpopulation j with given birth time b, denoted as π j (θ j |b), depends on the lifetime distribution of the corresponding subpopulation, and is equal to The intensity function of the observed point process of subpopulation j is equal to Similarly, Λ obs j = Λ j π j (θ j |G), where is the probability of observing a random unit from subpopulation j, given birth distribution function G. To see this, note that Using Eq.
(2), it follows that λ j (b)db = Λ j dG(b), and therefore By h(·), we denote the density function of x, i.e. the probability of a unit from the population belonging to a specific subpopulation. It is a consequence of Assumption (B) that this density function only depends on a = (a 1 , . . . , a m ), representing a multinomial experiment with Furthermore, the sum of all latent birth processes itself forms a Poisson process, and has cumulative intensity Λ 1 m j=1 a j . Similarly, it holds that the sum of all observed birth processes is a Poisson process, too, for which the cumulative intensity is equal to Λ 1 m j=1 a j π j (θ j |G). The probability of observing a single random unit from the population is equal to where π j (θ j |G) is defined as in Eq. (4). Finally, we state the following result, which is fundamental for constructing the likelihood function.

is a Poisson process with intensity function
The cumulative intensity is equal to π(G, a, θ )Λ 1 m j=1 a j .

Proof It can be shown by elementary calculations that η(·) is Poisson distributed with the corresponding intensity. Assumption (A) ensures independent increments of η(·).
A detailed proof is omitted.

Likelihood function
Under Assumptions (A)-(E), the observed Poisson process has the density function (see Snyder andMiller 1991 andTheorem 3.1.1 in Reiss 1993): Note that Eq. (7) does not consist of iid factors and is therefore not ideal for likelihood inference. By conditioning on the random sample size n, however, we construct a likelihood function with proper iid terms. For more technical details in a similar model, see Dörre (2020).
To estimate the birth distribution function G nonparametrically, we place positive probability weights precisely on all observed births b 1 , . . . , b n . Specifically, we set . . , n, and treat ΔG i as parameters of the unknown birth distribution. The corresponding birth distribution results as By conditioning Eq. (7) on the random sample size n, we obtain the corresponding likelihood function where >0 with a 1 = 1 and θ = (θ 1 , . . . , θ m ) ∈ Θ 1 × · · · × Θ m . We define the maximum likelihood estimator ( G (n) , a, θ ) as the maximizer of Eq. (8), i.e.
Altogether, this model entails n +m + m j=1 dim(Θ j )−2 parameters, not counting a 1 and ΔG n due to being subject to constraints. The model is semiparametric in the sense that the number of parameters increases linearly with growing sample size. The likelihood function Eq. (8) can be rearranged as so that the log-likelihood function can be written as where π(G, a, θ ) is obtained as in Eq. (6) and, following Eq. (4), Note that the likelihood function Eq. (8) represents an inverse probability weighting (IPW) approach, where the probability weights π(G, a, θ ) are determined endogenously and equally for every observed unit. By conditioning on certain parts of the data, other inverse probability weighting likelihood functions are conceivable under the considered truncation scheme. For instance, for any j = 1, . . . , m, an estimator θ c j of θ j may be defined as the maximizer of and L j is valid in the sense that θ c j → θ j as n j → ∞, which is not proved here. Using Eq. (8) has the advantage of leading to generally more precise estimates due to the stronger assumptions with regard to the birth distribution, while Eq. (9) has the advantage of not being vulnerable to misspecification with regard to the lifetime distribution in other subpopulations than j or the birth distribution G.

Birth intensity estimation
The proposed likelihood model Eq. (8) can be regarded as an inverse probability weighted approach, in which every sampled unit is weighted by the inverse probability of observing a single random unit from the population. Roughly speaking, every sampled unit corresponds to π(G, a, θ ) −1 hypothetical units in the population, out of which one single unit fulfills the selection criterion in Eq. (1) and has thus been sampled. Similarly, a sampled unit from the jth subpopulation with birth b i corresponds to π j (θ j |b i ) −1 hypothetical units before the selection. Based on this perspective and following Eq. (3), we define as an estimate of the cumulative birth intensity of the jth subpopulation. The IPW perspective also offers an alternative approach to estimate the proportionality factors a 2 , . . . , a m . Due to the proportional intensity assumption, it holds that Λ j = a j Λ 1 for j = 2, . . . , m. One simple way to estimate the coefficients a 2 , . . . , a m is therefore to use Eq. (10), by which is obtained. Note also that this approach is feasible whenever a consistent estimate of θ is available, and no estimate of G is required. We show in Sect. 4, that this method leads to valid, yet inefficient estimates.

Large-sample properties
Studying the asymptotic behaviour of the proposed estimators is based on the subsample sizes growing indefinitely. We note, however, that n j → ∞ if and only if Λ j → ∞, and thus asymptotics can not be studied without changing the birth intensity function. As a consequence of this perspective, it is not possible to estimate Λ j consistently in the sense that | Λ j − Λ j | converges to 0 for n → ∞. Nevertheless, Eq.
(2) shows that the birth distribution G, which we estimate, is invariant with respect to scaling the birth intensity function λ, as long as its shape is not changed. Due to Assumptions (A)-(E), n j is Poisson distributed with mean Λ obs j = Λ j π j (θ j |G). Therefore, when Λ j → ∞, it holds that n j → ∞, almost surely, and the expression converges to E(π j (θ j |b) −1 |x = j) by the law of large numbers, where E(·) is the conditional expectation with respect to the law of observed units in subpopulation j. It holds that is the density function of observed random births b in subpopulation j. Therefore, since n j is Poisson distributed with parameter Λ obs j . This derivation shows that Λ j as defined in Eq. (10) is a reasonable estimate of Λ j . Consequently, a simple j as defined in Eq. (11) is expected to be a reasonable estimate of a j .
To study the large-sample properties of the proposed estimators based on Eq. (8), we consider the asymptotic behaviour of the log-likelihood function. Define as the stochastic criterion function, which is maximized by the maximum likelihood estimator ( G (n) , a, θ ). Suppose first that the true birth distribution G is known. A conditional criterion function can be defined as G, a, θ ) and it holds due to the law of large numbers that almost surely, where E a,θ denotes the expectation with respect to the distribution conditional on τ L ≤ b + y ≤ τ R . For known G, the density function of an observed random pair (x, y) under truncation is precisely f x (y|θ x )h(x|a)/π (G, a, θ ), such that Eq. (12) in fact represents the Kullback-Leibler divergence with respect to (a, θ ) up to a constant additional term. It is well-known that this expression is uniquely maximized at the true parameter values (a, θ ) under appropriate regularity assumptions (see Lemma 5.35 in van der Vaart 1998). Now, more generally, assume that some consistent estimator of G is available, i.e. an estimator G (n) based on the data D n such that almost surely, for n → ∞. Reconsidering Eq. (4) and Eq. (6) and 0 < π j (θ j |G) ≤ 1 for every j = 1, . . . , m, it follows that π ( G (n) , a, θ ) − π(G, a, θ ) → 0, n → ∞.
Therefore, |Q n (a, θ | G (n) ) − Q n (a, θ |G)| → 0 for any a and θ and thus as in Eq. (12). It follows that a and θ are consistent under appropriate regularity assumptions, if G (n) is a consistent estimator of G.
Assuming asymptotic normality of γ = ( a, θ ) with the variance given by the Fisher information, it is possible to derive asymptotic confidence intervals in a conventional way. To do so, we abbreviate γ = (a, θ ) and define the matrix A n of all negative partial derivatives of the log-likelihood function Given that A n is a regular matrix, the asymptotic variance matrix of γ can be estimated by the inverse V n := A −1 n . The level (1 − α) confidence interval for the jth component of γ is then obtained as where z 1−α/2 is the (1 − α/2) quantile of the standard normal distribution.
Similarly, confidence intervals for aspects of the latent distribution, say the mean lifetime E(y|θ j ) in the jth subpopulation, can be defined. To do so, we first generate a large number M of iid draws An estimated level (1 − α) confidence interval of the mean lifetime is then obtained using the α/2 and 1 − α/2 quantiles of {E(y|θ 1 j ), . . . , E(y|θ M j )}.

Numerical implementation
To find the maximum likelihood estimate for given data D n , we propose a Newton-Raphson-type algorithm. In this procedure, the stopping criterion is defined with respect to a distance function which indicates the average change in an iteration with respect to the different parameter types. We define the distance function d (( G, a, θ), (G, a, θ ) in which the number n of support points is incorporated into measuring the distance of two distribution functions G and G. The following algorithm is proposed for determining the maximum likelihood estimate.
Step 2 For every i = 1, ..., n, determine Step 3 Perform univariate Newton-Raphson steps for maximizing log L(G 0 , a, θ |D n ) for fixed G 0 with regard to all components of a and θ until convergence. Denote the result as (a 1 , θ 1 ).
For the initial values θ 0 j , we use the maximum likelihood estimates of θ j in the model without truncation. The key step in this algorithm is that the initial value G 0 is determined utilizing the initial values θ 0 j and using Eq. (10). Intuitively, for every i, N 0 i is the hypothetical number of population units among which the single unit i has been observed. By standardizing these hypothetical population numbers as done in Step 4, the corresponding distribution function is derived. It turns out (see Sect. 4.2), that choosing ΔG 0 in this way remarkably reduces the computational runtime in comparison to alternative optimization strategies.
The damping factor 0 < ω ≤ 1 is used to avoid obtaining invalid parameter values, such as ΔG i ≤ 0 or a j ≤ 0 for any j ∈ {1, . . . , m}. Depending on the sample size, ω may need to be chosen close to zero initially. When γ ν is sufficiently close to the maximum likelihood estimator, the damping factor can be dropped, i.e. ω = 1. Similarly, the Newton-Raphson method in Step 3 of the algorithm MNS may potentially yield inadmissible values a j due to numerical reasons. To avoid this issue, one may need to adjust the optimization to ensure robustness (e.g., using a damping factor).

Simulation study on validity
We formulate the following simulation setup to study the finite-sample properties of the proposed method (see Table 1). Population units are born in the time interval [s, t] = [0, 8]. In all considered setups, we consider m = 2 subpopulations and set a 2 = 1.5. The birth intensity function of subpopulation 1 is defined as a piecewise constant function λ 1 (b) = λ 11 I(b ≤ (t + s)/2) + λ 12 I(b > (t + s)/2). Lifetimes are exponentially distributed according to y → f j (y|θ j ) = θ j exp(−θ j y) with parameters θ 1 and θ 2 in the respective subpopulations. For each simulation setup, we generate 1000 random datasets and determine the estimates using the algorithm MNS. The damping factor is set as ω = 1 and = 10 −4 .
Following Table 1, comparisons between the Setups 1-2 (λ 11 > λ 12 ) and Setups 3-4 (λ 11 = λ 12 ) are of primary interest, as these setups respectively concern the same birth distributions. The Setups 5 and 6 differ from Setup 4 only in that τ R is increased. The probability of observing a random unit ranges from 0.138 to 0.261 in the considered setups. Note also that although the birth distribution is in fact parametric in these simulation setups, the birth distribution is estimated nonparametrically.
The simulation results (see Table 2) indicate that the proposed method yields valid estimates of the lifetime parameters θ . While the bias of the estimates is notable due to the relatively small sample sizes, the MSEs are adequately small and decrease with growing sample size (see Setup 1 vs. 2 and Setup 3 vs. 4). Comparing the Setups Table 1 Simulation setups Setup τ L τ R λ 11 λ 12 a 2 θ 1 θ 2 π(G, a, θ)   5 and 6 with Setup 4, the MSE is not substantially decreased despite the increased sample size (and increased selection probability). This shows that the structure of the truncation mechanism, including the definition of the data collection period, may influence the precision of estimates in a nontrivial manner, and larger samples not necessarily correspond to higher precision. Table 3 summarises the performance of a 2 and the comparison to the simple estimate a simple 2 . It turns out that the biases of both estimates are considerable, yet of a similar magnitude. While a simple 2 is a valid estimator of a 2 , a 2 is more precise in all considered setups.
Finally, we study the performance of the algorithm MNS and the estimated birth distribution. For the latter purpose, we define the distances The results displayed in Table 4 demonstrate that the initial values as defined in Step 2 in the algorithm MNS are suitable for optimization. On average, the first Newton-Raphson-type search in Step 3 in the algorithm MNS required about 6 to 8 iterations before reaching the stopping criterion. Considering that the number of parameters in the studied simulation setups is n + 3, requiring on average less than 40 iterations in the final Newton-Raphson-type search in Step 5 in the algorithm MNS seems quite satisfactory.  Assessing the estimation of the birth distribution G, the mean distance of G (n) to G, denoted as E( G 0 − G ∞ ) is remarkably small in all considered setups and generally decreases with growing sample size, ceteris paribus. Note also that the distance of G 0 to G (n) , denoted as E( G 0 − G (n) ∞ ), is relatively small, again demonstrating that the suggested choice of initial values is appropriate.

and the final Newton-Raphson method in
Step 5 in the algorithm MNS with regard to all parameters is necessary to find the maximum likelihood estimate.

Simulation study on computational performance
We further evaluate the proposed initial parameter choice in the Newton-Raphson algorithm by comparing its performance to reasonable alternatives. A local random search (LRS) is a conceptually simple approach to finding local maxima of the likelihood function (see Zhigljavsky 1991). Given numerically permissible initial values, the algorithm successively finds parameters with higher likelihood value, until a defined termination criterion is met. Reasonable termination criteria are, e.g., that no improvements are found for a defined number of steps or that a preset number of iterations has been performed. The algorithm LRS has the advantages of using computationally inexpensive iterations and of being relatively robust in that the local step sizes can be chosen sufficiently conservative to avoid numerically non-permissible parameters. The algorithm LRS does not utilize the local shape of the log-likelihood function, but just requires the function's value at candidate points. See the Appendix for an implementation of the algorithm LRS.
In addition, we consider a Newton-Raphson-type algorithm, which we call 'Naive Newton-Raphson Search' (NNS, see Appendix). This algorithm addresses a drawback of the algorithm LRS using the local shape of the log-likelihood function for updating parameter values in each iteration. The algorithm NNS differs from the algorithm MNS in that the initial birth distribution is chosen naively, i.e. without utilizing the IPW relation expressed in Eq. (10).
For each algorithm and iteration, the distance of the current log-likelihood value to the maximal value is calculated. For the algorithms NNS and the MNS, 10 iterations have been performed, and took roughly the same runtime. Since an iteration in the algorithm LRS is computationally much cheaper than for the algorithms NNS and MNS, the number of iterations of the algorithm LRS has been adjusted to account for this difference. Therefore, the runtime of all three algorithms has been approximately the same in each setup. The initial values are chosen as θ 0 j = n j / i:x i = j y i for each j ∈ {1, 2}, which would be the subpopulation-specific maximum likelihood estimate of θ j if the data were not subject to truncation, and a 0 2 = 1. Figure 3 shows that the three considered algorithms seem to form a clear order with regard to performance. Due to having the same initial values, the algorithms LRS and NNS start with the same log-likelihood value. The algorithm LRS quickly improves over the algorithm NNS, and then progresses rather slowly towards the maximum. The algorithm MNS clearly has the best performance of the three considered algorithms in the considered setups. Interestingly, the initial value of the algorithm MNS has higher log-likelihood value than the algorithms NNS or LRS achieve within the considered runtime. Moreover, after few steps, the algorithm MNS essentially reaches the maximum of the log-likelihood. This simulation demonstrates that a model-driven initial choice of the parameters ΔG i can remarkably reduce the runtime of the numerical optimization.

Application
Personal insolvency occurs when a person is unable to pay their debts to other entities. In Germany, personal insolvencies are reported publicly due to legal regulations. Reports on the individual insolvencies are available through a public website for 14 days, before public access is removed again. The sample, which results from continually tracking such public reports, includes information on all persons who become insolvent during the timespan of data collection, and is thus subject to the described double truncation framework. From March 2018 to December 2018, n = 73,370 units that were born between 1920 and 2000 have been sampled (see Table 5 for basic summary statistics).
Individual information includes the date of birth b i , the age at insolvency y i (adjusted to biological age 18, when legal age is reached in Germany) and the federal state x i in which the insolvency is reported. There are m = 16 federal states, and they constitute the subpopulations in this analysis. For the lifetimes in any subpopulation j ∈ {1, . . . , 16}, we assume a Gamma distribution Ga(α j , β j ), defined as In this notation, θ j = (α j , β j ) and θ = ((α 1 , β 1 ), . . . , (α 16 , β 16 )). The mean lifetime in the jth subpopulation is equal to α j /β j . As stated in the Assumptions (A)-(E), we implicitly assume that births are formed independently across the m = 16 federal states, and that the corresponding intensity functions are proportional. We have considered this assumption by checking the observed empirical birth distributions for each state in the dataset, and while the impact of the lifetime distributions is neglected via this inspection, we found that there is no apparent reason to discard the proportionality assumption. Note that Assumption (D) is fulfilled by the procedure the data have been collected. Since there is perhaps a substantial percentage of people who never experience insolvency (i.e. 'immortal' units), the population of interest is implicitly defined as the population of all people in Germany who eventually experience insolvency. Therefore, the birth distribution estimated under the truncation setup does not necessarily represent the actual birth distribution known for Germany. The model altogether contains 73,417 parameters. We apply the algorithm MNS to determine the maximum likelihood estimates.
As initial values in Step 1 of the algorithm MNS, we choose a 0 j = n j /n 1 . Reasonable initial values (α 0 j , β 0 j ) were obtained by first fitting the model to a random subsample having size 2000. We choose ω = 0.4 as the damping factor. The resulting estimates are displayed in Table 6. Figure 4 summarizes the estimates of the proportion factors and the mean age at insolvency for each subpopulation. The results indicate distinct differences between the federal states, with the estimated proportion factors generally reflecting the proportions of the federal state sizes known for Germany (numerical values not displayed).
Similarly, the estimated mean age at insolvency varies across federal states, ranging from 56.4 in Saxony-Anhalt to 70.6 in Bavaria. In comparison to the observed mean ages at insolvency (see Table 5) in the sample, the estimates are considerably larger. Without accounting for the truncation mechanism, the mean ages are systematically  underestimated. We note that larger subsample sizes correlate with smaller confidence intervals for the mean biological age at insolvency in the respective subpopulations, which is not unexpected. Figure 5 indicates that the truncation mechanism substantially distorts the birth distribution, as the empirical distribution function differs significantly from the estimate. In particular, more recent births between 1985 and 2000 seem to be overrepresented in the sample in comparison to the population of interest, while people born between 1940 and 1980 are underrepresented.

Discussion
We have focused on a joint model for the birth distribution and the lifetime distribution for heterogeneous survival data under double truncation. In doing so, we have derived maximum likelihood estimates for all involved parameters.
The situation is cast as a high-dimensional problem, where the number of parameters increases linearly with sample size. The main advantage is that the model can be treated with well-known concepts for parametric models, and implemented in a standard fashion. One substantial drawback of the large number of parameters is that the optimization method for finding the maximum likelihood estimates needs to be carefully adjusted.
Concerning the implementation of the model, we have demonstrated that caution in the choice of initial values in a Newton-Raphson-type algorithm is important. Using a generic choice is not ideal and can lead to impractical runtime. Instead, we propose model-driven initial values, which greatly reduce computational runtime. In our calculations, we have found that with this approach, the model can be fitted to even medium-sized datasets such as considered in the application (n > 50,000, m > 10) within reasonable runtime. However, we note that further improvements on the optimization method are certainly possible and worth studying. This also concerns the choice of the damping factor ω, which has been introduced to avoid obtaining invalid parameter values within the optimization algorithm. We have not included a general guideline on how to properly choose ω, because it may depend on many variables such as the sample size, the setting of the birth interval [s, t], the data collection period [τ L , τ R ] and also on the parametric families used for modeling the lifetimes.
Similarly, alternative optimization methods such as the algorithm LRS discussed in Sect. 4, though being theoretically basic, can be valuable in certain situations, particularly when initial values of the Newton-Raphson methods are too far from the optimum. Owing to the relatively robust search strategy, the algorithm LRS can be used as a preparation for more sophisticated optimization methods such as the algorithm MNS.
All three considered algorithms require initial values θ 0 j , j = 1, . . . , m, for which we suggest using the conventional maximum likelihood estimate in the absence of truncation. Though it has not been further examined in the simulation study, this choice has generally lead to surprisingly satisfactory results. Alternative initial values might yield better numerical performance under certain circumstances.
We have sketched how the asymptotic behaviour of the estimates a and θ can be studied for given consistent estimator G (n) . It remains to prove that the proposed nonparametric estimator G (n) is in fact consistent.
As in the algorithm MNS, we use the maximum likelihood estimates of θ j in the model without truncation for the initial values θ 0 j . In Step 2, the parameters q G , q a , q θ ∈ (0, 1) define the maximum relative step size for each parameter and iteration. An efficient choice of these values ensures a sufficiently large probability of finding parameter values with increased log-likelihood value, and a sufficiently large average step size to enable a feasible runtime by avoiding unnecessarily many iterations. An optimal choice generally depends on the setup, in particular the sample size and the considered lifetime distributions. In the studied simulation setup, we have found that choosing q G = n −1 , q a = 0.05 and q θ = 0.025 seems to be a suitable practical choice for small to moderate sample sizes.
The damping factor ω is incorporated as in the algorithm MNS. For sample sizes n ≈ 1500, a damping factor of ω = 0.25 seemed suitable for both avoiding invalid parameter values and still achieving satisfactory runtime of the algorithm. Note that in Step 2, the current parameter γ is adjusted in each univariate Newton-Raphson step. As in the algorithm MNS, we use the maximum likelihood estimates of θ j in the model without truncation for the initial values θ 0 j .