1 Introduction

Poor sample selection is a frequent basis for objection to the inferential quality of data. Hospital controls may be negatively selective, a student sample is a positive selection. Sampling from soldiers is selective, because a body height threshold truncates smaller recruits. Inference from the status quo of a loan portfolio can take into account the fact that earlier loan applications with too small score had been rejected (see Bücker et al. 2013). Here we study de-selection on the basis of age being either too low or too high. An age is the duration between two events, denoted as “birth” and “death”, and Fig. 1 (left) shows the three possible situations.

Fig. 1
figure 1

Left: Three cases of the date of 1st event (black bullet) and date of 2nd event (white circle): observed (solid) and truncated (dashed) durations/ Right: Sets in the co-domain of \((X_i,T_i)'\) or \((\widetilde{X}_j, \widetilde{T}_j)'\) used in Lemma 1 (and proof): Example \(x \ge s\). (Explanation of panels and symbols is distributed over larger parts of text.)

We assume an Exponential distribution for the latent duration \(\widetilde{X}_j\), observed or truncated, and estimate its parameter \(\theta _0\). Our three applications will be the lifetime of a company (in Germany), the duration of a marriage (in the city of Rostock), and the waiting time, after the 50th birthday, until dementia onset (in Germany).

The parameter of an Exponential distribution is closely linked to the probability of the second event happening within one time unit, one year in all of our applications. In essence, one wants to estimate such an event probability by dividing the number of events (over a certain period) by the number of units at risk (at the beginning of the period), this being prohibited by the lack of denominator. We circumvent the missing data with the conditional distribution of the duration.

We distinguish, as three statistical masses, the population as all units with a first event in a period (of length G), the latent simple random sample and, after truncation, the observed data.

One can of course ask, in particular whether such simple random latent samples exist at all in practice. In survival analysis, the assumption of durations as independent identically distributed random variables can be defended, because independence and randomness are attributable to an unforeseeable staggered entry (see e.g. Weißbach and Walter 2010). Even more specifically, in labour economics, it is validated theoretically that market friction renders the entry into a new occupation for an employee random, and hence its duration until the new occupation.

Truncation is known to introduce a selection bias, referring to the comparison of two models, the estimate of the correct model compared to the estimate from erroneously modelling the observed data as a simple random sample (srs-design). (We will later distinguish the selection bias from the statistical bias, the later referring to only one model, namely comparing an estimate with the true parameter.) More important for us is that truncation is suspected to increase the standard error, as suspected by Adjoudj and Tatachak (2019) due to dependence in the observed data, and we are interested in the extent to which the truncation hinders statistical inference in terms of large sample properties.

As an early reference, Cox and Hinkley (1974) in their Example 2.25 consider the size of the truncated sample as an ancillary statistic, not acknowledging the size of the latent sample, n, as a parameter. The size of the truncated sample was subsequently considered again as random in Woodroofe (1985), and conditioning was used to prove consistency. Neighbouring contemporaneous work on truncation in survival analysis, mostly semi- and non-parametrically are (Shen 2010; Moreira and de Uña-Álvarez 2010; Weißbach et al. 2013; Emura et al. 2015, 2017; Frank et al. 2019; Dörre 2020).

Here, we derive the maximum likelihood estimator (MLE) of n and \(\theta _0\) by representing the observed data as a truncated empirical process. We derive the likelihood with standard results for empirical processes (see e.g. Reiss 1993). The size of the data m will be shown to be such a process, seen as a point measure, evaluated at a certain set S. To the best of our knowledge the model is the first example of an exponential family with the space of point measures being the sample space.

2 Model and result

Before presenting the estimator and its asymptotic distribution, the data need to be described.

2.1 Sample selection

The unit j of the latent sample carries as second measure with its lifetime \(\widetilde{X}_j\) (\(\in \mathbb {R}_0^+\)) its birthdate (a calendar time). We, equivalently, measure the birthdate backwards from a specific time point (equal for all units of the latent sample) and denote it as \(\widetilde{T}_j\). We use the calendar date when our study period begins as thus time point, so that \(\widetilde{T}_j\) has the interpretation of being the “age when the study begins”. We consider as population, units born within a pre-defined time window going back G time units from the study beginning, so that \(\widetilde{T}_j \in [0,G]\) (see Fig. 1 (left)). Define \(S:=\mathbb {R}_0^+ \times [0,G]\), with \(0<G < \infty \), the space for one outcome, and let it generate the \(\sigma \)-field \(\mathcal {B}\). In comparison to the example of soldiers whose recruitment truncates all at the same height, to fit our survival analytic applications, each unit is truncated at a different age. As illustrated in Fig. 1 (left), all units are truncated at the same time, when the study begins. Differently for each unit j, the time interval of observation truncates the sample unit in cases of a too low or too high age. Because \(\widetilde{T}_j\) is the (shifted) birth date, assuming as births process a time-homogeneous Poisson process renders the distribution of \(\widetilde{T}_j\) to be Uniform (see Dörre 2020, Lemma 2). Let us collect the following notation and assumptions:

  1. (A1)

    Let \(\Theta :=[ \varepsilon , 1 / \varepsilon ]\) for some “small” \(\varepsilon \in ]0,1[\).

  2. (A2)

    Let for \(\theta _0\) being an interior point of \(\Theta \), \(\widetilde{X}_j \sim Exp(\theta _0)\), i.e. with density \(f_E(\cdot /\theta _0)\) and CDF \(F_E(\cdot /\theta _0)\) of the Exponential distribution. Let \(\widetilde{T}_j \sim Uni[0,G]\), with density \(f^{\widetilde{T}}\) and CDF \(F^{\widetilde{T}}\) of the Uniform distribution.

  3. (A3)

    \(\widetilde{X}_j\) and \(\widetilde{T}_j\) are stochastically independent.

  4. (A4)

    For known constant \(s>0\), column vector \((\tilde{X}_j,\tilde{T}_j)'\) is observed if it is in

    $$\begin{aligned} D:=\{ (x,t)' | 0 < t \le x \le t + s, t \le G \}. \end{aligned}$$

Assumption (A4) formalises that a sample unit is only observed when its second event falls into the observation period (of length s). For instance, in one of the applications, we will know the age-at-insolvency, i.e. the duration until insolvency, only for those companies that filed for insolvency within the \(s=3\) years 2014–2016. The parallelogram D is depicted in Fig. 1 (right). Following up on (A4), we denote an observation by \((X_i,T_i)'\), \(i=1, \ldots , m \le n\).

The paper assumes a simple random sample for \((\widetilde{X}_j,\widetilde{T}_j)'\), \(j=1,\ldots ,n, n \in \mathbb {N}\), i.e. \(\text{ i.i.d. }\) random variables (r.v.) mapping from the probability space \((\Omega ,\mathcal {A},P)\) onto the measurable space \((S, \mathcal {B})\).

Define now for \(\theta \in \Theta \)

$$\begin{aligned} \alpha _{\theta } := \frac{(1- e^{-\theta s})(1- e^{-G \theta })}{G \theta } \end{aligned}$$
(1)

and note that for \(\theta =\theta _0\), by Fig. 1 (right), Fubini’s Lemma and the substitution rule, it is \(P\{\widetilde{T}_j \le \widetilde{X}_j \le \widetilde{T}_j +s\}\), i.e. the selection probability of the jth individual. The numerator is, due to \(\theta _0,s,G >0\), strictly positive and, as to be expected, with a larger observation interval, i.e. increasing s, the selection becomes more likely. Additionally, for larger \(\theta _0\) (or smaller expected duration) the denominator increases faster than the numerator does, so that the selection becomes less likely. A shorter duration will not reach the observation interval. Seen as a function of G, \(\alpha _{\theta _0}\) is monotonously decreasing, with almost the same interpretation.

The selection probability will occur in the likelihood, so that for maximisation, its first derivative will be needed. The second derivative of \(\alpha _{\theta }\) (with now variable \(\theta \)) will be needed for proving the asymptotic normality and thus calculating the standard error. The proof is elementary and omitted here.

Corollary 1

With Assumptions (A1)–(A4) the first and second derivatives of (1) in \(\theta \) are:

$$\begin{aligned} \dot{\alpha }_{\theta }= & {} \frac{ \theta s e^{-\theta s} (1- e^{-G \theta }) + \theta (1- e^{-\theta s}) \, G \, e^{-G \theta } - (1- e^{-\theta s})(1- e^{-G \theta })}{G \theta ^2} \\ \ddot{\alpha }_{\theta }= & {} e^{- \theta s} \left( - \frac{2 s}{G \theta ^2} - \frac{s^2}{G \theta } - \frac{2}{G \theta ^3} \right) + e^{- G \theta } \left( -\frac{2}{\theta ^2} - \frac{G}{\theta } - \frac{2}{G \theta ^3} \right) \\&+ e^{- (G+s) \theta } \left( \frac{2 s +G}{G \theta ^2} + \frac{(G+s)s}{G \theta } + \frac{1}{\theta ^2} + \frac{G+s}{\theta } + \frac{2}{G \theta ^3} \right) - \frac{2}{G \theta ^3} \end{aligned}$$

Obviously, the distribution of \((\widetilde{X}_j,\widetilde{T}_j)'\), conditional on being observed, will become important for deriving the likelihood.

Definition 1

Let \((X_1,T_1)'\), \((X_2,T_2)'\), \((X_3,T_3)'\), ..., be independent and identically distributed with CDF

$$\begin{aligned} F^{X,T}(x,t)= P\left\{ \widetilde{X}_j \le x,\widetilde{T}_j \le t|\widetilde{T}_j \le \widetilde{X}_j \le \widetilde{T}_j + s\right\} . \end{aligned}$$

In more detail, the distributions of \((X_i,T_i)'\) and \(X_i\) will be needed on the one hand later to define the precise stochastic description of the data, i.e. of the truncated sample as a truncated empirical process. On the other hand, we already need the distribution (and also moments) here to study the consistency and asymptotic normality of the maximum likelihood estimator. The proofs of Lemma 1 and Corollary 2 are elementary (and omitted), but it is useful to define sets (see Fig. 1 (right)):

$$\begin{aligned} \begin{aligned} E_1&:= [0,x] \times [0,t] \cap D, \; E_3 := \text {triangle spanned by points} \; (0,0)',(0,t)',(t,t)', \\ E_2&:= \text {triangle spanned by points} \; (s,0)',(x,0)',(x,x-s)' \quad (if \; x \ge s, \; else \; \emptyset ) \end{aligned} \end{aligned}$$

Lemma 1

With Definition 1 and under Assumptions (A1)–(A4) it holds, for \((x,t)' \in D\), \(\alpha _{\theta _0} F^{X,T}(x,t) = (1- e^{- \theta _0 x}) t/G - R(x,t)\), with \(\frac{\partial ^2}{\partial x \partial t} R(x,t) =0\).

Corollary 2

With Definition 1 and under Assumptions (A1)–(A4):

  1. (i)

    For \((x,t)' \in D\) it holds \(f^{X,T}(x,t)= \frac{\theta _0}{G \alpha _{\theta _0}} e^{- \theta _0 x}\) (outside D it is zero).

  2. (ii)

    The marginal density of X for \(x \in [0,G+s]\) is

    $$\begin{aligned} f^X(x)= & {} \frac{\theta _0}{G \alpha _{\theta _0}} e^{- \theta _0 x} \left( \mathbbm {1}_{[0,s]}(x) x + \mathbbm {1}_{]s,G]}(x) s + \mathbbm {1}_{]G,G+s]}(x) (G-x+s) \right) . \end{aligned}$$
  3. (iii)

    For the expectation of \(X_i\) it holds

    $$\begin{aligned} \alpha _{\theta _0} E_{\theta _0}(X_i)&=A(s,G,\theta _0) e^{-\theta _0 s} + B(s,G,\theta _0) e^{-\theta _0 G} \\&\quad + C(s,G,\theta _0) e^{-\theta _0 (G+s)} + \frac{2}{G\theta _0^2}, \end{aligned}$$

    with \(A(s,G,\theta _0) := -\frac{s}{G\theta _0}- \frac{2}{G\theta _0^2}, B(s,G,\theta _0) := - \frac{1}{\theta _0} - \frac{2}{G \theta _0^2}\) and \(C(s,G,\theta _0) := \frac{G+s}{G \theta _0} + \frac{2}{G \theta _0^2}\).

  4. (iv)

    For the variance of \(X_i\) note that

    $$\begin{aligned} E_{\theta _0}(X^2_i)&=A^q(s,G,\theta _0) e^{-\theta _0 s} + B^q(s,G,\theta _0) e^{-\theta _0 G} \\&\quad + C^q(s,G,\theta _0) e^{-\theta _0 (G+s)} + \frac{1}{4\theta _0^3} \end{aligned}$$

    with \(A^q(s,G,\theta _0)= \frac{-s^2}{G \theta _0} - \frac{s}{6 \theta _0^2} - \frac{1}{4 \theta _0^3}\), \(B^q(s,G,\theta _0)=\frac{-G}{\theta _0} - \frac{4}{\theta _0^2} - \frac{1}{4 \theta _0^3}\) and \(C^q(s,G,\theta _0)= \frac{(G+s)^2}{G \theta _0} + \frac{G+s}{6 \theta ^2} + \frac{1}{4 \theta _0^3}\).

We are now in the position to formulate the likelihood, maximise it and apply large sample theory.

2.2 Estimator and confidence interval

Similar to \(P\{A \cap B\}=P\{A|B\}P\{B\}\) and with detailed definitions following, we decompose the density of the observations and the random sample size, i.e. the likelihood \(\ell \), into the product of the conditional density of the data—conditional on observation—and the distribution of the observation count. If the observations—conditional on having been observed—are independent, the first factor of such product, again, is a product, namely over the conditional densities of each observation.

W.r.t. the second of such factors, note that the size of the observed sample has a Binomial distribution. We can approximate it by a Poisson distribution, when—as is usually argued with the probability generating function—the selection probability \(\alpha _{\theta }\) for each of the n i.i.d. latent Bernoulli experiments is small. This is the case when the width of the observation period (of length s) is “short”, relative to population period (of length G), as will be true for our applications. The description so far motivates

$$\begin{aligned} \ell (m, \sum _{i=1}^m x_i;\theta ,n) \approx \theta ^m \exp \left( -\theta \sum _{i=1}^m x_i\right) n^m \exp (- n \alpha _{\theta }), \end{aligned}$$
(2)

where we already use the “generic” parameter \(\theta \), as will be explained at the end of Sect. 3. The conditionally independent and Exponentially distributed observed durations \(X_i\) cause the first two factors in (2). The last two factors appear in the Poisson distribution of the observed sample size with parameter \(n \alpha _{\theta }\). Details for the likelihood construction will need a formulation of the data as truncated empirical process and will be given in Sect. 3 (and in Theorem 3). The main topic is that it is not necessary to formulate the conditional independence as further assumptions, but that it follows from the simple sample assumption for the \((\widetilde{X}_j,\widetilde{T}_j)'\) and Assumptions (A1)–(A4). At first reading, Sect. 3 may be omitted without lack of coherence.

As a side remark, by inspection of (2), and long-known for random left-truncated durations, the likelihood does not include the (observed) \(t_i\), but it does include the (unobserved) n. Accordingly n, that has not been a parameter in the model (A1)–(A3), becomes a parameter after adding (A4).

As n is unknown in likelihood (2) (and equally in its rigorous counterpart to follow in Theorem 3), we obtain the approximate MLE for \((n,\theta _0)\) and use the \(\theta \)-coordinate of the bivariate zero as \(\hat{\theta }\). The logarithm of the likelihood has the derivative

$$\begin{aligned} \frac{\partial }{\partial \theta } \log \ell \left( m, \sum _{i=1}^m x_i; \theta ,n\right) = \frac{m}{\theta } - \sum _{i=1}^m x_i - n \dot{\alpha }_{\theta }. \end{aligned}$$
(3)

Solving the bivariate equation for \(n \in \mathbb {R}^+\) results in \(m/\alpha _{\theta }\). In order to facilitate the proofs later on, we formulate the estimation as a minimization problem, and in detail as a minimization of an average. Define

$$\begin{aligned} \begin{aligned} \psi _{\theta }({\tilde{x}}_j,{\tilde{t}}_j)&:= i_j \left( {\tilde{x}}_j - \frac{1}{\theta } + \frac{\dot{\alpha }_{\theta }}{\alpha _{\theta }} \right) = i_j \left( {\tilde{x}}_j - \frac{1}{\theta } \right. \\&\quad + \left. \frac{ \theta s e^{-\theta s} (1- e^{-G \theta }) + \theta (1- e^{-\theta s}) \, G \, e^{-G \theta } - (1- e^{-\theta s})(1- e^{-G \theta })}{\theta (1- e^{-\theta s})(1- e^{-G \theta })} \right) , \end{aligned}\nonumber \\ \end{aligned}$$
(4)

with \(i_j\) as a realization of \(I_j := \mathbbm {1}_{[\tilde{T}_j,{\tilde{T}}_j + s]}({\tilde{X}}_j)\).

The derivative of the log-likelihood is now obviously related to (see van der Vaart 1998, Sect. 5)

$$\begin{aligned} \Psi _n(\theta ):= \frac{1}{n} \sum _{j=1}^n \psi _{\theta }(\widetilde{X}_j,\widetilde{T}_j). \end{aligned}$$
(5)

The function is not observable, but it becomes observable after multiplication by n and hence its zero, \(\hat{\theta }\), is observable.

In order to account for boundary maxima, define the MLE \(\hat{\theta }\) now as the zero of \(\Psi _n(\theta )\) if it exists in (the open) \(\Theta \), as \(\varepsilon \) if \(\Psi _n(\theta )>0\), respectively as \(1/\varepsilon \) if \(\Psi _n(\theta )<0\), both for all \(\theta \in \Theta \). The following analytical properties (with proof in Appendix A) will be needed to prove the consistency and asymptotic normality of \(\hat{\theta }\).

Lemma 2

Under the Assumptions (A1)–(A4) it is

  1. (i)

    \(\psi _{\theta }(\tilde{x}_j,\tilde{t}_j)\) twice continuously differentiable in \(\theta \) for every \(({\tilde{x}}_j, {\tilde{t}}_j)'\),

  2. (ii)

    for \(({\tilde{x}}_j, {\tilde{t}}_j)' \in D\)

    $$\begin{aligned} \dot{\psi }_{\theta }(\tilde{x}_j,\tilde{t}_j)= i_j \left( \frac{2}{\theta ^2} - \frac{s^2 e^{-\theta s}}{(1- e^{- \theta s})^2} - \frac{G^2 e^{-G \theta }}{(1- e^{- G \, \theta })^2} \right) > 0, \end{aligned}$$
    (6)
  3. (iii)

    \(E_{\theta _0}[\psi _{\theta }(\widetilde{X}_j,\widetilde{T}_j)] = \alpha _{\theta _0} E_{\theta _0}(X_i) - \frac{\alpha _{\theta _0}}{\theta } + \frac{\alpha _{\theta _0}\dot{\alpha }_{\theta }}{\alpha _{\theta }} =:\Psi (\theta )\),

  4. (iv)

    \(E_{\theta _0}[\psi _{\theta _0}(\widetilde{X}_j,\widetilde{T}_j)]=\Psi (\theta _0)=0\) and

  5. (v)

    \(\Psi _n(\hat{\theta }) {\mathop {\rightarrow }\limits ^{p}} 0\).

As a comparison, we consider the naïve approach to assume already for the observed data, \(X_1, \ldots , X_m {\mathop {\sim }\limits ^{iid}}Exp(\theta _0)\). This is even more tempting, as the necessity of a population definition seems to be redundant. Theoretically, under srs-assumption, the derivative of the log-likelihood—multiplied by minus one—has summands

$$\begin{aligned} \psi ^{srs}_{\theta }(x_i)=x_i - \frac{1}{\theta }, \end{aligned}$$
(7)

being similar to the first two summands of (4) if \(i_j=1\). An interpretation of (ii) in Lemma 2 is now the srs-design as the limit, in the sense that, if \(i_j=1\), it is, \(\lim _{s \rightarrow \infty } \lim _{G \rightarrow \infty } \dot{\psi }_{\theta }(\tilde{x}_j,\tilde{t}_j) = \dot{\psi }^{srs}_{\theta }(x_i)\). Condition (v) is a tribute to boundary maxima, \(\Psi _n(\theta )\) has no zero in \(\Theta \) in case of a too high or too low “location” of \(\Psi _n\), in combination with a too small amplitude over the parameter space, meaning \(\Psi _n(1/\varepsilon ) - \Psi _n(\varepsilon )\). As \(\varepsilon \) can be chosen arbitrarily small, the amplitude depends on the limiting behaviour of \(\Psi _n\) towards the boundaries of \(\mathbb {R}^+\), on the left for and on the right for \(\theta \rightarrow \infty \). Towards the left border, consider Taylor expansions for the numerator and denominator of \(\psi _{\theta }({\tilde{x}}_j, \tilde{t}_j)/i_j - {\tilde{x}}_j\) to show that the first two derivatives, using l’Hôspital’s rule for , are zero, but the third is not. The resulting finite limit is

Following up, note that

(8)

(see Definition 2 and Proof to Lemma 2(iii)). Note further , from Corollary 2(iii), and (see (1)).

Compare with , to see that the reduced amplitude implies less information for truncation, due to the obviously reduced slope also at \(\theta _0\).

By contrast, on the right border, the limiting behaviour for \(\theta \rightarrow \infty \) is not affected by the change in design. To see when \(\psi _{1/\varepsilon }({\tilde{x}}_j,{\tilde{t}}_j)>0\), note that \(\lim _{\theta \rightarrow \infty }\psi _{\theta }({\tilde{x}}_j,{\tilde{t}}_j)/i_j - {\tilde{x}}_j=0\), using l’Hôspital’s rule once. For the srs-design, it is the same and finite, showing that a boundary maximum can occur when the observed durations are small, i.e. when \(\theta _0\) is large (compared to n). We will continue the comparison of designs in Monte Carlo simulation and applications of Sects. 4 and 5.

Theorem 1

Under assumptions (A1)–(A4) and for \(\theta _0 \in ]\varepsilon ,1/\varepsilon [\) holds \(\hat{\theta } {\mathop {\rightarrow }\limits ^{p}} \theta _0\).

Proof

Apply van der Vaart (1998), Lemma 5.10. \(]\varepsilon ,1/\varepsilon [\) is a subset of the real line, \(\Psi _n\) is a random function and \(\Psi \) a fixed, both in \(\theta \). It is \(\Psi _n(\theta ) {\mathop {\rightarrow }\limits ^{p}} \Psi (\theta )\) for every \(\theta \), roughly speaking due to Lemma 2(iii) and the LLN. Specifically, the Poisson property for M results in \(M/n {\mathop {\rightarrow }\limits ^{p}} \alpha _{\theta _0}\). Furthermore, \(\frac{1}{n} \sum _{j=1}^n I_j \widetilde{X}_j=\frac{1}{n} \sum _{i=1}^M X_i {\mathop {\rightarrow }\limits ^{p}} \alpha _{\theta _0} E_{\theta _0}(X_i)\) is a consequence of \(M \sim Poi(n \alpha _{\theta _0})\). Together with \(E_{\theta _0}(M)= Var_{\theta _0}(M) = n \alpha _{\theta _0}\) one has

$$\begin{aligned} Var_{\theta _0}\left( \frac{1}{n} \sum _{i=1}^M X_i \right)= & {} E_{\theta _0}\left[ Var_{\theta _0}\left( \frac{1}{n} \sum _{i=1}^M X_i |M\right) \right] \\&+ Var_{\theta _0}\left[ E_{\theta _0}\left( \frac{1}{n} \sum _{i=1}^M X_i |M\right) \right] \\= & {} \frac{1}{n^2} E_{\theta _0}[M Var_{\theta _0}(X_i)] + \frac{1}{n^2} Var_{\theta _0}[M E_{\theta _0}(X_i)] \\= & {} \frac{1}{n} \alpha _{\theta _0} Var_{\theta _0}(X_i) + \frac{1}{n} [E_{\theta _0}(X_i)]^2 \alpha _{\theta _0} \quad {\mathop {\longrightarrow }\limits ^{n \rightarrow \infty }} 0,\\ \end{aligned}$$

as \(E_{\theta _0}(X_i)\) and \(Var_{\theta _0}(X_i)\) are finite by Corollary 2(iii+iv). Convergence follows in squared mean, and hence in probability.

For the next condition in Lemma 5.10, we need a short discussion about maxima at the boundary of \(\Theta \) for some—typically small—n. In these situations, there is no zero to \(\Psi _n(\theta )\). We will demonstrate that, using the boundary in these situations, the MLE is a “near zero”. That is, \(\Psi _n(\theta )\) is non-decreasing due to Lemma 2(ii) and Lemma 2(v) holds. Furthermore, \(\Psi (\theta )\) is obviously differentiable and \(\dot{\Psi }(\theta _0)>0\) with the same argument as for \(\dot{\psi }_{\theta }\) in Lemma 2(ii) for \(({\tilde{x}}_j, {\tilde{t}}_j)' \in D\), such that \(\Psi (\theta _0- \eta )< 0 < \Psi (\theta _0+ \eta )\) for every \(\eta >0\) when \(\Psi (\theta _0)=0\), which holds due to Lemma 2(iv). \(\square \)

Although being the MLE, we cannot study asymptotic normality with general results from maximum likelihood theory. This would only be possible if we had considered an estimator for the pair \((n,\theta _0)\). Nonetheless, \(\hat{\theta }\) is an M-estimator.

The main idea is to use the smoothness of \(\Psi _n(\theta )\) and apply a quadratic Taylor expansion of \(\Psi _n\) around \(\theta _0\) and evaluated at \(\hat{\theta }\), resulting in (see van der Vaart 1998, Equation (5.18))

$$\begin{aligned} \sqrt{n} (\hat{\theta } - \theta _0) = \frac{- \sqrt{n} \Psi _n(\theta _0)}{\dot{\Psi }_n(\theta _0) + \frac{1}{2}(\hat{\theta } - \theta _0)\ddot{\Psi }_n({\tilde{\theta }})}, \end{aligned}$$

with \({\tilde{\theta }}\) between \(\hat{\theta }\) and \(\theta _0\). We will need:

$$\begin{aligned} \begin{aligned} \psi _{\theta }^2(\tilde{x}_j,\tilde{t}_j) =&i_j \left( \frac{1}{\theta ^2} + {\tilde{x}}_j^2 + \frac{\dot{\alpha }_{\theta }^2}{\alpha _{\theta }^2} - \frac{2 {\tilde{x}}_j}{\theta } - \frac{2 \dot{\alpha }_{\theta }}{\theta \alpha _{\theta }} + \frac{2 {\tilde{x}}_j \dot{\alpha }_{\theta }}{\alpha _{\theta }} \right) \\ \ddot{\psi }_{\theta }(\tilde{x}_j,\tilde{t}_j) =&i_j \left( \frac{\dddot{\alpha }_{\theta }\alpha _{\theta } - \dot{\alpha }_{\theta }\ddot{\alpha }_{\theta }}{\alpha ^2_{\theta }} - \frac{2 \dot{\alpha }_{\theta } \ddot{\alpha }_{\theta } \alpha ^2_{\theta } - 2\dot{\alpha }_{\theta }^3\alpha _{\theta }}{\alpha ^4_{\theta }} -\frac{1}{2\theta ^3} \right) \end{aligned} \end{aligned}$$
(9)

Lemma 3

It is \(E_{\theta _0}[\psi ^2_{\theta _0}(\widetilde{X}_j,\widetilde{T}_j)] < \infty \) and \(\ddot{\psi }_{\theta }(\tilde{x}_j,\tilde{t}_j) \le \ddot{\psi }(\tilde{x}_j,\tilde{t}_j)\) for all \(\theta \) and the subsequent bound integrable.

Proof

For the first half: It is \(I_j \widetilde{X}_j^2 \le (G+s)^2 \Rightarrow E_{\theta _0}(I_j \widetilde{X}_j^2) \le \alpha _{\theta _0} (G+s)^2\), \(I_j \widetilde{X}_j \ge 0 \Rightarrow E_{\theta _0}(I_j \widetilde{X}_j) \ge 0\) and \(I_j \widetilde{X}_j \le (G+s) \Rightarrow E_{\theta _0}(I_j \widetilde{X}_j) \le \alpha _{\theta _0} (G+s)\), so that

$$\begin{aligned} \psi _{\theta _0}^2(\widetilde{X}_j,\widetilde{T}_j) \le \frac{\alpha _{\theta _0}}{\theta _0} + \alpha _{\theta _0} (G+s)^2 + \frac{( \dot{\alpha }_{\theta _0})^2}{\alpha _{\theta _0}} - \frac{2 \dot{\alpha }_{\theta _0}}{\theta _0} + 2 (G+s) \dot{\alpha }_{\theta _0} \end{aligned}$$

which is finite due to \(\theta _0 \in \Theta \), the finiteness and positivity of \(\alpha _{\theta _0}\) from (1) and the finiteness of \(\dot{\alpha }_{\theta _0}\) from Corollary 1(i). For the second half: In (9), we can replace the denominators by their (due to the arguments after (1)) positive minima. Then, all numerators are continuous functions on compact \(\Theta \) hence with finite maxima, that we may insert. So that \(\ddot{\psi }_{\theta }(\tilde{x}_j,\tilde{t}_j) \le i_j C=:\ddot{\psi }(\tilde{x}_j,\tilde{t}_j)\) (with \(C < \infty \)) having finite integral \(C \alpha _{\theta _0}\). \(\square \)

Theorem 2

Let be \(\theta _0 \in ]\varepsilon , 1/\varepsilon [\) then, under assumptions (A1)–(A4), holds \(\sqrt{n} (\hat{\theta } - \theta _0) {\mathop {\rightarrow }\limits ^{d}} N(0, \sigma ^2)\) with \( \sigma ^2 := E_{\theta _0}(\psi _{\theta _0}^2(\widetilde{X}_j,\widetilde{T}_j))/ [E_{\theta _0}(\dot{\psi }_{\theta _0}(\widetilde{X}_j,\widetilde{T}_j))]^2\) (see definitions (6) and (9)).

Proof

Use the classical assumptions of Fisher (here in the formulation from van der Vaart 1998, Theorem 5.41). The main assumption of consistency is Theorem 1. Now \(\psi _{\theta }({\tilde{x}}_j, {\tilde{t}}_j)\) is twice continuously differentiable in \(\theta \) for every \(({\tilde{x}}_j,{\tilde{t}}_j)\), due to Lemma 2(i). \(E_{\theta _0}[\psi _{\theta _0}(\widetilde{X}_j, \widetilde{T}_j)]=0\) due to Lemma 2(iv) with \(E_{\theta _0}[\psi ^2_{\theta _0}(\widetilde{X}_j, \widetilde{T}_j)] < \infty \) due to Lemma 3. The existence of \(E_{\theta _0}[\dot{\psi }_{\theta _0}(\widetilde{X}_j,\widetilde{T}_j)]\) follows from (4) and positivity from Lemma 2(ii) combined with \(E_{\theta _0}(I_j)=\alpha _{\theta _0}>0\). Dominance of the second derivative by a fixed integrable function around \(\theta _0\) is due to Lemma 3. \(\square \)

For the estimation of the standard error (SE) from Theorem 2, we replace expectations by averages over the latent sample (with \(\theta _0\) replaced by \(\hat{\theta }\)),

$$\begin{aligned} \frac{\hat{\sigma }}{\sqrt{n}} := \frac{\frac{1}{\sqrt{n}} \sqrt{\sum _{j=1}^n \psi _{\theta _0}^2(\tilde{x}_j,\tilde{t}_j)}}{\frac{\sqrt{n}}{n} \sum _{j=1}^n \dot{\psi }_{\theta _0}(\tilde{x}_j,\tilde{t}_j)}, \end{aligned}$$
(10)

being observable, because indicators reduce sums up to m.

3 Likelihood approximation

In order to give a precise version and derivation of the likehood (2), we now describe the truncated sample as stochastic process as in Kalbfleisch and Lawless (1989), especially as truncated empirical process, which in turn is approximated by a mixed empirical process. For the mixed process, deriving the likelihood is relatively simple.

Denote by \(\epsilon _a\) the Dirac measure concentrated at point \(a \in S\). Define the point measure \(\mu := \sum _{j=1}^n \epsilon _{(\widetilde{x}_j,\widetilde{t}_j)'}\), \(\mu : \mathcal {B} \mapsto \bar{\mathbb {N}}_0\), and the space of point measure on \(\mathcal {B}\) (with fixed n) by \(\mathbb {M}\). By inserting random variables, it becomes an empirical process \(N_n:= \sum _{j=1}^n \epsilon _{(\widetilde{X}_j,\widetilde{T}_j)'(\omega )}\) (\(\Omega \mapsto \mathbb {M}\)), measurable w.r.t. \(\sigma \)-algebras from \(\mathcal {A}\) to \(\mathcal {M}\), the \(\sigma \)-algebra for \(\mathbb {M}\). The data is now the truncated empirical process (for an illustration, see Fig. 2 (left))

$$\begin{aligned} N_{n,D}(\cdot ):= N_n(\cdot \cap D)=\sum _{j=1}^n \epsilon _{\left( {\begin{array}{c}\widetilde{X}_j\\ \widetilde{T}_j\end{array}}\right) } \left( \cdot \cap D \right) , \end{aligned}$$

for which we write \(X_1, \ldots , X_m\) in all but this section. The size of the truncated sample is \(N_{n,D}(S)\), for which we write M—and realised m—in all but this section, and is hence random and dependent on the sample size n.

Fig. 2
figure 2

Left: Realisation of \(N_{n,D}\) on sequences of rectangles \([0,x] \times [0,t]\), as a function of the upper right corner \((x,t)'\). Dots mark \((\tilde{x}_j,\tilde{t}_j)'\). Right (for Sect. 5): Criterion function (5) (times n) for Application “insolvency”

In order to parametrize the data, i.e. the truncated empirical process, we write its intensity measure (only needed for sets \([0,x] \times [0,t]\)) as

$$\begin{aligned} \begin{aligned} \nu _{N_{n,D}}([0,x] \times [0,t])&:= E_{\theta _0}[N_{n,D}([0,x] \times [0,t])] \\&= n P \{(\widetilde{X}_j,\widetilde{T}_j)' \in [0,x] \times [0,t] \cap D\} \\&= n \alpha _{\theta _0} F^{X,T}(x,t), \end{aligned} \end{aligned}$$
(11)

due to Lemma 1. To see that, note that

$$\begin{aligned} \alpha _{\theta _0} F^{X,T}(x,t) = \mathcal {L}(\widetilde{X}_j,\widetilde{T}_j)(E_1) = F_E(x/\theta _0) F^{\widetilde{T}}(t) - \mathcal {L}(\widetilde{X}_j,\widetilde{T}_j)(E_2 \cup \mathrm {E}_3). \end{aligned}$$

Here, and in the following, the measure in the co-domain of a random variable is denoted \(\mathcal {L}\), e.g. \(\mathcal {L}(\widetilde{X}_j,\widetilde{T}_j)\). Note also that, \(\nu _{N_{n,D}}\) evaluated at S, is \(n \alpha _{\theta _0}\). One can show that \(N_{n,D}\) is equal in distribution to a Binomial-mixing empirical process. However, as our data in the applications (Sect. 5) will be relatively few, because s is relatively small, we will see shortly that it is enough to approximate the data with a Poisson-mixing empirical process.

Definition 2

Assume (A1)–(A4) and let Z be Poisson-distributed with parameter \(n \alpha _{\theta _0}\) and independent thereof \((X_i,T_i)'\) of Definition 1:

$$\begin{aligned} N^{*}_n:= \sum _{i=1}^{Z} \epsilon _{\left( {\begin{array}{c}X_i\\ T_i\end{array}}\right) } \end{aligned}$$

Due to \(\nu _{n,D}(S)=n \alpha _{\theta _0}<\infty \) and \(\mathcal {L}[(X_i,T_i)']=\nu _{n,D}/(n \alpha _{\theta _0})\) (by (11)) now \(N^{*}_n\) is a Poisson process with an intensity measure (see Reiss 1993, Theorem 1.2.1(i))

$$\begin{aligned} \nu _n^*=\nu _{n,D} \quad \text {and} \quad N^{*}_n(S)=Z. \end{aligned}$$
(12)

The latter is generally true for Poisson processes, (realized or not), so that Z is also observed.

The parallelogram D is “small” (in terms of \(\mathcal {L}(\widetilde{X}_j,\widetilde{T}_j)\)) relative to S, as long as the observation interval width s is relatively small compared to the width G of the population (and the typically long expected durations). Hence, \(N^{*}_n\) is “close” to \(N_{n,D}\) in Hellinger distance (see e.g. Reiss 1993, Approximation Theorem 1.4.2). We will now derive the likelihood for \(N^{*}_n\).

The likelihood is the density of \(N^{*}_n\), evaluated at the realisation, denoted as \(n^{*}_n\), i.e. with inserted z and \((x_i,t_i)'\)’s. The density of \(N^{*}_n\) has as its domain, the co-domain of \(N^{*}_n\), \(\mathbb {M}\), so that the density of \(N^{*}_n\) is a function of the point measure \(\mu \). Furthermore, a Radon–Nikodym density requires a dominating measure and we use the density of another Poisson process. We chose the 2-dim homogeneous Poisson process on \([0,A]^2\).

Definition 3

Let \(A \in \mathbb {N}\) be a number larger than the support of \(X_i\) or \(T_i\), e.g. the next natural number larger then \(G+s\) (see Definition 1). Let \(N_0\) be a Poisson process with \(Z_0 \sim Poi_{A^2}\) and independently thereof \((X^0_i,T^0_i)' \sim Uni([0,A]^2)\) \(i=1,2,3, \ldots \).

Note that \(N_0\) has a (finite) intensity measure, where \(\lambda _{[0,A]^2}\) denotes the Lebegues measure restricted to \([0,A]^2\), (see Reiss 1993, Theorem 1.2.1.(i))

(13)

The latter is different from a geometrically intuitive volume \(A^4\). \(\mathcal {L}(N_0)\) will now serve as the dominating measure in order to derive the Radon–Nikodym density of \(\mathcal {L}(N^{*}_n)\). But for that we will need the Radon–Nikodym density of \(\nu _{n,D}\) w.r.t. \(\nu _0\), so that (see Billingsley 2012, Formula (16.11)) one searches \(h_{\theta _0}:S \rightarrow \mathbb {R}_0^+\) with \(\forall B \in \mathcal {B}\) it is

$$\begin{aligned} \nu _{n,D}(B)= \int _B h_{\theta _0} \, d \nu _0. \end{aligned}$$
(14)

For \(B=[0,x]\times [0,t]\) and \(x\le A, t \le A\) due to Fubini’s theorem, with \(\lambda \) as the univariate Lebesgues measure, due to the differentiability,

$$\begin{aligned} \nu _{n,D}([0,x]\times [0,t])= & {} A^2 \int _0^x \int _0^t h_{\theta _0}(a_1,a_2) \lambda (d a_2) \lambda (d a_1) \nonumber \\ \Rightarrow h_{\theta _0}(x,t)= & {} \frac{1}{A^2} \frac{\partial ^2}{\partial x \partial t} \nu _{N_{n,D}}([0,x] \times [0,t]) \nonumber \\= & {} \frac{1}{A^2} \frac{\partial ^2}{\partial x \partial t} n \alpha _{\theta _0} F^{X,T}(x,t) =\frac{n \theta _0}{G \, A^2} e^{- \theta _0 x}, \end{aligned}$$
(15)

where (11) is used for the third equality, and Lemma 1 for the forth together with \(\frac{\partial ^2}{\partial x \partial t} R(x,t) = 0\) from Lemma 1. Of course, for \((x,t)' \not \in D\) is \(h_{\theta _0}(x,t) = 0\).

Theorem 3

For Assumptions (A1)–(A4) and \(\alpha _{\theta _0}\) from (1), the model \(N^{*}_n\) of Definition 2, has likelihood w.r.t. to \(\mathcal {L}(N_0)\) from Definition 3:

$$\begin{aligned} \ell (n^{*}_n;\theta _0,n)= \frac{n^{n^{*}_n(S)} \theta _0^{n^{*}_n(S)}}{G^{n^{*}_n(S)} A^{2 n^{*}_n(S)}} \exp \left( -\theta _0 \sum _{i=1}^{n^{*}_n(S)} x_i\right) \exp (A^2 - n \alpha _{\theta _0}) \end{aligned}$$
(16)

The proof is in Appendix B. The main idea is to decompose the density of the data, i.e. of \(\mathcal {L}(N_n^*)\), into the product of the density, conditional on \(N_n^*(S)\), multiplied by the probability mass distribution of the Poisson distributed \(N_n^*(S)\). The later results in the very last factor of (16) to include an exponential function in \(n \alpha _{\theta _0}\). Note that by Fisher-Neyman factorization \((N^{*}_n(S), \sum _{i=1}^{N^{*}_n(S)} X_i)\) is a sufficient statistic.

We maximise the likelihood as a function in its second argument, the “generic” parameter \(\theta \), being already the notation in (3). For a thorough discussion about the parameter notation, we refer the reader to the maximum likelihood estimator as posterior mode in a Bayesian analysis with uniform prior (see e.g. Robert 2001, Sect. 2.3). Finally note that, after taking logarithm, the derivatives w.r.t to \(\theta \) and n of (16) are equal to that of its intuitive counterpart (2) with \(n^{*}_n(S)\) replaced by m (see (3)).

4 Monte Carlo simulations

Our aim in this section is twofold, first we illustrate the vanishing bias, i.e. consistency, stated theoretically by Theorem 1. Second, the notion of a “bias”, referring to one model so far, can be extended to the “selection bias” comparing two models. We will assess such design-effect compared to the srs-design as motivated theoretically after Lemma 2.

We simulate \(n \in \{10^p, p=3, \ldots , 6 \}\) durations \(\widetilde{X}_j\) from Assumption (A2) with \(\theta _0 \in \{0.005, 0.01, 0.05, 0.1 \}\) according to (A1) and further \(\widetilde{T}_j\) according to (A2) with \(G \in \{24, 48 \}\), and we obey (A3). We then retained m of the \(\tilde{x}_j\), that fulfil (A4) with \(s \in \{2,3,48\}\). We calculate for the data set v the MLE \(\hat{\theta }^{(v)}\) as zero of (5) by means of a standard algorithm. Boundary maxima do not occur because (8) is markedly negative for all simulation scenarios.

Table 1 Simulation averages of Bias, estimated asymptotic variance \(\hat{\sigma }^2\) and \(\widehat{VIF}\) / Simulated \(Var(\hat{\theta })\) (times n)

In order to illustrate, first, consistency, assess the finite sample bias as an average over the \(R =1000\) simulated \((\hat{\theta }^{(v)} - \theta _0)\). Table 1 (1st rows) lists the results, and it can be seen that the bias decreases to virtually zero. In order to show the decline in the mean squared error, consider the estimated standard error (10) of \(\hat{\theta }^{(v)}\). In Table 1 (2nd rows) averages over the \(\left( \hat{\sigma }^{(v)}\right) ^2\) seem to have a finite limit for increasing n. Hence, the standard error decreases of order \(\sqrt{n}\).

A by-product of the simulations is that they enable confirming the representation of \(\sigma ^2\) (in Theorem 2). On the one hand, \(Var(\hat{\theta })\) can be approximated by \(\frac{1}{R} \sum _{v}^R (\hat{\theta }^{(v)} - \theta _0)^2\), the simulated variance, i.e. \(\sigma ^2= Var(\sqrt{n} \hat{\theta }) =n Var(\hat{\theta })\) by n times the simulated variance (Table 1 (3rd rows)). On the other hand, in a simulation, and not in an application, can \(\sigma ^2\) be estimated as n times the square of (10) (Table 1 (2rd rows)). Both quantifications become equal for large n.

The relation of the standard error with respect to \(\alpha _{\theta _0}\) is also interesting. It decreases, obviously because \(\alpha _{\theta _0}\) is linearly related to the size of the truncated sample by \(m=n \alpha _{\theta _0}\) (see again (11)). The relation of \(\alpha _{\theta _0}\) to \(\theta _0\), s and G is already explained after its definition (1) and its respective sensitivity is presented in Table 1. There is one exception; although \(\alpha _{\theta _0}\) is decreasing in G, the simulated \(\hat{\sigma }^2\) does not increase, but instead decreases for a given n (Table 1 (left panels)). The reason can be suspected to be as in the srs-design, where the estimated standard error (17) is not only increasing in m of order 1/2, but also decreases in \(\sum _{i=1}^m x _i\) of order 1, the latter being much larger for a large G (at given m).

Second, for the srs-design, applying (7) results in an MLE \(\hat{\theta }_{srs}= m/\sum _{i=1}^m x _i\) with standard error \(\sigma _{srs}/\sqrt{m}= \theta _0/\sqrt{n^{*}(S)}\) (i.e. \(\sigma _{srs}^2:=\theta _0^2\)). The latter can be estimated by inserting \(\hat{\theta }_{srs}\),

$$\begin{aligned} \frac{\hat{\sigma }_{srs}}{m} = \frac{\sqrt{m}}{\sum _{i=1}^m x _i}. \end{aligned}$$
(17)

The factor for “inflating” the variance from Theorem 2, denoted as Kish’s design effect, is

$$\begin{aligned} VIF:= \frac{\sigma ^2/n}{\sigma ^2_{srs}/m}. \end{aligned}$$
(18)

Illustrating the design effect with the VIF is typical for the field of sampling techniques, especially in survey sampling. (By contrast, in the field of econometrics, variance inflation typically denotes the fact that standard errors increase for coefficients in a regression when accepting more covariates.) In the simulations, the VIF remains overall at a quite moderate size, with a tendency to increase in \(\alpha _{\theta _0}\).

We will continue the comparison of designs in the applications of Sect. 5 where we will see a substantial variance inflation in all three applications.

5 Three empirical applications

5.1 Populations and data

Insolvency of corporates founded 1990–2013 The population of our first application are German companies founded after the last structural break in Germany, the re-unification, namely at the beginning of 1990. The first event is the foundation of the company, and the second considered event is the insolvency. We restrict attention to the \(G=24\) years until the end of 2013, after which we started observing. Let \(\widetilde{X}_j\sim Exp(\theta _0)\) denote the age-at-insolvency, and by \(\widetilde{T}_j\) its age at the beginning of 2014. We assume a foundation to have taken place constantly (over those \(G=24\) years), i.e. \(\widetilde{T}_j \sim Uni[0,24]\). The German federal ministry of finance publishes the age of each insolvent debtor. We stop observing in 2016, i.e. \(s=3\), after having collected, as a truncated sample \(m=55,279\) companies.

Divorce of couples married 1993–2017 In our next application, the German bureau of statistics reports divorces, with marriage lengths. Of marriages sealed between 1993 and 2017 in the German city of Rostock, \(m=327\) marriages were divorced during 2018. Of these, 82 lasted less than 5 years, 112 lasted 6–10, 67 lasted 11–15, 40 lasted 16–20 and 26 held 21–25 years, i.e. \(G=25\) and \(s=1\). This small sample size example can help to understand dependence of the variance inflation to the data size.

Dementia onset of people born 1900–1954 Our final application is dementia incidence in Germany for the birth cohorts 1900 until 1954. The first event is the 50th birthday of a person, between 1950 and 2004, i.e. we have \(G=55\) . An insurance company reported that between 2004 and 2013 (\(s=9\)), \(m=35,929\) insurants has had a dementia incidence (the second event) (for more information about the data see Weißbach et al. 2021).

5.2 Comparison of estimation results

The zero of (5), i.e. the point estimate \(\hat{\theta }\), is found graphically, for instance for the first application by Fig. 2 (right). For the estimated standard error see (10). All estimates are in Table 2, which also contains the estimates under srs-design (17).

Table 2 ML estimate \(\hat{\theta }\) and estimate of standard error (SE) \(\sigma /\sqrt{n}\) (see Theorem 2) for applications, and comparison with srs-design

It is evident that ignoring truncation overestimates the hazard \(\theta _0\) by, for example, 29% in the insolvency application, and also causes negative selection of units in the others. We observe that the standard error is underestimated by about 35% for all applications (equivalent to an on average \(\widehat{VIF}=2,5\), as estimation of (18)), presumably through ignoring the stochastic dependence between units (and thus measurements) within the truncated sample. Also variance inflation almost seems not to depend on the sample size.

6 Discussion

The results are encouraging, as even after truncation, asymptotic normality holds, and standard errors do not increase too much. The considerable selection bias can be accounted for easily and identification of the parameters follows from standard results on the exponential family.

However, it is somewhat unfortunate that standard consistency proofs for the Exponential family fail, because compactness of the parameter space is violated, even when re-parametrising, due to the growing sample size being a parameter itself. And a temptation to withstand is to misinterpret the data as a simple random sample, only because statistical units are selected with equal probabilities (see (1)). This is especially tempting, because if the truncated sample was simple, not knowing n would be similar to not knowing the size of the population, requiring “finite-population corrections” only in the case of relatively many observations.

In practice, the considerable effort to account for truncation can even be circumvented in rich data situations by adjusting the population definition to start at the observation interval, however thereby excluding observable units (see e.g. Weißbach et al. 2009).

Of course more advanced sampling designs exist, such as endogenous sampling where units that have had a longer timeframe have a larger selection probability, in contrast to our model (sse (1)). Also truncation is typically analysed with counting process theory, focusing more on the role of the filtration as an information model (see e.g. Andersen et al. 1988). And with respect to robustness, the maximum likelihood method we use can be inferior to the method of moments (see e.g. Weißbach and Radloff 2020; Rothe and Wied 2020).

Nonetheless, we believe that our approach still offers some advantages: As we (i) directly recognize the second measurement, the age when observation starts, as random, (ii) model the sample size as random and (iii) distinguish explicitly between indices in observed and unobserved sample.

Two more minor points appear notable. First, the distance from the data to the mixed empirical process can be reduced to zero by changing from Poisson-mixing to Binomial-mixing, although little new insight can be expected, other than longer proofs. The same is true when proving the information equality for the standard error. And finally, one troublesome aspect should not be concealed. Compare the design effect with the theory of cluster samples where the VIF increases in the cluster size linearly, for given intra-cluster correlation. Considering the time as a classifier, truncation seems to introduce a very small intra-temporal correlation, because the increase in the VIF is small. However, for very small sample sizes, the VIF should then be even smaller. Non-linear behaviour of the dependence on the sample size is conceivable.