Truncating the Exponential with a Uniform Distribution

For a sample of Exponentially distributed durations we aim at point estimation and a confidence interval for its parameter. A duration is only observed if it has ended within a certain time interval, determined by a Uniform distribution. Hence, the data is a truncated empirical process that we can approximate by a Poisson process when only a small portion of the sample is observed, as is the case for our applications. We derive the likelihood from standard arguments for point processes, acknowledging the size of the latent sample as the second parameter, and derive the maximum likelihood estimator for both. Consistency and asymptotic normality of the estimator for the Exponential parameter are derived from standard results on M-estimation. We compare the design with a simple random sample assumption for the observed durations. Theoretically, the derivative of the log-likelihood is less steep in the truncation-design for small parameter values, indicating a larger computational effort for root finding and a larger standard error. In applications from the social and economic sciences and in simulations, we indeed, find a moderately increased standard error when acknowledging truncation.


Introduction
Poor sample selection is a frequent basis for objection to the inferential quality of data. Hospital controls may be negatively selective, a student sample is a positive selection. Sampling from soldiers is selective, because a body height threshold truncates smaller recruits. Inference from the status quo of a loan portfolio can take into account the fact that earlier loan applications with too small score had been rejected (see Bücker et al., 2013). Here we study de-selection on the basis of age being either too low or too high. An age is the duration between two events, denoted as "birth" and "death", and  age-at-death-event X j age-at-study-begin T j observation period (length s) Figure 1: Left: Three cases of the date of 1 st event (black bullet) and date of 2 nd event (white circle): observed (solid) and truncated (dashed) durations/ Right: Sets in the co-domain of (X i , T i ) or ( X j , T j ) used in Lemma 1 (and proof): Example x ≥ s (Explanation of panels and symbols is distributed over larger parts of text.) We assume an Exponential distribution for the latent duration X j , observed or truncated, and estimate its parameter θ 0 . Our three applications will be the lifetime of a company (in Germany), the duration of a marriage (in the city of Rostock), and the waiting time, after the 50 th birthday, until dementia onset (in Germany).
The parameter of an Exponential distribution is closely linked to the probability of the second event happening within one time unit, one year in all of our applications. In essence, one wants to estimate such an event probability by dividing the number of events (over a certain period) by the number of units at risk (at the beginning of the period), this being prohibited by the lack of denominator. We circumvent the missing data with the conditional distribution of the duration.
We distinguish, as three statistical masses, the population as all units with a first event in a period (of length G), the latent simple random sample and, after truncation, the observed data.
One can of course ask, in particular whether such simple random latent samples exist at all in practice. In survival analysis, the assumption of durations as independent identically distributed random variables can be defended, because independence and randomness are attributable to an unforeseeable staggered entry (see e.g. Weißbach and Walter, 2010). Even more specifically, in labour economics, it is validated theoretically that market friction renders the entry into a new occupation for an employee random, and hence its duration until the new occupation.
Truncation is known to introduce a selection bias, referring to the comparison of two models, the estimate of the correct model compared to the estimate from erroneously modelling the observed data as a simple random sample (srs-design). (We will later distinguish the selection bias from the statistical bias, the later referring to only one model, namely comparing an estimate with the true parameter.) More important for us is that truncation is suspected to increase the standard error, as suspected by Adjoudj and Tatachak (2019) due to dependence in the observed data, and we are interested in the extent to which the truncation hinders statistical inference in terms of large sample properties.
As an early reference, Cox and Hinkley (1974) in their Example 2.25 consider the size of the truncated sample as an ancillary statistic, not acknowledging the size of the latent sample, n, as a parameter. The size of the truncated sample was subsequently considered again as random in Woodroofe (1985), and conditioning was used to prove consistency. Neighbouring contemporaneous work on truncation in survival analysis, mostly semi-and non-parametrically are Shen (2010) Here, we derive the maximum likelihood estimator (MLE) of n and θ 0 by representing the observed data as a truncated empirical process. We derive the likelihood with standard results for empirical processes (see e.g. Reiss, 1993). The size of the data m will be shown to be such a process, seen as a point measure, evaluated at a certain set S. To the best of our knowledge the model is the first example of an exponential family with the space of point measures being the sample space.

Model and Result
Before presenting the estimator and its asymptotic distribution, the data need to be described.

Sample Selection
The unit j of the latent sample carries as second measure its lifetime X j (∈ R + 0 ) its birthdate (a calendar time). We, equivalently, measure the birthdate backwards from a specific time point (equal for all units of the latent sample) and denote it as T j . We use the calendar date when our study period begins as thus time point, so that T j has the interpretation of being the "age when the study begins". We consider as population, units born within a pre-defined time window going back G time units from the study beginning, so that Figure 1(left)). Define S := R + 0 × [0, G], with 0 < G < ∞, the space for one outcome, and let it generate the σ-field B. In comparison to the example of soldiers whose recruitment truncates all at the same height, to fit our survival analytic applications, each unit is truncated at a different age.
As illustrated in Figure 1(left), all units are truncated at the same time, when the study begins. Differently for each unit j, the time interval of observation truncates the sample unit in cases of a too low or too high age. Because T j is the (shifted) birth date, assuming as births process a time-homogeneous Poisson process renders the distribution of T j to be Uniform (see Dörre, 2020, Lemma 2). Let us collect the following notation and assumptions: (A1) Let Θ := [ε, 1/ε] for some "small" ε ∈]0, 1[. (A2) Let for θ 0 being an interior point of Θ, X j ∼ Exp(θ 0 ), i.e. with density f E (·/θ 0 ) and CDF F E (·/θ 0 ) of the Exponential distribution. Let T j ∼ U ni [0, G], with density f T and CDF F T of the Uniform distribution.
(A3) X j and T j are stochastically independent.
(A4) For known constant s > 0, column vector (X j ,T j ) is observed if it is in 5 Assumption (A4) formalises that a sample unit is only observed when its second event falls into the observation period (of length s). For instance, in one of the applications, we will know the age-at-insolvency, i.e. the duration until insolvency, only for those companies that filed for insolvency within the s = 3 years 2014 -2016. The parallelogram D is depicted in Figure 1(right).
The paper assumes a simple random sample for ( X j , T j ) , j = 1, . . . , n, n ∈ N, i.e. i.i.d. random variables (r.v.) mapping from the probability space (Ω, A, P ) onto the measurable space (S, B).

Define now for
and note that for θ = θ 0 , by Figure 1(right), Fubini's Lemma and the substitution rule, it is P { T j ≤ X j ≤ T j + s}, i.e. the selection probability of the j th individual. The numerator is, due to θ 0 , s, G > 0, strictly positive and, as to be expected, with a larger observation interval, i.e. increasing s, the selection becomes more likely. Additionally, for larger θ 0 (or smaller expected duration) the denominator increases faster than the numerator does, so that the selection becomes less likely. A shorter duration will not reach the observation interval. Seen as a function of G, α θ 0 is monotonously decreasing, with almost the same interpretation.
The selection probability will occur in the likelihood, so that for maximisation, its first derivative will be needed. The second derivative of α θ (with now variable θ) will be needed for proving the asymptotic normality and thus calculating the standard error. The proof is elementary and omitted here.
Corollary 1. With Assumptions (A1)-(A4) the first and second derivatives (iii) For the expectation of X i it holds (iv) For the variance of X i note that We are now in the position to formulate the likelihood, maximise it and apply large sample theory. experiments is small. This is the case when the width of the observation period (of length s) is "short", relative to population period (of length G), as will be true for our applications. The description so far motivates

Estimator and Confidence Interval
where we already use the "generic" parameter θ, as will be explained at the end of Section 3. The conditionally independent and Exponentially distributed observed durations X i cause the first two factors in (2). The last two factors appear in the Poisson distribution of the observed sample size with parameter nα θ . Details for the likelihood construction will need a formulation of the data as truncated empirical process and will be given in Section 3 (and in Theorem 3). The main topic is that it is not necessary to formulate the conditional independence as further assumptions, but that it follows from the simple sample assumption for the ( X j , T j ) and Assumptions (A1)-(A4).
At first reading, Section 3 may be omitted without lack of coherence.
As a side remark, by inspection of (2), and long-known for random lefttruncated durations, the likelihood does not include the (observed) t i , but it does include the (unobserved) n. Accordingly n, that has not been a parameter in the model (A1)-(A3), becomes a parameter after adding (A4).
As n is unknown in likelihood (2) (and equally in its rigorous counterpart to follow in Theorem 3), we obtain the approximate MLE for (n, θ 0 ) and use the θ-coordinate of the bivariate zero asθ. The logarithm of the likelihood has the derivative Solving the bivariate equation for n ∈ R + results in m/α θ . In order to facilitate the proofs later on, we formulate the estimation as a minimization problem, and in detail as a minimization of an average. Define with i j as a realization of The derivative of the log-likelihood is now obviously related to (see van der Vaart, 1998, Sect. 5) The function is not observable, but it becomes observable after multiplication by n and hence its zero,θ, is observable.
In order to account for boundary maxima, define the MLEθ now as the zero of Ψ n (θ) if it exists in (the open) Θ, as ε if Ψ n (θ) > 0, respectively as 1/ε if Ψ n (θ) < 0, both for all θ ∈ Θ. The following analytical properties (with proof in Appendix A) will be needed to prove the consistency and asymptotic normality ofθ.
As a comparison, we consider the naïve approach to assume already for the observed data, X 1 , . . . , X m iid ∼ Exp(θ 0 ). This is even more tempting, as the necessity of a population definition seems to be redundant. Theoretically, under srs-assumption, the derivative of the log-likelihood -multiplied by minus one -has summands being similar to the first two summands of (4) if i j = 1. An interpretation of (ii) in Lemma 2 is now the srs-design as the limit, in the sense that, is a tribute to boundary maxima, Ψ n (θ) has no zero in Θ in case of a too high or too low "location" of Ψ n , in combination with a too small amplitude over the parameter space, meaning Ψ n (1/ε) − Ψ n (ε). As ε can be chosen arbitrarily small, the amplitude depends on the limiting behaviour of Ψ n towards the boundaries of R + , on the left for θ 0 and on the right for θ → ∞. Towards the left border, consider Taylor expansions for the numerator and denominator of ψ θ (x j ,t j )/i j −x j to show that the first two derivatives, using l'Hôspital's rule for θ 0, are zero, but the third is not. The resulting Following up, note that (see Definition 2 and Proof to Lemma 2(iii)). Note further lim s 0 α θ 0 E θ 0 (X i ) = 0, from Corollary 2(iii), and lim s 0 α θ 0 = 0 (see (1)).
Compare with lim θ 0 ψ srs θ (x i ) = −∞, to see that the reduced amplitude implies less information for truncation, due to the obviously reduced slope also at θ 0 .
By contrast, on the right border, the limiting behaviour for θ → ∞ is not affected by the change in design. To see when ψ 1/ε (x j ,t j ) > 0, note that lim θ→∞ ψ θ (x j ,t j )/i j −x j = 0, using l'Hôspital's rule once. For the srs-design, it is the same and finite, showing that a boundary maximum can occur when the observed durations are small, i.e. when θ 0 is large (compared to n).
We will continue the comparison of designs in Monte Carlo simulation and applications of Sections 4 and 5.
Proof. Apply Lemma 5.10 in van der Vaart (1998). ]ε, 1/ε[ is a subset of the real line, Ψ n is a random function and Ψ a fixed, both in θ. It is Ψ n (θ) p → Ψ(θ) for every θ, roughly speaking due to Lemma 2(iii) and the LLN. Specifically, the Poisson property for M results in M/n as E θ 0 (X i ) and V ar θ 0 (X i ) are finite by Corollary 2(iii+iv). Convergence follows in squared mean, and hence in probability.
For the next condition in Lemma 5.10, we need a short discussion about maxima at the boundary of Θ for some -typically small -n. In these situations, there is no zero to Ψ n (θ). We will demonstrate that, using the boundary in these situations, the MLE is a "near zero". That is, Ψ n (θ) is non-decreasing due to Lemma 2(ii) and Lemma 2(v) holds. Furthermore, Ψ(θ) is obviously differentiable andΨ(θ 0 ) > 0 with the same argument as for every η > 0 when Ψ(θ 0 ) = 0, which holds due to Lemma 2(iv).
Although being the MLE, we cannot study asymptotic normality with general results from maximum likelihood theory. This would only be possible if we had considered an estimator for the pair (n, θ 0 ). Nonetheless,θ is an The main idea is to use the smoothness of Ψ n (θ) and apply a quadratic Taylor expansion of Ψ n around θ 0 and evaluated atθ, resulting in (see van der Vaart, 1998, Equation (5.18)) , withθ betweenθ and θ 0 . We will need: for all θ and the subsequent bound integrable.
Proof. For the first half: which is finite due to θ 0 ∈ Θ, the finiteness and positivity of α θ 0 from (1) and the finiteness ofα θ 0 from Corollary 1(i). For the second half: In (9), we can replace the denominators by their (due to the arguments after (1) (6) and (9)).
For the estimation of the standard error (SE) from Theorem 2, we replace expectations by averages over the latent sample, being observable, because indicators reduce sums up to m.

Likelihood Approximation
In order to give a precise version and derivation of the likehood (2), we now describe the truncated sample as stochastic process as in Kalbfleisch and Lawless (1989), especially as truncated empirical process, which in turn is approximated by a mixed empirical process. For the mixed process, deriving the likelihood is relatively simple.
Denote by a the Dirac measure concentrated at point a ∈ S. Define the point measure µ := n j=1 ( x j , t j ) , µ : B →N 0 , and the space of point measure on B (with fixed n) by M. By inserting random variables, it becomes an empirical process N n := n j=1 ( X j , T j ) (ω) (Ω → M), measurable w.r.t. σalgebras from A to M, the σ-algebra for M. The data is now the truncated empirical process (for an illustration, see Figure 2 for which we write X 1 , . . . , X m in all but this section. The size of the truncated sample is N n,D (S), for which we write M -and realised m -in all but this section, and is hence random and dependent on the sample size n.
In order to parametrize the data, i.e. the truncated empirical process, we write its intensity measure (only needed for sets [0, x] × [0, t]) as due to Lemma 1. To see that, note that Here, and in the following, the measure in the co-domain of a random variable is denoted L, e.g. L( X j , T j ). Note also that, ν N n,D evaluated at S, is  (5) (times n) for Application "insolvency" nα θ 0 . One can show that N n,D is equal in distribution to a Binomial-mixing empirical process. However, as our data in the applications (Section 5) will be relatively few, because s is relatively small, we will see shortly that it is enough to approximate the data with a Poisson-mixing empirical process.
The latter is generally true for Poisson processes, (realized or not), so that Z is also observed.
The parallelogram D is "small" (in terms of L( X j , T j )) relative to S, as long as the observation interval width s is relatively small compared to the width G of the population (and the typically long expected durations).
Hence, N * n is "close" to N n,D in Hellinger distance (see e.g. Reiss, 1993, Approximation Theorem 1.4.2). We will now derive the likelihood for N * n . The likelihood is the density of N * n , evaluated at the realisation, denoted as n * n , i.e. with inserted z and (x i , t i ) 's. The density of N * n has as its domain, the co-domain of N * n , M, so that the density of N * n is a function of the point measure µ. Furthermore, a Radon-Nikodym density requires a dominating measure and we use the density of another Poisson process. We chose the 2-dim homogeneous Poisson process on [0, A] 2 .
Definition 3. Let A ∈ N be a number larger than the support of X i or T i , e.g. the next natural number larger then G + s (see Definition 1). Let N 0 be a Poisson process with Z 0 ∼ P oi A 2 and independently thereof ( The latter is different from a geometrically intuitive volume A 4 . L(N 0 ) will now serve as the dominating measure in order to derive the Radon-Nikodym density of L(N * n ). But for that we will need the Radon-Nikodym density of ν n,D w.r.t. ν 0 , so that (see Billingsley, 2012, Formula (16.11)) one searches For B = [0, x] × [0, t] and x ≤ A, t ≤ A due to Fubini's theorem, with λ as the univariate Lebesgues measure, due to the differentiability, where (11) is used for the third equality, and Lemma 1 for the forth together Theorem 3. For Assumptions (A1)-(A4) and α θ 0 from (1), the model N * n of Definition 2, has likelihood w.r.t. to L(N 0 ) from Definition 3: The proof is in Appendix B. The main idea is to decompose the density of the data, i.e. of L(N * n ), into the product of the density, conditional on N * n (S), multiplied by the probability mass distribution of the Poisson distributed N * n (S). The later results in the very last factor of (16) to include an exponential function in nα θ 0 . Note that by Fisher-Neyman factorization We maximise the likelihood as a function in its second argument, the "generic" parameter θ, being already the notation in (3). For a thorough discussion about the parameter notation, we refer the reader to the maximum likelihood estimator as posterior mode in a Bayesian analysis with uniform prior (see e.g. Robert, 2001, Sect. 2.3). Finally note that, after taking logarithm, the derivatives w.r.t to θ and n of (16) are equal to that of its intuitive counterpart (2) with n * n (S) replaced by m (see (3)).

Monte Carlo Simulations
Our aim in this section is twofold, first we illustrate the vanishing bias, i.e. consistency, stated theoretically by Theorem 1. Second, the notion of a "bias", referring to one model so far, can be extended to the "selection bias" comparing two models. We will assess such design-effect compared to the srs-design as motivated theoretically after Lemma 2.
In order to illustrate, first, consistency, assess the finite sample bias as an average over the R = 1000 simulated (θ (v) − θ 0 ). Table 1(1 st rows) lists the results, and it can be seen that the bias decreases to virtually zero. In order to show the decline in the mean squared error, consider the estimated standard error (10) ofθ (v) . In Table 1(2 nd rows) averages over the σ (v) 2 seem to have a finite limit for increasing n. Hence, the standard error decreases of order √ n.
A by-product of the simulations is that they enable confirming the representation of σ 2 (in Theorem 2). On the one hand, V ar(θ) can be approx- nV ar(θ) by n times the simulated variance (Table 1(3 rd rows)). On the other hand, in a simulation, and not in an application, can σ 2 be estimated as n times the square of (10) ( Table 1(2 rd rows)). Both quantifications become equal for large n.
The relation of the standard error with respect to α θ 0 is also interesting. It   20 decreases, obviously because α θ 0 is linearly related to the size of the truncated sample by m = nα θ 0 (see again (11)). The relation of α θ 0 to θ 0 , s and G is already explained after its definition (1) and its respective sensitivity is presented in Table 1. There is one exception; although α θ 0 is decreasing in G, the simulatedσ 2 does not increase, but instead decreases for a given n (Table 1(left panels)). The reason can be suspected to be as in the srs-design, where the estimated standard error (17) is not only increasing in m of order 1/2, but also decreases in m i=1 x i of order 1, the latter being much larger for a large G (at given m).
Second, for the srs-design, applying (7) results in an MLEθ srs = m/ m i=1 x i with standard error σ srs / √ m = θ 0 / n * (S) (i.e. σ 2 srs := θ 2 0 ). The latter can be estimated by insertingθ srs ,σ The factor for "inflating" the variance from Theorem 2, denoted as Kish's design effect, is Illustrating the design effect with the V IF is typical for the field of sampling techniques, especially in survey sampling. (By contrast, in the field of econometrics, variance inflation typically denotes the fact that standard errors increase for coefficients in a regression when accepting more covariates.) In the simulations, the V IF remains overall at a quite moderate size, with a tendency to increase in α θ 0 .
We will continue the comparison of designs in the applications of Section 5 where we will see a substantial variance inflation in all three applications.

Comparison of Estimation Results
The zero of (5), i.e. the point estimateθ, is found graphically, for instance for the first application by Figure 2(right). For the estimated standard error see (10). All estimates are in Table 2, which also contains the estimates under srs-design (17).
It is evident that ignoring truncation overestimates the hazard θ 0 by, for example, 29% in the insolvency application, and also causes negative selection of units in the others. We observe that the standard error is underestimated by about 35% for all applications (equivalent to an on average V IF = 2, 5, as estimation of (18)), presumably through ignoring the stochastic dependence between units (and thus measurements) within the truncated sample. Also variance inflation almost seems not to depend on the sample size.

Discussion
The results are encouraging, as even after truncation, asymptotic normality holds, and standard errors do not increase too much. The considerable selection bias can be accounted for easily and identification of the parameters follows from standard results on the exponential family.
However, it is somewhat unfortunate that standard consistency proofs for the Exponential family fail, because compactness of the parameter space is violated, even when re-parametrising, due to the growing sample size being a parameter itself. And a temptation to withstand is to misinterpret the data as a simple random sample, only because statistical units are selected with equal probabilities (see (1)). This is especially tempting, because if the truncated sample was simple, not knowing n would be similar to not knowing the size of the population, requiring "finite-population corrections" only in the case of relatively many observations.
In practice, the considerable effort to account for truncation can even be circumvented in rich data situations by adjusting the population definition to start at the observation interval, however thereby excluding observable units (see e.g. Weißbach et al., 2009).
Of course more advanced sampling designs exist, such as endogenous sampling where units that have had a longer timeframe have a larger selection probability, in contrast to our model (sse (1)). Also truncation is typically analysed with counting process theory, focusing more on the role of the filtration as an information model (see e.g. Andersen et al., 1988). And with respect to robustness, the maximum likelihood method we use can be inferior to the method of moments (see e.g. Weißbach and Radloff, 2020;Rothe and Wied, 2020).
Nonetheless, we believe that our approach still offers some advantages: As

A Proof of Lemma 2
For (i), note first that by (A1), θ 0 ∈ Θ, so is θ: For (x j ,t j ) ∈ D, ψ θ (x j ,t j ) ≡ 0. Alternatively, due to Corollary 1, first and second derivatives of α θ are continuous, and therefore, so will be the third. Also α θ , being -along with θ -the only component of a denominator in the first or second derivative of ψ θ , is strictly positive due to the quotient rule.
For (ii): For the equality, due to Corollary 1, it iṡ For the positivity, we start to show that for x > 0 or y > 0 Study its slope, g (y) = 2e −2y − (2e −y − 2ye −y ) = 2e −2y − 2e −y + 2ye −y , being equal to zero if and only if The latter is only fulfilled for y = 0, due to the known inequality e y > 1 + y for y = 0, applied to −y. Now, y = 0 is not in the domain and hence, g does not change the sign of the slope. It is g(log(2)) = 0.06 and g(1) = 0.13, so that g is increasing and positive, due to lim y→0 g(y) = 0. Now proceed to and similarly for G instead of s, both for i j = 1.
For (v): The main idea of the proof is that in the event of a boundary minimum, the distance from Ψ n (θ) to the θ-axis is smaller than to Ψ(θ), and that it will converge to the latter. Hence, after surpassing the axis, there will be a zero and Ψ n (θ) = 0.

B Proof of Theorem 3
First we derive the density of L(N * n ) w.r.t. L(N 0 ) to be It is easiest to start reading the line from the centre, where ι −1 k (M ) is short for {ι −1 k (µ), µ ∈ M }. (Notation to be distinguished from sample size.) Similarly, i ) ). Hence by Lemma 3.1.1 of Reiss (1993), In the first equality, the second condition, Z = k, results from the fact that whatever µ, it must be in M k . For the first condition, the largest index for summation is originally Z, but can be replaced by k due to the second condition. (The order of conditions is irrelevant.) The second equality is due to the independence (see Definitions 2). Similarly by Definitions 3 for N 0 : Hence, The last equality is due to (23). Now, due to Definitions 2 and 3, (13)(right), Hence, by the display (24) of the distribution of L(N * n ), its density is, inserting (22), for µ = k i=1 (X i ,T i ) . Concluding from k to µ(S) and inserting the above displays, L(N * n ) (or more informally N * n ) has L(N 0 )-density (21) (see Reiss, 1993, Theorem 3.1.1 and Example 3.1.1)