Abstract
Understanding the distribution of an event duration time is essential in many studies. The exact time to the event is often unavailable, and thus so is the full event duration. By linking relevant longitudinal measures to the event duration, we propose to estimate the duration distribution via the first-hitting-time model (e.g. Lee and Whitmore in Stat Sci 21(4):501–513, 2006). The longitudinal measures are assumed to follow a Wiener process with random drift. We apply a variant of the MCEM algorithm to compute likelihood-based estimators of the parameters in the longitudinal process model. This allows us to adapt the well-known empirical distribution function to estimate the duration distribution in the presence of missing time origin. Estimators with smooth realizations can then be obtained by conventional smoothing techniques. We establish the consistency and weak convergence of the proposed distribution estimator and present its variance estimation. We use a collection of wildland fire records from Alberta, Canada to motivate and illustrate the proposed approach. The finite-sample performance of the proposed estimator is examined by simulation. Viewing the available data as interval-censored times, we show that the proposed estimator can be more efficient than the well-established Turnbull estimator, an alternative that is often applied in such situations.
Similar content being viewed by others
1 Introduction
The patterns of event duration times are of primary interest in many research studies. A close follow-up of each study individual who potentially experiences the event of interest is commonly unrealistic. Although the occurrence of the event may be reported, its start time is often unavailable, especially with so-called “silent” event occurrence (e.g. Balasubramanian and Lagakos 2003).
For example, it is important for the prediction of future fire growth and the allocation of suppression resources to understand the distribution of the duration from the fire start time to the time when the work of suppressing the fire begins, i.e. the initial attack time. A wildland fire is usually reported to the fire management agency by look-out towers or people in the area, and the fire manager then dispatches fire-fighting resources (e.g. Martell 2007; Morin 2014). The exact time when the fire starts is often unknown; instead, recorded is the time when the fire is reported. As another example, many HIV/AIDS studies are concerned with the duration of HIV infection from the time of infection to the onset of an AIDS event; see, for example, Degruttola et al. (1991) and Doksum and Normand (1995). Often the infection is detected at a time considerably later after it takes place and thus the exact HIV infection time is usually unavailable. Time to COVID–19 infection is a more recent example of this phenomenon.
A practical approach to handling data with missing time origins ignores the reporting delay and performs inference on the duration distribution using the observed portion of the event duration, that is, the duration between the report time and event termination. This naive approach yields apparently biased inference when the time gap is nonignorable between the event onset to when it is reported. Another commonly used approach to handling times with missing time origin is to view the observation on the time origin subject to interval-censoring. Thus the lower limit of the interval is the length of time that the event has been observed (which we refer to as \(L^*\)), and the upper limit is the sum of the observed duration and the longest possible reporting delay (which we refer to as \(R_{max}\)). Turnbull’s nonparametric maximum likelihood estimator (NPMLE; Turnbull 1976) can then be employed to estimate the distribution of the actual duration (which we refer to as L) using such manufactured interval-censored data. The resulting inference can be unsatisfactory, especially when the longest expected reporting delay is large relative to the observed portion of the event duration. In addition, the interval-censoring is likely informative in many situations, which invalidates Turnbull’s estimator. For example, fires can occur at varying distances from fire management resources. It results in a reporting delay S and an observed duration \(L^*\) varying together, and thus \(L^*\) and S may not be independent. That is, the interval \([L^*, L^*+R_{max}]\) provides information on L additional to the fact of \(L \in [L^*, L^*+R_{max}]\). This paper considers estimation of the event duration distribution with the aforementioned type of event time data under a first-hitting-time model using the event associated longitudinal measures.
Many studies have readily available longitudinal measures associated with the event of interest. In reliability, for example, Lu and Meeker (1993) use degradation data to estimate the distribution of a failure time, taking the failure time to be the time when the degradation path hits a critical level. The concept of first-hitting-time has been widely applied. Various models have been used to formulate longitudinal measures, such as a Gamma process (e.g. Lawless and Crowder 2004), a Wiener process (e.g. Doksum and Hoyland 1992; Horrocks and Thompson 2004; Lee and Whitmore 2006; Pennell et al. 2009; Choi et al. 2014; Mulatya et al. 2016), and an inverse Gaussian process (e.g. Peng 2015).
We aim to make inference on the population of all reported fires. Most of the first-hitting-time models formulate the event time via a hypothetical/underlying process. Our modelling is similar to the excellent exception presented in Mulatya et al. (2016). Using Brownian motion with random drift, we link the readily available longitudinal measures to the recorded times to locate the missing time origin. The issue of dependence between the reporting delay and the observed portion of duration is handled by conditioning on the random drift. We adapt the empirical distribution function, which requires independent and identically distributed (iid) observations, into an intuitive and easy-to-implement estimator for the distribution based on the event duration with missing time origin. Conventional smoothing techniques, such as kernel methods, can be straightforwardly applied to smooth the proposed estimator. A collection of wildland fire records from Alberta, Canada is used to motivate and present the proposed approach. However, potential applications of our approach are broad and not limited to wildland fire management studies.
The rest of the paper is organized as follows. Section 2 introduces a model for longitudinal measures of fire burnt area over time and proposes an estimator for the duration distribution aided by the longitudinal model using duration times with missing origins. It is straightforward to evaluate smoothed versions of the proposed estimator. We present procedures for estimating the parameters that are involved in the longitudinal model and required by the estimator for the duration distribution. We then derive the asymptotic properties of the distribution estimator and its variance estimation. Section 3 reports an analysis of wildland fire records with the proposed approach and Sect. 4 presents simulation studies conducted to examine finite-sample performance of the proposed estimator regarding its consistency, efficiency, and robustness. We also compare the performance of the proposed approach with that of the naive approach and of the Turnbull estimator. Some final remarks are given in Sect. 5.
2 Estimation of duration distribution in the presence of missing time origin
2.1 Notation and model
We formulate the aforementioned statistical problem in terms of wildland fire management. Following Parks (1964), Fig. 1 provides a description for the development of a hypothetical wildland fire via its progression of burnt area over time. The solid curve in the figure represents the burnt area overtime of a fire that is subject to suppression after detection. The time point when suppression begins is referred to as the time of initial attack. The dashed curve shows the expected trajectory of the fire’s burnt area if it had continued to burn without any suppression or intervention. After ignition, the burnt area grows nonlinearly in time, and can be well approximated as exponential initially. Prior to initial attack, the dashed curve and the solid curve coincide. The times \(T_{S}, T_{R}\), and \(T_{F}\) in Fig. 1 are the calendar times when a fire starts, is reported, and initial attack begins, respectively.
The duration time of interest, denoted by L, is the elapsed time from the fire’s start to the time of initial attack, i.e., \(L=T_{F}-T_{S}\). We are concerned with situations where the true fire start time \(T_{S}\) is unavailable, and thus the time origin of the event duration is missing. Denote the unobservable reporting delay by \(S=T_R-T_S\). The observed portion of the duration is \(L^*=T_F-T_R=L-S\), the period between report time and initial attack time. Let A(u) be the burnt area at time u, where we assume there is no burnt area at the start time, i.e. \(A(u)=0\) when \(u=0\). Let \(B=A(S)\) and \(D=A(L)-A(S)\) denote the burnt area at the report time and the increase in area by the initial attack time, respectively.
Consider a collection of n independent wildland fires. We assume that the natural logarithm of fire i’s burnt area is \(A_i(u)=g_i(u)+\sigma _i W_i(u)\) for \(i=1,\cdots , n\), where \(g_i(u)\) is a nondecreasing function with \(g_i(0)=0\) and \(W_i(\cdot )\) is the standard Wiener process. As a fire usually grows unhindered until initial attack, we suppose \(\sigma _i\equiv \sigma \) and use a linear approximation to \(g_i(u)\) with random drift \(\nu _i=\nu e^{\delta _i}\), where the constant \(\nu \) is positive, and \(\delta _i\) is independent of \(W_i(\cdot )\) and following a distribution \(\phi (\cdot ;\sigma _r)\) with \(E[\delta _i]=0\) and \(Var[\delta _i]=\sigma ^2_r\). This results in the model considered in this paper:
The random drift \(\nu _i\) characterizes the heterogeneity in fire growth among the individual fires. Note that \(\nu _i\) reduces to a constant drift when \(\sigma _r=0\). In the rest of this paper, we assume \(\delta _i\sim N(0,\sigma ^2_r)\) with \(\sigma _r\ge 0\).
Under the Wiener process model for burnt-area (1), the reporting delay \(S_i\) (report time since the fire starts) can be viewed as a first-hitting-time: it is the time when fire i’s burnt area first reaches the threshold \(B_i\), the burnt area at the report time: \(S_{i} =\sup \{u: u>0, A_{i}(u)< B_{i}\}\), which is the same as \(\inf \{u: u>0, A_{i}(u)> B_{i}\}\) almost surely. According to Chhikara and Folks (1989), the first-hitting-time \(S_{i}\) follows the inverse Gaussian distribution given threshold \(B_i\), with the cumulative distribution function (CDF)
where \(\varPhi (\cdot )\) is the CDF of the standard normal distribution. Denote the observed data by
This paper focuses on estimation of \(F(\cdot )\), the CDF of the event duration using the observed data (3) with the following assumptions.
-
Assumption (A1). \(\{(T_{Ri},T_{Fi}, B_i, D_i,L_i):i=1,2,\cdots ,n\}\) is a collection of iid realizations of \(\big (T_{R},T_{F}, B, D,L\big )\), where \(L\sim F(\cdot )\).
-
Assumption (A2). \(L_i^*=T_{Fi}-T_{Ri}=L_i-S_i\) and \(S_i=T_{Ri}-T_{Si}\) are independent conditional on \(\nu _i\) for \(i=1,\ldots ,n\).
The assumptions may hold to a reasonable level of approximation in many practical situations. In our wildland fire application, the assumption A2 assumes that the reporting delay (\(S_i\)) of fire i and the time to the initial attack since being reported (\(L^*_i\)) are independent conditional on the fire spread rate \(\nu _i\). This is plausible since the fire agency often assesses a reported fire regarding its spread rate, and then arranges for the initial attack accordingly. That is, \(L^*_i\) depends likely on \(S_i\) solely through \(\nu _i\).
2.2 Procedures for estimating \(F(\cdot )\)
2.2.1 Review of the existing approaches
If all duration times \(L_i\) for \(i=1,\ldots , n\) were observed, the empirical distribution function, the nonparametric maximum likelihood estimator (NPMLE) based on the iid observations, could be applied to estimate the duration distribution: \({F}_n(t)=\sum _{i=1}^{n}I(L_i\le t)\big /n\), where \(I({\mathscr {E}})\) is the indicator function of event \({\mathscr {E}}\). Since a fire is usually reported after a delay, only \(L^*_i\), a portion of \(L_i=S_i+L^*_i\), is recorded. The aforementioned naive estimator is \(F^*_n(t)=\sum _{i=1}^{n}I(L^*_i\le t)\big /n\). It is clearly biased when \(P(S_i>0)\ne 0\).
Observe that \(L_i=L^*_i+S_i \in [L^{*}_{i}, L^{*}_{i}+R_{max}]\) with \(R_{max}\) the longest possible reporting delay as discussed in the existing literature. The current observations might then be cast as interval-censored event times. Turnbull’s self-consistent estimator (Turnbull 1976) can then be used to estimate the distribution with the interval-censored observations. Let \(0=\tau _0<\tau _1<\cdots <\tau _Q\) be the ordered unique values of \(\big \{\{L^*_{i}\}_{i=1}^{n},\{L^{*}_{i}+R_{max}\}_{i=1}^{n}\big \}\), and define \(\alpha _{iq}=I\big \{(\tau _{q-1},\tau _{q}]\subseteq (L^*_{i},L^{*}_{i}+R_{max}]\big \}\) and \(p_q=F(\tau _{q})-F(\tau _{q-1})\). Following Klein and Moeschberger (2003), the Turnbull estimator is the solution to the equations
for \(q=1,\ldots , Q\). However, the Turnbull estimator may not perform very well in the situations of particular interest in this paper. Note that the Turnbull estimator is not uniquely defined over the whole positive real line but up to an equivalence class of distributions that may only differ over gaps, i.e. the innermost intervals. Since \(R_{max}\) is large relative to \(L_i^*\) in our application, the data form relatively small number of innermost intervals and thus often give a quite noninformative estimate. Moreover, the mechanism of interval-censoring in wildland fire studies may be informative since the observed \(L_i^{*}\) is often dependent on the reporting delay \(S_{i}\) through the fire spread rate \(\nu _i\). Those considerations motivate us to propose an alternative estimator for the duration distribution \(F(\cdot )\) using available observations on the burnt-area process, which contain information related to the reporting delay.
2.2.2 Proposed estimator of \(F(\cdot )\)
By Model (1) and Assumption (A2), note that \(\text{ E }\big \{I(L_i\le t)|\text{ Observed-data }\big \} =\text{ P }\big (S_{i}\le t-L^*_i|{\mathbf {O}}_i\big )\) can be expressed as \( \int _{-\infty }^{\infty } G(t-L^*_i|B_i,\nu e^{\delta },\sigma ) {\phi }(\delta |{\mathbf {O}}_i;\nu ,\sigma , \sigma _r) \mathrm {d}\delta ,\) where \({\phi }(\cdot |{\mathbf {O}}_i;\nu ,\sigma ,\sigma _r)\) is the conditional distribution of \(\delta _i\) given the observed data associated with fire i as specified in (3). The consideration above suggests the following estimator, provided that the parameters \(\nu ,\sigma , \sigma _r\) are known:
We propose to replace parameters in (5) with their consistent estimators based on the available data. This results in a feasible distribution estimator,
abbreviated by \({\hat{F}}_n(t)\) in the rest of this paper. In Sect. 2.3, we present procedures for consistently estimating parameters \(\nu ,\sigma , \sigma _r\). To compute (6) numerically, one may approximate \({\hat{F}}_n(t)\) with
where \(\delta _i^{(1)}, \cdots , \delta _i^{(J)}\) are sampled independently from the estimated conditional distribution \({\phi }(\cdot |{\mathbf {O}}_i;{\hat{\nu }},{\hat{\sigma }},{\hat{\sigma }}_{r})\) for \(i=1,\ldots ,n\).
The proposed estimator in (5) is adapted from the empirical distribution function. Analogously, we can obtain a smoothed distribution estimator of \(F(\cdot )\) by adapting the kernel distribution estimator with all the duration observed. Recall that a kernel estimator of \(F(\cdot )\) with iid observed duration is \(F_{n,h}(t)=\sum _{i=1}^{n}K(\frac{t-L_i}{h})\big /n\), where \(K(t)=\int _{-\infty }^{t}k(u)\mathrm {d}u\) with \(k(\cdot )\) a kernel function and h being the bandwidth (e.g., Rosenblatt 1956). Its projection onto the available data space, \(\sum _{i=1}^{n}\text{ E }\big \{K\big (\frac{t-(L^*_i+S_{i})}{h}\big ) \big |{\mathbf {O}}_i\big \}\big /n\), yields the following estimator with smooth realizations, denoted by \({\hat{F}}_{n,h}(t)\):
When one deals with the situation where no random effect is involved and \(S_i\) is assumed to be uniformly distributed over \([0,R_{max}]\), the estimator \({\hat{F}}_{n,h}(t)\) reduces to the one discussed in Braun et al. (2005). Since the choice of bandwidth h is still under investigation, we focus on the estimator \({\hat{F}}_{n}(t)\) given in (6) for the rest of the paper.
2.3 Procedures for estimating parameters in Model (1): \(\varvec{\theta }=(\nu ,\sigma ,\sigma _r)\)
The log-likelihood function based on the available data is
where the contribution from fire i \(\log L_{obs}(\varvec{\theta }; {\mathbf {O}}_i)\) is \(\log \int _{0}^{\infty }\int _{-\infty }^{\infty } \big \{L_{obs,i|S,\delta }\big \}\mathrm {d}[S,\delta ]\) with \(L_{obs,i|S_i,\delta _i}= [D_i|L^*_i,\delta _i][B_i|S_{i},\delta _i]\). Here \([D_i|L^*_i,\delta _i]\) and \([B_i|S_{i},\delta _i]\) are the conditional distribution of \(D_{i}\) given \(L^*_i,\delta _i\) and the conditional distribution of \(B_{i}\) given \(S_{i}\) and \(\delta _i\), respectively. Under Model (1), \([D_i|L^*_i,\delta _i]\) and \([B_i|S_{i},\delta _i]\) are both normal, denoted by \(N(\nu e^{\delta _i} L^*_i, \sigma ^2 L^*_i)\) and \(N(\nu e^{\delta _i} S_{i}, \sigma ^2 S_{i})\), respectively.
We estimate \(\varvec{\theta }\) by maximizing \(\log L_{obs}(\varvec{\theta }\big |\text{ Observed-data})\). Denote the resulting estimator by \(\hat{\varvec{\theta }}_{n}\). One can use \(\hat{\varvec{\theta }}_{n}\) together with the collection of \({\underline{\delta }}^{(1)},\cdots , {\underline{\delta }}^{(J)}\) in the last iteration to compute (7) and (8) and to obtain \({\hat{F}}_n(\cdot )\) and \({\hat{F}}_{n,h}(\cdot )\), respectively. Here \({\underline{\delta }}^{(j)}\) for \(j=1,\ldots ,J\) are the n-dimensional vectors with the i-th components \(\delta _i^{(j)}\).
We apply the MCEM algorithm (Wei and Tanner 1990) to compute the MLE, and present details in Algorithm A below. The log-likelihood function of \(\varvec{\theta }\) based on the observed data (3) together with \({\underline{S}},{\underline{\delta }}\) is \(l_F(\varvec{\theta }|\text {Observed-data},{\underline{S}}, {\underline{\delta }}) = l_{F_1}(\nu ,\sigma |{\underline{S}},{\underline{\delta }}) + l_{F_2}(\varvec{\theta }; {\underline{S}},{\underline{\delta }})\), where
and \(l_{F_2}(\varvec{\theta }; {\underline{S}},{\underline{\delta }}) =\sum _{i=1}^{n}\log [S_i|\delta _i]+\sum _{i=1}^n\log \phi (\delta _i;\sigma _r)\).
Algorithm A For \(m=0,1,2,\cdots ,\) denote the estimate from the mth iteration by \({\varvec{\theta }}^{(m)} = ({\nu }^{(m)},{\sigma }^{(m)},{\sigma _r}^{(m)})\).
-
E-step. Approximate \(Q(\varvec{\theta },\varvec{\theta }^{(m)}) =\text{ E }\{l_F(\varvec{\theta }|\text {Observed-data},{\underline{S}}, {\underline{\delta }})|{\mathbf {O}},{\varvec{\theta }}^{(m)}\}\) as
$$\begin{aligned} \frac{1}{J}\sum _{j=1}^{J}l_F(\varvec{\theta }|\text {Observed-data}, {\underline{S}}^{(j)}, {\underline{\delta }}^{(j)})&=\frac{1}{J}\sum _{j=1}^{J}l_{F_1}(\nu ,\sigma |{\underline{S}}^{(j)}, {\underline{\delta }}^{(j)})\nonumber \\&\qquad + \frac{1}{J}\sum _{j=1}^{J} l_{F_2}(\varvec{\theta }; {\underline{S}}^{(j)},{\underline{\delta }}^{(j)}), \end{aligned}$$(10)where for \(j=1,2,\cdots ,J\), \((S_{i}^{(j)}, \delta _i^{(j)})\) is generated from the conditional distribution given the observed data with the current parameter estimate \({\varvec{\theta }}^{(m)}\),
$$\begin{aligned}{}[S,\delta |{\mathbf {O}}_i;\varvec{\theta }^{(m)}]=\frac{ L_{obs,i|S,\delta } ({\nu }^{(m)},{\sigma }^{(m)};S,\delta ) [S,\delta ;\varvec{\theta }^{(m)}]}{\int _{0}^{\infty }\int _{-\infty }^{\infty } L_{obs,i|S,\delta } ({\nu }^{(m)},{\sigma }^{(m)};S,\delta ) \mathrm {d}[S,\delta ;\varvec{\theta }^{(m)}]}. \end{aligned}$$(11) -
M-step. Maximize (10) with respect to \(\varvec{\theta }\) to obtain \({\varvec{\theta }}^{(m+1)}\).
Repeat Steps E and M until \(||{\varvec{\theta }}^{(m+1)}-{\varvec{\theta }}^{(m)}||<\epsilon \) for a pre-specified tolerance \(\epsilon \). The limit of the sequence \(\{\varvec{\theta }^{(m)}: m=1,2,\ldots \}\) is the MLE \(\hat{\varvec{\theta }}_{n}\).
We follow the Metropolis–Hastings algorithm (Metropolis et al. 1953; Hastings 1970) to generate \((S_{i}^{(j)}, \delta _i^{(j)})\) from the conditional distribution (11). The details are provided in Sect. S1.1 of the Supplementary Material. One should also note that \([S_i|\delta _i]\) in \(l_{F_2}(\varvec{\theta }; {\underline{S}},{\underline{\delta }})\) equals \([S_i|B_i,\delta _i][B_i|\delta _i]\big /[B_i|S_i,\delta _i]\), which does not have much additional information on the parameters \(\nu ,\sigma \). To ease the computational burden, one may replace (10) with the following to update \({\varvec{\theta }}^{(m)}\):
The maximizing procedure based on (12) leads to a variant of Algorithm A and results in \(\tilde{\varvec{\theta }}_{n}\), a close approximation to the MLE \(\hat{\varvec{\theta }}_{n}\). For the numerical studies presented in this paper, we choose \(J=200\) and the algorithm converges with \(J=200\).
We may obtain another estimator by maximizing the conditional likelihood function using only the observations on D and \(L^*\). The log conditional likelihood function is \(\log L^{c}_{obs}(\varvec{\theta }\big |\text{ Observed-data}) =\sum _{i=1}^n \log \int _{-\infty }^{\infty }[D_i|L^*_i,\delta ]\phi (\delta ;\sigma _r)d\delta \), and can be written as
The estimator obtained by maximizing (13) is likely less efficient but easier to implement. We describe the procedure to obtain the maximizer of (13) in Sect. S1.2 of the Supplementary Material. We refer to the second algorithm as Algorithm B and denote estimators obtained from the two algorithms by \(\hat{\varvec{\theta }}_{n_A}\) and \(\hat{\varvec{\theta }}_{n_B}\) for the reminder of the paper. The estimate obtained from Algorithm B may be used as the initial estimate \(\varvec{\theta }^{(0)}\) for Algorithm A.
2.4 Asymptotic properties of \({\hat{F}}_n(t)\) and variance estimation
The proposed estimator \({\hat{F}}_n(t)\) using \(\hat{\varvec{\theta }}_{n_A}\) from Algorithm A in Sect. 2.3 has the following asymptotic properties.
Theorem 1
Under Assumptions (A1) and (A2) and Conditions (C1)-(C5) for the log-likelihood function in (9), the estimator \({\hat{F}}_n(t)\) has the following properties:
-
(i)
Strong Consistency. \(sup_{t\in [0,\tau ]}|{\hat{F}}_n(t)-F(t)|\overset{p}{\rightarrow }0\) as \(n\rightarrow \infty \).
-
(ii)
Weak Convergence. For \(t\in [0,\tau ]\), as \(n\rightarrow \infty \), \(\sqrt{n}({\hat{F}}_n(t)-F(t))\) converges weakly in \(\ell ^{\infty }([0,\tau ])\) to a tight, mean-zero Gaussian process \({\mathcal {G}}\) with covariance \(Cov( {\mathcal {G}}(t), {\mathcal {G}}(s))\) given by
$$\begin{aligned} {\left\{ \begin{array}{ll} \int _{0}^{\infty }\int _{0}^{\infty } M(t,l^*, b;\varvec{\theta }_0)M(s,l^*, b;\varvec{\theta }_0)h(l^*,b)dl^*db-F(t)F(s), &{} t\ne s \\ \int _{0}^{\infty }\int _{0}^{\infty } M(t,l^*, b;\varvec{\theta }_0)^2 h(l^*,b)dl^*db-F^2(t)&{}\\ \quad +\text{ E}_{\varvec{\theta }_0}[{\partial M(t,L^*_i, B_i;\varvec{\theta }_0)}\big /{\partial \varvec{\theta }}]'\varPi ^{-1}(\varvec{\theta }_0)\text{ E}_{\varvec{\theta }_0}[{\partial M(t,L^*_i, B_i;\varvec{\theta }_0)}\big /{\partial \varvec{\theta }}], &{}t=s, \end{array}\right. } \end{aligned}$$(14)where \(\varvec{\theta }_0\) is the true parameter, \(\varPi (\varvec{\theta }_0) =\text{ E }\big \{-\partial ^2 \log L_{obs}(\varvec{\theta };{\mathbf {O}}_i)\big /\partial \varvec{\theta }\varvec{\theta }^{'}\big \}\) is the same as \(\varSigma (\varvec{\theta }_0) =\text{ Var }\big \{\partial \log L_{obs}(\varvec{\theta };{\mathbf {O}}_i)\big /\partial \varvec{\theta }\big \}\) with \(\log L_{obs}(\varvec{\theta }; {\mathbf {O}}_i)\) the contribution from individual i to the log-likelihood function in (9), \(M(t,L^*_i, B_i;\varvec{\theta }) =\int _{-\infty }^{\infty } G(t-L^*_i|B_i,\nu e^{\delta },\sigma ) \phi (\delta |{\mathbf {O}}_i;\nu ,\sigma ,\sigma _{r}) \mathrm {d}\delta \), and \(h(l^*,b)\) is the joint probability density function of \(L^*_i\) and \(B_i\).
A proof of Theorem 1 is outlined in the Appendix. It results in a consistent estimator of the covariance function in (14) substituting its unknown elements with their following estimators.
Note that \(\int _{0}^{\infty }\int _{0}^{\infty } M(t,l^*, b;\varvec{\theta }_0)^2 h(l^*,b)dl^*db\) can be approximated by \(n^{-1}\sum _{i=1}^{n}\big [\sum _{k=1}^{K}G(t-L^*_i|B_i,{\hat{\nu }}_n e^{\delta _i^{(k)}},{\hat{\sigma }}_n)\big /K\big ]^2\) with \(\delta _i^{(1)},\cdots , \delta _i^{(K)}\) obtained from the last iteration of Algorithm A in Sect. 2.3. We may similarly approximate \(\text{ E}_{\varvec{\theta }_0}\big [\partial M(t,L^*_i, B_i; \varvec{\theta })\big /\partial \varvec{\theta }\big ]\). Further, note that \({\widehat{\varPi }}_n(\varvec{\theta }_0)= -n^{-1}{\partial ^2\log L_{obs}(\varvec{\theta };\text {Observed-data })}/{\partial \varvec{\theta }\varvec{\theta }^{'}}\Big )\) converges in probability to \(\varPi (\varvec{\theta }_0)= \varSigma (\varvec{\theta }_0)\), and so does \({\widehat{\varSigma }}_n(\varvec{\theta }_0)=n^{-1}\text{ Var}_{\varvec{\theta }_0}\Big ( {\partial \log L_{obs}(\varvec{\theta };\text {Observed-data})} \Big /{\partial \varvec{\theta }}\Big )\). Thus, either \({\widehat{\varPi }}_n(\hat{\varvec{\theta }}_{n_A})\) or \({\widehat{\varSigma }}_n(\hat{\varvec{\theta }}_{n_A})\) can be used to estimate \(\varPi (\varvec{\theta }_0)=\varSigma (\varvec{\theta }_0)\).
2.5 Construction of confidence bands for \(F(\cdot )\)
Based on Theorem 1, we see that the process \(\sqrt{n}({\hat{F}}_{n}(t)-F(t))\big /\sqrt{\text{ var }(t)}\) converges weakly to the standard Gaussian process, where \(\text{ var }(t)\) is \(\text{ Cov }({\mathcal {G}}(t), {\mathcal {G}}(s))\) given in (14) with \(t=s\). Denote the consistent estimator of \(\text{ var }(t)\) obtained as described in Sect. 2.4 by \(\widehat{\text{ var }}(t)\). We employ the resampling approach in Hu and Lagakos (1999); Zhao et al. (2008) to construct the following confidence band (CB) for the distribution \(F(\cdot )\).
The (\(1-\alpha \)) confidence band for \(F(\cdot )\) is
where the critical value \(c_\alpha \) is determined by the resampling scheme as follows. For \(t\in [0,\tau ]\), define
where \(Z_1,\cdots , Z_n \sim N(0,1)\) iid and are independent of the data. We compute \(c_\alpha \) as follows:
-
Step (i)
. Generate M sets of independent realizations of \((Z_1,\cdots , Z_n)\) and, with each of sets, compute \({C}^{(m)}_{n}(\cdot )\) for \(m=1,\cdots ,M\).
-
Step (ii)
. Choose \(c_\alpha \) as the \((1-\alpha )\%\) quantile of \(\sup _{t\in [0,\tau ]}|{C}^{(1)}_{n}(t)|,\ldots , \sup _{t\in [0,\tau ]}|{C}^{(M)}_{n}(t)|\).
3 Analysis of Alberta Forest fire data
We now apply the proposed approach to analyze the wildland fire data that motivated this research. Alberta Agriculture and Forestry collected records of 603 lightning-caused fires that occurred in 10 wildland fire management areas of Alberta, Canada during the fire season from May to August in 2004. Each fire record contains the fire progression information: the times and the fire burnt area at the time of report and at the time of initial attack. As expected, the records do not include the exact fire start times.
Figure 2 shows the burnt area at the report times and at initial attack times for the different regions. The 10 Alberta wildland fire management areas are classified into two groups: north and south. The north region includes Fort McMurray, High Level, Lac La Biche, Peace River, Slave Lake; the south region: Calgary, Edson, Grande Prairie, Rocky Mountain House and Whitecourt. Table S1 in Sect. S2 of the Supplementary Material summarizes burnt area for the two regions at the report times, the initial attack times, and at the time when fires were extinguished. Fires in the north region tend to have larger burnt area at the times of report and initial attack. The distributions of the burnt area are skewed so we use the transformed version \(\log _{10} (\text {burnt area}+1)\) in the analysis.
The time of initial attack is when the first fire-fighting resource arrives at a wildland fire to prevent the fire from spreading, and to extinguish it if possible. It is believed that fires with a delayed initial attack may require a more substantial suppression effort.
Using the proposed approach, we estimate the distribution of the duration between the start of a fire to its initial attack. We consider two cases with Model (1): (i) \(\sigma _r=0\) (i.e., \(\nu _i=\nu \) for \(i=1,2,\cdots ,n\)), and (ii) \(\sigma _r\ge 0\). Table 1 presents the parameter estimates and the corresponding standard error estimates obtained by Algorithms A and B in Sect. 2.3. The standard errors are estimated using both the inverse Fisher information matrix and the sandwich variance estimator. We also provide computing times for each algorithm in Table 1. Algorithms B is computationally faster than Algorithms A, but it yields less efficient estimator as the estimated standard errors are larger than those of Algorithms A. The estimates of \(\sigma _r\) for the model with random drift are quite large, indicating considerable variation among the fires. This could be because the fire spread rates depend on location and local weather.
We estimate the distribution of duration by substituting the estimated model parameters into (7), and obtained the smoothed estimator based on (8). To make a comparison, we also evaluated the empirical distribution function based on the observed event duration, the naive estimator, and the Turnbull estimator viewing the fire data as interval-censored data with \(L_{i}\in [L^{*}_{i}, L^{*}_{i}+R_{max}]\). We set \(R_{max}= 6, 12\) or 48 hours for illustration. In fact, \(R_{max}\) could be up to 2 weeks (Wotton and Martell 2005). Figure 3 presents the estimated distributions for the times to initial attack with Algorithms A and B together with approximate \(95\%\) pointwise confidence intervals (CIs) calculated using the estimated asymptotic variance given in (14).
Figure 3 shows that the naive distribution estimate and the Turnbull’s estimates are different from the proposed estimates. We see that Turnbull’s estimates deviate more from the proposed estimates as \(R_{max}\) increases. This is because a larger \(R_{max}\) can lead to a wider interval \((L^*_i,L^*_i+R_{max})\) for \(L_i\). As a result, there are fewer disjoint inner–most intervals within which the survivor function estimated by Turnbull’s method can jump. Comparing the estimates by the two algorithms, we see that Algorithms A can produce a more efficient estimator. We also evaluate the kernel-smoothed estimator (8) presented in Sect. 2.2. The distribution estimates and their corresponding \(95\%\) CIs/CBs are in close agreement with those un-smoothed estimates.
Figure 4 presents scatterplots of the final burnt area vs the estimated duration times. The estimated duration is calculated by \({\tilde{L}}_i=L^*_i+{\tilde{S}}_{i}\), where \({\tilde{S}}_{i}\) is generated from the posterior distribution of reporting delay \([S|\delta , {\mathbf {O}}_i;\varvec{\theta }^{(m)}]\) at the last iteration of the MCEM procedure in Algorithms A, for \(i=1,\cdots ,n\). We present scatterplots using three realizations of \({\tilde{L}}_i\) together with the scatterplot in Fig. 4a using the observed portion of the duration \(L^*_i\). The pattern of association between the final burnt area and duration is apparently more obvious with the estimated duration. This suggests that the duration between fire start and initial attack may be more predictive of the final burnt area. Accounting for the reporting delay time is worthwhile when using the duration as a predictor for the final burnt area.
We applied the proposed procedure to analyze the data of fires from the north region and the ones from the south region separately. Table 2 gives the model parameter estimates. Figure 5 shows the estimated duration distributions. The estimate of \(\sigma _r\) associated with north region is large, significantly different from zero. It indicates a larger variation across fires in the region. The south region has a smaller estimate of \(\sigma _r\).
4 Simulation studies
We conducted two simulation studies to examine the finite-sample performance of the proposed approach and to verify the findings from the data analysis. Specifically, in the first simulation study, we generated data based on Model (1) to verify consistency and efficiency, and in the second simulation study, we assessed robustness of the approach against model misspecification.
4.1 Simulation A: Consistency and efficiency
To mimic the fire data, we simulated a study with \(n=300\) independent fires with the data of fire i for \(i=1,2,\cdots ,n\) generated as follows.
-
(i)
Generate the burnt area process \(A_i(t), t\in [0,30]\) based on Model (1) with the parameter values \(\nu =2.0\) and \(\sigma =0.5\), and \(\delta _i\sim N(0,\sigma ^2_r)\) with \(\sigma _r=0, 0.5\), or 0.8.
-
(ii)
Generate the size at the report time \(B_i \sim \text{ logNormal }(2.0,0.1)\), and determine the reporting delay as \(S_{i}=\max \{t|t\in [0,30], A_{i}(t)\le B_{i}\}\).
-
(iii)
Generate \(L^*_i \sim \text{ Exp }(3.0 B_i^{-1})\), calculate the duration \(L_{i}=S_{i}+L^*_{i}\), and obtain the burnt area at the time of initial attack \(D_i=A_i(L_i)\).
Using the simulated data, we evaluated the estimator \(\tilde{\varvec{\theta }}_{n_A}\), the approximation to \(\hat{\varvec{\theta }}_{n_A}\), by the variant of Algorithm A in Sect. 2.3. We then obtained the corresponding duration distribution estimates.
Table 3 summarizes the parameter estimates based on 200 simulation repetitions. The sample means of estimates obtained by Wiener process model with a constant drift are close to the true parameter values for the scenario of \(\sigma _r=0\); the bias is evident when the true value of \(\sigma _r\) increases to 0.5 and 0.8. When we use a model with random drift, i.e. \(\sigma _r\ge 0\), the sample means of estimates are close to the true parameter values for all three scenarios of \(\sigma _r\). This provides an empirical verification of the consistency of the two estimators, and suggests that it may be acceptable not to assume \(\sigma _r=0\) in practice. Further, we estimated \(\theta \) by maximizing the conditional likelihood given in (13) which uses only observations on D and \(L^*\). The results are presented in Table S2 of Sect. S3.1 of the Supplementary Material. While the parameter estimates are similar to those obtained by Algorithm A, the sample means of the estimated standard errors are larger. The sample means of the estimated standard errors associated with the robust sandwich variance estimator are similar to the corresponding sample standard deviations of the estimates for both algorithms, which suggests that the proposed variance estimator performs sufficiently well at the simulation settings, we conclude that maximizing (13) may yield less efficient estimators.
For each generated data set, we estimated the duration distribution by \({\hat{F}}_n(t)\) using \(\varvec{\theta }\) obtained from Algorithm A, and used \({\tilde{F}}_n(t;\nu ,\sigma ,\sigma _r)\) given in (5) with the true values of parameters as a reference. The consistent variance estimator of (14) given in Appendix C was evaluated to construct confidence intervals (CIs). Assuming the drift of Wiener process involves random effects, Fig. 6 shows the sample means of the 200 estimated distribution functions together with the approximate conventional \(95\%\) CIs and their \(2.5\%\), \(97.5\%\) sample quantiles. To make a comparison, each plot in Fig. 6 also includes the sample means of the 200 evaluations of the empirical distribution function \(F_n(\cdot )\) using the true duration, the empirical distribution function \(F^*_n(\cdot )\) using the observed duration times (the naive approach), and the Turnbull estimator using \(R_{max}\) as the third quantile and maximum of the reporting delay in each generated data set.
The estimate associated with the proposed approach is very close to those based on \({\tilde{F}}_n(t;\nu ,\sigma ,\sigma _r)\) using true \(\varvec{\theta }\). At all simulation settings, both the approximate \(95\%\) CIs and the CIs using the \(2.5\%\) and \(97.5\%\) sample quantiles contain the empirical distribution functions \(F_n(\cdot )\) obtained with true duration. The naive estimates and the Turnbull’s estimates appear to be different from \(F_n(\cdot )\). The Turnbull estimator is highly dependent on the assumed values of \(R_{max}\), especially when \(R_{max}\) is much larger than \(L^*_i\), the performance of the Turnbull estimator deteriorates substantially. Histograms for realizations of \(L^*_i\) presented in Fig. S1 of Sect. S3.1 of the Supplementary Material support this finding. For the scenario that the true value of \(\sigma _r\) is 0, the two values of \(R_{max}\) are relatively small and Turnbull’s estimates are close to the proposed estimate as shown in Fig. 6. When the true value of \(\sigma _r\) increases, \(R_{max}\) becomes much greater than the maximum of \(L^*_i\) when it is chosen as the maximum of the reporting delays in the generated dataset and the corresponding Turnbull’s estimates depart much further from the proposed estimates. This is consistent with the outcome seen in the data analysis. Moreover, we evaluated the distribution estimator using \(\varvec{{\hat{\theta }}}_{n_B}\) obtained from Algorithm B (See Fig. S2 of Sect. S3.1 in the Supplementary Material) and the kernel smoothed version of the proposed estimator. The behavior of these two estimates in comparison with the naive estimates and Turnbull’s estimates is similar to that observed in Fig. 6.
We computed the point-wise sample mean square errors of Turnbull’s estimates, the proposed estimates and the reference estimates based on \({\tilde{F}}_n(t;\nu ,\sigma ,\sigma _r)\). With any \(t\ge 0\), the proposed estimator has the smallest sample mean squared error, which demonstrates the relative efficiency of the proposed estimator over the naive estimator and the Turnbull estimator at all simulation settings. Figure S3 in Sect. S3.1 of the Supplementary Material presents the sample standard deviations and sample means of the estimated standard errors of the proposed distribution estimator with \(\tilde{\varvec{\theta }}_{n_A}\) by Algorithm A, together with those associated with the empirical distribution function and \({\tilde{F}}_n(\cdot ; \nu ,\sigma ,\sigma _r)\), which require more information than the data structure of interest. The plots in the figure show that the variation of the proposed estimator is comparable to that for \({\tilde{F}}_n(\cdot ; \nu ,\sigma ,\sigma _r)\), and is, in some settings, smaller than that of the empirical function. This indicates that using the available information on fire growth can recover the efficiency loss due to the missing start times and even in some situations outperform the empirical distribution function, a nonparametric estimator for the duration distribution.
4.2 Simulation B: Robustness
We generated burnt area sample paths for a collection of simulated independent fires following the model \(A_{i}(t)=\nu _i t +\sigma _i W^*_{i}(t)\) , \(i=1,2,\cdots ,n=300\), where \(\nu _i=\nu e^{\delta _i}\) with \(\delta _i \sim N(0,\sigma ^2_r)\) and \(W^*_i(\cdot )\) is a process with correlated increments. Specifically, the increments \(W^*_i(t_{k})-W^*_i(t_{k-1})\) for the partition \(t_k, k=1,2,\cdots ,K\) of the time period [0, 30] were generated from \(MN({\mathbf {0}}, \varSigma )\) with the \((k,k')\) entry as \(\varDelta t \rho ^{|k-k'|}\) for \(k\ne k'\) and \( \varDelta t\) for \( k= k'\) with \(\rho =0.2\). The observations on variables \(B, S, L^*\), and D were generated in the same way as in Simulation A.
We computed \(\varvec{\theta }\) estimates and then the duration distribution estimates as if the data were generated from Model (1). Table S3 in Sect. S3.2 of the Supplementary Material summarizes the simulation outcomes for 200 replicates of the estimates. The sample means of the \(\varvec{\theta }\) estimates are close to the true parameter values with the assumed Wiener process model using random drift. The sample means of the estimated standard errors from the robust estimator are fairly close to the sample standard deviations of the \(\varvec{\theta }\) estimates.
Figure S4 in Sect. S3.2 of the Supplementary Material presents the sample means of the estimated distribution functions from both Algorithms A and B with the approximate \(95\%\) CIs and their \(2.5\%\) and \(97.5\%\) quantiles. In each plot, we also overlaid the sample means of the estimates by empirical distribution function, \({\tilde{F}}_n(\cdot ; \nu ,\sigma ,\sigma _r)\), the naive estimator, and the Turnbull estimator. These plots indicate that the proposed estimator is close to the empirical function in the situation, even when Model (1) does not hold.
We also explored scenarios where the burnt-area process is generated following other models, such as \(A_{i}(t)=\nu _i t^2 +\sigma _i W_{i}(t)\). The estimated duration distribution based on the proposed approach assuming Model (1) is also close to the empirical function using all the true duration times. This indicates that the proposed estimator \({\hat{F}}_n(\cdot )\) can be quite robust to model misspecification. Further investigation could lead to a way of systematically checking the validity of Model (1).
5 Final remarks
We propose in this paper procedures for estimating the distribution of event duration with observations in the presence of missing time origins. By employing the distribution of the first-hitting-time with a Wiener process, we link the distribution of the event duration with associated longitudinal measures. Both simulation and real data analysis show that the proposed approach performs well in predicting the times to initial attack and also demonstrates the importance of taking into account the duration between the unobserved start time and the later report time.
The proposed approach is applicable to many situations where event duration is of interest but where the time origins in the duration observations are missing. Examples include predicting the length of period from the unknown HIV infection to detection of infection by making use of the longitudinal viral load measures (Doksum and Normand 1995), predicting the lifetime of trees by using longitudinal measures of the diameter at breast height for trees (Thompson 2011), and, as suggested by a referee, estimating the onset time of a disease by utilizing longitudinal medical expenditure data such as the usage of prescription drugs and the cost of skilled nursing facilities. The idea underlying our approach could readily be applied to analysis under a different model for longitudinal measures, e.g. Wang (2008); Wang and Xu (2010). It would be worth exploring the validity of the stochastic process for longitudinal measures.
Several other investigation would be worthwhile. The target population in the wildland fire application of this paper includes only the fires that are reported and have been dispatched with initial attack resources. When a study aims to explore the whole physical development process of wildland fire, the fires not reported should also be included in the population under consideration. The current available wildland fire records are then length-biased. We suggest to extend the idea of the proposed approach and adapt methods for estimating distributions with right-censored event times subject to length-biased sampling (e.g. Asgharian and Wolfson 2005; Huang and Qin 2011) to the situation where the origins of the duration times are missing.
Heterogeneity and correlation between fires should be accounted for. Applying the proposed approach to the data stratified by fire region has revealed that the event duration distributions are different for fires in different regions; see Table 2 and Fig. 5, for example. The duration is likely related to fuel type and moisture content as well as wind activity and local topography. To deal with this problem, as discussed in Wang (2010), we could follow (Lawless and Crowder 2004) and specify the drift parameter \(\nu _i\) of Model (1) as a function of covariates. In addition, due to potential correlation between wildland fires, it would be of interest to extend the approach to account for spatio-temporal correlation. A third possibility is to follow (Heitjan and Rubin 1990) to accommodate semi-continuous data with the rounded burnt-area records shown in Table S1 of Sect. S2 of the Supplementary Material.
More investigation is required to systematically determine J, the number of Monte Carlo samples in Algorithm A. We can incorporate automated data-driven strategies (e.g. Levine and Casella 2001; Caffo et al. 2005) to the current algorithm to choose an appropriate J at each iteration. This paper evaluates integrals by monte carlo integration. As suggested by a referee, it can be interesting to compare the integral approximation with different numerical integration approaches, such as the Gaussian quadrature rule.
References
Asgharian M, Wolfson DB (2005) Asymptotic behavior of the unconditional NPMLE of the length-biased survivor function from right censored prevalent cohort data. Ann Stat 33(5):2109–2131
Balasubramanian R, Lagakos SW (2003) Estimation of a failure time distribution based on imperfect diagnostic tests. Biometrika 90(1):171–182
Braun J, Duchesne T, Stafford JE (2005) Local likelihood density estimation for interval censored data. Can J Stat 33(1):39–60
Caffo BS, Jank W, Jones GL (2005) Ascent-based Monte Carlo expectation-maximization. J R Stat Soc Ser B (Stat Methodol) 67(2):235–251
Chhikara RS, Folks JL (1989) The inverse Gaussian distribution: theory, methodology, and applications. Marcel Dekker Inc, New York
Choi S, Huang X, Cormier JN, Doksum KA (2014) A semiparametric inverse-gaussian model and inference for survival data with a cured proportion. Can J Stat 42(4):635–649
Degruttola V, Lange N, Dafni U (1991) Modeling the progression of HIV infection. J Am Stat Assoc 86(415):569–577
Doksum KA, Hoyland A (1992) Models for variable-stress accelerated life testing experiments based on wiener processes and the inverse gaussian distribution. Technometrics 34(1):74–82
Doksum KA, Normand SLT (1995) Gaussian models for degradation processes-part I: methods for the analysis of biomarker data. Lifetime Data Anal 1(2):131–144
Hastings WK (1970) Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57(1):97–109
Heitjan DF, Rubin DB (1990) Inference from coarse data via multiple imputation with application to age heaping. J Am Stat Assoc 85(410):304–314
Horrocks J, Thompson ME (2004) Modeling event times with multiple outcomes using the wiener process with drift. Lifetime Data Anal 10(1):29–49
Hu XJ, Lagakos SW (1999) Interim analyses using repeated confidence bands. Biometrika 86(3):517–529
Huang CY, Qin J (2011) Nonparametric estimation for length-biased and right-censored data. Biometrika 98(1):177–186
Klein J, Moeschberger M (2003) Survival analysis: statistical methods for censored and truncated data. Springer-Verlag, New York
Kosorok MR (2008) Introduction to empirical processes and semiparametric inference. Springer, Berlin
Lawless J, Crowder M (2004) Covariates and random effects in a gamma process model with application to degradation and failure. Lifetime Data Anal 10(3):213–227
Lee MLT, Whitmore GA (2006) Threshold regression for survival analysis: modeling event times by a stochastic process reaching a boundary. Stat Sci 21(4):501–513
Levine RA, Casella G (2001) Implementations of the Monte Carlo EM algorithm. J Comput Gr Stat 10(3):422–439
Lu CJ, Meeker WO (1993) Using degradation measures to estimate a time-to-failure distribution. Technometrics 35(2):161–174
Martell DL (2007) Forest fire management. In: Handbook of operations research in natural resources. Springer, pp. 489–509
Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E (1953) Equation of state calculations by fast computing machines. J Chem Phys 21(6):1087–1092
Morin AA (2014) A spatial analysis of forest fire survival and a marked cluster process for simulating fire load. Master’s thesis, The University of Western Ontario, London, Ontario, Canada
Mulatya CM, McLain AC, Cai B, Hardin JW, Albert PS (2016) Estimating time to event characteristics via longitudinal threshold regression models-an application to cervical dilation progression. Stat Med 35(24):4368–4379
Parks GM (1964) Development and application of a model for suppression of forest fires. Manag Sci 10(4):760–766
Peng CY (2015) Inverse gaussian processes with random effects and explanatory variables for degradation data. Technometrics 57(1):100–111
Pennell ML, Whitmore G, Ting Lee ML (2009) Bayesian random-effects threshold regression with application to survival data with nonproportional hazards. Biostatistics 11(1):111–126
Rosenblatt M (1956) Remarks on some nonparametric estimates of a density function. Ann Math Stat 27:832–837
Serfling RJ (1980) Approximation theorems of mathematical statistics. Wiley, New Jersey
Thompson DJ (2011) Survival models for data arising from multiphase hazards, latent subgroups or subject to time-dependent treatment effects. PhD thesis, Simon Fraser University, Burnaby, BC, Canada
Turnbull BW (1976) The empirical distribution function with arbitrarily grouped, censored and truncated data. J R Stat Soc Ser B (Methodol) 38(3):290–295
Wang X (2008) A pseudo-likelihood estimation method for nonhomogeneous gamma process model with random effects. Stat Sin 18:1153–1163
Wang X (2010) Wiener processes with random effects for degradation data. J Multivar Anal 101(2):340–351
Wang X, Xu D (2010) An inverse gaussian process model for degradation data. Technometrics 52(2):188–197
Wei GC, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704
Wotton B, Martell DL (2005) A lightning fire occurrence model for Ontario. Can J For Res 35(6):1389–1401
Zhao L, Hu XJ, Lagakos SW (2008) Statistical monitoring of clinical trials with multivariate response and/or multiple arms: a flexible approach. Biostatistics 10(2):310–323
Acknowledgements
We are grateful to N. McLoughlin of Alberta Wildfire Management Branch for providing the wildland fire data. We thank G.A. Whitmore for his many constructive suggestions, and the associate editor and two referees for their helpful comments and suggestions. The research was supported by grants from the Natural Sciences and Engineering Research Council of Canada (NSERC) and a CRT grant from the Canadian Statistical Sciences Institute (CANSSI).
Funding
Funding was provided by Natural Sciences and Engineering Research Council of Canada (CA) (Grant No. NSERC DG 116860), Natural Sciences and Engineering Research Council of Canada (CA) (Grant No. NSERC DG and DAS 177430).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Appendices
Appendix: Technical details of Sect. 2.4
This appendix outlines our derivation for the asymptotic properties of the estimator \({\hat{F}}_n(t)\) in Sect. 2.4. To simplify the notation, we define \(\varvec{\theta }=(\nu ,\sigma ,\sigma _r)\) and use \(\varvec{\theta }_0\) to represent the true values of the parameters. Denote \({\tilde{F}}_n(t;{\nu }_0, {\sigma }_0,{\sigma }_{r0})\) by \({\tilde{F}}_n(t)\). We focus on the realization of the proposed estimator \({\hat{F}}_n(\cdot )\) using \(\hat{\varvec{\theta }}_{n_A}\) obtained by Algorithm A. The derivation can be adapted to handle \({\hat{F}}_n(\cdot )\) using \(\hat{\varvec{\theta }}_{n_B}\) by Algorithm B with little change.
The following are the regularity conditions assumed in Model (1). We adapt the regularity conditions for the MLE, which are summarized by Serfling (1980).
-
Condition (C1). The parameter space \(\varTheta \) is compact and the true parameter \(\varvec{\theta }_0\) is an interior point of \(\varTheta \).
-
Condition (C2). The first, second, and third partial derivatives of the log-likelihood function given in (9) with respect to \(\varvec{\theta }\) exist for each \(\varvec{\theta } \in \varTheta \).
-
Condition (C3). Differentiation and integration are interchangeable for first, second and third partial derivatives of the log-likelihood function with respect to \(\varvec{\theta }\).
-
Condition (C4). The third partial derivative of the log-likelihood function with respect to \(\varvec{\theta }\), is dominated by a fixed integrable function \(M_{2}(\cdot )\) for every \(\varvec{\theta } \in \varTheta \).
-
Condition (C5). \(\varSigma (\varvec{\theta }_0)=\text{ E}_{\varvec{\theta }_0}\big [\big (\partial \log L_{obs}(\varvec{\theta };{\mathbf {O}}_i)\big /\partial \varvec{\theta }\big )\big ( \partial \log L_{obs}(\varvec{\theta };{\mathbf {O}}_i)\big /\partial \varvec{\theta }\big )^{\prime }{}\big ]\) exists and is positive definite, where \(\log L_{obs}(\varvec{\theta };{\mathbf {O}}_i)\) is the ith term in (9). It is the same as \(\varPi (\varvec{\theta }_0) = -\text{ E}_{\varvec{\theta }_0}\big [ \partial ^2 \log L_{obs}(\varvec{\theta };{\mathbf {O}}_i) \big / \partial \varvec{\theta }\varvec{\theta }^{'}\big ]\).
Appendix A: Proof of Theorem 1(i) Strong consistency
To establish the consistency of the proposed estimator, we first note that the MLE \(\hat{\varvec{\theta }}_{n_A}\) is consistent and asymptotically normal. That is, under (C1)–(C5), \(\hat{\varvec{\theta }}_{n_A} \overset{a.s.}{\rightarrow }{\varvec{\theta }}_0\) as \(n\rightarrow \infty \), and \(\sqrt{n}(\hat{\varvec{\theta }}_{n_A} - {\varvec{\theta }}_0) \overset{d}{\rightarrow }N(0, \varPi ^{-1}(\varvec{\theta }_0))\). Here \(\varPi (\varvec{\theta }_0) =\varSigma (\varvec{\theta }_0)\).
Let \(M(t, L^*_i,B_i;\varvec{\theta }_0)=\int _{-\infty }^{\infty } G(t-L^*_i|B_i,\nu _0e^{\delta },\sigma _0) \phi (\delta |{\mathbf {O}}_i,\nu _0,\sigma _0, \sigma _{r0}) \mathrm {d}\delta \). Define the class of functions \({\mathcal {F}}=\{M(t, L^*_i,B_i;\varvec{\theta }_0):t\in [0,\tau ]\}\). Let \({\mathbb {P}}_n\) and P be the empirical measure and probability measure of n i.i.d. observations \(\{(L^*_i,B_i),i=1,2,\cdots ,n\}\). Thus, \({\mathbb {P}}_n M(t, L^*_i,B_i;\varvec{\theta }_0) =\sum _{i=1}^{n}M(t, L^*_i,B_i;\varvec{\theta }_0)\big /n\), \(P M(t, L^*_i,B_i;\varvec{\theta }_0) =\text{ E }\big [M(t, L^*_i,B_i;\varvec{\theta }_0)\big ]\). Note that \({\tilde{F}}_n(t)\) can be written as \( {\mathbb {P}}_n M(t, L^*_i,B_i;\varvec{\theta }_0)\) and \({\hat{F}}_n(t)\) can be written as \( {\mathbb {P}}_n M(t, L^*_i,B_i;\varvec{{\hat{\theta }}}_{n_A})\). Since \(G(t-L^*_i|B_i,\nu _0 e^{\delta _i},\sigma _0) \) is a monotone process, the integral \(M(t, L^*_i,B_i;\varvec{\theta }_0)\) is also monotone. By Lemma 4.1 in Kosorok (2008), we have that \({\mathcal {F}}\) is a P-Donsker. The Glivenko–Cantelli property of \({\mathcal {F}}\) yields
Note that \(\text{ E }\big [M(t, L^*_i,B_i;\varvec{\theta }_0)] =\text{ E }[\text{ E }\{G(t-L^*_i|B_i,\nu _0 e^{\delta _i},\sigma _0) |{\mathbf {O}}_i\}\big ]=F(t)\). Thus \(\underset{t\in [0,\tau ]}{\sup }\Big |{\tilde{F}}_n(t)-F(t)\Big |\overset{p}{\rightarrow }0\).
We then write \({\hat{F}}_n(t)-{\tilde{F}}_n(t)\) as \(n^{-1}\sum _{i=1}^{n}\Big [M(t, L^*_i,B_i;\varvec{{\hat{\theta }}}_{n_A}) - M(t, L^*_i,B_i;\varvec{{\theta }}_0)\Big ]\), which is
For every \(t\in [0,\tau ]\) and \(\nu ,\sigma ,\sigma _r\) belonging to a compact set \(\varTheta \), \(|G(t-L^*_i|B_i,\nu e^{\delta _i},\sigma ) \phi (\delta _i|{\mathbf {O}}_i;\nu ,\sigma ,\sigma _{r})|\) is bounded. Therefore, we can interchange integration and differentiation. Then,
The derivatives \(\partial G(t-L^{*}_{i}|B_i,\nu e^{\delta _i},\sigma )\big /\partial \nu \) and \(\partial G(t-L^{*}_{i}|B_i,\nu e^{\delta _i},\sigma )\big /\partial \sigma \) are respectively
where \(A_1=\sqrt{\frac{B_i}{\sigma ^2(t-L^*_i)}}\Big (\frac{(t-L^*_i)\nu e^{\delta _i}}{B_i}-1\Big )\) and \(A_2=-\sqrt{\frac{B_i}{\sigma ^2(t-L^*_i)}}\Big (\frac{(t-L^*_i)\nu e^{\delta _i}}{B_i}+1\Big )\).
The derivatives in (16) are uniformly bounded in probability for every \(t\in [0,\tau ]\) and \(\nu ,\sigma ,\sigma _r\) belonging to a compact set \(\varTheta \). Therefore, \(n^{-1}\sum _{i=1}^{n}\partial \text{ E}_{\varvec{\theta }^*} \big [ G(t-L^*_i|B_i,\nu e^{\delta _i},\sigma )|{\mathbf {O}}_i\big ] \big /\partial \varvec{\theta }\) is uniformly bounded in probability. This together with the already observed fact that \(\hat{\varvec{\theta }}_{n_A} \overset{a.s.}{\rightarrow }{\varvec{\theta }}_0\) leads to \(\sup _{t\in [0,\tau ]}| {\hat{F}}_n(t)-{\tilde{F}}_n(t)|\overset{p}{\rightarrow }0\), as \(n\rightarrow \infty \). By (15), it follows that \(\sup _{t\in [0,\tau ]}| {\hat{F}}_n(t)-F(t)|\overset{p}{\rightarrow }0\).
Appendix B: Proof of Theorem 1(ii) Weak convergence
Note that \(\sqrt{n}({\hat{F}}_n(t)-F(t)) =\sqrt{n}\Big ({\mathbb {P}}_n M(t, L^*_i,B_i;\hat{\varvec{\theta }}_{n_A})-P M(t, L^*_i,B_i;\varvec{\theta }_0)\Big )\) is
Since the class of functions \({\mathcal {F}}=\{M(t, L^*_i,B_i;\varvec{\theta }_0):t\in [0,\tau ]\}\) is P-Donsker, the first term \( \sqrt{n}\Big ({\mathbb {P}}_n M(t, L^*_i,B_i;\hat{\varvec{\theta }}_{n_A})-P M(t, L^*_i,B_i;\varvec{\theta }_0)\Big )\) converges weakly to a tight, mean-zero Gaussian process in \(\ell ^{\infty }([0,\tau ])\), whose variance is
where \(h(l^*,b)\) is the PDF for the joint distribution function of \(L^*\) and B.
Also, for the second term on the right hand side (RHS) of (17), we have
The third term on RHS of (17) can be written as \( \sqrt{n}(\hat{\varvec{\theta }}_{n_A}-\varvec{\theta }_0)\text{ E}_{\varvec{\theta }_0}\big [ \partial M(t, L^*_i,B_i;\varvec{\theta })\big /{\partial \varvec{\theta }}\big ]+o_{p}(1)\), which is
Therefore, \({\hat{F}}_n(t)\) is asymptotically linear with influence function:
Since this influence function is P-Donsker, \(\sqrt{n}({\hat{F}}_n(t)-F(t))\) converges weakly to a tight, mean-zero Gaussian process \({\mathcal {G}}\), as \(n\rightarrow \infty \).
Appendix C: Variance estimation in Theorem 1(ii)
To calculate the variance of \({\mathcal {G}}\), we write \(\text{ Var }\Big ( \sqrt{n}\big ({\hat{F}}_n(t)-F(t)\big )\Big )\) as
The second term on the RHS of (19) is given in (18) and the third term can be calculated as \(n\text{ E }\Big [\big ({\hat{F}}_n(t)-{\tilde{F}}_n(t)\big )\big ({\tilde{F}}_n(t)-F(t)\big )\Big ]\). Noting that \(|{\hat{F}}_n(t)|\le 1\), \(|{\tilde{F}}_n(t)|\le 1\), \(|{\hat{F}}_n(t)-{\tilde{F}}_n(t)|\le 1\), \(\sup _{t\in [0,\tau ]}|{\hat{F}}_n(t)-{\tilde{F}}_n(t)|\overset{p}{\rightarrow }0\), it is clear to show that \(\text{ E }\Big [\big ({\hat{F}}_n(t)-{\tilde{F}}_n(t)\big )\big ({\tilde{F}}_n(t)-F(t)\big )\Big ]\overset{p}{\rightarrow }0\), by the bounded convergence theorem. Therefore, \(\text{ Cov }\Big (\sqrt{n}\big ({\hat{F}}_n(t)-{\tilde{F}}_n(t)\big ), \sqrt{n}\big ({\tilde{F}}_n(t)-F(t)\big )\Big )\overset{p}{\rightarrow }0\).
The first term, \(\text{ Var }\Big (\sqrt{n}\big ({\hat{F}}_n(t)-{\tilde{F}}_n(t)\big )\Big )\), can be calculated as
Note that \(\sum _{i=1}^{n}\frac{\partial M(t, L^*_i,B_i;{\varvec{\theta }}) }{\partial \varvec{\theta }}\big /n|_{\varvec{\theta }=\varvec{\theta }^*}\overset{p}{\rightarrow }\text{ E}_{\varvec{\theta }_0}\big [\partial M(t, L^*_i,B_i;{\varvec{\theta }})\big /\partial \varvec{\theta }\big ]\) and the asymptotic variance of \(\hat{\varvec{\theta }}_{n_A}\) is \(\varPi ^{-1}(\varvec{\theta }_0)\). Thus \(\text{ Var }\Big (\sqrt{n}\big ({\hat{F}}_n(t)-{\tilde{F}}_n(t)\big )\Big )\) converges in probability as \(n\rightarrow \infty \) to
The covariance of the limiting Gaussian process of \( \sqrt{n}\Big ({\hat{F}}_n(t)-F(t)\Big )\) is as given in (14) in Theorem 1 (ii). Further, one can use \(\sum _{i=1}^{n}\frac{\partial M(t, L^*_i,B_i;{\varvec{\theta }}) }{\partial \varvec{\theta }}\big /n|_{\varvec{\theta }=\varvec{{\hat{\theta }}}_{n_A}}\) to estimate \(\text{ E}_{\varvec{\theta }_0}\big [\partial M(t, L^*_i,B_i;{\varvec{\theta }})\big /\partial \varvec{\theta }\big ]\) and use \(\varPi (\varvec{{\hat{\theta }}}_{n_A})\) to estimate \(\varPi (\varvec{{\theta }}_{0})\).
Rights and permissions
About this article
Cite this article
Xiong, Y., Braun, W.J. & Hu, X.J. Estimating duration distribution aided by auxiliary longitudinal measures in presence of missing time origin. Lifetime Data Anal 27, 388–412 (2021). https://doi.org/10.1007/s10985-021-09520-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-021-09520-w