Abstract
A new algorithm is presented and studied in this paper for fast computation of the nonparametric maximum likelihood estimate of a U-shaped hazard function. It successfully overcomes a difficulty when computing a U-shaped hazard function, which is only properly defined by knowing its anti-mode, and the anti-mode itself has to be found during the computation. Specifically, the new algorithm maintains the constant hazard segment, regardless of its length being zero or positive. The length varies naturally, according to what mass values are allocated to their associated knots after each updating. Being an appropriate extension of the constrained Newton method, the new algorithm also inherits its advantage of fast convergence, as demonstrated by some real-world data examples. The algorithm works not only for exact observations, but also for purely interval-censored data, and for data mixed with exact and interval-censored observations.
Introduction
Estimation of a survival function is important in fields such as biomedical studies and reliability engineering. Parametric models are used in many applications, for their straightforward implementation and ease of analysis. To avoid strict assumptions that are inherent with parametric models and may lead to biased inference and invalid conclusions, one can resort to nonparametric approaches. As a nonparametric maximum likelihood estimator, the renowned Kaplan–Meier estimator for a survival function has been widely used in the case of right-censored data (Kaplan and Meier 1958). There is also an increasing research interest in nonparametric maximum likelihood estimation for general interval-censored data (Wellner 1995; Schick and Yu 2000), which does not have a closed-form solution and thus requires numerical algorithms (Peto 1973; Turnbull 1974; Dümbgen et al. 2006; Wang 2008).
Sometimes in practice, vague prior knowledge may be available, such as those regarding the shape of the distribution of survival times. It is therefore natural to make use of such knowledge to improve estimation accuracy. For example, a reliability engineer or medical practitioner may know the general trend of the true hazard over time, being increasing, decreasing or U-shaped; see Grenander (1956), Proschan (1963), Hall et al. (2001) and Reboul (2005). The recent monograph of Groeneboom and Jongbloed (2014) provides an excellent coverage of many shape-restricted estimation problems.
In this paper, we study the computation of the maximum likelihood estimate (MLE) of a survival function under the restriction of a U-shaped hazard function. It consists of three periods: burn-in, stable, and wear-out. It occurs when, e.g., subjects that are more prone to failure are “weeded” out at an early stage and the remaining ones will later suffer from the “aging” effect. Nonparametric estimation of a U-shaped hazard function in the uncensored situation was pioneered by Bray et al. (1967) and was extended to the case of right-censored data by Mykytyn and Santner (1981); see also Tsai (1988) and Huang and Wellner (1995). Banerjee (2008) studied pointwise confidence intervals of such hazard functions for right-censored data using nonparametric likelihood ratios. Jankowski and Wellner (2009b) studied the estimation of a U-shaped hazard function that is only subject to the convexity restriction and showed that the MLE in this case is piecewise linear and has a local rate of convergence \(n^{2/5}\). Meyer and Habtzghi (2011) also studied the estimation of a shape-restricted hazard function via regression splines.
Since the nonparametric MLE of a U-shaped hazard function is not explicitly available, iterative algorithms have to be used. There is seemingly only one purpose-built algorithm available in the literature, and it was proposed by Jankowski and Wellner (2009a) for finding a convex hazard function. Their algorithm (SRB) iterates between the support reduction algorithm (Groeneboom et al. 2008) and the bisection technique. The former is used to find the MLE over all convex hazard functions with a fixed minimum (anti-mode), and the latter is applied to optimize over all possible minimum values of a convex hazard function. This is a double-looping process and hence is very time-consuming. To overcome the difficulty in dealing with the anti-mode of a hazard function and hence to provide a single-looping, fast algorithm is the main goal of our research. It is also of our interest to propose an algorithm that works for estimating a general U-shaped hazard function, with a focus on the convexity restriction.
Our new algorithm, being an extension of the constrained Newton method (CNM) (Wang 2007) that was proposed for fitting a nonparametric mixture, contains a novel technique to deal with the anti-mode problem. The main idea is to always maintain a constant hazard segment, even if its length is zero. The two situations when the constant hazard segment has either a zero or positive length are treated the same way and thus interchangeable during the computation. Specifically, the new algorithm will not only find and add new (potential) support points outside the constant hazard segment by finding the local maxima of two special gradient functions, but also find and add one new support point inside the constant hazard segment to either the left or right support set according to the larger value of the two gradient functions at the point. This allows for the length of the constant hazard segment to possibly decrease. The masses associated with all the support points are updated rapidly under the U-shaped hazard restriction, and the support points with zero masses are removed subsequently. Note that removing a support point with zero mass which is an endpoint of the constant hazard segment means an increase in its length. This overcomes the anti-mode difficulty and makes double looping unnecessary and thus computation significantly faster.
Since the SRB algorithm only deals with exact observations, our secondary goal is to provide an algorithm that also works for general interval-censored observations, and for data mixed with both exact and interval-censored observations. These two cases are quite common in survival analysis. Some theoretical properties of maximum likelihood estimation are established in this more general setting, which are needed for establishing the convergence of our new algorithm.
The rest of the paper is organized as follows. Section 2 describes briefly the nonparametric maximum likelihood estimation of a hazard function, in particular under the convexity restriction. Section 3 gives some theoretical properties of nonparametric maximum likelihood estimation, defines three gradient functions that are vital for our new algorithm and specifies the characterization conditions of the nonparametric MLE. Section 4 describes the new algorithm in details, and its convergence to a global optimizer is theoretically established in Sect. 5. Section 6 considers a family of U-shaped hazard functions with varying smoothness. Three real-world data sets are studied in Sect. 7. The final section gives some concluding remarks.
The software developed and data sets used in this paper are available in the R package npsurv (Wang 2015).
Nonparametric maximum likelihood estimation
Let T be a random variable taking values in \([0, \infty )\) and have distribution function F and density function f. It represents the time until some specified event occurs. The hazard function is given by
where \(S(t) = 1 - F(t)\) is the survival function and \(H(t) = \int _0^{t} h(u) \, \mathrm {d}u\) the cumulative hazard function. Given h, the other functions H, F, S and f can all be readily derived. For example, one can express the density as
We are interested in finding the MLE of h (or f, equivalently), under the restriction of a nonnegative convex h on \([0, \infty )\).
Given an independent and identically distributed sample of exact observations \(T_1, \ldots , T_n \in (0, \infty )\), the log-likelihood function of h is given by
Since h can be made arbitrarily large at the largest observation \(T_{(n)}\) without affecting the value of \(H(T_{(n)})\), one should instead maximize the modified log-likelihood given by
where \(\mathcal{I}= \{i: T_i < T_{(n)}, 1 \le i \le n\}\) and h is restricted to be on \([0, T_{(n)})\). The full MLE is obtained by additionally setting \(\hat{h}(T_{(n)}) = \infty \). This follows the same reasoning as given by Grenander (1956) who studied an increasing h.
Denote by \(\mathcal{K}\) the space of all nonnegative convex functions on \([0, T_{(n)}]\) (or \([0, L_{(n)}]\), if interval-censored observations exist). When all event times are exactly observed, Jankowski and Wellner (2009b) showed that the MLE of a convex hazard function is unique and is piecewise linear. Hence, to find the MLE one only needs to consider hazard functions of the form
for \(\alpha \ge 0\) and \( 0< \tau _1< \cdots< \tau _k \le \eta _1< \cdots< \eta _m < L_{(n)}\). It consists of three parts: (a) h is decreasing with slope changes at \(\tau _j\)’s, (b) h is constant, being \(\alpha \), on \([\tau _k, \eta _1]\), and (c) h is increasing with slope changes at \(\eta _j\)’s. Typically, the MLE only has a constant part of a positive length, when h touches 0. The corresponding cumulative hazard function is given by
Let \({\varvec{\nu }}= (\nu _1, \ldots , \nu _k)^\top \), \({\varvec{\mu }}= (\mu _1, \ldots , \mu _m)^\top \), \(\varvec{\tau }= (\tau _1, \ldots , \tau _k)^\top \) and \({\varvec{\eta }}= (\eta _1, \ldots , \eta _m)^\top \). In other words, a piecewise linear \(h \in \mathcal{K}\) is determined by three parameters \((h^{(0)}, h^{(1)}, h^{(2)})\): a nonnegative scalar \(h^{(0)} \equiv \alpha \), and two nonnegative measures \(h^{(1)} \equiv ({\varvec{\nu }}, \varvec{\tau })\) (i.e., \(\nu _j = \nu (\tau _j)\)) and \(h^{(2)} \equiv ({\varvec{\mu }}, {\varvec{\eta }})\) (i.e., \(\mu _j = \mu (\eta _j)\)), whose numbers of support points, k and m, need to be estimated as well.
In a more general setting, an event time \(T_i\) does not have to be exactly observed. It can be interval-censored, where \(T_i\) is only known to lie within an interval that is specified independently of \(T_i\). For an interval-censored observation, we denote it by its censoring interval \(O_i = [L_i, R_i) \subset [0, \infty )\), while an exact observation is indicated by \(T_i= L_i = R_i\). In the case where both exact and interval-censored observations exist, let us assume without loss of generality that the first \(n_1\) observations are exact. Hence, the log-likelihood function is given by
Similar to the case when all observations are exact, \(\ell (h)\) can be made arbitrarily large by increasing the value of h at the largest observed value and hence a modified log-likelihood should be maximized. The modified log-likelihood should take the form of
where \(\mathcal{I}= \{i: T_i < L_{(n)}, 1 \le i \le n_1\}\) and
Note that if \(T_{(n_1)} < L_{(n)}\), then the modified log-likelihood is the same as the full log-likelihood (2). We can hence always work with the modified log-likelihood.
When interval-censored observations are present, we establish in the next section, along with other theoretical properties, that there must exist a piecewise linear MLE of h under the convexity restriction. Hence the linear form (1) can continue to be used, despite that the piecewise linear MLE may not be unique. A fast algorithm will be proposed in Sect. 4 that finds a piecewise linear MLE, and its convergence is theoretically guaranteed by Theorem 5.
The reason for the non-uniqueness here is that interval-censored observations only affect the value of \({\tilde{\ell }}\) by the values of H at the censoring points, not by how H varies inside an interval that contains no exactly observed value or censoring point. Therefore, one can change h inside such an interval, as long as h is nonnegative and convex and \({\tilde{\ell }}\) remains the same.
Theoretical properties
In this section, we establish some properties for the likelihood method described in Sect. 2, in particular the characterization of the nonparametric MLE under the convexity restriction and the existence of a piecewise linear MLE. Section 5 establishes the convergence of the proposed algorithm, which makes use of these properties.
Theorem 1
\(\mathcal{K}\) is convex and \({\tilde{\ell }}\) is concave on \(\mathcal{K}\).
Proof
For any \(h, g \in \mathcal{K}\) and any \(\epsilon \in (0, 1)\), let \(h_\epsilon = (1 - \epsilon ) h + \epsilon g\), which must also be a nonnegative convex function on \([0, L_{(n)}]\). Hence \(\mathcal{K}\) is convex.
The concavity of \({\tilde{\ell }}\) on \(\mathcal{K}\) is established, if we can show that
Inequality (4) apparently holds in the special case when \({\tilde{\ell }}(h) = {\tilde{\ell }}(g) = -\infty \), since \({\tilde{\ell }}(\kappa ) \ge -\infty \) for all \(\kappa \in \mathcal{K}\).
Now let us consider the case when \({\tilde{\ell }}(h) > -\infty \) or \({\tilde{\ell }}(g) > -\infty \). We then must have that \(h_\epsilon (T_i) > 0\) for \(i \in \mathcal{I}\) and \(H_\epsilon (R_i) - H_\epsilon (L_i) > 0\) for \(i \ge n_1 + 1\), where \(H_\epsilon = (1 - \epsilon ) H + \epsilon G\), with \(G(t) = \int _0^t g(u) \, \mathrm {d}u\). Then
and
This means that \({\tilde{\ell }}\) is concave along the linear path from h to g and that inequality (4) also holds in this case. \(\square \)
Let \(\mathcal{K}(c) = \{h \in \mathcal{K}: {\tilde{\ell }}(h) \ge c > -\infty \}\), which is always assumed non-empty below for a chosen value of c. Note that \(\mathcal{K}(c)\) is also convex because of the concavity of \({\tilde{\ell }}\) on the convex set \(\mathcal{K}\).
Theorem 2
Let \(\epsilon _1, \epsilon _2 > 0\) be so chosen that intervals \((0, \epsilon _1)\) and \((L_{(n)} - \epsilon _2, L_{(n)})\) contain no exact observation or censoring point. For any fixed \(c > -\infty \) and all \(h \in \mathcal{K}(c)\), h is bounded on \([\epsilon _1, L_{(n)} - \epsilon _2]\). In addition, \(H(L_{(n)})\) is bounded.
Proof
It is a little convenient here to use the modified likelihood function:
Since h is convex, the supremum of h on \([\epsilon _1, L_{(n)} - \epsilon _2]\) occurs at either \(\epsilon _1\) or \(L_{(n)} - \epsilon _2\).
Assume that the supremum of h is achieved at \(\epsilon _1\) and is unbounded. Then
Since the right-hand side approaches 0 if \(h(\epsilon _1) \rightarrow \infty \), \({\widetilde{\mathcal {L}}}(h) \rightarrow 0\) or \({\tilde{\ell }}(h) \rightarrow -\infty \). This contracts that \(h \in \mathcal{K}(c)\). Hence h must be bounded at \(\epsilon _1\).
Now assume that the supremum of h is achieved at \(L_{(n)} - \epsilon _2\) and is unbounded. Because
h must also be bounded at \(L_{(n)} - \epsilon _2\), by a similar argument to the above.
Since
and \(h(T_i)\) is bounded for every \(i \in |\mathcal{I}|\), \(H(L_{(n)})\) must also be bounded. \(\square \)
Corollary 1
There exists an \({\hat{h}}\in \mathcal{K}\) that maximizes \({\tilde{\ell }}(h)\), and \({\tilde{\ell }}({\hat{h}}) < \infty \).
Proof
Immediately from Theorems 1 and 2. \(\square \)
Theorem 3
Let \(\epsilon _1\) and \(\epsilon _2\) be specified as in Theorem 2. Then for any \(g \in \mathcal{K}(c)\), \(c > -\infty \), there is a piecewise linear \(h \in \mathcal{K}(c)\) that has \({\tilde{\ell }}(h) = {\tilde{\ell }}(g)\), is bounded on \([0,L_{(n)}]\), and its \(h^{(1)}\) does not change slope on \([0, \epsilon _1)\) and \(h^{(2)}\) does not change slope on \((L_{(n)} - \epsilon _2, L_{(n)}]\).
Proof
Let \(\mathcal{U}\) be the set that contains the unique elements of \(\{L_i > 0: 1 \le i \le n\} \cup \{R_i < L_{(n)}: n_1 + 1 \le i \le n\}\). First, partition \([\epsilon _1, L_{(n)} - \epsilon _2]\) into disjoint subintervals by the elements of \(\mathcal{U}\). A convex g that is not piecewise linear on any of the subintervals can always be replaced with a piecewise linear, convex function h, which has \(h(u) = g(u)\) for all \(u \in \mathcal{U}\) and the same area underneath as g. For example, one can first construct a piecewise, linear minorant of g, without violating convexity, and then shift/rotate some linear sides upwardly until the area underneath is the same as g. This results that \(H(u) = G(u)\) for all \(u \in \mathcal{U}\).
If the part of a convex, decreasing \(g^{(1)}\) on \([0, \min (\mathcal{U}))\) changes slope, one can first replace it with a line segment that is a minorant of \(h^{(1)}\), has value \(g^{(1)}(\epsilon _1)\) at \(\epsilon _1\), and does not violate convexity. Then rotate it upwardly about point \((\epsilon _1, g^{(1)}(\epsilon _1))\) until \(H(\epsilon _1) = G(\epsilon _1)\). A similar argument can be made to obtain a line segment for \(h^{(2)}\) on \((L_{(n)} - \epsilon _2, L_{(n)}]\).
Since \(h(T_i) = g(T_i)\) for \(i = 1,\ldots , n_1\) and \(H(u) = g(u)\) for all \(u \in \mathcal{U}\cup \{L_{(n)}\}\), we have \({\tilde{\ell }}(h) = {\tilde{\ell }}(g)\). By Theorem 2, \(H(L_{(n)})\) is bounded. This means that h(0) and \(h(L_{(n)})\), and hence h on \([0,L_{(n)}]\), must be bounded, which completes the proof. \(\square \)
With Theorem 3, we can therefore define
where \(\epsilon _1\) and \(\epsilon _2\) are as specified in Theorem 2. Given any \(g \in \mathcal{K}(c)\), there must exist an equivalent class of \(h \in \mathcal{K}^{(\mathrm {PL})}_{\epsilon _1, \epsilon _2}(c)\) such that \({\tilde{\ell }}(h) = {\tilde{\ell }}(g)\).
Corollary 2
For any \(h \in \mathcal{K}^{(\mathrm {PL})}_{\epsilon _1, \epsilon _2}(c)\),
is bounded.
Proof
From Theorem 2, h is bounded on \([\epsilon _1, L_{(n)} - \epsilon _2]\). Therefore, \(\int \mathrm {d}h^{(1)}\), the total change of slope of the decreasing, convex \(h^{(1)}\), must be bounded. Similarly, so must be \(\int \mathrm {d}h^{(2)}\), and hence |h|. \(\square \)
Corollary 3
There exists an \({\hat{h}}\in \mathcal{K}^{(\mathrm {PL})}_{\epsilon _1, \epsilon _2}(c)\) that maximizes \({\tilde{\ell }}(h)\).
Proof
The existence of an MLE follows from the concavity of \({\tilde{\ell }}\), by Theorem 1. Hence by Theorem 3, there exists an equivalent piecewise linear MLE of h. \(\square \)
Letting \(e_{1,\tau }(t) = (\tau - t)_+\) and \(e_{2,\eta }(t) = (t - \eta )_+\), we define two gradient functions as follows:
and
where
For the reason of completeness, let us define
where \(e_0 = 1\). Note that \(d_1\) and \(d_2\) are piecewise quadratic functions of \(\tau \) and \(\eta \), respectively. The directional derivative from h to g is given by
Lemma 1
For any \(g, h \in \mathcal{K}\),
Proof
This follows from the concavity of \({\tilde{\ell }}\). \(\square \)
Theorem 4
A piecewise linear \({\hat{h}}\in \mathcal{K}\) maximizes \({\tilde{\ell }}(h)\) if and only if the following conditions are satisfied:
-
(i)
\(d_0({\hat{h}}) \le 0\), if \({\hat{\alpha }}= 0\);
-
(ii)
\(d_0({\hat{h}}) = 0\), if \({\hat{\alpha }}> 0\);
-
(iii)
\(d_1(\tau ; {\hat{h}}) \le 0\), for \(\tau \in [0, \hat{\eta }_1]\);
-
(iv)
\(d_1(\tau ; {\hat{h}}) = 0\), for \(\tau \in \{\hat{\tau }_1, \ldots , \hat{\tau }_{\hat{k}}\}\);
-
(v)
\(d_2(\eta ; {\hat{h}}) \le 0\), for \(\eta \in [\hat{\tau }_{\hat{k}}, L_{(n)}]\);
-
(vi)
\(d_2(\eta ; {\hat{h}}) = 0\), for \(\eta \in \{\hat{\eta }_1, \ldots , \hat{\eta }_{\hat{m}}\}\).
Proof
Necessity can be established by contradiction. For any h that fails to satisfy any of the six conditions, \({\tilde{\ell }}\) must have an ascent direction at h and a sufficiently small movement in the direction will increase \({\tilde{\ell }}\). Hence this h does not maximize \({\tilde{\ell }}\).
If the six conditions are satisfied, then \(d(g; {\hat{h}}) \le 0\) for every \(g \in \mathcal{K}\). Sufficiency follows from Lemma 1. \(\square \)
Computation
In this section, we propose a new, fast algorithm that finds a piecewise linear MLE, \({\hat{h}}\). Section 4.1 briefly describes the main difficulty with computing \({\hat{h}}\) and our main idea to overcome it. In Sects. 4.2 and 4.3, we describe, respectively, how to update masses with support points fixed and how to expand and shrink the two sets of support points appropriately. The algorithm is summarized in Sect. 4.4.
Main ideas
To compute \({\hat{h}}\), one major difficulty is that it is not clear during the computation where the two nonnegative measures \({\hat{h}}^{(1)}\) and \({\hat{h}}^{(2)}\) will eventually be separated by the constant part, the endpoints of which, \(\hat{\tau }_{\hat{k}}\) and \(\hat{\eta }_1\), are themselves to be found. In general, it is also impossible to pre-identify a point between \(\hat{\tau }_{\hat{k}}\) and \(\hat{\eta }_1\), since for a practical data set it is almost always that \(\hat{\tau }_{\hat{k}}= \hat{\eta }_1\), unless \({\hat{\alpha }}= 0\), i.e., the constant part lies on the horizontal axis. To deal with this dilemma, Jankowski and Wellner (2009a) propose a double-looping profile likelihood method. In the inner loop, the support reduction algorithm (Groeneboom et al. 2008) is used to find the MLE over all convex hazard functions with a fixed minimum. In the outer loop, the bisection algorithm is utilized to optimize over possible minimum values of a convex hazard function. This hybrid SRB method is apparently time-consuming, as caused by the seemingly inevitable double looping.
Our new algorithm is able to reduce the double loop to a single loop and thus speeds up significantly. The key idea is to maintain a constant hazard part of a variable length, including length zero. New support points outside the constant part are provided by the local maxima of the two gradient functions (6) and (7), while inside the constant part, a single new support point is provided by the greater of the maxima of the two gradient functions. This gives an opportunity for the constant part to shrink. The masses at all support points are then updated, and some may turn out to be exactly zero. This gives an opportunity for the constant part to expand. The computation is iterative and is guaranteed to create a sequence of estimates that converges to the MLE.
In addition, the new algorithm adds multiple new support points to the support sets in each iteration, while the SRB method only adds one at a time. This also helps speed up the computation, when \({\hat{h}}\) has many support points, which is usually true when the sample size is large.
Updating masses
Let us denote \({\varvec{\pi }}\) = \((\alpha , {\varvec{\nu }}^\top , {\varvec{\mu }}^\top )^\top \), the vector of masses, and \({\varvec{\theta }}= (\varvec{\tau }^\top , {\varvec{\eta }}^\top )^\top \), the vector of support points. Since h is fully defined by its \({\varvec{\pi }}\) and \({\varvec{\theta }}\), with k and m always implicitly assumed known, we treat h and \(({\varvec{\pi }}, {\varvec{\theta }})\) interchangeably below and hence may write the modified log-likelihood as \({\tilde{\ell }}({\varvec{\pi }}, {\varvec{\theta }})\).
In order to update \({\varvec{\pi }}\) to \({\varvec{\pi }}'\) with \({\varvec{\theta }}\) fixed, consider the second-order Taylor series expansion of the modified log-likelihood function (3) in the neighborhood of \({\varvec{\pi }}\). Denote the gradient vector and Hessian matrix (see “Appendix 1”) with respect to \({\varvec{\pi }}\) by, respectively,
and for a \({\varvec{\pi }}'\) near \({\varvec{\pi }}\), we have
Therefore, in a small neighborhood of \({\varvec{\pi }}\), the problem of maximizing \({\tilde{\ell }}({\varvec{\pi }}', {\varvec{\theta }})\) can be approximated by the following least squares linear regression problem with nonnegativity constraints:
where \(\mathbf {R}\equiv \mathbf {R}({\varvec{\pi }}, {\varvec{\theta }})\) satisfies \(\mathbf {H}= -\mathbf {R}^\top \mathbf {R}\) and \(\mathbf {c}\equiv \mathbf {c}({\varvec{\pi }}, {\varvec{\theta }})\) solves \(\mathbf {R}^\top \mathbf {c}= \mathbf {g}\). For example, \(\mathbf {R}\) is the upper triangular matrix given by the Cholesky decomposition of \(-\mathbf {H}\), which is feasible since \(\mathbf {H}\) is negative semidefinite (see “Appendix 1”). Numerically, \(\mathbf {R}\) may turn out to be singular when there exist similar support points. To resolve this numerical issue, we can add a small positive value to all diagonal elements of \(-\mathbf {H}\). Problem (10) is then readily solvable by the nonnegativity least squares (NNLS) algorithm of Lawson and Hanson (1974). We also find it helpful if \(-\mathbf {H}\) is first standardized to have unit diagonal elements, which leads to a problem similar to (10).
To ensure monotone increase in the modified log-likelihood, a backtracking line search can be conducted. Let \({\varvec{\pi }}'\) be the solution to problem (10). The updated vector \({\varvec{\pi }}+ \sigma ^p ({\varvec{\pi }}' - {\varvec{\pi }})\) uses the smallest \(p \in \left\{ 0, 1, 2, \ldots \right\} \) such that, for \(0<\alpha <\frac{1}{2}\),
holds true. In our implementation, we use \(\sigma =\frac{1}{2}\) and \(\alpha = \frac{1}{3}\). To satisfy the Armijo rule, p must have a finite upper bound that is independent of iterations.
Expanding and shrinking support sets
We need to dynamically enlarge and shrink the two support sets \(\varvec{\tau }\) and \({\varvec{\eta }}\). The gradient functions \(d_1(\tau ; h)\) and \(d_2(\eta ; h)\) are great tools to locate new elements for the support sets. Since \(d_1(\tau ; h)\) and \(d_2(\eta ; h)\) are piecewise quadratic functions, it is simple to locate all global maxima between every two neighboring points of these functions. It is carried out separately on three intervals: (i) the decreasing part on \([0, \tau _k)\), (ii) the constant part on \([\tau _k, \eta _1]\), and (iii) the increasing part on \((\eta _1, L_{(n)}]\).
Let us define an ordered set
In the first step of the new algorithm, we expand the two support sets corresponding to the decreasing and increasing parts of a convex hazard function by finding and adding one new support point between every two neighboring points in set J. For the decreasing part \([0, \tau _k)\), we use the gradient function \(d_1(\tau ; h)\) (6); and for the increasing segment \([\eta _1, L_{(n)}]\), the gradient function \(d_2(\eta ; h)\) (7) is applied. For the constant hazard interval \([\tau _k, \eta _1]\), we employ both gradient functions \(d_1(\tau ; h)\) and \(d_2(\eta ; h)\) when \(\tau _k \ne \eta _1\). This means that the greatest value of \(d_1\) and \(d_2\) on \([\tau _k, \eta _1]\) determines the new support point to be added, and to which support set. Of course, nothing needs to be done for the constant hazard interval, when \(\tau _k = \eta _1\). Of these newly found potential support points, one can also just add those with positive gradient values.
Theoretically, according to Theorem 3 one should restrict the algorithm from finding new support points on \([0, \epsilon _1)\) for \(h^{(1)}\) or on \((L_{(n)} - \epsilon _2, L_{(n)}]\) for \(h^{(2)}\). This can certainly be implemented easily, by choosing, e.g., \(\epsilon _1 = \min (\mathcal{U})\) and \(\epsilon _2 = L_{(n)} - \max (\mathcal{U})\). However, since \(\epsilon _1\) and \(\epsilon _2\) can also be made arbitrarily small, we may simply replace them with 0’s in practice, which caused no problems in our numerical studies.
In the second step of the new algorithm, the mass vectors \({\varvec{\nu }}\) and \({\varvec{\mu }}\) are updated by solving problem (10), and the support sets \(\varvec{\tau }\) and \({\varvec{\eta }}\) are shrunk, by removing the elements of \(\varvec{\tau }\) and \({\varvec{\eta }}\) with zero masses. The length of the constant part may thus increase or decrease after each iteration. This includes the situation when a positive length becomes zero, and the opposite situation when length zero becomes positive. This is because if the updated mass of \(\tau _k\) or \(\eta _1\) becomes zero, the length of the constant part increases, while if there is a new support point between \([\tau _k, \eta _1]\) which receives a positive mass, then the length of constant segment decreases. The process repeats these two steps until the final solution is found, which must be the global maximum \(\hat{h}\).
By implementing this new idea, the double loop needed by Jankowski and Wellner (2009a) is reduced to a single loop. Thus, the convergence of our algorithm is much faster.
The algorithm
Let us denote \({\varvec{\theta }}^+\) the extended support point vector by finding and including some new support points, and \({\varvec{\pi }}^+\) the corresponding mass vector enlarged on \({\varvec{\pi }}\) by adding 0s for new support points. Let \({\varvec{\pi }}'\) and \({\varvec{\theta }}'\) be the updated mass vector and support point vector by discarding the redundant points from \({\varvec{\theta }}^+\) with zero masses in \({\varvec{\pi }}^+\). Being an extension of CNM (Wang 2007), the new algorithm for computing a convex hazard function is summarized as follows.
Algorithm 1
Choose a small \(\gamma > 0\) and set \(s = 0\). From an initial estimate \(h_0\) with finite support and \({\tilde{\ell }}({h}_0) > -\infty \), repeat the following steps.
-
Step 1 compute all global maxima of gradients between every two adjacent points in set J, as described above, hence giving \(\varvec{\tau }_{s}^*\equiv (\tau _{s1}^*,\ldots , \tau _{sk}^*)^\top \) for the decreasing part, \({\varvec{\eta }}_{s}^{*} \equiv (\eta _{s1}^*,\ldots , \eta _{sm}^*)^\top \) for the increasing part, and \(\rho _s^*\) for the constant part.
-
Step 2 set \({\varvec{\theta }}_{{s}^+}\) = \((\varvec{\tau }_{s}^\top , \varvec{\tau }_{s}^{{*}^\top }, \rho _{s}^{*}, {\varvec{\eta }}_{s}^\top , {\varvec{\eta }}_{s}^{{*}^\top })^\top \) and \({\varvec{\pi }}_{s}^+ = (\alpha _s, {\varvec{\nu }}_s^\top , \mathbf {0}^\top , 0, {\varvec{\mu }}_s^\top , \mathbf {0}^\top )^\top \). Find \({\varvec{\pi }}_{s+1}^{-}\) by solving problem (10), with \(\mathbf {R}\) replaced by \(\mathbf {R}_{s}^+ = \mathbf {R}({\varvec{\pi }}_{s}^+, {\varvec{\theta }}_{s}^+)\) and \(\mathbf {c}\) by \( \mathbf {c}_{s}^+ = \mathbf {c}({\varvec{\pi }}_{s}^+, {\varvec{\theta }}_{s}^+)\), followed by a backtracking line search.
-
Step 3 discard all support points with zero masses in \({\varvec{\pi }}_{s+1}^-\), which gives \({h}_{s+1}\) with \({\varvec{\theta }}_{s+1}\) and \({\varvec{\pi }}_{s+1}\). If \({\tilde{\ell }}({h}_{s+1}) - {\tilde{\ell }}({h}_{s}) \le \gamma \), stop; otherwise, set \(s = s + 1\).
The initial hazard function \(h_0\) can, e.g., be chosen to be a constant function, which can be approximately found from the data. In the numerical studies presented below, we used \(\gamma = 10^{-6}\) for stopping the algorithm.
Convergence
Next we establish the convergence of the CNMU algorithm for computing the nonparametric MLE of a convex hazard function. It is guaranteed that any sequence of h created by the algorithm converges to a global maximum of \({\tilde{\ell }}\), and certainly to the unique global maximum if there is only one, as in the case that all observations are exact. Let \(\mathcal{K}_0 \equiv \mathcal{K}({\tilde{\ell }}(h_0))\) and \(\mathbf {b}= \mathbf {R}{\varvec{\pi }}+ \mathbf {c}\) [see problem (10)]. Further,
where \({\varvec{\beta }}= - \sum _{i=1}^n (T_i, \frac{1}{2} \{\tau _1^2 - (\tau _1 - T_i)_+^2\}, \ldots , \{\tau _k^2 - (\tau _k - T_i)_+^2\}, (T_i - \eta _1)_+^2, \ldots , (T_i - \eta _m)_+^2)^\top \).
Lemma 2
Let \({\varvec{\pi }}'\) be obtained by minimizing \(||{\mathbf {S}}{\varvec{\pi }}' - \mathbf {b}||\) in the algorithm. Then always
for some \(u < \infty \) independent of s.
Lemma 3
The line search used in the algorithm always succeeds in a finite number of steps independent of s.
Theorem 5
Let \(\{h_s\}\) be any sequence created by the algorithm. Then \({\tilde{\ell }}(h_s) \rightarrow \sup _{h\in \mathcal{K}} {\tilde{\ell }}(h)\) monotonically as \(s \rightarrow \infty \).
The proofs are deferred to “Appendix 2”.
Generalization
Let us define the pth-order U-shaped hazard function as
for \(p \ge 0\). Note that when \(p = 0\), it is the U-shaped hazard function studied by Bray et al. (1967), and that when \(p = 1\), it is the convex hazard function studied by Jankowski and Wellner (2009a, b) and in the previous sections. For \(p > 1\), it is a U-shaped hazard function that has a continuous first derivative and is thus smooth.
For the corresponding cumulative hazard function, we have
The three gradient functions defined in the same way as (6)–(9) are given by
where \(\varDelta _i(H)\) is the same as given by (8).
Maximum likelihood computation of a pth-order U-shaped hazard function can be carried out by basically the same CNMU algorithm, using the above formulae. There are two issues here. One is that in the special case when \(p = 0\), one can make use of the faster pool-adjacent-violators (PAV) algorithm of Ayer et al. (1955) for a fixed anti-mode, as effectively used by Bray et al. (1967). Since the location of an anti-mode between two distinct observed values does not affect the hazard or the likelihood function, there is only a finite number of potential anti-modes that are equivalent. Run the PAV algorithm over both the hazard decreasing and decreasing parts for each potential anti-mode and, then, take the estimate that has the largest likelihood value. The other issue is that when \(p \ne 1\), the likelihood function may not necessarily have only one local maxima. Hence, the CNMU algorithm in principle only gives a solution that satisfies the conditions specified in Theorem 4. In our numerical experience, producing a globally suboptimal solution only has happened for \(p = 0\) or close to 0, and is extremely unlikely for p large, especially for \(p > 1\). The main problem is where to place the constant hazard interval (containing the anti-mode). One can easily establish that the algorithm converges to the global maximum, for the interval held fixed. Therefore, when a small p is chosen, a user can also choose to start the algorithm with initial hazard functions with different constant hazard intervals and, then, take the best solution from different ones produced.
Real-world data
Three real-world data sets were studied. The first concerns New Zealand human mortality in 2000, where the data were treated as exact observations and hence the performance of the CNMU and SRB algorithms can be compared. The CNMU algorithm was implemented in \(\texttt {R}\) (R Core Team 2015). The R package \(\texttt {convexHaz}\) (Jankowski et al. 2009) has an implementation of the SRB method, which only deals with exact observations. The second data set contains interval-censored observations, and the third has both exact and right-censored observations. We will see that the CNMU algorithm can also handle these types of observations at ease.
In survival analysis, it is not unusual that there exist identical event times, or ties. For computational efficiency, tied observations should better be grouped and treated as observations with weights. Since the SRB implementation in \(\texttt {convexHaz}\) does not deal with this issue, for comparison reasons we considered two versions of our CNMU algorithm, for ties grouped and not grouped, respectively. Two versions of the SRB algorithm, gridded and gridless, were used. As described in Jankowski and Wellner (2009a), a grid is needed for finding approximately the local minima of their directional derivatives, where \([0, L_{(n)}]\) is split into M intervals. The “gridded” version is faster but produces a less accurate estimate than the “gridless” one that fine-tunes the given grid. A larger M-value defines a finer grid and thus tends to give a more accurate estimate, yet at a higher computation expense. We simply used \(M=100\), the default value, for obtaining the results given below. Larger M-values were also attempted in our studies, but they did not seem to help improve estimation accuracy very much. By contrast, the CNMU algorithm needs no grid and finds exactly all the local maxima of the gradient functions. This results in a highly accurate estimate, as determined by the stopping criterion.
All computations were carried out in R (R Core Team 2015), on a workstation with a clock speed of 3.70 GHz; see Wang (2015) for code and data sets.
New Zealand human mortality data
Downloaded from http://www.mortality.org/, our first data set concerns human mortalities, in particular those of Maori and non-Maori ethnic groups in New Zealand in the year 2000. As is widely known, new-born babies have unfortunately a high hazard rate of death, which decreases as they grow bigger and stronger, while old people, on the other hand, have an increasing hazard rate as they age. We will treat the people who die in one year as a cross-sectional sample of the population behind. We note that the integer-valued ages are in fact interval-censored values, since they mean that a death occurred within one-year period. Because the SRB algorithm can only process exact observations, in order to make comparison possible we chose to add 0.5 to each age value and treat them as exact observations, in other words, using midpoint imputation for interval-censored data.
There were 2529 and 24,134 deaths in Maori and non-Maori ethnic groups, respectively. The top panels of Fig. 1 show the histograms for both groups, superimposed by the density estimates computed by the CNMU algorithm. The corresponding convex hazard functions are shown in the bottom panels, along with the observed hazard rates averaged for every given age. The initially decreasing and later increasing trends of the two hazard rate functions look fit very well to the data. One can also see that Maori people have a clearly higher hazard rate than non-Maori people at any given age under 80.
The computing results are given in Table 1. While the CNMU algorithm is easily applicable to data of both groups, it took SRB a long time even for the smaller, Maori group. With ties grouped or not, the CNMU algorithm was very fast and took, respectively, 0.29 and 2.10 s to find the same, accurate estimate of h. The gridded and gridless versions of SRB took, respectively, 8 and 19 min yet gave less accurate estimates, as evidenced by their lower modified log-likelihood values.
For the non-Maori group, if ties are grouped, the CNMU algorithm was as fast as it was for the Maori group, despite a drastic increase in the total number of observations. If ties are not grouped and hence all 24,134 observations are processed individually, its running time was barely 20 s, showing how robust the algorithm is.
Angina pectoris survival data
Table 2 gives the survival information for 2418 male patients with angina pectoris; see Lee and Wang (2003), page 92. The survival times are recorded in years from the time of diagnosis and collected in each one-year time interval for 16 years. Since there are not many distinct values, it is better treating these interval-censored values as they are than using imputation techniques. There exist right-censored observations anyway.
The CNMU algorithm took 0.06 or 0.18 s for ties grouped and not grouped, respectively, and 8 iterations to find the MLE under convex hazard restriction, which has a modified log-likelihood value of \(-4817.556\). The fitted hazard function is
Figure 2 gives an illustration of how the CNMU algorithm progresses, showing the results after the 0th, 2nd, 4th and 8th iterations, respectively. The top panels depict how the three gradient functions (6)–(9) change as the algorithm proceeds, which gradually ensures the satisfaction of the characterization conditions specified in Theorem 4. The bottom panels show the corresponding estimates of the hazard function under convexity restriction, superimposed over the empirical hazards that were approximately obtained from the MLE under no shape restriction. The latter MLE was computed by the dimension-reduced constrained Newton method (Wang 2008), which took 0.03 s and 6 iterations.
The MLE of a pth-order U-shaped hazard function as defined in (12) can be easily computed for any \(p \ge 0\) by the CNMU algorithm. Figure 3 shows three U-shaped hazard functions for \(p = 0, 1, 2\), and the corresponding density functions. It only took the CNMU algorithm 0.06 and 0.16 s, 8 and 16 iterations, respectively, to find the zeroth- and second-order estimates. The second-order estimate is given by
which is smooth and looks consistent with the estimate under no shape restriction.
Gastric cancer survival data
Table 3 listed the survival times of 45 gastric cancer patients who were treated with both chemotherapy and radiotherapy. The data, obtained from Klein and Moeschberger (2003), page 224, gives a typical right-censoring situation, with both exact and right-censored observations. In this case, the MLE under no restriction is the well-known Kaplan–Meier estimator, which has an explicit expression. By contrast, there is no closed-form solution for the MLE under U-shaped hazard restriction and an iterative algorithm has to be used to find it.
For \(p = 0, 1, 2\), it took the CNMU algorithm, respectively, 0.08, 0.10 and 0.25 s, and 8, 10, and 8 iterations to find the solutions, which are shown in Fig. 4. The fitted survival curves under the shape restrictions are increasingly smoother and all follow closely the stepwise Kaplan–Meier one. The fitted hazard function under convexity restriction is
It shows a clearly decreasing trend, which becomes 0 near the end of the study, indicating a negligible hazard after about 7.5 years. The MLE of the second-order U-shaped hazard function is
It has the same decreasing trend as for \(p = 0, 1\), but it is smooth and does not become exactly zero toward the end of the study.
Concluding remarks
In this paper, we proposed a new algorithm, named CNMU, that is able to compute rapidly the MLE of a U-shaped hazard function. The new algorithm overcomes the difficulty in dealing with the anti-mode and improves computation efficiency significantly. It can further deal with general interval-censored observations or data mixed with both exact and interval-censored observations, which is not previously available. Numerical studies show that the new algorithm is very fast and outperforms significantly the state-of-the-art SRB algorithm that is only available for estimating a convex hazard function.
We note that the hazard function (1) can also be written, equivalently, as
where \(\alpha _0, \nu _j \ge 0\) and \(\eta _j \in (0, L_{(n)})\), which has only one infinite-dimensional parameter and thus may seem to be a simpler problem to solve. However, with the new formulation, one also needs to restrict that \(h(t) \ge 0\) for all \(t \in (0, L_{(n)})\) or at every support point. This is numerically difficult to deal with, especially when \({\hat{h}}\) becomes exactly 0 on an interval, as is true for gastric cancer survival data.
Although we focused on estimating a U-shaped hazard function, it is straightforward to modify the proposed algorithm for estimating a convex hazard function that is restricted to be increasing or decreasing only. It is also likely extensible to cope with other types of shape-restricted hazard functions. Moreover, the new technique proposed to deal with simultaneous estimation of two infinite-dimensional parameters whose supports are not clearly defined and may have an overlap between them should also work in other similar situations, or even when there are more than two such infinite-dimensional parameters. Future investigations will consider such extensions.
References
Ayer, M., Brunk, H.D., Ewing, G.M., Reid, W.T., Silverman, E.: An empirical distribution function for sampling with incomplete information. Ann. Math. Stat. 26, 641–647 (1955)
Banerjee, M.: Estimating monotone, unimodal and U-shaped failure rates using asymptotic pivots. Stat. Sin. 18, 467–492 (2008)
Bray, T.A., Crawford, G.B., Proschan, F.: Maximum Likelihood Estimation of a U-shaped Failure Rate Function. Defense Technical Information Center, Mathematical Note 534, Boeing Research Laboratories, Seattle (1967)
Dümbgen, L., Freitag-Wolf, S., Jongbloed, G.: Estimating a unimodal distribution from interval-censored data. J. Am. Stat. Assoc. 101, 1094–1106 (2006)
Grenander, U.: On the theory of mortality measurement. II. Skand. Aktuarietidskr. 39, 125–153 (1956)
Groeneboom, P., Jongbloed, G.: Nonparametric Estimation under Shape Constraints: Estimators, Algorithms and Asymptotics. Cambridge University Press, Cambridge (2014)
Groeneboom, P., Jongbloed, G., Wellner, J.A.: The support reduction algorithm for computing non-parametric function estimates in mixture models. Scand. J. Stat. 35, 385–399 (2008)
Hall, P., Huang, L.S., Gifford, J.A., Gijbels, I.: Nonparametric estimation of hazard rate under the constraint of monotonicity. J. Comput. Graph. Stat. 10, 592–614 (2001)
Huang, J., Wellner, J.A.: Estimation of a monotone density or monotone hazard under random censoring. Scand. J. Stat. 22, 3–33 (1995)
Jankowski, H., Wang, I., McCague, H., Wellner, J.A.: R Package ConvexHaz: Nonparametric MLE/LSE of Convex Hazard (Version 0.2). http://cran.r-project.org/web/packages/convexHaz/index.html (2009)
Jankowski, H.K., Wellner, J.A.: Computation of nonparametric convex hazard estimators via profile methods. J. Nonparametric Stat. 21, 505–518 (2009a)
Jankowski, H.K., Wellner, J.A.: Nonparametric estimation of a convex bathtub-shaped hazard function. Bernoulli 15, 1010–1035 (2009b)
Kaplan, E.L., Meier, P.: Nonparametric estimation from incomplete observations. J. Am. Stat. Assoc. 53, 457–481 (1958)
Klein, J .P., Moeschberger, M.L.: Survival Analysis: Techniques for Censored and Truncated Data, 2nd edn. Springer, Berlin (2003)
Lawson, C .L., Hanson, R .J.: Solving Least Squares Problems. Prentice-Hall, Inc, Englewood Cliffs (1974)
Lee, E .T., Wang, J .W.: Statistical Methods for Survival Data Analysis, 3rd edn. Wiley, London (2003)
Meyer, M.C., Habtzghi, D.: Nonparametric estimation of density and hazard rate functions with shape restrictions. J. Nonparametric Stat. 23, 455–470 (2011)
Mykytyn, S.W., Santner, T.J.: Maximum likelihood estimation of the survival function based on censored data under hazard rate assumptions. Commun. Stat. Theory Methods 10, 1369–1387 (1981)
Peto, R.: Experimental survival curves for interval-censored data. J. R. Stat. Soc. Ser. C 22, 86–91 (1973)
Proschan, F.: Theoretical explanation of observed decreasing failure rate. Technometrics 5, 375–383 (1963)
R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna (2015)
Reboul, L.: Estimation of a function under shape restrictions: applications to reliability. Ann. Stat. 33, 1330–1356 (2005)
Schick, A., Yu, Q.: Consistency of the GMLE with mixed case interval-censored data. Scand. J. Stat. 27, 45–55 (2000)
Tsai, W.-Y.: Estimation of the survival function with increasing failure rate based on left truncated and right censored data. Biometrika 75, 319–324 (1988)
Turnbull, B.W.: Nonparametric estimation of a survivorship function with doubly censored data. J. Am. Stat. Assoc. 69, 169–173 (1974)
Wang, Y.: On fast computation of the non-parametric maximum likelihood estimate of a mixing distribution. J. R. Stat. Soc. Ser. B 69, 185–198 (2007)
Wang, Y.: Dimension-reduced nonparametric maximum likelihood computation for interval-censored data. Comput. Stat. Data Anal. 52, 2388–2402 (2008)
Wang, Y.: npsurv: Non-parametric Survival Analysis (R Package Version 0.3-4). http://cran.r-project.org/package=npsurv (2015)
Wellner, J.A.: Interval censoring case 2: alternative hypotheses. In: Koul, H., Deshpande, J.V. (eds.) Analysis of Censored Data, Proceedings of the Workshop on Analysis of Censored Data, vol. 27, pp. 271–291. University of Pune, Pune (1995)
Acknowledgements
The authors thank the editor, associated editor and two referees for their constructive suggestions, which led to many improvements in the manuscript.
Author information
Affiliations
Corresponding author
Appendices
Appendix 1: Derivatives and the Hessian matrix
Consider the partial derivatives of the modified log-likelihood function (3) with respect to masses, in the general case when there exist both exact and interval-censored data. The first partial derivatives are simply the gradient functions evaluated at the corresponding support points, i.e.,
The Hessian matrix \(\mathbf {H}\) can be computed by \( \mathbf {H}= -{\mathbf {D}}^\top {\mathbf {D}}\), where \({\mathbf {D}}\) is \(n \times (|\mathcal{I}| + m)\), with its (i, j)-th element given by, for \(i \in 1, \ldots , |\mathcal{I}|\),
and, for \(i = |\mathcal{I}| + 1, \ldots , n\),
where
Appendix 2: Proofs
Proof of Lemma 2
Since \({\varvec{\pi }}' = {{\mathrm{arg\,min}}}_{{\varvec{\pi }}\ge 0} ||{\mathbf {S}}{\varvec{\pi }}- \mathbf {b}||\), it holds that \(||{\mathbf {S}}{\varvec{\pi }}' - \mathbf {b}|| \le ||{\mathbf {S}}\mathbf {0}- \mathbf {b}|| = ||\mathbf {b}||\) and hence \(||{\mathbf {S}}{\varvec{\pi }}'|| \le ||{\mathbf {S}}{\varvec{\pi }}' - \mathbf {b}|| + ||\mathbf {b}|| \le 2 ||\mathbf {b}||\). Therefore, \(||{\mathbf {S}}{\varvec{\delta }}|| \le 2 ||\mathbf {b}|| + \sqrt{n}\). Because \(||\mathbf {b}||\) only depends on \(h(T_i)\), \(i \in \mathcal{I}\), which is bounded away from zero for all \(h \in \mathcal{K}_0\). \(\square \)
Proof of Lemma 3
Since \({\varvec{\delta }}\equiv {\varvec{\pi }}' - {\varvec{\pi }}\) maximizes
under restriction \({\varvec{\pi }}' \ge 0\), we have
Noting the Taylor series expansion
for any \(0< \alpha < \frac{1}{2}\), there is a \(\lambda > 0\) such that if \(||{\mathbf {S}}{\varvec{\delta }}|| \le \lambda \), then
thus satisfying the Armijo rule.
If \(||{\mathbf {S}}{\varvec{\delta }}|| > \lambda \), then \(||\sigma ^k {\mathbf {S}}{\varvec{\delta }}|| \le \lambda \) holds for some \(k > 0\). Because \(||{\mathbf {S}}{\varvec{\delta }}|| \le u\) from Lemma 2, we need at most
steps for Armijo’s rule to be satisfied in all cases. \(\square \)
Proof of Theorem 5
Owing to its monotone increase, \({\tilde{\ell }}(h_s)\) will converge to a finite value no greater than \({\tilde{\ell }}({\hat{h}})\), where \({\hat{h}}\) maximizes \({\tilde{\ell }}(h)\). Further,
because of Armijo’s rule and the nonnegative definiteness of \({\mathbf {S}}_s^{+\top } {\mathbf {S}}_s^+\).
Consider all point-mass directions \(e \in \{\pm e_0, \pm e_{1,\tau }, \pm e_{2,\eta }\}\) from \(h_s\), that are valid in the sense that there exists an \(\epsilon > 0\) such that \(h_s + \epsilon e \in \mathcal{K}\). Denote the steepest ascent direction by
and \({\varvec{\delta }}_s^*\) the direction resulting from \(h_s\) to \(h_s + e_s^*\). Hence, from any \(\epsilon \in {\mathbb {R}}\) such that \(h_s + \epsilon e_s^* \in \mathcal{K}\), we have
because of the optimality of \({\varvec{\delta }}_s\).
Now, let us assume that \(d(h_s + e_s^*; h_s)\) does not approach 0 as \(s \rightarrow \infty \). There are, hence, infinitely many s such that \(d(h_s + e_s^*; h_s) \ge \tau \), for some \(\tau > 0\). For such an s and noting that
we have, with Lemma 2,
Without loss of generality, assume \(\tau \le u^2\) and let \(\epsilon = \tau / u^2\). As a result,
a positive value that is independent of s. Since this violates the Cauchy property of a convergent sequence, we must have \(d(h_s + e_s^*; h_s) \rightarrow 0\) as \(s \rightarrow \infty \). Therefore, \(d({\hat{h}}; h_s) \le d(h_s + e_s^*; h_s) (|h_s| + |{\hat{h}}|) \rightarrow 0\) from Corollary 2, and \({\tilde{\ell }}(h_s) \rightarrow {\tilde{\ell }}({\hat{h}})\) from Lemma 1. \(\square \)
Rights and permissions
About this article
Cite this article
Wang, Y., Fani, S. Nonparametric maximum likelihood computation of a U-shaped hazard function. Stat Comput 28, 187–200 (2018). https://doi.org/10.1007/s11222-017-9724-z
Received:
Accepted:
Published:
Issue Date:
Keywords
- Algorithms
- Lifetime and survival analysis
- Nonparametric methods
- Shape-restricted estimation
- Numerical optimization



