1 Introduction

Variable selection, the process of identifying the covariates that are most associated or predictive of survival time from the set of available covariates, is crucial in survival modeling. Deciding which covariates to include in the final model is an important task for investigators, as it affects both accuracy and interpretability of the model. In ordinary linear regression settings, a variety of variable selection methods have been well established, such as all possible subsets method, forward selection, backward selection, and step-wise selection. These methods are typically evaluated using criteria such as Mallows’ \(C_p\) (Mallows, 1973), Akaike’s Information Criterion (Akaike, 1974) (AIC), Schwarz’s Bayesian Information Criterion (Schwarz, 1978) (BIC), Copula Information Criterion (Grønneberg & Hjort, 2014) (CIC), Deviance Information Criterion (Spiegelhalter et al., 2014) (DIC), etc. However, these methods can suffer from high variability and long computation time for large datasets, as well as selection bias, which leads to overestimating the effects of the selected covariates.

In the last decades, shrinkage methods have become increasingly popular in model selection, such as LASSO (Tibshirani, 1996), Adaptive LASSO (Schneider & Wagner, 2012), and Elastic Net (Zou & Hastie, 2005). These methods use penalties to shrink the coefficients of less important variables to zero, thus reducing the risk of overfitting and improves the interpretability of the model. For survival analysis, variable selection is more challenging due to the nature of censored data. Recent methods, such as LASSO and Smoothly Clipped Absolute Deviation or SCAD (Fan & Li, 2001), have been proposed and applied for variable selection in survival analysis (Salerno & Li, 2022).

Cox’s model is the most widely used model in survival analysis. However, proportional hazards, the most important assumption of Cox’s model, are violated in some cases (e.g., when modeling prognostic factors in studies with long follow-up times). While Cox’s model can be expanded to accommodate non-proportional hazards, such as by integrating time-varying regression coefficients (Hess, 1994; Hastie & Tibshirani, 1993), there isn’t a universally endorsed or straightforward method to do so and developing a suitable model using these approaches can be a very complex and challenging task. In the cases where proportional hazards assumption is not satisfied, the proportional odds (PO) model is a useful alternative. The PO model was initially presented by Bennett (1983) within a semi-parametric framework. Under the PO model, which specifies that the covariate effect is a multiplicative factor on the baseline odds function, the hazard ratio between two sets of covariate values tends to approach unity instead of remaining constant as time progresses. Collett (1994) applied the PO model to the data on the survival times of women with breast tumors that were negatively or positively stained. Crowder et al. Crowder (1991) employed the PO model for the analysis of reliability data. Rossini and Tsiatis (1996) adapted the PO model for modeling current status data.

Regarding the variable selection problem for the PO model, Lu and Zhang (2007) suggested to fit it by maximizing the marginal likelihood subject to LASSO and adaptive LASSO penalties and their numerical study shown that adaptive LASSO outperforms LASSO. However, both of the two approaches have weakness when confront with high-dimensional data (Zou & Hastie, 2005; Zou & Zhang, 2009) or collinearity issues. The adaptive elastic net penalty (Zou & Zhang, 2009), using both \(l_2\) and weighted \(l_1\) constraints, inherits the oracle properties (Zou, 2006) from adaptive LASSO and has stronger capability of handling collinearity problem.

In this paper, we propose using the adaptive elastic net penalty to the marginal likelihood function of PO model and solve variable selection problems under proportional odds assumption. We compare its performance with LASSO, adaptive LASSO and elastic net methods in simulation studies as well as in applications to real datasets. Results show that the proposed method tends to work better than existing ones.

In Sect. 2, a brief review is given to common proportional odds model with right-censored data and its marginal likelihood function. In Sect. 3, we apply the adaptive elastic net approach to marginal likelihood of proportional odds model. In Sect. 4, we develop a computational algorithm. We also discuss how to choose tuning parameters. In Sect. 5, we present the results of simulation study. Then, in Sect. 6, we apply this method to real data sets to compare its performance with other methods. Finally, summary and discussion are given in Sect. 7.

2 Proportional odds model and its marginal likelihood function

For a survival analysis problem involving right-censored data, the dataset comprises n independent observations, each denoted as \((\min \{T_i, C_i\},\theta _i)\), where \(T_i\) represents the time until the occurrence of a specific event of interest, \(C_i\) denotes a censoring time, and \(\theta _i\) stands for the censoring indicator \({\varvec{1}}_{(T_i < C_i)}\). Additionally, we possess \(\textbf{Z}_i= ({Z}_{i1},\ldots ,{Z}_{ip})'\), which denotes a p-dimensional vector of covariates for the ith observation. Our primary objective is to explore the relationship between the survival time T and the covariates \({\textbf {Z}}\). In the realm of survival analysis, the most commonly employed model is the Proportional Hazards (PH) model, originally proposed by Cox (1972). However, in some certain scenarios, the underlying assumptions under the PH model may not hold. In such cases, the Proportional Odds model serves as a valuable alternative (Peterson et al., 1990). The Proportional Odds model is based on the assumption that

$$\begin{aligned} \frac{1-S(t\mid {\textbf {Z}})}{S(t\mid {\textbf {Z}})}= \frac{1-S_0(t)}{S_0(t)} \exp ({\varvec{\beta }}'{} {\textbf {Z}}), \end{aligned}$$
(1)

where \(S(t\mid {\textbf {Z}})\) denotes the conditional survival function of \(T \) given \({\textbf {Z}}\) and \(S_0(t)\) is the baseline survival function with \({\textbf {Z}}={\varvec{0}}\), which is completely unspecified. \({\varvec{\beta }}=(\beta _1,\ldots ,\beta _p)'\) is the regression parameter vector.

Let \(H(t)=\log [(1-S_0(t))/S_0(t)]\), a regression model under proportional odds assumption (1) can be expressed as

$$\begin{aligned} H(T)=-{\varvec{\beta }}'\textbf{Z}+\varepsilon , \end{aligned}$$
(2)

where \(\varepsilon \) follows standard logistic distribution.

The partial likelihood function of \({\varvec{\beta }}\) under the PO model is unavailable, we use Lam and Leung’s (2001) method to estimate \({\varvec{\beta }}\) by maximizing its marginal likelihood function: Let \(T_{(1)}< \cdots < T_{(K)}\) represent the ordered uncensored failure times in the sample and define \(T_{(0)} =0\), \(T_{(K+1)} = \infty \). For \(0 \le k \le \textit{K}\), let \(L_k\) denote the set of labels i corresponding to those observations censored in the interval \([T_{(k)},T_{(k+1)})\). The complete ranks of \(T_i\)’s are unknown as the result of the censoring scheme. Let \({\textbf {R}} \) denote the unobserved rank vector of \(T_i\)’s and let \({\textbf {G}} \) denote the collection of all possible rank vectors of \(T_i\)’s consistent with the observed data \((\tilde{T_i},\theta _i)\) \((i =1,\ldots ,n)\). The marginal likelihood is then defined by \(L_{n,M} ({\varvec{\beta }})= P ({\textbf {R}} \in {\textbf {G}} )\), where the probability is with respect to the underlying uncensored version of the study. It can be shown that \(L_{n,M} ({\varvec{\beta }})\) can be represented as

$$\begin{aligned} L_{n,M} ({\varvec{\beta }})=\idotsint _{V_{(1)}< \cdots < V_{(K)}} \prod _{i=1}^{n} \{\lambda (V_{(k_i)}+{\varvec{\beta }}'\mathbf {Z_i})\} ^{\theta _i} e^{-\Lambda (V_{(k_i)}+{\varvec{\beta }}'\textbf{Z}_i)} \prod _{k=1}^{K}\,\textrm{d}V_{(k)},\nonumber \\ \end{aligned}$$
(3)

where \(V_{(k)} = H(T_{(k)}),k =1,\ldots ,K\). \(\Lambda (x)\) denote the cumulative hazard function of \(\varepsilon \), i.e., \(\Lambda (x)=\log \{1+\exp (x)\}\) and \(\lambda (x)=\textrm{d}\Lambda (x)/\textrm{d}x\).

Because there is no explicit solution to the maximization of (3), importance sampling method is used to approximate it. Following Lu and Zhang (2007), (3) can be estimated by performing a multiplication followed by a division by:

$$\begin{aligned} \textrm{ c }\sum _{i=1}^{n}\{\lambda (V_{(k_i)})\}^{\theta _i} e^{-\Lambda (V_{(k_i)})}, \end{aligned}$$
(4)

where the constant \( c \) is the total number of possible rank vectors in \({\textbf {G}} \). When \(V_i\equiv H(T_i)\) (\(i =1,\ldots ,n\)) are independent and identically distributed according to the distribution function F(x), and then it can be shown that (4) is the density function of \(V_{(1)},\ldots ,V_{(K)}\) under progressive type II censoring scheme. Then, the marginal likelihood (3) can be expressed as

$$\begin{aligned} L_{n,M} ({\varvec{\beta }})=E\{Q(V_{(1)},\ldots , V_{(K)};{\varvec{\beta }})\}, \end{aligned}$$
(5)

where the expectation is with respect to the density (4) and

$$\begin{aligned} Q(V_{(1)},\ldots , V_{(K)};{\varvec{\beta }})=\frac{1}{ c } \prod _{i=1}^{n} \frac{\{\lambda (V_{(k_i)}+{\varvec{\beta }}'\mathbf {Z_i})\} ^{\theta _i} e^{-\Lambda (V_{k_i}+{\varvec{\beta }}'\mathbf {Z_i})}}{\{\lambda (V_{(k_i)})\} ^{\theta _i} e^{-\Lambda (V_{(k_i)})}}. \end{aligned}$$
(6)

Then, (5) can be estimated by

$$\begin{aligned} \hat{L}_{n,M} ({\varvec{\beta }})=\frac{1}{B}\sum _{b=1}^{B}Q \{F^{-1}(U^{b}_{(1)}),\ldots , F^{-1}(U^{b}_{(K)});{\varvec{\beta }} \}, \end{aligned}$$
(7)

where \(F^{-1}(\cdot )\) is the inverse of \(F(\cdot )\) and \(U^{b}_{(1)} ,\ldots ,U^{b}_{(K)}\), \(b=1,\ldots ,B\), denote B independent instances of the uncensored order statistics derived from a random sample of size n, taken from a uniform distribution subject to the progressive type II censoring scheme.

3 Adaptive elastic net approach for proportional odds model

To address the challenges posed by the inherent instability of high-dimensional data and the limitation in handling collinearity, as originally encountered with the LASSO and adaptive LASSO methods proposed by Lu and Zhang (2007), we have chosen to employ the adaptive elastic net penalty, as outlined in the work of Zou and Zhang (2009). This approach encompassed both the \(l_2\) and weighted \(l_1\) penalties and is applied to enhance the robustness of the marginal likelihood estimation

$$\begin{aligned} \min _{\beta }\left( -\frac{1}{n}\hat{l}_{n,M}({\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}w_j\mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}w_j\beta _j^2\right) , \end{aligned}$$
(8)

where \(\hat{l}_{n,M} ({\varvec{\beta }})=\log [\hat{L}_{n,M} ({\varvec{\beta }})]\).

The tuning parameters, denoted as \(\lambda _1\) and \(\lambda _2\), exercise control over the magnitudes of the \(l_1\) (LASSO) and \(l_2\) (ridge) penalty, respectively. Analogous to conventional regression scenarios, the distinguishing feature between \(l_1\) and \(l_2\) penalty lies in that \(l_2\) penalty tends to yield small but non-zero coefficient estimates across all variables, whereas \(l_1\) penalty is inclined to make some regression coefficients shrunk to exactly 0 and some other coefficients with comparatively little shrinkage. Combining \(l_1\) and \( l_2\) penalties is likely to give a result in between and an intermediate outcome, with fewer coefficient estimates set to 0 than in a pure LASSO setting, and more shrinkage for other coefficients.

The larger the tuning parameters \(\lambda _1\) and \(\lambda _2\) are, the greater degree of penalty or shrinkage is imposed upon the coefficients. In practice, it can be challenging to determine the appropriate values for these parameters. We propose to use BIC criterion to find the optimal values. The choice of tuning parameters is further discussed at the end of next section.

\({\textbf {w}} =(w_1,\ldots ,w_p)'\) is a non-negative weight vector, which adjusts penalties applied to the coefficients. As the weight value increase, the corresponding penalties are augmented accordingly. For important covariates, we take larger value weights; for unimportant covariates, we take small value weights. In practice, the weights are chosen adaptively by data. For example, any consistent estimator of \({\varvec{\beta }}\) could be used as a good candidate (Zou, 2006).

Here, we denote the maximum marginal likelihood estimate (MMLE) of \({\varvec{\beta }}\) as

$$\begin{aligned} \tilde{{\varvec{\beta }}}=\arg \max _{{\varvec{\beta }}}(\hat{L}_{n,M} ({\varvec{\beta }})). \end{aligned}$$

Lam and Leung (2001) have shown that \(\tilde{{\varvec{\beta }}}\) is a consistent estimator of \({\varvec{\beta }}\). The absolute values of the elements in \(\tilde{{\varvec{\beta }}}\) reflect the relative importance of the covariates. Hence, we set \(\hat{w}_j=\frac{1}{\mid \tilde{\beta _j}\mid }\).

We define our adaptive elastic net estimate for proportional odds model as

$$\begin{aligned} \hat{{\varvec{\beta }}}=\arg \min _{{\varvec{\beta }}}\left( -\frac{1}{n}\hat{l}_{n,M}({\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}\beta _j^2\right) , \end{aligned}$$
(9)

where \( \hat{w}_j=\frac{1}{\mid \tilde{\beta _j}\mid } \).

If \(\tilde{\beta _j}=0\), then we assign \(\hat{\beta _j}=0\). When equal weights are used in (9), the adaptive elastic net estimate simplifies and transforms to elastic net estimate. If we set \(\lambda _2=0\), then it leads to the simplification of the adaptive elastic net estimate into the adaptive LASSO estimate.

4 Computational algorithm and choice of tuning parameters

In this section, we discuss the computational algorithm to solve the adaptive elastic net challenges within the context of the proportional odds model. The algorithm involves 3 transformations. First, leveraging Taylor’s expansion, we convert the adaptive elastic net penalized likelihood problem into an adaptive elastic net penalized least square problem. Subsequently, we transform the adaptive elastic net penalized least square problem into an adaptive LASSO penalized least square problem. Finally, we transform the adaptive LASSO problem into an ordinary LASSO problem. Following these successive transformations, the ordinary LASSO problem can be easily solved using existing methods as described as LARS (Efron et al., 2004).

We define the gradient vector of \(l({\varvec{\beta }})\) as \(\nabla l({\varvec{\beta }})=-\partial \hat{l}_{n,M} ({\varvec{\beta }})/\partial {\varvec{\beta }}\) and the Hessian matrix \(\nabla ^2l({\varvec{\beta }})=-\partial ^2 \hat{l}_{n,M} ({\varvec{\beta }})/\partial {\varvec{\beta }}{\varvec{\beta }}'\). Let \({\varvec{X}}\) be \(\frac{1}{\sqrt{2n}}\) of the Cholesky decomposition of \(\nabla ^2 l({\varvec{\beta }})\), such that \({\varvec{X'X}}=\frac{1}{2n}\nabla ^2 l({\varvec{\beta }})\). A pseudo-response vector \({\varvec{Y}}\) is set as \({\varvec{Y}}=\frac{1}{2n}({\varvec{X}}')^{-1} (\nabla ^2l({\varvec{\beta }}){\varvec{\beta }}-\nabla l({\varvec{\beta }}))\).

Applying similar argument by Lu and Zhang (2007), we can show that \(-\frac{1}{n} \hat{l}_{n,M} ({\varvec{\beta }})\) can be approximated using the second-order Taylor expansion by \(({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})'({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})\).

Therefore, we can solve (9) iteratively. First, we solve maximum marginal likelihood estimate \(\tilde{{\varvec{\beta }}}\). Afterward, we compute \(\nabla l(\tilde{{\varvec{\beta }}})\), \(\nabla ^2l(\tilde{{\varvec{\beta }}})\), \({\varvec{X}}\) and \({\varvec{Y}}\) based on \(\tilde{{\varvec{\beta }}}\). Then, we update \({\varvec{\beta }}\) by minimizing

$$\begin{aligned} ({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})'({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}\beta _j^2 \end{aligned}$$
(10)

until it converges.

Next, we show that this adaptive elastic net problem can be transformed into an adaptive LASSO type problem in some augmented space. Building upon the approach outlined by Zou and Hastie (2005), we construct an artificial augmented data set \(({\varvec{X}}^A,{\varvec{Y}}^A)\), where

$$\begin{aligned}{} & {} {\varvec{X}}^A_{(n+p)\times p}=(1+\lambda _2)^{-\frac{1}{2}}\left( \begin{array}{c}{\varvec{X}} \\ \sqrt{\lambda _2}{\varvec{I}}\end{array} \right) \end{aligned}$$
(11)
$$\begin{aligned}{} & {} {\varvec{Y}}^A_{(n+p)\times 1}= \left( \begin{array}{c} {\varvec{Y}} \\ {\varvec{0}} \end{array} \right) . \end{aligned}$$
(12)

Let \(\gamma =\lambda _1/\sqrt{1+\lambda _2}\), \({\varvec{\beta }}^A=\sqrt{1+\lambda _2}{\varvec{\beta }}\). For the adaptive elastic net solution, we have

$$\begin{aligned}{} & {} \arg \min _{\beta }({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})'({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}\beta _j^2 \nonumber \\{} & {} \quad =\arg \min _{\beta }\left\{ \left( \left( {\begin{array}{c}{\varvec{Y}}\\ {\varvec{0}}\end{array}}\right) -\frac{1}{\sqrt{1+\lambda _2}}\left( \left( {\begin{array}{c}{\varvec{X}}\\ \sqrt{\lambda _2} {\varvec{I}}\end{array}}\right) \right) \sqrt{1+\lambda _2}{\varvec{\beta }}\right) '\left( \left( {\begin{array}{c}{\varvec{Y}}\\ {\varvec{0}}\end{array}}\right) \right. \right. \nonumber \\{} & {} \qquad \left. \left. -\frac{1}{\sqrt{1+\lambda _2}} \left( \left( {\begin{array}{c}{\varvec{X}}\\ \sqrt{\lambda _2}{\varvec{I}}\end{array}}\right) \right) \sqrt{1+\lambda _2}{\varvec{\beta }}\right) +\lambda _1\sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid \right\} \nonumber \\{} & {} \quad =\arg \min _{\beta }\left\{ \left( \left( {\begin{array}{c}{\varvec{Y}}\\ {\varvec{0}}\end{array}}\right) -\frac{1}{\sqrt{1+\lambda _2}}\left( \left( {\begin{array}{c}{\varvec{X}}\\ \sqrt{\lambda _2} {\varvec{I}}\end{array}}\right) \right) \sqrt{1+\lambda _2}{\varvec{\beta }}\right) ' \left( \left( {\begin{array}{c}{\varvec{Y}}\\ {\varvec{0}}\end{array}}\right) \right. \right. \nonumber \\{} & {} \qquad \left. \left. -\frac{1}{\sqrt{1+\lambda _2}} \left( \left( {\begin{array}{c}{\varvec{X}}\\ \sqrt{\lambda _2}{\varvec{I}}\end{array}}\right) \right) \sqrt{1+\lambda _2}{\varvec{\beta }}\right) +\frac{\lambda _1}{\sqrt{1+\lambda _2}}\sum _{j=1}^{p}\hat{w}_j\mid \sqrt{1+\lambda _2}\beta _j\mid \right\} \nonumber \\{} & {} \quad =\frac{1}{\sqrt{1+\lambda _2}}\arg \min _{{\varvec{\beta }}^A}\left[ ({\varvec{Y}}^A-{\varvec{X}}^A{\varvec{\beta }}^A)'({\varvec{Y}}^A-{\varvec{X}}^A{\varvec{\beta }}^A)' +\gamma \sum _{j=1}^{p}\hat{w}_j\mid \beta _j^A\mid \right] .\nonumber \\ \end{aligned}$$
(13)

At this juncture, the problem is reconfigured as an adaptive LASSO problem, featuring a tuning parameter denoted as \(\gamma =\lambda _1/\sqrt{1+\lambda _2}\), which is a convex optimization problem and does not suffer from the multiple local minima issue (Zou, 2006).

According to Zou (2006), the solution to an adaptive LASSO problem

$$\begin{aligned} \arg \min _{{\varvec{\beta }}}\left( ({\varvec{y}}-{\varvec{X}}{\varvec{\beta }})'({\varvec{y}}-{\varvec{X}}{\varvec{\beta }})+\lambda \sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid \right) \end{aligned}$$

is \(\hat{\beta }_j^{\textrm{alasso}}=\textrm{sign}(\hat{\beta }_j^{\textrm{ols}})(\mid \hat{\beta }_j^{\textrm{ols}}\mid -\frac{1}{2}\hat{w}_j\lambda )_{+}\), for \(j=1,\ldots ,p\), where \(\hat{\beta }_j^{\textrm{ols}}\) is the ordinary least square estimate and \(z_+\) denotes the positive part of z, i.e., \(z_+=z\) if \(z>0\) and 0 otherwise.

By multiplying the LASSO solution by its respective weight, we derive the solution to \(\hat{\beta }_j^{\textrm{ols}}\). Notably, established computational techniques are available for addressing LASSO problems, such as least angle regression (LARS) introduced by Efron et al. (2004) and path-wise coordinate descent algorithm proposed by Wu and Lange (2008).

Here, we use a modified shooting algorithm (Fu, 1998; Lu & Zhang, 2007) to solve the adaptive LASSO problem to avoid additional transformation and make computation more efficient. We define \(G({\varvec{\beta }})=\sum _{i=1}^{n}(y_i-{\varvec{\beta }}'x_i)^2\), \(\dot{G}_j({\varvec{\beta }})=\frac{\partial G({\varvec{\beta }})}{\partial \beta _j}\), \(j=1,\ldots ,p\), and denote \({\varvec{\beta }}\) by \((\beta _j,{\varvec{\beta }}^{-j})'\) where \({\varvec{\beta }}^{-j}\) is the \((p-1)\)-dimensional vector consisting of the \(\beta _i\)’s other than \(\beta _j\).

The complete algorithm to compute adaptive elastic net solution for proportional odds model when \(p<n\) is given as follows:

$$\begin{aligned} \arg \min _{\beta }\left( -\frac{1}{n}\hat{l}_{n,M}({\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}\hat{w}_j\mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}\beta _j^2\right) . \end{aligned}$$
(14)
  1. 1.

    Solve \(\tilde{{\varvec{\beta }}}\) by maximizing \(\hat{l}_{n,M}({\varvec{\beta }})\). Set \(\hat{w}_j=\frac{1}{\mid \tilde{\beta _j}\mid }\) for \(j=1,\ldots ,p\).

  2. 2.

    Let \(k=0\), and \(\beta _j^{(0)}=0\) for \(j=1,\ldots ,p\).

  3. 3.

    Compute \(\nabla l\), \(\nabla ^2l\), \({\varvec{X}}\) and \({\varvec{Y}}\) based on current value of \({\varvec{\beta }}^{(k)}\).

  4. 4.

    Solve

    $$\begin{aligned} {\varvec{\beta }}^{(k+1)}=\arg \min _{\beta }[({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})'({\varvec{Y}}-{\varvec{X}}{\varvec{\beta }})+\lambda _1\sum _{j=1}^{p}\hat{w}_j \mid \beta _j\mid +\lambda _2\sum _{j=1}^{p}\beta _j^2]. \end{aligned}$$
    1. (a)

      Let

      $$\begin{aligned} {\varvec{X}}_{(n+p)\times p}^A=(1+\lambda _2)^{-\frac{1}{2}}\left( \frac{{\varvec{X}}}{\sqrt{\lambda _2}{\varvec{I}}}\right) , {\varvec{Y}}^A_{(n+p) \times 1}= \left( \begin{array}{c} {\varvec{Y}} \\ {\varvec{0}} \end{array} \right) \textrm{and} \, \gamma =\lambda _1/\sqrt{1+\lambda _2}. \end{aligned}$$
    2. (b)

      Solve \(\hat{{\varvec{\beta }}}^A=\arg \min _{{\beta }^A}[({\varvec{Y}}^A-{\varvec{X}}^A{\varvec{\beta }}^A)'({\varvec{Y}}^A-{\varvec{X}}^A{\varvec{\beta }}^A) +\gamma \sum _{j=1}^{p}\frac{\mid \beta _j^A\mid }{\mid \tilde{\beta }_j\mid }]\).

      1. (i)

        Start with \(\hat{{\varvec{\beta }}}_0 = \tilde{{\varvec{\beta }}}=(\tilde{\beta }_i,\ldots ,\tilde{\beta }_p)'\) and let \(\lambda _j=\frac{\gamma }{\mid \tilde{\beta }_j\mid }\) for \(j=1,\ldots ,p\).

      2. (ii)

        At step m, for each \(j=1,\ldots ,p\), let \(G_0=\dot{G}_j(0,\hat{{\varvec{\beta }}}_{m-1}^{-j})\) and set

        $$\begin{aligned} \hat{\beta }_j^A = \left\{ \begin{array}{ll} \frac{\lambda _j-G_0}{2(x^j)'x^j} &{}\quad \text{ if } G_0 > \lambda _j\\ \frac{-\lambda _j-G_0}{2(x^j)'x^j} &{}\quad \text{ if } G_0 < \lambda _j \\ 0 &{}\quad \text{ if } \mid G_0\mid \le \lambda _j.\end{array} \right. \end{aligned}$$
      3. (iii)

        Repeat ((ii)) until \(\hat{{\varvec{\beta }}}_m^A\) converges.

    3. (c)

      Set \({\varvec{\beta }}^{k+1}=\frac{1}{\sqrt{1+\lambda _2}}\hat{{\varvec{\beta }}}^A\).

  5. 5.

    If \(\left\| {\varvec{\beta }}^{k+1}-{\varvec{\beta }}^{k} \right\| ^2<0.0001\) (or other given small \(\varepsilon >0\)), then stop, else set \(k=k+1\) and go to 3.

For \(p \geqslant n\) cases, the MMLE of \({\varvec{\beta }}\) is not available. Following Zou and Zhang (2009), we use elastic net estimates to construct the adaptive weight \(\hat{w}_j\). We first apply the algorithm with the initial adaptive weight \(\hat{w}_j^{(0)}=1\) for \(j=1,\ldots ,p\) to get the elastic net estimates \(\hat{\beta }_j^{\textrm{enet}}\), then set \(\hat{w}_j=(\mid \hat{\beta }_j^{\textrm{enet}}\mid +\frac{1}{n})^{-1}\) and run step 2 through 5 to get the adaptive elastic net solution.

Tuning is a very important aspect of model fitting. For adaptive elastic net approaches, we need to find the optimal value of \(\lambda _1\) and \(\lambda _2\). We use Bayesian Information Criterion (BIC) (Schwarz, 1978) to choose the best combination of \(\lambda _1\) and \(\lambda _2\). We define the BIC for proportional odds model as

$$\begin{aligned} \textrm{BIC}=-2\hat{L}_{n,M} (\hat{{\varvec{\beta }}}^{\textrm{aenet}} \mid \lambda _1,\lambda _2 )+k\log (n), \end{aligned}$$
(15)

where k is the total number of non-zero parameters and n is the number of observations.

The typical way to deal with two tuning parameters in adaptive elastic net problem is to pick a relatively small grid of values for \(\lambda _2\), for example (0, 0.001, 0.01, 0.1, 1, 10). Then, for each value of \(\lambda _2\), we get the BIC scores for a sequence of \(\lambda _1\). The chosen \((\lambda _1,\lambda _2)\) is the pair that gives the smallest BIC score.

5 Simulation studies

We generate data from proportional odds models and apply adaptive elastic net procedure to do variable selection. For each model, we generate 100 simulated datasets and gauge the variable selection performance using (C0, IC0), where C0 is the number of unimportant covariates that the procedure correctly estimates as zero and IC0 is the number of important covariates that the procedure incorrectly estimates as zero. To measure prediction accuracy, we follow Tibshirani (1996) to summarize the average mean square error (MSE) \((\hat{{\varvec{\beta }}}-{\varvec{\beta }})'{\varvec{V}}(\hat{{\varvec{\beta }}}-{\varvec{\beta }})\) over the 100 runs, where \({\varvec{V}}\) is the population covariance matrix of the covariates. BIC method is used to choose tuning parameters. The simulation is run in no censoring, 20% censoring rate and 40% censoring rate settings, respectively. Also, 3 sample sizes, \(n=100\), \(n=200\), and \(n=500\) are used for each model. The results are then compared with LASSO, adaptive LASSO, and elastic net. In our implementation, we set \(\lambda _2=0\) in the adaptive elastic net to get the adaptive LASSO fit. To get the elastic net fit, we set \(w_j=1\) for \(j=1,2,\ldots ,p\). For these 3 methods, BIC method is also used to select the tuning parameter. Five models with different \({\varvec{\beta }}\) and Pearson’s correlation coefficient \(\rho _{i,j}\) are used for our simulation studies. The results are as follows:

Model 1: The design contains ten covariates: \(({Z}_1, {Z}_2,\ldots ,{Z}_{10})\). The covariates are marginally standard normal distributed and \(\rho _{i,j}=0.2\) for \(i,j=1,2,\ldots ,10\) and \(i\ne j\). \({\varvec{\beta }}'=(-0.8,0,0,-0.8,0,0,-0.7,0,0,-0.7)\). Therefore, \({Z}_1, {Z}_4,{Z}_7\) and \({Z}_{10}\) are important variables. This model is used to compare the performance of adaptive elastic net and other 3 procedures in a scenario that important covariates all have large effects and that the pairwise correlations between the covariates are weak. The simulation result is summarized and shown in Fig. 1.

Fig. 1
figure 1

Variable selection results and MSE for \(\rho =0.2\) and \({\varvec{\beta }}'=(-0.8,0,0,-0.8,0,0,-0.7,0,0,-0.7)\). The MSE is the averaged value over 100 replicates

For model 1, when sample size is small (\(n=100\)), the performances of two procedures with oracle properties (Zou, 2006), adaptive LASSO and adaptive elastic net, are comparable in terms of accuracy. LASSO and elastic net are outperformed by their counterparts. Regarding mean square error, the adaptive elastic net approach is better than any of the other 3 approaches (around 24.5% smaller MSE on average compared to the next best approach). Adaptive LASSO and elastic net have similar MSE. LASSO is the one that have the largest MSE. When the sample size is increased to 200, the difference in MSE for adaptive LASSO and adaptive elastic net becomes small (around 10% on average). When sample size is increased to 500, adaptive LASSO and adaptive elastic net perform equally good in terms of accuracy. However, adaptive elastic net is still slightly better than adaptive LASSO in terms of MSE (around 6% on average). For this model, we conclude that adaptive LASSO and adaptive elastic net are the two best approaches in terms of selection accuracy. Also, the adaptive elastic net does better than adaptive LASSO in terms of MSE when sample size is small.

Model 2: The design contains ten covariates: \(({Z}_1,{Z}_2,\ldots ,{Z}_{10})\). The covariates are marginally standard normal distributed and \(\rho _{i,j}=0.8\) for \(i,j=1,2,\ldots ,10\) and \(i\ne j\). \({\varvec{\beta }}'=(-0.8,0,0,-0.8,0,0,-0.7,0,0,-0.7)\). This model is used to compare the performance of the 4 procedures in a scenario that important covariates all have large effects and that the pairwise correlations between the covariates are strong. The simulation result is summarized and shown in Fig. 2.

Fig. 2
figure 2

Variable selection results and MSE for \(\rho =0.8\) and \({\varvec{\beta }}'=(-0.8,0,0,-0.8,0,0,-0.7,0,0,-0.7)\). The MSE is the averaged value over 100 replicates

For model 2, when sample size is 100, adaptive elastic net and adaptive LASSO seem to have the largest accuracy rate and adaptive elastic net has the smallest MSE (around 3% smaller on average than the elastic net, which has the second smallest MSE). At high censoring rate, adaptive elastic net and adaptive LASSO tend to shrink more important variables to 0 than LASSO and adaptive LASSO do. When sample size is increased to 500, none of the 4 procedures misses any important variables and adaptive LASSO and adaptive elastic net have comparable performance in terms of accuracy. Adaptive elastic net is still the best in terms of MSE (around 7% smaller on average). We conclude that adaptive elastic net method has the best overall performance for model 2.

Model 3: The design contains ten covariates: \(({Z}_1,{Z}_2,\ldots ,{Z}_{10})\). The covariates are marginally standard normal distributed and \(\rho _{i,j}=0.2\) for \(i,j=1,2,\ldots ,10\) and \(i\ne j\). \({\varvec{\beta }}'=(-0.3,0,0,-0.3,0,0,-0.2,0,0,-0.2)\). This model is used to compare the performance of the 4 procedures in a scenario that important covariates all have small effects and that the pairwise correlations between the covariates are low. The simulation result is summarized and shown in Fig. 3.

Fig. 3
figure 3

Variable selection results and MSE for \(\rho =0.2\) and \({\varvec{\beta }}'=(-0.3,0,0,-0.3,0,0,-0.2,0,0,-0.2)\). The MSE is the averaged value over 100 replicates

When sample size is 100, adaptive LASSO seems to have the best accuracy in terms with dropping unimportant variables, but it also tends to drop important variables more than the other 3 methods. On the contrary, elastic net selects the least number of unimportant variables, but it does the worst job in keeping important variables. The adaptive elastic net is very closed to adaptive LASSO in keeping no-zero variables (7% less accurate rate) and is almost as good as elastic net in eliminating zero variables. Also, adaptive elastic net is consistently the best among the 4 approaches in terms of MSE (around 24% less than elastic net, the approach that has the second smallest MSE). As sample size increases to 200, the difference in correct 0 s as well as MSE gets closer between adaptive elastic net and adaptive LASSO. When sample size gets to 500, adaptive elastic net outperforms adaptive LASSO in dropping unimportant variables. Elastic net is still the best in keeping important variables. Adaptive elastic net still has the least MSE. Considering all 3 factors, we conclude that adaptive elastic net is the best approach for this scenario.

Model 4: The design contains ten covariates: \(({Z}_1,{Z}_2,\ldots ,{Z}_{10})\). The covariates are marginally standard normal distributed and \(\rho _{i,j}=0.8\) for \(i,j=1,2,\ldots ,10\) and \(i\ne j\). \({\varvec{\beta }}'=(-0.3,0,0,-0.3,0,0,-0.2,0,0,-0.2)\). This model is used to compare the performance of the 4 procedures in a scenario that important covariates all have small effects and that the pairwise correlations between the covariates are strong. The simulation result is summarized and shown in Fig. 4.

Fig. 4
figure 4

Variable selection results and MSE for \(\rho =0.8\) and \({\varvec{\beta }}'=(-0.3,0,0,-0.3,0,0,-0.2,0,0,-0.2)\). The MSE is the averaged value over 100 replicates

For model 4, elastic net is still the best in retaining the none-zero variables in the model and adaptive LASSO and adaptive elastic net is the best in eliminating zero variables out of the model. For small sample size, adaptive elastic net has the smallest MSE (around 3% difference on average). At large sample size, adaptive elastic net and adaptive LASSO have the smallest MSE. For this model, we conclude that adaptive elastic net has overall the best performance when sample size is small. When sample size is large, elastic net is the best procedure in terms of selection accuracy and adaptive elastic net is the best procedure in terms of MSE.

Model 5: The design contains 100 marginally standard normal distributed covariates: \(({Z}_1,{Z}_2,\ldots ,{Z}_{100})\). \({\varvec{\beta }}'=(-0.8\textbf{I}_{10},-0.7\textbf{I}_{10},-0.3\textbf{I}_{10},-0.2\textbf{I}_{10},0_{60} )\). So \({Z}_1,{Z}_2,\ldots ,{Z}_{20}\) are important variables having large effects, \({Z}_{21},{Z}_{22},\ldots ,{Z}_{40}\) are important variables having small effects, and \({Z}_{41},{Z}_{42},\ldots ,{Z}_{100}\) are unimportant variables. We use this model to compare the performance of the 4 procedures in a complicated case that have large dimension of covariates and the important covariates have both small and large effects. We run this model with 20% and 40% censoring rate and consider pairwise correlation coefficient in 0.5 and 0.8 settings. Result is summarized and shown in Figs. 5 and 6.

Fig. 5
figure 5

Variable selection results and MSE for \(\rho =0.5, 0.8\) and \({\varvec{\beta }}'=(-0.8\textbf{I}_{10},-0.7\textbf{I}_{10},-0.3\textbf{I}_{10},-0.2\textbf{I}_{10},0_{60})\) censoring rate 20% The MSE is the averaged value over 100 replicates

Fig. 6
figure 6

Variable selection results and MSE for \(\rho =0.5, 0.8\) and \({\varvec{\beta }}'=(-0.8\textbf{I}_{10},-0.7\textbf{I}_{10},-0.3\textbf{I}_{10},-0.2\textbf{I}_{10},0_{60} )\) censoring rate 40% The MSE is the averaged value over 100 replicates

For this complex model, the adaptive elastic net demonstrates the lowest MSE in 10 out of 12 combinations of sample sizes, correlation coefficients, and censoring rates. Adaptive LASSO and adaptive elastic net are the two best approaches in terms of dropping correct zero variables. Elastic net outperforms the other 3 methods in keeping non-zero variables. If penalizing the selection of unimportant variables and the omission of important ones carry the same weight, then the adaptive elastic net is considered the most favorable approach for this complex model.

6 Application in real data

6.1 Veteran cancer data

We apply the adaptive elastic net method to the data from the Veteran’s Administration lung cancer trial (Prentice & Kalbfleisch, 2002). In this trial, 137 males with advanced inoperable lung cancer were randomized to either a standard treatment or chemotherapy. There are six covariates: treatment (1 \(=\) standard, 2 \(=\) test); cell type (1 \(=\) squamous, 2 \(=\) small cell, 3 \(=\) adeno, 4 \(=\) large); Karnofsky score; months from diagnosis; age; prior therapy (0 \(=\) no, 10 \(=\) yes).

We include all the covariates and all the patients in our analysis and compute the adaptive elastic net estimates under the proportional odds model. Maximum marginal likelihood, LASSO, adaptive LASSO, and elastic net estimates are also computed.

Table 1 Estimated coefficients for lung cancer data

Table 1 summarizes the estimated coefficients estimated by the 4 approaches. The maximum marginal likelihood estimates are in good agreement with those reported in Lam and Leung (2001) and Lu and Zhang (2007). The LASSO selects cell type (squamous versus large, small versus large, and adeno versus large) and Karnofsky score as important variables, while all other three methods eliminate one more variable (squamous versus large).

We use k-fold cross-validation, a standard resampling procedure to evaluate and compare the models in general machine learning and survival analysis. In the cross-validation, the dataset is randomly partitioned into k blocks, which are used, respectively, as test set in each cross-validation iteration. The remaining \(k-1\) blocks are combined and used as the training set to fit the model. Concordance index proposed by Zheng and Heagerty (2005) is used to evaluate the performance of the models in the cross-validation procedure. The concordance index is defined as

$$\begin{aligned} \text {CI}=1-\frac{1}{\mid \varepsilon \mid }\sum _{\{i:\delta _i=1\}}\sum _{t_i< t_j}\left( {\varvec{1}}_{f(x_i)<f(x_j)}+\frac{1}{2}{\varvec{1}}_{f(x_i)=f(x_j)}\right) , \end{aligned}$$
(16)

where \(\varepsilon \) is defined as the set of all pairs \((t_i,t_j)\) with \(i,j=1,\ldots ,n\) where it can be concluded that \(t_i,<t_j,t_i,=t_j\), or \(t_i,>t_j\), despite censoring; f(x) denotes the predicted survival time of the event given covariate vector \({\varvec{x}}\).

The concordance index is the probability of concordance of observed and predicted survival time and can be interpreted as the portion of all pairs of subjects whose predicted survival times are correctly ordered among all subjects that can actually be ordered. Values of concordance index that close to 0 mean nearly perfect prediction; values of concordance index that close to 0.5 are signs of random predictions.

Here, we apply threefold cross-validation, the reason why we use a low number of folds is because we need a large enough number of samples in the test set to get a meaningful concordance index. In each iteration of the cross-validation, the 4 selection procedures are used to build the models and the concordance indices (CI) are calculated using the linear predictor \(X_{\textrm{test}}\hat{\beta }_{\textrm{train}}\). The CI for the cross-validation is the average of the CIs of the 3 iterations. We run the cross-validation procedure 3 times and take the average of concordance indices to reduce the performance error due to randomization.

The results of cross-validation are shown in Table 2; we can see that the adaptive elastic net has the lowest mean concordance index, meaning that it has the best performance among the 4 methods in the cross-validation.

Table 2 Mean concordance indices comparison for VC data

6.2 GSE data

We select 5 lung cancer data sets with genome wide gene expression measurements and additional clinical information from Gene Expression Omnibus: GSE4573 (Beer, 2006), GSE14814 (Tsao, 2010), GSE31210 (Gotoh, 2012), GSE37745 (Micke, 2013), and GSE50081 (Tsao, 2014). Table 3 and Fig. 7 show the characteristics of the datasets and the Kaplan–Meier estimates of survival times, respectively. As we can see, GSE14814, GSE50081, and GSE4573 have comparable estimated survival curves, GSE37745 has slightly lower survival times over time, and GSE31210 has significantly higher survival rate than the other 4 datasets.

Table 3 Characteristics of lung cancer datasets
Fig. 7
figure 7

Kaplan–Meier estimator of survival for 5 GSE datasets

In addition to the gene covariates, important clinical variables, including sex, age, stage, and histology, are also used in the analysis. We remove all the incomplete observations before we apply the model selection methods. For all the 5 datasets, we evaluate the LASSO, adaptive LASSO, elastic net, and adaptive elastic net methods through threefold cross-validation. The CI for the cross-validation is the average of the CIs of the three iterations. Again, we run the cross-validation procedure 3 times and take the average of concordance indices to reduce the performance error due to randomization. The results are shown in Table 4.

Table 4 Mean concordance indices for 5 GSE datasets

We can see that though the 5 datasets are taken from the same field, to fit a predictive model on them is unequally difficult. This is quite obvious by comparing the concordance indices across the 5 datasets. For datasets GSE4573, GSE14814, and GSE37745, the concordance indices of cross-validation are around 0.5 for all model selection procedures that we are comparing, which means that the prediction is nearly random. Adaptive elastic net does not seem to perform better in terms of prediction than other methods. However, for datasets GSE31210 and GSE50081, it is much easier to build a model to predict the survival. In these two datasets and the adaptive elastic net method does perform better in prediction than other methods. For GSE31210, the concordance index for adaptive elastic net method is 0.04 (15.3%) smaller than the next best method, and for GSE50081, the concordance index is also 0.04 (11.8%) smaller than the next best method. The mean concordance index for adaptive elastic net, across all the 5 datasets, is 0.012 (3%) smaller than the next best approach.

7 Summary and discussion

In this paper, we have studied the application of adaptive elastic net for variable selection problem under the proportional odds model and compared its performance with LASSO, adaptive LASSO, and elastic net. Our simulation results show that the adaptive elastic net method has superior result in terms of accuracy of variable selection and MSE in most of the cases. The simulation also indicates that as the censoring rate increases, all the approaches tend to have higher error rate in selection and higher MSE, but the relative ranks of their performance do not change. Our proposed method naturally exhibits greater complexity compared to LASSO, adaptive LASSO, and elastic net. As a result, when conducting a comparison with those three methods, it should be anticipated that our method will demand more computation time, with the percentage varying across different scenarios. Nevertheless, taking into account its superior performance in the majority of scenarios, our method continues to be an efficient approach. Moreover, because adaptive elastic net has the oracle properties (Zou, 2006), the bias of its coefficient estimates tends to be zero when the number of samples goes to infinity. For finite samples, because of the nature of shrinkage, the adaptive elastic net estimator may have obvious bias. Therefore, in real applications, it may be a good choice to do variable selection and estimation separately: We first eliminate unimportant variables using the adaptive elastic net procedure, and then fit the model using classical method such as maximum marginal likelihood estimation to get the coefficient estimates.