Skip to main content
Log in

Variable selection in discrete survival models including heterogeneity

  • Published:
Lifetime Data Analysis Aims and scope Submit manuscript

Abstract

Several variable selection procedures are available for continuous time-to-event data. However, if time is measured in a discrete way and therefore many ties occur models for continuous time are inadequate. We propose penalized likelihood methods that perform efficient variable selection in discrete survival modeling with explicit modeling of the heterogeneity in the population. The method is based on a combination of ridge and lasso type penalties that are tailored to the case of discrete survival. The performance is studied in simulation studies and an application to the birth of the first child.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. On a MacBook Pro with 2.5 GHz Intel Core i5 processor the methods managed about 10 simulation runs per week.

  2. Also multiple imputation techniques were used that are implemented in the software R e.g. in the packages mi (Gelman et al. 2013) and mice (van Buuren and Groothuis-Oudshoorn 2013). As shown in the work of Abedieh, all different imputation techniques led to almost indistinguishable results, so our analysis is based on the data set obtained via the last value carried forward method. For a very helpful description of the MICE-technique together with illustrative examples, see van Buuren and Groothuis-Oudshoorn (2011).

References

  • Anderson DA, Aitkin M (1985) Variance component models with binary response: interviewer variability. J R Stat Soc Ser B 47:203–210

    MathSciNet  Google Scholar 

  • Androulakis E, Koukouvinos C, Vonta F (2012) Estimation and variable selection via frailty models with penalized likelihood. Stat Med 31(20):2223–2239

    Article  MathSciNet  MATH  Google Scholar 

  • Baker M, Melino A (2000) Duration dependence and nonparametric heterogeneity: a monte carlo study. J Econom 96:357–393

    Article  MATH  Google Scholar 

  • Bates D, Maechler M (2010) lme4: linear mixed-effects models using S4 classes. http://CRAN.R-project.org/package=lme4, R package version 0.999999-0

  • Bradic J, Fan J, Jiang J (2011) Regularization for coxÕs proportional hazards model with np-dimensionality. Ann Stat 39(6):3092

    Article  MathSciNet  MATH  Google Scholar 

  • Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed model. J Am Stat Assoc 88:9–25

    MATH  Google Scholar 

  • Breslow NE, Lin X (1995) Bias correction in generalized linear mixed models with a single component of dispersion. Biometrika 82:81–91

    Article  MathSciNet  MATH  Google Scholar 

  • Broström G (2009) glmmML: generalized linear models with clustering. http://CRAN.R-project.org/package=glmmML, R package version 0.81-6

  • Brown C (1975) On the use of indicator variables for studying the time-dependence of parameters in a response-time model. Biometrics 31:863–872

    Article  MATH  Google Scholar 

  • Bühlmann P, Van De Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, New York

    Book  MATH  Google Scholar 

  • Cox DR (1972) Regression models and life tables (with discussion). J R Stat Soc B 34:187–220

    MathSciNet  MATH  Google Scholar 

  • Dezeure R, Bühlmann P, Meier L, Meinshausen N (2014) High-dimensional inference: confidence intervals, p values and R-Software hdi. arXiv preprint arXiv:14084026

  • Dierckx P (1993) Curve and surface fitting with splines. Oxford Science Publications, Oxford

    MATH  Google Scholar 

  • Do Ha I, Noh M, Lee Y (2012) Frailtyhl: a package for fitting frailty models with h-likelihood. R J 4(2):28–36

    Google Scholar 

  • Efron B (1988) Logistic regression, survival analysis, and the Kaplan–Meier-curve. J Am Stat Assoc 83:414–425

    Article  MathSciNet  MATH  Google Scholar 

  • Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11:89–121

    Article  MathSciNet  MATH  Google Scholar 

  • Fahrmeir L (1994) Dynamic modelling and penalized likelihood estimation for discrete time survival data. Biometrika 81:317–330

    Article  MATH  Google Scholar 

  • Fahrmeir L, Kneib T (2011) Bayesian smoothing and regression for longitudinal, spatial and event history data. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  • Fahrmeir L, Knorr-Held L (1997) Dynamic discrete-time duration models: estimation via markov chain monte carlo. Sociol Methodol 27(1):417–452

    Article  Google Scholar 

  • Fahrmeir L, Tutz G (2001) Multivariate statistical modelling based on generalized linear models. Springer, New York

    Book  MATH  Google Scholar 

  • Fan J, Li R (2002) Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. pp 74–99

  • Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22

    Article  Google Scholar 

  • Gamst A, Donohue M, Xu R (2009) Asymptotic properties and empirical evaluation of the npmle in the proportional hazards mixed-effects model. Stat Sin 19(3):997

    MathSciNet  MATH  Google Scholar 

  • Gelman A, Hill J, Su Y, Yajima M, Pittau MG (2013) mi: missing data imputation and model checking. http://CRAN.R-project.org/package=mi, R package version 0.09-18.03

  • Goeman JJ (2010) \(\rm {L}_1\) penalized estimation in the Cox proportional hazards model. Biom J 52:70–84

    MathSciNet  MATH  Google Scholar 

  • Goeman JJ (2011) Penalized. R package version 0.9-42

  • Groll A (2011) glmmLasso: variable selection for generalized linear mixed models by \(\text{ L }_1\)-penalized estimation. http://CRAN.R-project.org/package=glmmLasso, R package version 1.2.3

  • Groll A, Tutz G (2014) Variable selection for generalized linear mixed models by \(\text{ L }_1\)-penalized estimation. Stat Comput 24(2):137–154

    Article  MathSciNet  MATH  Google Scholar 

  • Ham JC, Rea Jr SA (1987) Unemployment insurance and male unemployment duration in Canada. J Labor Econom. pp 325–353

  • Hartzel J, Liu I, Agresti A (2001) Describing heterogenous effects in stratified ordinal contingency tables, with applications to multi-center clinical trials. Comput Stat Data Anal. 35(4):429–449

    Article  MATH  Google Scholar 

  • Heckman JJ, Singer B (1984) Econometric duration analysis. J Econom 24(1):63–132

    Article  MathSciNet  MATH  Google Scholar 

  • Hinde J (1982) Compound poisson regression models. In: Gilchrist R (ed) GLIM 1982 international conference on generalized linear models. Springer, New York, pp 109–121

    Google Scholar 

  • Huinink J, Brüderl J, Nauck B, Walper S, Castiglioni L, Feldhaus M (2011) Panel analysis of intimate relationships and family dynamics (pairfam): conceptual framework and design. J Fam Res 23:77–101

    Google Scholar 

  • Kalbfleisch J, Prentice R (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Kauermann G, Tutz G, Brüderl J (2005) The survival of newly founded firms: a case-study into varying-coefficient models. J R Stat Soc A 168:145–158

    Article  MathSciNet  MATH  Google Scholar 

  • Laird N, Olivier D (1981) Covariance analysis of censored survival data using log-linear analysis techniques. J Am Stat Assoc 76(374):231–240

    Article  MathSciNet  MATH  Google Scholar 

  • Lancaster T (1990) The econometric analysis of transition data. Cambridge University Press, Cambridge

    MATH  Google Scholar 

  • Land KC, Nagin DS, McCall PL (2001) Discrete-time hazard regression models with hidden heterogeneity the semiparametric mixed poisson regression approach. Sociol Methods Res 29(3):342–373

    Article  MathSciNet  Google Scholar 

  • Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econom Theory 21(01):21–59

    Article  MathSciNet  MATH  Google Scholar 

  • Lin X, Breslow NE (1996) Bias correction in generalized linear mixed models with multiple components of dispersion. J Am Stat Assoc 91:1007–1016

    Article  MathSciNet  MATH  Google Scholar 

  • Littell R, Milliken G, Stroup W, Wolfinger R (1996) SAS system for mixed models. SAS Institute Inc., Cary

    Google Scholar 

  • Liu Q, Pierce DA (1994) A note on Gauss–Hermite quadrature. Biometrika 81:624–629

    MathSciNet  MATH  Google Scholar 

  • Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R (2014) A significance test for the Lasso. Ann Stat 42(2):413

    Article  MathSciNet  MATH  Google Scholar 

  • Möst S, Pößnecker W, Tutz G (2015) Variable selection for discrete competing risks models. Qual Quant. pp 1–22

  • Nauck B, Brüderl J, Huinink J, Walper S (2013) The german family panel (pairfam). GESIS data archive, cologne ZA5678 data file version 4.0.0

  • Nicoletti C, Rondinelli C (2010) The (mis) specification of discrete duration models with unobserved heterogeneity: a monte carlo study. J Econom 159(1):1–13

    Article  MathSciNet  MATH  Google Scholar 

  • Park MY, Hastie T (2007) An l1 regularization-path algorithm for generalized linear models. J R Stat Soc B 69:659–677

    Article  Google Scholar 

  • Pinheiro JC, Bates DM (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat 4:12–35

    Google Scholar 

  • Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-plus. Springer, New York

    Book  MATH  Google Scholar 

  • Pötscher BM, Leeb H (2009) On the distribution of penalized maximum likelihood estimators: the lasso, scad, and thresholding. J Multivar Anal 100(9):2065–2082

    Article  MathSciNet  MATH  Google Scholar 

  • Prentice RL, Gloeckler LA (1978) Regression analysis of grouped survival data with application to breast cancer data. Biometrics 34:57–67

    Article  MATH  Google Scholar 

  • Rondeau V, Mazroui Y, Gonzalez JR (2012) frailtypack: an R package for the analysis of correlated survival data with frailty models using penalized likelihood estimation or parametrical estimation. J Stat Softw 47(4):1–28

    Article  Google Scholar 

  • Scheike T, Jensen T (1997) A discrete survival model with random effects: an application to time to pregnancy. Biometrics. pp 318–329

  • Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464

    Article  MathSciNet  MATH  Google Scholar 

  • Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13

    Article  Google Scholar 

  • Therneau T, Grambsch P (2000) Modeling survival data: extending the Cox model. Springer, New York

    Book  MATH  Google Scholar 

  • Therneau TM (2013) A package for survival analysis in S. R package version 2.37-4

  • Thompson WA (1977) On the treatment of grouped observations in life studies. Biometrics 33:463–470

    Article  MathSciNet  MATH  Google Scholar 

  • Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288

    MathSciNet  MATH  Google Scholar 

  • Tutz G, Pritscher L (1996) Nonparametric estimation of discrete hazard functions. Lifetime Data Anal 2:291–308

    Article  MATH  Google Scholar 

  • van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. JStat Softw 45(3):1–67. http://www.jstatsoft.org/v45/i03/

  • van Buuren S, Groothuis-Oudshoorn K (2013) Mice: multivariate imputation by chained equations in R. http://CRAN.R-project.org/package=mice, R package version 2.18

  • Van den Berg GJ (2001) Duration models: specification, identification and multiple durations. Handbook Econom 5:3381–3460

    Article  Google Scholar 

  • Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York

    Book  MATH  Google Scholar 

  • Vermunt JK (1996) Log-linear event history analysis: a general approach with missing data, latent variables, and unobserved heterogeneity, vol 8. Tilburg University Press, Tilburg

    MATH  Google Scholar 

  • Vonesh EF (1996) A note on the use of Laplace’s approximation for nonlinear mixed-effects models. Biometrika 83:447–452

    Article  MathSciNet  MATH  Google Scholar 

  • Wolfinger R, O’Connell M (1993) Generalized linear mixed models; a pseudolikelihood approach. J Stat Comput Simul 48:233–243

    Article  MATH  Google Scholar 

  • Wood S, Scheipl F (2013) Gamm4: generalized additive mixed models using mgcv and lme4. http://CRAN.R-project.org/package=gamm4, R package version 0.2-2

  • Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London

    MATH  Google Scholar 

  • Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This article uses data from the German family panel pairfam, coordinated by Josef Brüderl, Johannes Huinink, Bernhard Nauck, and Sabine Walper. Pairfam is funded as long-term Project by the German Research Foundation (DFG). We are also grateful to Jasmin Abedieh for providing the specific discrete survival data, which were constructed from the pairfam data and were part of her master thesis.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Groll.

Appendices

Appendix 1: Simulation results

Table 4 Results for \(\text {mse}_{0}\) for glmmLasso and alternative approaches (standard errors in brackets) with low censoring rate (\(\pi _{cens}=0.05\))
Table 5 Results for \(\text {mse}_{\varvec{\gamma }}\) for glmmLasso and alternative approaches (standard errors in brackets) with low censoring rate (\(\pi _{cens}=0.05\))
Table 6 Results for average computational times (in minutes) for glmmLasso and alternative approaches with high censoring rate (\(\pi _{cens}=0.2\))

Appendix 2: Binary predictors

Table 7 Results for \(\text {mse}_{0}\) for glmmLasso and alternative approaches (standard errors in brackets) with low censoring rate (\(\pi _{cens}=0.05\))
Table 8 Results for \(\text {mse}_{\varvec{\gamma }}\) for glmmLasso and alternative approaches (standard errors in brackets) with low censoring rate (\(\pi _{cens}=0.05\))
Table 9 Results for \(\text {mse}_{\sigma _b}\) for glmmLasso and alternative approaches (standard errors in brackets) with low censoring rate (\(\pi _{cens}=0.05\))
Table 10 Number of simulation runs, where the fitting procedures did not converge (n.c.) together with false positives (f.p.) and false negatives (f.n.) for glmmLasso and alternative approaches for low censoring rate (\(\pi _{cens}=0.05\))

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Groll, A., Tutz, G. Variable selection in discrete survival models including heterogeneity. Lifetime Data Anal 23, 305–338 (2017). https://doi.org/10.1007/s10985-016-9359-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10985-016-9359-y

Keywords

Navigation