Abstract
Several variable selection procedures are available for continuous time-to-event data. However, if time is measured in a discrete way and therefore many ties occur models for continuous time are inadequate. We propose penalized likelihood methods that perform efficient variable selection in discrete survival modeling with explicit modeling of the heterogeneity in the population. The method is based on a combination of ridge and lasso type penalties that are tailored to the case of discrete survival. The performance is studied in simulation studies and an application to the birth of the first child.
Similar content being viewed by others
Notes
On a MacBook Pro with 2.5 GHz Intel Core i5 processor the methods managed about 10 simulation runs per week.
Also multiple imputation techniques were used that are implemented in the software R e.g. in the packages mi (Gelman et al. 2013) and mice (van Buuren and Groothuis-Oudshoorn 2013). As shown in the work of Abedieh, all different imputation techniques led to almost indistinguishable results, so our analysis is based on the data set obtained via the last value carried forward method. For a very helpful description of the MICE-technique together with illustrative examples, see van Buuren and Groothuis-Oudshoorn (2011).
References
Anderson DA, Aitkin M (1985) Variance component models with binary response: interviewer variability. J R Stat Soc Ser B 47:203–210
Androulakis E, Koukouvinos C, Vonta F (2012) Estimation and variable selection via frailty models with penalized likelihood. Stat Med 31(20):2223–2239
Baker M, Melino A (2000) Duration dependence and nonparametric heterogeneity: a monte carlo study. J Econom 96:357–393
Bates D, Maechler M (2010) lme4: linear mixed-effects models using S4 classes. http://CRAN.R-project.org/package=lme4, R package version 0.999999-0
Bradic J, Fan J, Jiang J (2011) Regularization for coxÕs proportional hazards model with np-dimensionality. Ann Stat 39(6):3092
Breslow NE, Clayton DG (1993) Approximate inference in generalized linear mixed model. J Am Stat Assoc 88:9–25
Breslow NE, Lin X (1995) Bias correction in generalized linear mixed models with a single component of dispersion. Biometrika 82:81–91
Broström G (2009) glmmML: generalized linear models with clustering. http://CRAN.R-project.org/package=glmmML, R package version 0.81-6
Brown C (1975) On the use of indicator variables for studying the time-dependence of parameters in a response-time model. Biometrics 31:863–872
Bühlmann P, Van De Geer S (2011) Statistics for high-dimensional data: methods, theory and applications. Springer, New York
Cox DR (1972) Regression models and life tables (with discussion). J R Stat Soc B 34:187–220
Dezeure R, Bühlmann P, Meier L, Meinshausen N (2014) High-dimensional inference: confidence intervals, p values and R-Software hdi. arXiv preprint arXiv:14084026
Dierckx P (1993) Curve and surface fitting with splines. Oxford Science Publications, Oxford
Do Ha I, Noh M, Lee Y (2012) Frailtyhl: a package for fitting frailty models with h-likelihood. R J 4(2):28–36
Efron B (1988) Logistic regression, survival analysis, and the Kaplan–Meier-curve. J Am Stat Assoc 83:414–425
Eilers PHC, Marx BD (1996) Flexible smoothing with B-splines and Penalties. Stat Sci 11:89–121
Fahrmeir L (1994) Dynamic modelling and penalized likelihood estimation for discrete time survival data. Biometrika 81:317–330
Fahrmeir L, Kneib T (2011) Bayesian smoothing and regression for longitudinal, spatial and event history data. Cambridge University Press, Cambridge
Fahrmeir L, Knorr-Held L (1997) Dynamic discrete-time duration models: estimation via markov chain monte carlo. Sociol Methodol 27(1):417–452
Fahrmeir L, Tutz G (2001) Multivariate statistical modelling based on generalized linear models. Springer, New York
Fan J, Li R (2002) Variable selection for Cox’s proportional hazards model and frailty model. Ann Stat. pp 74–99
Friedman JH, Hastie T, Tibshirani R (2010) Regularization paths for generalized linear models via coordinate descent. J Stat Softw 33(1):1–22
Gamst A, Donohue M, Xu R (2009) Asymptotic properties and empirical evaluation of the npmle in the proportional hazards mixed-effects model. Stat Sin 19(3):997
Gelman A, Hill J, Su Y, Yajima M, Pittau MG (2013) mi: missing data imputation and model checking. http://CRAN.R-project.org/package=mi, R package version 0.09-18.03
Goeman JJ (2010) \(\rm {L}_1\) penalized estimation in the Cox proportional hazards model. Biom J 52:70–84
Goeman JJ (2011) Penalized. R package version 0.9-42
Groll A (2011) glmmLasso: variable selection for generalized linear mixed models by \(\text{ L }_1\)-penalized estimation. http://CRAN.R-project.org/package=glmmLasso, R package version 1.2.3
Groll A, Tutz G (2014) Variable selection for generalized linear mixed models by \(\text{ L }_1\)-penalized estimation. Stat Comput 24(2):137–154
Ham JC, Rea Jr SA (1987) Unemployment insurance and male unemployment duration in Canada. J Labor Econom. pp 325–353
Hartzel J, Liu I, Agresti A (2001) Describing heterogenous effects in stratified ordinal contingency tables, with applications to multi-center clinical trials. Comput Stat Data Anal. 35(4):429–449
Heckman JJ, Singer B (1984) Econometric duration analysis. J Econom 24(1):63–132
Hinde J (1982) Compound poisson regression models. In: Gilchrist R (ed) GLIM 1982 international conference on generalized linear models. Springer, New York, pp 109–121
Huinink J, Brüderl J, Nauck B, Walper S, Castiglioni L, Feldhaus M (2011) Panel analysis of intimate relationships and family dynamics (pairfam): conceptual framework and design. J Fam Res 23:77–101
Kalbfleisch J, Prentice R (2002) The statistical analysis of failure time data, 2nd edn. Wiley, New York
Kauermann G, Tutz G, Brüderl J (2005) The survival of newly founded firms: a case-study into varying-coefficient models. J R Stat Soc A 168:145–158
Laird N, Olivier D (1981) Covariance analysis of censored survival data using log-linear analysis techniques. J Am Stat Assoc 76(374):231–240
Lancaster T (1990) The econometric analysis of transition data. Cambridge University Press, Cambridge
Land KC, Nagin DS, McCall PL (2001) Discrete-time hazard regression models with hidden heterogeneity the semiparametric mixed poisson regression approach. Sociol Methods Res 29(3):342–373
Leeb H, Pötscher BM (2005) Model selection and inference: facts and fiction. Econom Theory 21(01):21–59
Lin X, Breslow NE (1996) Bias correction in generalized linear mixed models with multiple components of dispersion. J Am Stat Assoc 91:1007–1016
Littell R, Milliken G, Stroup W, Wolfinger R (1996) SAS system for mixed models. SAS Institute Inc., Cary
Liu Q, Pierce DA (1994) A note on Gauss–Hermite quadrature. Biometrika 81:624–629
Lockhart R, Taylor J, Tibshirani RJ, Tibshirani R (2014) A significance test for the Lasso. Ann Stat 42(2):413
Möst S, Pößnecker W, Tutz G (2015) Variable selection for discrete competing risks models. Qual Quant. pp 1–22
Nauck B, Brüderl J, Huinink J, Walper S (2013) The german family panel (pairfam). GESIS data archive, cologne ZA5678 data file version 4.0.0
Nicoletti C, Rondinelli C (2010) The (mis) specification of discrete duration models with unobserved heterogeneity: a monte carlo study. J Econom 159(1):1–13
Park MY, Hastie T (2007) An l1 regularization-path algorithm for generalized linear models. J R Stat Soc B 69:659–677
Pinheiro JC, Bates DM (1995) Approximations to the log-likelihood function in the nonlinear mixed-effects model. J Comput Graph Stat 4:12–35
Pinheiro JC, Bates DM (2000) Mixed-effects models in S and S-plus. Springer, New York
Pötscher BM, Leeb H (2009) On the distribution of penalized maximum likelihood estimators: the lasso, scad, and thresholding. J Multivar Anal 100(9):2065–2082
Prentice RL, Gloeckler LA (1978) Regression analysis of grouped survival data with application to breast cancer data. Biometrics 34:57–67
Rondeau V, Mazroui Y, Gonzalez JR (2012) frailtypack: an R package for the analysis of correlated survival data with frailty models using penalized likelihood estimation or parametrical estimation. J Stat Softw 47(4):1–28
Scheike T, Jensen T (1997) A discrete survival model with random effects: an application to time to pregnancy. Biometrics. pp 318–329
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Simon N, Friedman J, Hastie T, Tibshirani R (2011) Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw 39(5):1–13
Therneau T, Grambsch P (2000) Modeling survival data: extending the Cox model. Springer, New York
Therneau TM (2013) A package for survival analysis in S. R package version 2.37-4
Thompson WA (1977) On the treatment of grouped observations in life studies. Biometrics 33:463–470
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
Tutz G, Pritscher L (1996) Nonparametric estimation of discrete hazard functions. Lifetime Data Anal 2:291–308
van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. JStat Softw 45(3):1–67. http://www.jstatsoft.org/v45/i03/
van Buuren S, Groothuis-Oudshoorn K (2013) Mice: multivariate imputation by chained equations in R. http://CRAN.R-project.org/package=mice, R package version 2.18
Van den Berg GJ (2001) Duration models: specification, identification and multiple durations. Handbook Econom 5:3381–3460
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. Springer, New York
Vermunt JK (1996) Log-linear event history analysis: a general approach with missing data, latent variables, and unobserved heterogeneity, vol 8. Tilburg University Press, Tilburg
Vonesh EF (1996) A note on the use of Laplace’s approximation for nonlinear mixed-effects models. Biometrika 83:447–452
Wolfinger R, O’Connell M (1993) Generalized linear mixed models; a pseudolikelihood approach. J Stat Comput Simul 48:233–243
Wood S, Scheipl F (2013) Gamm4: generalized additive mixed models using mgcv and lme4. http://CRAN.R-project.org/package=gamm4, R package version 0.2-2
Wood SN (2006) Generalized additive models: an introduction with R. Chapman & Hall/CRC, London
Zou H (2006) The adaptive lasso and its oracle properties. J Am Stat Assoc 101(476):1418–1429
Acknowledgments
This article uses data from the German family panel pairfam, coordinated by Josef Brüderl, Johannes Huinink, Bernhard Nauck, and Sabine Walper. Pairfam is funded as long-term Project by the German Research Foundation (DFG). We are also grateful to Jasmin Abedieh for providing the specific discrete survival data, which were constructed from the pairfam data and were part of her master thesis.
Author information
Authors and Affiliations
Corresponding author
Appendices
Appendix 1: Simulation results
Appendix 2: Binary predictors
Rights and permissions
About this article
Cite this article
Groll, A., Tutz, G. Variable selection in discrete survival models including heterogeneity. Lifetime Data Anal 23, 305–338 (2017). https://doi.org/10.1007/s10985-016-9359-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10985-016-9359-y