Data-Adaptive Target Parameters

  • Alan E. HubbardEmail author
  • Chris J. Kennedy
  • Mark J. van der Laan
Part of the Springer Series in Statistics book series (SSS)


What factors are most important in predicting coronary heart disease? Heart disease is the leading cause of death and serious injury in the United States. To address this question we turn to the Framingham Heart Study, which was designed to investigate the health factors associated with coronary heart disease (CHD) at a time when cardiovascular disease was becoming increasingly prevalent. Starting in 1948, the prospective cohort study began monitoring a population of 5209 men and women, ages 30–62, in Framingham, Massachusetts. Those subjects received extensive medical examinations and lifestyle interviews every 2 years that provide longitudinal measurements that can be compared to outcome status. The data has been analyzed in countless observational studies and resulted in risk score equations used widely to assess risk of coronary heart disease. In our case, we conduct a comparison analysis to Wilson et al. (1998) using the data-adaptive variable importance approach described in this chapter.


  1. L. Auret, C. Aldrich, Empirical comparison of tree ensemble variable importance measures. Chemom. Intel. Lab. Syst. 105(2), 157–170 (2011)CrossRefGoogle Scholar
  2. O. Bembom, M.L. Petersen, S.-Y. Rhee, W.J. Fessel, S.E. Sinisi, R.W. Shafer, M.J. van der Laan, Biomarker discovery using targeted maximum likelihood estimation: application to the treatment of antiretroviral resistant HIV infection. Stat. Med. 28, 152–72 (2009)MathSciNetCrossRefGoogle Scholar
  3. Y. Benjamini, Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Ser. B 57, 289–300 (1995)MathSciNetzbMATHGoogle Scholar
  4. D.I. Broadhurst, D.B. Kell, Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics 2(4), 171–196 (2006)CrossRefGoogle Scholar
  5. T. Chen, C. Guestrin, Xgboost: a scalable tree boosting system, in Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, New York, 2016), pp. 785–794Google Scholar
  6. J.H. Friedman, T.J. Hastie, R.J. Tibshirani, Glmnet: lasso and elastic-net regularized generalized linear models (2010).
  7. A. Gelman, Y.-S. Su, M. Yajima, J. Hill, M.G. Pittau, J. Kerman, T. Zheng, Arm: data analysis using regression and multilevel/hierarchical models (2010).
  8. U. Grömping, Variable importance assessment in regression: linear regression versus random forest. Am. Stat. 63(4) (2009)Google Scholar
  9. S. Gruber, M.J. van der Laan, tmle: an R package for targeted maximum likelihood estimation. J. Stat. Softw. 51(13) (2012a)Google Scholar
  10. A.E. Hubbard, M.J. van der Laan, Mining with inference: data adaptive target parameters, in Handbook of Big Data. Chapman-Handbooks-Statistical-Methods, ed. by P. Buhlmann, P. Drineas, M. Kane, M.J. van der Laan (Chapman & Hall/CRC, Boca Raton, 2016)Google Scholar
  11. A.E. Hubbard, I Diaz Munoz, A. Decker, J.B. Holcomb, M.A. Schreiber, E.M. Bulger, K.J. Brasel, E.E. Fox, D.J. del Junco, C.E. Wade et al., Time-dependent prediction and evaluation of variable importance using superlearning in high-dimensional clinical data. J. Trauma-Injury Infect. Crit. Care 75(1), S53–S60 (2013)Google Scholar
  12. A.E. Hubbard, S. Kherad-Pajouh, M.J. van der Laan, Statistical inference for data adaptive target parameters. Int. J. Biostat. 12(1), 3–19 (2016)MathSciNetCrossRefGoogle Scholar
  13. J.P. Ioannidis, Why most discovered true associations are inflated. Epidemiology 19(5), 640–648 (2008)CrossRefGoogle Scholar
  14. Joint National Committee, The fifth report of the joint national committee on detection, evaluation, and treatment of high blood pressure (JNC V). Arch. Intern. Med. 153(2), 154–183 (1993)Google Scholar
  15. A. Liaw, M. Wiener, Classification and regression by randomforest. R News 2(3), 18– 22 (2002)Google Scholar
  16. A.R. Luedtke, M.J. van der Laan, Statistical inference for the mean outcome under a possibly non-unique optimal treatment strategy. Ann. Stat. 44(2), 713–742 (2016a)Google Scholar
  17. A.R. Luedtke, M.J. van der Laan, Super-learning of an optimal dynamic treatment rule. Int. J. Biostat. 12(1), 305–332 (2016b)Google Scholar
  18. S. Milborrow, T Hastie, R Tibshirani, Earth: multivariate adaptive regression spline models. R package version 3.2-7 (2014)Google Scholar
  19. T. Mildenberger, Y. Rozenholc, D. Zasada, histogram: Construction of regular and irregular histograms with different options for automatic choice of bins (2009).
  20. A. Peters, T. Hothorn, ipred: improved predictors (2009)
  21. R. Pirracchio, M.L. Petersen, M.J. van der Laan, Improving propensity score estimators’ robustness to model misspecification using super learner. Am. J. Epidemiol. 181(2), 108–119 (2014)CrossRefGoogle Scholar
  22. E.C. Polley, M.J. van der Laan, SuperLearner: super learner prediction (2013).
  23. E.C. Polley, E. LeDell, C. Kennedy, M.J. van der Laan, SuperLearner: super learner prediction (2017).
  24. R Development Core Team. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna (2016).
  25. S. Rose, Robust machine learning variable importance analyses of medical conditions for health care spending. Health Serv. Res. (2018, in press)Google Scholar
  26. Y. Rozenholc, T. Mildenberger, U. Gather, Combining regular and irregular histograms by penalized likelihood. Comput. Stat. Data Anal. 54(12), 3313–3323 (2010)MathSciNetCrossRefzbMATHGoogle Scholar
  27. M.J. van der Laan, Statistical inference for variable importance. Int. J. Biostat. 2(1), Article 2 (2006b)Google Scholar
  28. M.J. van der Laan, A.R. Luedtke, Targeted learning of an optimal dynamic treatment, and statistical inference for its mean outcome. Technical Report, Division of Biostatistics, University of California, BerkeleyGoogle Scholar
  29. M.J. van der Laan, K.S. Pollard, Hybrid clustering of gene expression data with visualization and the bootstrap. J. Stat. Plann. Inference 117, 275–303 (2003)CrossRefzbMATHGoogle Scholar
  30. M.J. van der Laan, E.C. Polley, A.E. Hubbard, Super learner. Stat. Appl. Genet. Mol. 6(1), Article 25 (2007)Google Scholar
  31. M.J. van der Laan, S. Rose, Targeted Learning: Causal Inference for Observational and Experimental Data (Springer, Berlin, Heidelberg, New York, 2011)CrossRefGoogle Scholar
  32. H. Wang, S. Rose, M.J. van der Laan, Finding quantitative trait loci genes with collaborative targeted maximum likelihood learning. Stat. Probab. Lett. 81(7), 792–796 (2011a)Google Scholar
  33. H. Wang, S. Rose, M.J. van der Laan. Finding quantitative trait loci genes, in Targeted Learning: Causal Inference for Observational and Experimental Data, ed. by M.J. van der Laan, S. Rose (Springer, Berlin Heidelberg, New York, 2011b)Google Scholar
  34. H. Wang, Z. Zhang, S. Rose, M.J. van der Laan, A novel targeted learning methods for quantitative trait Loci mapping. Genetics 198(4), 1369–1376 (2014)CrossRefGoogle Scholar
  35. P. Wilson, R.B. D’Agostino, D. Levy, A.M. Belanger, H. Silbershatz, W.B. Kannel, Prediction of coronary heart disease using risk factor categories. Circulation 97(18), 1837–1847 (1998)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  • Alan E. Hubbard
    • 1
    Email author
  • Chris J. Kennedy
    • 1
  • Mark J. van der Laan
    • 2
  1. 1.Division of BiostatisticsUniversity of California, BerkeleyBerkeleyUSA
  2. 2.Division of Biostatistics and Department of StatisticsUniversity of California, BerkeleyBerkeleyUSA

Personalised recommendations