Statistics and Computing

, Volume 26, Issue 1–2, pp 49–60 | Cite as

Does data splitting improve prediction?



Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.


Cross-validation Model assessment Model uncertainty Model validation Prediction Scoring 


  1. Altman, D.G., Royston, P.: What do we mean by validating a prognostic model? Stat. Med. 19(4), 453–473 (2000)CrossRefGoogle Scholar
  2. Bell, R., Koren, Y.: Lessons from the Netflix prize challenge. ACM SIGKDD Explor. Newsl. 9(2), 75–79 (2007)CrossRefGoogle Scholar
  3. Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)MATHMathSciNetCrossRefGoogle Scholar
  4. Berk, R., Brown, L., Zhao, L.: Statistical inference after model selection. J. Quant. Criminol. 26(2), 217–236 (2009)CrossRefGoogle Scholar
  5. Carpenter, J.: May the best analyst win. Science 331(6018), 698–699 (2011)CrossRefGoogle Scholar
  6. Chatfield, C.: Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. Ser. A 158(3), 419–466 (1995)CrossRefGoogle Scholar
  7. Cox, D.: A note on data-splitting for the evaluation of significance levels. Biometrika 62, 441–444 (1975)MATHMathSciNetCrossRefGoogle Scholar
  8. Dahl, F., Grotle, M., Saltyte Benth, J., Natvig, B.: Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain. Eur. J. Epidemiol. 23(4), 237–242 (2008)CrossRefGoogle Scholar
  9. Dawid, A.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A 147, 278–292 (1984)MATHMathSciNetCrossRefGoogle Scholar
  10. Draper, D.: Assessment and propogation of model uncertainty. J. R. Stat. Soc. Ser. B 57, 45–97 (1995)MATHMathSciNetGoogle Scholar
  11. Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215–231 (1992)Google Scholar
  12. Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008)Google Scholar
  13. Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102(477), 359–378 (2007)MATHMathSciNetCrossRefGoogle Scholar
  14. Good, I.J.: Rational decisions. J. R. Stat. Soc. Ser. B 14(1), 107–114 (1952)MathSciNetGoogle Scholar
  15. Heller, R., Rosenbaum, P.R., Small, D.S.: Split samples and design sensitivity in observational studies. J. Am. Stat. Assoc. 104(487), 1090–1101 (2009)MathSciNetCrossRefGoogle Scholar
  16. Hinkley, D., Runger, G.: The analysis of transformed data (with discussion). J. Am. Stat. Assoc. 79, 302–319 (1984)MATHMathSciNetCrossRefGoogle Scholar
  17. Hirsch, R.: Validation samples. Biometrics 47(3), 1193–1194 (1991)Google Scholar
  18. Lawless, J.F., Fredette, M.: Frequentist prediction intervals and predictive distributions. Biometrika 92(3), 529–542 (2005)MATHMathSciNetCrossRefGoogle Scholar
  19. Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econom. Theory 21(01), 21–59 (2005)MATHGoogle Scholar
  20. Little, R.: Calibrated bayes. Am. Stat. 60(3), 213–223 (2006)MathSciNetCrossRefGoogle Scholar
  21. Meng, X., Xie, X.: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econom. Rev. 33, 1–33 (2013)MathSciNetGoogle Scholar
  22. Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (1990)MATHCrossRefGoogle Scholar
  23. Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)Google Scholar
  24. Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)Google Scholar
  25. Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012)Google Scholar
  26. Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140–147 (1990)Google Scholar
  27. Picard, R., Cook, R.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575–583 (1984)MATHMathSciNetCrossRefGoogle Scholar
  28. Pötscher, B.: Effects of model selection on inference. Econom. Theory 7(2), 163–185 (1991)CrossRefGoogle Scholar
  29. Roecker, E.: Prediction error and its estimation for subset-selected models. Technometrics 33, 459–468 (1991)CrossRefGoogle Scholar
  30. Schumacher, M., Binder, H., Gerds, T.: Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768–1774 (2007)CrossRefGoogle Scholar
  31. Steyerberg, E.: Clinical Prediction Models. Springer, New York (2009)MATHCrossRefGoogle Scholar
  32. Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36, 111–147 (1974)MATHGoogle Scholar
  33. Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)MATHCrossRefGoogle Scholar
  34. Wit, E., Heuvel, E.V.D., Romeijn, J.W.: All models are wrong..: an introduction to model uncertainty. Stat. Neerl. 66(3), 217–236 (2012)MathSciNetCrossRefGoogle Scholar
  35. Xie, M.G., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter — a review. Int. Stat. Rev. 81, 3–39 (2013)MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  1. 1.Department of Mathematical SciencesUniversity of BathBathUK

Personalised recommendations