Skip to main content
Log in

Does data splitting improve prediction?

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  • Altman, D.G., Royston, P.: What do we mean by validating a prognostic model? Stat. Med. 19(4), 453–473 (2000)

    Article  Google Scholar 

  • Bell, R., Koren, Y.: Lessons from the Netflix prize challenge. ACM SIGKDD Explor. Newsl. 9(2), 75–79 (2007)

    Article  Google Scholar 

  • Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19(2), 521–547 (2013)

    Article  MATH  MathSciNet  Google Scholar 

  • Berk, R., Brown, L., Zhao, L.: Statistical inference after model selection. J. Quant. Criminol. 26(2), 217–236 (2009)

    Article  Google Scholar 

  • Carpenter, J.: May the best analyst win. Science 331(6018), 698–699 (2011)

    Article  Google Scholar 

  • Chatfield, C.: Model uncertainty, data mining and statistical inference. J. R. Statist. Soc. Ser. A 158(3), 419–466 (1995)

    Article  Google Scholar 

  • Cox, D.: A note on data-splitting for the evaluation of significance levels. Biometrika 62, 441–444 (1975)

    Article  MATH  MathSciNet  Google Scholar 

  • Dahl, F., Grotle, M., Saltyte Benth, J., Natvig, B.: Data splitting as a countermeasure against hypothesis fishing: with a case study of predictors for low back pain. Eur. J. Epidemiol. 23(4), 237–242 (2008)

    Article  Google Scholar 

  • Dawid, A.: Present position and potential developments: some personal views statistical theory the prequential approach. J. R. Stat. Soc. Ser. A 147, 278–292 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  • Draper, D.: Assessment and propogation of model uncertainty. J. R. Stat. Soc. Ser. B 57, 45–97 (1995)

    MATH  MathSciNet  Google Scholar 

  • Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215–231 (1992)

    Google Scholar 

  • Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008)

    Google Scholar 

  • Gneiting, T., Raftery, A.E.: Strictly proper scoring rules, prediction, and estimation. J. Am. Stat. Assoc. 102(477), 359–378 (2007)

    Article  MATH  MathSciNet  Google Scholar 

  • Good, I.J.: Rational decisions. J. R. Stat. Soc. Ser. B 14(1), 107–114 (1952)

    MathSciNet  Google Scholar 

  • Heller, R., Rosenbaum, P.R., Small, D.S.: Split samples and design sensitivity in observational studies. J. Am. Stat. Assoc. 104(487), 1090–1101 (2009)

    Article  MathSciNet  Google Scholar 

  • Hinkley, D., Runger, G.: The analysis of transformed data (with discussion). J. Am. Stat. Assoc. 79, 302–319 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  • Hirsch, R.: Validation samples. Biometrics 47(3), 1193–1194 (1991)

    Google Scholar 

  • Lawless, J.F., Fredette, M.: Frequentist prediction intervals and predictive distributions. Biometrika 92(3), 529–542 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  • Leeb, H., Pötscher, B.M.: Model selection and inference: facts and fiction. Econom. Theory 21(01), 21–59 (2005)

    MATH  Google Scholar 

  • Little, R.: Calibrated bayes. Am. Stat. 60(3), 213–223 (2006)

    Article  MathSciNet  Google Scholar 

  • Meng, X., Xie, X.: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? Econom. Rev. 33, 1–33 (2013)

    MathSciNet  Google Scholar 

  • Miller, A.: Subset Selection in Regression. CRC Press, Boca Raton (1990)

    Book  MATH  Google Scholar 

  • Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)

  • Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)

    Google Scholar 

  • Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012)

  • Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140–147 (1990)

    Google Scholar 

  • Picard, R., Cook, R.: Cross-validation of regression models. J. Am. Stat. Assoc. 79, 575–583 (1984)

    Article  MATH  MathSciNet  Google Scholar 

  • Pötscher, B.: Effects of model selection on inference. Econom. Theory 7(2), 163–185 (1991)

    Article  Google Scholar 

  • Roecker, E.: Prediction error and its estimation for subset-selected models. Technometrics 33, 459–468 (1991)

    Article  Google Scholar 

  • Schumacher, M., Binder, H., Gerds, T.: Assessment of survival prediction models based on microarray data. Bioinformatics 23(14), 1768–1774 (2007)

    Article  Google Scholar 

  • Steyerberg, E.: Clinical Prediction Models. Springer, New York (2009)

    Book  MATH  Google Scholar 

  • Stone, M.: Cross-validatory choice and assessment of statistical predictions. J. R. Stat. Soc. Ser. B 36, 111–147 (1974)

    MATH  Google Scholar 

  • Venables, W.N., Ripley, B.D.: Modern Applied Statistics with S, 4th edn. Springer, New York (2002)

    Book  MATH  Google Scholar 

  • Wit, E., Heuvel, E.V.D., Romeijn, J.W.: All models are wrong..: an introduction to model uncertainty. Stat. Neerl. 66(3), 217–236 (2012)

    Article  MathSciNet  Google Scholar 

  • Xie, M.G., Singh, K.: Confidence distribution, the frequentist distribution estimator of a parameter — a review. Int. Stat. Rev. 81, 3–39 (2013)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Julian J. Faraway.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Faraway, J.J. Does data splitting improve prediction?. Stat Comput 26, 49–60 (2016). https://doi.org/10.1007/s11222-014-9522-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-014-9522-9

Keywords

Navigation