Does data splitting improve prediction?
- 491 Downloads
Data splitting divides data into two parts. One part is reserved for model selection. In some applications, the second part is used for model validation but we use this part for estimating the parameters of the chosen model. We focus on the problem of constructing reliable predictive distributions for future observed values. We judge the predictive performance using log scoring. We compare the full data strategy with the data splitting strategy for prediction. We show how the full data score can be decomposed into model selection, parameter estimation and data reuse costs. Data splitting is preferred when data reuse costs are high. We investigate the relative performance of the strategies in four simulation scenarios. We introduce a hybrid estimator that uses one part for model selection but both parts for estimation. We argue that a split data analysis is prefered to a full data analysis for prediction with some exceptions.
KeywordsCross-validation Model assessment Model uncertainty Model validation Prediction Scoring
- Faraway, J.: On the cost of data analysis. J. Comput. Gr. Stat. 1, 215–231 (1992)Google Scholar
- Friedman, J., Hastie, T., Tibshirani, R.: Elements Statistical Learning, 2nd edn. Springer, New York (2008)Google Scholar
- Hirsch, R.: Validation samples. Biometrics 47(3), 1193–1194 (1991)Google Scholar
- Molinaro, A.M., Simon, R., Pfeiffer, R.M.: Prediction error estimation: a comparison of resampling methods. Bioinformatics 21(15), 3301–3307 (2005)Google Scholar
- Mosteller, F., Tukey, J.: Data Analysis and Regression. A Second Course in Statistics. Addison-Wesley, Reading (1977)Google Scholar
- Parry, M., Dawid, A.P., Lauritzen, S.: Proper local scoring rules. Ann. Stat. 40(1), 561–592 (2012)Google Scholar
- Picard, R., Berk, K.: Data splitting. Am. Stat. 44, 140–147 (1990)Google Scholar