Skip to main content
Log in

Subsampling bias and the best-discrepancy systematic cross validation

  • Articles
  • Published:
Science China Mathematics Aims and scope Submit manuscript

Abstract

Statistical machine learning models should be evaluated and validated before putting to work. Conventional k-fold Monte Carlo cross-validation (MCCV) procedure uses a pseudo-random sequence to partition instances into k subsets, which usually causes subsampling bias, inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation. Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory, we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence, which ensures low subsampling bias and leads to more precise expected-prediction-error (EPE) estimates. Experiments with 156 benchmark datasets and three classifiers (logistic regression, decision tree and naive bayes) show that in general, our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18% and the variances around 26.73%. In comparison, the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58% and 11.85%, respectively. The leave-one-out (LOO) can lower the EPE around 2.50% but its variances are much higher than the any other cross-validation (CV) procedure. The computational time of our cross-validation procedure is just 8.64% of the MCCV, 8.67% of the stratified MCCV and 16.72% of the LOO. Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio. This makes our approach particularly pertinent when solving bioscience classification problems. Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Baker A. On some Diophantine inequalities involving the exponential function. Canad J Math, 1965, 17: 616–626

    MathSciNet  MATH  Google Scholar 

  2. Bergstra J, Bardenet R, Bengio Y, et al. Algorithms for hyper-parameter optimization. Adv Neural Inf Process Syst, 2011, 1: 2546–2554

    Google Scholar 

  3. Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res, 2012, 13: 281–305

    MathSciNet  MATH  Google Scholar 

  4. Boyle P, Broadie M, Glasserman P. Monte Carlo methods for security pricing. J Econom Dynam Control, 1997, 21: 1267–1321

    MathSciNet  MATH  Google Scholar 

  5. Braga-Neto U M, Dougherty E R. Is cross-validation valid for small-sample microarray classification? Bioinformatics, 2004, 20: 374–380

    Google Scholar 

  6. Braga-Neto U M, Zollanvari A, Dougherty G. Cross-validation under separate sampling: Strong bias and how to correct it. Bioinformatics, 2014, 30: 3349–3355

    Google Scholar 

  7. Branicky M, LaValle S, Olson K, et al. Quasi-randomized path planning. In: IEEE International Conference on Robotics and Automation, vol. 2. Piscataway: IEEE, 2001, 1481–1487

    Google Scholar 

  8. Cheng J. Computational investigation of low-discrepancy sequences in simulation algorithms for Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, vol. 1. San Francisco: Morgan Kaufmann Publishers, 2000, 72–81

    Google Scholar 

  9. Chung K. An estimate concerning the Kolmogoroff limit distribution. Trans Amer Math Soc, 1949, 67: 36–50

    MathSciNet  MATH  Google Scholar 

  10. Cunningham J, Ghahramani Z. Linear dimensionality reduction: Survey, insights, and generalizations. J Mach Learn Res, 2015, 16: 2859–2900

    MathSciNet  MATH  Google Scholar 

  11. Dai H, Wang W. Application of low-discrepancy sampling method in structural reliability analysis. Struct Safety, 2009, 31: 155–164

    Google Scholar 

  12. Díaz-Uriarte R, DeAndrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006, 7: 1–13

    Google Scholar 

  13. Dick J, Kuo F, Sloan I. High-dimensional integration: The quasi-Monte Carlo way. Acta Numer, 2013, 22: 133–288

    MathSciNet  MATH  Google Scholar 

  14. Dick J, Pillichshammer F. The weighted star discrepancy of Korobov sets. Proc Amer Math Soc, 2015, 143: 5043–5057

    MathSciNet  MATH  Google Scholar 

  15. Fu W, Carroll R, Wang S. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, 2005, 21: 1979–1986

    Google Scholar 

  16. Gentle J. Statistics and Computing Random Number Generation and Monte Carlo Methods. New York: Springer, 2003

    MATH  Google Scholar 

  17. Georgieva A, Jordanov I. A hybrid meta-heuristic for global optimisation using low-discrepancy sequences of points. Comput Oper Res, 2010, 37: 456–469

    MathSciNet  MATH  Google Scholar 

  18. Groot P De, Postma G, Melssen W, et al. Selecting a representative training set for the classification of demolition waste using remote NIR sensing. Anal Chimica Acta, 1999, 392: 67–75

    Google Scholar 

  19. Halton J H. Algorithm 247: Radical-inverse quasi-random point sequence. Comm ACM, 1964, 7: 701–702

    Google Scholar 

  20. Hua L K, Wang Y. Applications of Number Theory in Approximate Analysis. Beijing: Science Press, 1978

    Google Scholar 

  21. Kalagnanam J, Diwekar U. An efficient sampling technique for off-line quality control. Technometrics, 1997, 39: 308–319

    MATH  Google Scholar 

  22. Keller A. The fast calculation of form factors using low discrepancy sequences. In: Proceedings of the 12th Spring Conference on Computer Graphics, vol. 1. Bratislava: Comenius University Press, 1996, 195–204

    Google Scholar 

  23. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conferences on Artificial Intelligence, vol. 14. San Francisco: Morgan Kaufmann Publishers, 1995, 1137–1143

    Google Scholar 

  24. Kollig T, Keller A. Efficient multidimensional sampling. Comput Graph Forum, 2002, 21: 557–563

    MATH  Google Scholar 

  25. Kucherenko S, Sytsko Y. Application of deterministic and low-discrepancy sequence in global optimisation. Comput Optim Appl, 2005, 30: 297–318

    MathSciNet  MATH  Google Scholar 

  26. Kuipers L, Niederreiter H. Uniform Distribution of Sequences. New York: John Wiley & Sons, 1974

    MATH  Google Scholar 

  27. Li X, Wang W, Martin R, et al. Using low-discrepancy sequences and the crofton formula to compute surface areas of geometric models. Comput Aided Design, 2003, 35: 771–782

    MATH  Google Scholar 

  28. Lindermann R, Steven S, LaValle M. Incremental low-discrepancy lattice methods for motion planning. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 1. Piscataway: IEEE, 2003, 2920–2927

    Google Scholar 

  29. Lohr S. Sampling: Design and Analysis. Boston: Brooks/Cole, 2009

    MATH  Google Scholar 

  30. Mahler K. On a paper by A. Baker on the approximation of rational powers of e. Acta Arith, 1975, 27: 61–87

    MathSciNet  MATH  Google Scholar 

  31. Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics, 2005, 21: 307–330

    Google Scholar 

  32. Niederreiter H. Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia: SIAM, 1992

    MATH  Google Scholar 

  33. Olson R, LaCava W, Orzechowski P, et al. PMLB: A Large benchmark suite for machine learning evaluation and comparison. BioData Mining, 2017, 10: 36

    Google Scholar 

  34. Pant M, Thangaraj R, Grosan C, et al. Improved particle swarm optimization with low-discrepancy sequences. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 2. Piscataway: IEEE, 2008, 3011–3018

    Google Scholar 

  35. Paskov S, Traub J. Faster valuation of financial derivatives. J Portfolio Management, 1995, 22: 113–123

    Google Scholar 

  36. Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res, 2011, 12: 2825–2830

    MathSciNet  MATH  Google Scholar 

  37. Quinn J, Langbein F, Martin R. Low-discrepancy sampling of meshes for rendering. Eurographics Symp Point-Based Graph, 2007, 1: 19–28

    Google Scholar 

  38. Schmidt W. Irregularities of distribution, VII. Acta Arith, 1972, 21: 45–50

    MathSciNet  MATH  Google Scholar 

  39. Singhee A, Rutenbar R. From finance to flip flops: A study of fast quasi-Monte Carlo methods from computational finance applied to statistical circuit analysis. In: Proceedings of the 8th International Symposium on Quality Electronic Design, vol. 1. Washington: IEEE, 2007, 685–692

    Google Scholar 

  40. Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol, 1974, 36: 111–147

    MathSciNet  MATH  Google Scholar 

  41. Struckmeier J. Fast generation of low-discrepancy sequences. J Comput Appl Math, 1995, 61: 29–41

    MathSciNet  MATH  Google Scholar 

  42. Tan K, Boyle P. Applications of randomized low discrepancy sequences to the valuation of complex securities. J Econom Dynam Control, 2000, 24: 1747–1782

    MathSciNet  MATH  Google Scholar 

  43. Uy N, Hoai N, McKay R, et al. Initialising PSO with randomised low-discrepancy sequences: The comparative results. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 1. Piscataway: IEEE, 2007, 1985–1992

    Google Scholar 

  44. van der Corput J G. Verteilungsfunktionen (Erste Mitteilung). In: Proceedings of the Koninklijke Akademie van Wetenschappen te Amsterdam, vol. 38. Amsterdam: Elsevier, 1935, 813–821

    Google Scholar 

  45. Wenzel L, Dair D, Vazquez N. Pattern Matching System and Method with Improved Template Image Sampling Using Low Discrepancy Sequence. Washington: US Patent No. 6,229,921, 2001

  46. Xu Z Q, Zhou T. On sparse interpolation and the design of deterministic interpolation points. SIAM J Sci Comput, 2014, 36: 1752–1769

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

The first author was supported by the Qilu Youth Scholar Project of Shandong University. The second author was supported by National Natural Science Foundation of China (Grant No. 11531008), the Ministry of Education of China (Grant No. IRT16R43) and the Taishan Scholar Project of Shandong Province. The authors thank the referees for helpful suggestions on an earlier draft of the paper.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianya Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, L., Liu, J. & Lu, R. Subsampling bias and the best-discrepancy systematic cross validation. Sci. China Math. 64, 197–210 (2021). https://doi.org/10.1007/s11425-018-9561-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11425-018-9561-0

Keywords

MSC(2010)

Navigation