Subsampling bias and the best-discrepancy systematic cross validation

Guo, Liang; Liu, Jianya; Lu, Ruodan

doi:10.1007/s11425-018-9561-0

Subsampling bias and the best-discrepancy systematic cross validation

Articles
Published: 21 November 2019

Volume 64, pages 197–210, (2021)
Cite this article

Science China Mathematics Aims and scope Submit manuscript

Liang Guo¹,
Jianya Liu² &
Ruodan Lu^3,4

202 Accesses
6 Citations
Explore all metrics

Abstract

Statistical machine learning models should be evaluated and validated before putting to work. Conventional k-fold Monte Carlo cross-validation (MCCV) procedure uses a pseudo-random sequence to partition instances into k subsets, which usually causes subsampling bias, inflates generalization errors and jeopardizes the reliability and effectiveness of cross-validation. Based on ordered systematic sampling theory in statistics and low-discrepancy sequence theory in number theory, we propose a new k-fold cross-validation procedure by replacing a pseudo-random sequence with a best-discrepancy sequence, which ensures low subsampling bias and leads to more precise expected-prediction-error (EPE) estimates. Experiments with 156 benchmark datasets and three classifiers (logistic regression, decision tree and naive bayes) show that in general, our cross-validation procedure can extrude subsampling bias in the MCCV by lowering the EPE around 7.18% and the variances around 26.73%. In comparison, the stratified MCCV can reduce the EPE and variances of the MCCV around 1.58% and 11.85%, respectively. The leave-one-out (LOO) can lower the EPE around 2.50% but its variances are much higher than the any other cross-validation (CV) procedure. The computational time of our cross-validation procedure is just 8.64% of the MCCV, 8.67% of the stratified MCCV and 16.72% of the LOO. Experiments also show that our approach is more beneficial for datasets characterized by relatively small size and large aspect ratio. This makes our approach particularly pertinent when solving bioscience classification problems. Our proposed systematic subsampling technique could be generalized to other machine learning algorithms that involve random subsampling mechanism.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

The leave-worst-k-out criterion for cross validation

Article 17 June 2022

Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

Article 13 June 2020

References

Baker A. On some Diophantine inequalities involving the exponential function. Canad J Math, 1965, 17: 616–626
MathSciNet MATH Google Scholar
Bergstra J, Bardenet R, Bengio Y, et al. Algorithms for hyper-parameter optimization. Adv Neural Inf Process Syst, 2011, 1: 2546–2554
Google Scholar
Bergstra J, Bengio Y. Random search for hyper-parameter optimization. J Mach Learn Res, 2012, 13: 281–305
MathSciNet MATH Google Scholar
Boyle P, Broadie M, Glasserman P. Monte Carlo methods for security pricing. J Econom Dynam Control, 1997, 21: 1267–1321
MathSciNet MATH Google Scholar
Braga-Neto U M, Dougherty E R. Is cross-validation valid for small-sample microarray classification? Bioinformatics, 2004, 20: 374–380
Google Scholar
Braga-Neto U M, Zollanvari A, Dougherty G. Cross-validation under separate sampling: Strong bias and how to correct it. Bioinformatics, 2014, 30: 3349–3355
Google Scholar
Branicky M, LaValle S, Olson K, et al. Quasi-randomized path planning. In: IEEE International Conference on Robotics and Automation, vol. 2. Piscataway: IEEE, 2001, 1481–1487
Google Scholar
Cheng J. Computational investigation of low-discrepancy sequences in simulation algorithms for Bayesian networks. In: Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence, vol. 1. San Francisco: Morgan Kaufmann Publishers, 2000, 72–81
Google Scholar
Chung K. An estimate concerning the Kolmogoroff limit distribution. Trans Amer Math Soc, 1949, 67: 36–50
MathSciNet MATH Google Scholar
Cunningham J, Ghahramani Z. Linear dimensionality reduction: Survey, insights, and generalizations. J Mach Learn Res, 2015, 16: 2859–2900
MathSciNet MATH Google Scholar
Dai H, Wang W. Application of low-discrepancy sampling method in structural reliability analysis. Struct Safety, 2009, 31: 155–164
Google Scholar
Díaz-Uriarte R, DeAndrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics, 2006, 7: 1–13
Google Scholar
Dick J, Kuo F, Sloan I. High-dimensional integration: The quasi-Monte Carlo way. Acta Numer, 2013, 22: 133–288
MathSciNet MATH Google Scholar
Dick J, Pillichshammer F. The weighted star discrepancy of Korobov sets. Proc Amer Math Soc, 2015, 143: 5043–5057
MathSciNet MATH Google Scholar
Fu W, Carroll R, Wang S. Estimating misclassification error with small samples via bootstrap cross-validation. Bioinformatics, 2005, 21: 1979–1986
Google Scholar
Gentle J. Statistics and Computing Random Number Generation and Monte Carlo Methods. New York: Springer, 2003
MATH Google Scholar
Georgieva A, Jordanov I. A hybrid meta-heuristic for global optimisation using low-discrepancy sequences of points. Comput Oper Res, 2010, 37: 456–469
MathSciNet MATH Google Scholar
Groot P De, Postma G, Melssen W, et al. Selecting a representative training set for the classification of demolition waste using remote NIR sensing. Anal Chimica Acta, 1999, 392: 67–75
Google Scholar
Halton J H. Algorithm 247: Radical-inverse quasi-random point sequence. Comm ACM, 1964, 7: 701–702
Google Scholar
Hua L K, Wang Y. Applications of Number Theory in Approximate Analysis. Beijing: Science Press, 1978
Google Scholar
Kalagnanam J, Diwekar U. An efficient sampling technique for off-line quality control. Technometrics, 1997, 39: 308–319
MATH Google Scholar
Keller A. The fast calculation of form factors using low discrepancy sequences. In: Proceedings of the 12th Spring Conference on Computer Graphics, vol. 1. Bratislava: Comenius University Press, 1996, 195–204
Google Scholar
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of International Joint Conferences on Artificial Intelligence, vol. 14. San Francisco: Morgan Kaufmann Publishers, 1995, 1137–1143
Google Scholar
Kollig T, Keller A. Efficient multidimensional sampling. Comput Graph Forum, 2002, 21: 557–563
MATH Google Scholar
Kucherenko S, Sytsko Y. Application of deterministic and low-discrepancy sequence in global optimisation. Comput Optim Appl, 2005, 30: 297–318
MathSciNet MATH Google Scholar
Kuipers L, Niederreiter H. Uniform Distribution of Sequences. New York: John Wiley & Sons, 1974
MATH Google Scholar
Li X, Wang W, Martin R, et al. Using low-discrepancy sequences and the crofton formula to compute surface areas of geometric models. Comput Aided Design, 2003, 35: 771–782
MATH Google Scholar
Lindermann R, Steven S, LaValle M. Incremental low-discrepancy lattice methods for motion planning. In: Proceedings of IEEE International Conference on Robotics and Automation, vol. 1. Piscataway: IEEE, 2003, 2920–2927
Google Scholar
Lohr S. Sampling: Design and Analysis. Boston: Brooks/Cole, 2009
MATH Google Scholar
Mahler K. On a paper by A. Baker on the approximation of rational powers of e. Acta Arith, 1975, 27: 61–87
MathSciNet MATH Google Scholar
Molinaro A, Simon R, Pfeiffer R. Prediction error estimation: A comparison of resampling methods. Bioinformatics, 2005, 21: 307–330
Google Scholar
Niederreiter H. Random Number Generation and Quasi-Monte Carlo Methods. Philadelphia: SIAM, 1992
MATH Google Scholar
Olson R, LaCava W, Orzechowski P, et al. PMLB: A Large benchmark suite for machine learning evaluation and comparison. BioData Mining, 2017, 10: 36
Google Scholar
Pant M, Thangaraj R, Grosan C, et al. Improved particle swarm optimization with low-discrepancy sequences. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 2. Piscataway: IEEE, 2008, 3011–3018
Google Scholar
Paskov S, Traub J. Faster valuation of financial derivatives. J Portfolio Management, 1995, 22: 113–123
Google Scholar
Pedregosa F, Varoquaux G, Gramfort A, et al. Scikit-learn: Machine learning in python. J Mach Learn Res, 2011, 12: 2825–2830
MathSciNet MATH Google Scholar
Quinn J, Langbein F, Martin R. Low-discrepancy sampling of meshes for rendering. Eurographics Symp Point-Based Graph, 2007, 1: 19–28
Google Scholar
Schmidt W. Irregularities of distribution, VII. Acta Arith, 1972, 21: 45–50
MathSciNet MATH Google Scholar
Singhee A, Rutenbar R. From finance to flip flops: A study of fast quasi-Monte Carlo methods from computational finance applied to statistical circuit analysis. In: Proceedings of the 8th International Symposium on Quality Electronic Design, vol. 1. Washington: IEEE, 2007, 685–692
Google Scholar
Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc Ser B Stat Methodol, 1974, 36: 111–147
MathSciNet MATH Google Scholar
Struckmeier J. Fast generation of low-discrepancy sequences. J Comput Appl Math, 1995, 61: 29–41
MathSciNet MATH Google Scholar
Tan K, Boyle P. Applications of randomized low discrepancy sequences to the valuation of complex securities. J Econom Dynam Control, 2000, 24: 1747–1782
MathSciNet MATH Google Scholar
Uy N, Hoai N, McKay R, et al. Initialising PSO with randomised low-discrepancy sequences: The comparative results. In: Proceedings of the IEEE Congress on Evolutionary Computing, vol. 1. Piscataway: IEEE, 2007, 1985–1992
Google Scholar
van der Corput J G. Verteilungsfunktionen (Erste Mitteilung). In: Proceedings of the Koninklijke Akademie van Wetenschappen te Amsterdam, vol. 38. Amsterdam: Elsevier, 1935, 813–821
Google Scholar
Wenzel L, Dair D, Vazquez N. Pattern Matching System and Method with Improved Template Image Sampling Using Low Discrepancy Sequence. Washington: US Patent No. 6,229,921, 2001
Xu Z Q, Zhou T. On sparse interpolation and the design of deterministic interpolation points. SIAM J Sci Comput, 2014, 36: 1752–1769
MathSciNet MATH Google Scholar

Download references

Acknowledgements

The first author was supported by the Qilu Youth Scholar Project of Shandong University. The second author was supported by National Natural Science Foundation of China (Grant No. 11531008), the Ministry of Education of China (Grant No. IRT16R43) and the Taishan Scholar Project of Shandong Province. The authors thank the referees for helpful suggestions on an earlier draft of the paper.

Author information

Authors and Affiliations

Data Science Institute, Shandong University, Weihai, 264209, China
Liang Guo
Data Science Institute, Shandong University, Jinan, 250100, China
Jianya Liu
School of Architecture, Building and Civil Engineering, Loughborough University, Loughborough, LE11 3TU, UK
Ruodan Lu
Darwin College, University of Cambridge, Cambridge, CB3 9EU, UK
Ruodan Lu

Authors

Liang Guo
View author publications
You can also search for this author in PubMed Google Scholar
Jianya Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ruodan Lu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianya Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Guo, L., Liu, J. & Lu, R. Subsampling bias and the best-discrepancy systematic cross validation. Sci. China Math. 64, 197–210 (2021). https://doi.org/10.1007/s11425-018-9561-0

Download citation

Received: 05 November 2018
Accepted: 01 July 2019
Published: 21 November 2019
Issue Date: January 2021
DOI: https://doi.org/10.1007/s11425-018-9561-0

Keywords

MSC(2010)

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Subsampling bias and the best-discrepancy systematic cross validation

Abstract

Access this article

Similar content being viewed by others

The leave-worst-k-out criterion for cross validation

Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

MSC(2010)

Navigation

Subsampling bias and the best-discrepancy systematic cross validation

Abstract

Access this article

Similar content being viewed by others

The leave-worst-k-out criterion for cross validation

Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization

What is an optimal value of k in k-fold cross-validation in discrete Bayesian network analysis?

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

MSC(2010)

Search

Navigation