Analytical and Bioanalytical Chemistry

, Volume 410, Issue 23, pp 5981–5992 | Cite as

Overoptimism in cross-validation when using partial least squares-discriminant analysis for omics data: a systematic study

  • Raquel Rodríguez-Pérez
  • Luis Fernández
  • Santiago Marco
Research Paper


Advances in analytical instrumentation have provided the possibility of examining thousands of genes, peptides, or metabolites in parallel. However, the cost and time-consuming data acquisition process causes a generalized lack of samples. From a data analysis perspective, omics data are characterized by high dimensionality and small sample counts. In many scenarios, the analytical aim is to differentiate between two different conditions or classes combining an analytical method plus a tailored qualitative predictive model using available examples collected in a dataset. For this purpose, partial least squares-discriminant analysis (PLS-DA) is frequently employed in omics research. Recently, there has been growing concern about the uncritical use of this method, since it is prone to overfitting and may aggravate problems of false discoveries. In many applications involving a small number of subjects or samples, predictive model performance estimation is only based on cross-validation (CV) results with a strong preference for reporting results using leave one out (LOO). The combination of PLS-DA for high dimensionality data and small sample conditions, together with a weak validation methodology is a recipe for unreliable estimations of model performance. In this work, we present a systematic study about the impact of the dataset size, the dimensionality, and the CV technique used on PLS-DA overoptimism when performance estimation is done in cross-validation. Firstly, by using synthetic data generated from a same probability distribution and with assigned random binary labels, we have obtained a dataset where the true classification rate (CR) is 50%. As expected, our results confirm that internal validation provides overoptimistic estimations of the classification accuracy (i.e., overfitting). We have characterized the CR estimator in terms of bias and variance depending on the internal CV technique used and sample to dimensionality ratio. In small sample conditions, due to the large bias and variance of the estimator, the occurrence of extremely good CRs is common. We have found that overfitting peaks when the sample size in the training subset approaches the feature vector dimensionality minus one. In these conditions, the models are neither under- or overdetermined with a unique solution. This effect is particularly intense for LOO and peaks higher in small sample conditions. Overoptimism is decreased beyond this point where the abundance of noisy produces a regularization effect leading to less complex models. In terms of overfitting, our study ranks CV methods as follows: Bootstrap produces the most accurate estimator of the CR, followed by bootstrapped Latin partitions, random subsampling, K-Fold, and finally, the very popular LOO provides the worst results. Simulation results are further confirmed in real datasets from mass spectrometry and microarrays.


Metabolomics Mass spectrometry Microarrays Chemometrics Data analysis Classification Method validation 


Authors’ contributions

RR wrote the software, analyzed the data, and prepared the figures and text. LF supervised the code of RR and provided useful insights. SM conceived the study and supervised the work. RR and SM authors contributed to writing the manuscript. All authors read and approved the final manuscript.

Funding information

This work was partially funded by the Spanish MINECO program, under grants TEC2011-26143 (SMART-IMS) and TEC2014-59229-R (SIGVOL). The Signal and Information Processing for Sensor Systems group is a consolidated Grup de Recerca de la Generalitat de Catalunya and has support from the Departament d’Universitats, Recerca i Societat de la Informació de la Generalitat de Catalunya (expedient 2017 SGR 1721). This work has received support from the Comissionat per a Universitats i Recerca del DIUE de la Generalitat de Catalunya and the European Social Fund (ESF). Additional financial support has been provided by the Institut de Bioenginyeria de Catalunya (IBEC). IBEC is a member of the CERCA Programme/Generalitat de Catalunya.

Compliance with ethical standards

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Availability of data and material

The microarray dataset analyzed during the current study is publicly available at

Competing interests

The authors declare that they have no competing interests.


  1. 1.
    Santana R, Galdiano J, Pérez A, Bielza C, Larrañaga P, Calvo B, et al. Machine learning in bioinformatics machine learning in bioinformatics. Brief Bioinform. 2006;7:1–16. Scholar
  2. 2.
    Kulasingam V, Diamandis EP. Strategies for discovering novel cancer biomarkers through utilization of emerging technologies. Nat Clin Pract Oncol. 2008;5:588–99. Scholar
  3. 3.
    Vinaixa M, Samino S, Saez I, Duran J, Guinovart JJ, Yanes O. A guideline to univariate statistical analysis for LC/MS-based untargeted metabolomics-derived data. Metabolites. 2012;2:775–95. Scholar
  4. 4.
    Bellman R. Adaptive control processes—a guided tour. Z Angew Math Mech. 1962;42:364–5.Google Scholar
  5. 5.
    Bishop CM. Pattern recognition and machine learning. Heidelberg: Springer-Verlag Berlin; 2006.Google Scholar
  6. 6.
    Ghosh D, Poisson LM. “Omics” data and levels of evidence for biomarker discovery. Genomics. 2009;93:13–6. Scholar
  7. 7.
    Rubingh CM, Bijlsma S, Derks EPP, Bobeldijk I, Verheij ER, Kochhar S, et al. Assessing the performance of statistical validation tools for megavariate metabolomics data. Metabolomics. 2006;2:53–61. Scholar
  8. 8.
    Westad F, Marini F. Validation of chemometric models—a tutorial. Anal Chim Acta. 2015;893:14–24. Scholar
  9. 9.
    Marco S. The need for external validation in machine olfaction: emphasis on health-related applications chemosensors and chemoreception. Anal Bioanal Chem. 2014;406:3941–56. Scholar
  10. 10.
    Kennard RW, Stone LA. Computer aided design of experiments. Technometrics. 1969;11:137–48. Scholar
  11. 11.
    Galvão RKH, Araujo MCU, José GE, Pontes MJC, Silva EC, Saldanha TCB. A method for calibration and validation subset partitioning. Talanta. 2005;67:736–40. Scholar
  12. 12.
    Barker M, Rayens W. Partial least squares for discrimination. J Chemom. 2003;17:166–73. Scholar
  13. 13.
    Chevallier S, Bertrand D, Kohler A, Courcoux P. Application of PLS-DA in multivariate image analysis. J Chemom. 2006;20:221–9. Scholar
  14. 14.
    Sirven J-B, Sallé B, Mauchien P, Lacour J-L, Maurice S, Manhès G. Feasibility study of rock identification at the surface of Mars by remote laser-induced breakdown spectroscopy and three chemometric methods. J Anal At Spectrom. 2007;22:1471. Scholar
  15. 15.
    Ciosek P, Wróblewski W. Miniaturized electronic tongue with an integrated reference microelectrode for the recognition of milk samples. Talanta. 2008;76:548–56. Scholar
  16. 16.
    Ivorra E, Girón J, Sánchez AJ, Verdú S, Barat JM, Grau R. Detection of expired vacuum-packed smoked salmon based on PLS-DA method using hyperspectral images. J Food Eng. 2013;117:342–9. Scholar
  17. 17.
    Bassbasi M, De Luca M, Ioele G, Oussama A, Ragno G. Prediction of the geographical origin of butters by partial least square discriminant analysis (PLS-DA) applied to infrared spectroscopy (FTIR) data. J Food Compos Anal. 2014;33:210–5. Scholar
  18. 18.
    Lo Y-L, Pan W-H, Hsu W-L, Chien Y-C, Chen J-Y, Hsu M-M, et al. Partial least square discriminant analysis discovered a dietary pattern inversely associated with nasopharyngeal carcinoma risk. PLoS One. 2016.
  19. 19.
    Pérez-Enciso M, Tenenhaus M. Prediction of clinical outcome with microarray data: a partial least squares discriminant analysis (PLS-DA) approach. Hum Genet. 2003;112:581–92. Scholar
  20. 20.
    Boulesteix AL, Strimmer K. Partial least squares: a versatile tool for the analysis of high-dimensional genomic data. Brief Bioinform. 2007;8:32–44. Scholar
  21. 21.
    Izquierdo-García JL, Rodríguez I, Kyriazis A, Villa P, Barreiro P, Desco M, et al. A novel R-package graphic user interface for the analysis of metabonomic profiles. BMC Bioinformatics. 2009;10.
  22. 22.
    Biswas A, Mynampati KC, Umashankar S, Reuben S, Parab G, Rao R, et al. Metdat: a modular and workflow-based free online pipeline for mass spectrometry data processing, analysis and interpretation. Bioinformatics. 2010;26:2639–40. Scholar
  23. 23.
    Smolinska A, Blanchet L, Buydens LMC, Wijmenga SS. NMR and pattern recognition methods in metabolomics: from data acquisition to biomarker discovery: a review. Anal Chim Acta. 2012;750:82–97. Scholar
  24. 24.
    Sugimoto M, Kawakami M, Robert M, Soga T, Tomita M. Bioinformatics tools for mass spectroscopy-based metabolomic data processing and analysis. Curr Bioinforma. 2012;7:96–108. Scholar
  25. 25.
    Cauchi M, Fowler DP, Walton C, Turner C, Jia W, Whitehead RN, et al. Application of gas chromatography mass spectrometry (GC-MS) in conjunction with multivariate classification for the diagnosis of gastrointestinal diseases. Metabolomics. 2014;10:1113–20.CrossRefGoogle Scholar
  26. 26.
    Bro R, Kamstrup-Nielsen MH, Engelsen SB, Savorani F, Rasmussen MA, Hansen L, et al. Forecasting individual breast cancer risk using plasma metabolomics and biocontours. Metabolomics. 2015;11:1376–80. Scholar
  27. 27.
    Garreta-Lara E, Campos B, Barata C, Lacorte S, Tauler R. Metabolic profiling of Daphnia magna exposed to environmental stressors by GC–MS and chemometric tools. Metabolomics. 2016;12.
  28. 28.
    Fang J, Wang W, Sun S, Wang Y, Li Q, Lu X, et al. Metabolomics study of renal fibrosis and intervention effects of total aglycone extracts of Scutellaria baicalensis in unilateral ureteral obstruction rats. J Ethnopharmacol. 2016;192:20–9. Scholar
  29. 29.
    Lämmerhofer M, Weckwerth W. Metabolomics in practice successful strategies to generate and analyze metabolic data. Weinheim, Germany: Wiley-VCH Verlag GmbH & Co. KGaA; 2013.CrossRefGoogle Scholar
  30. 30.
    Broadhurst DI, Kell DB. Statistical strategies for avoiding false discoveries in metabolomics and related experiments. Metabolomics. 2006;2:171–96. Scholar
  31. 31.
    Gromski PS, Muhamadali H, Ellis DI, Xu Y, Correa E, Turner ML, et al. A tutorial review: metabolomics and partial least squares-discriminant analysis - a marriage of convenience or a shotgun wedding. Anal Chim Acta. 2015;879:10–23. Scholar
  32. 32.
    Eriksson L, Johansson E, Kettaneh-Wold N, Wold S. Introduction to multi-and megavariate data analysis using projection methods (PCA & PLS). Umea: Umetrics AB; 1999.Google Scholar
  33. 33.
    Mehmood T, Liland KH, Snipen L, Saebø S. A review of variable selection methods in partial least squares regression. Chemom Intell Lab Syst. 2012;118:62–9. Scholar
  34. 34.
    Westerhuis JA, Hoefsloot HCJ, Smit S, Vis DJ, Smilde AK, Velzen EJJ, et al. Assessment of PLSDA cross validation. Metabolomics. 2008;4:81–9. Scholar
  35. 35.
    Brereton RG, Lloyd GR. Partial least squares discriminant analysis: taking the magic away. J Chemom. 2014;28:213–25. Scholar
  36. 36.
    Sousa PF, Åberg KM. Can we beat overfitting?—a closer look at Cloarec’s PLS algorithm. J Chemom. 2018:e3002.
  37. 37.
    Agne K, Alexander HJ, Marcis L, Juozas K, Hossam H, Hermann B. Detection of cancer through exhaled breath: a systematic review. Oncotarget. 2015;6.
  38. 38.
    Steyerberg EW, Bleekerb SE, Moll HA, Grobbee DE, Moons KGM. Internal and external validation of predictive models: a simulation study of bias and precision in small samples. J Clin Epidemiol. 2003;56:441–7. Scholar
  39. 39.
    Kim J-H. Estimating classification error rate: repeated cross-validation, repeated hold-out and Bootstrap. Comput Stat Data Anal. 2009;53:3735–45. Scholar
  40. 40.
    Jiang G, Wang W. Error estimation based on variance analysis of k-fold cross-validation. Pattern Recogn. 2017;69:94–106. Scholar
  41. 41.
    Wong TT. Performance evaluation of classification algorithms by k-fold and leave-one-out cross validation. Pattern Recogn. 2015;48:2839–46. Scholar
  42. 42.
    Filzmoser P, Liebmann B, Varmuza K. Repeated double cross validation. J Chemom. 2009;23:160–71. Scholar
  43. 43.
    Anderssen E, Dyrstad K, Westad F, Martens H. Reducing over-optimism in variable selection by cross-model validation. Chemom Intell Lab Syst. 2006;84:69–74. Scholar
  44. 44.
    Martens H, Martens M. Modified Jack-knife estimation of parameter uncertainty in bilinear modelling by partial least squares regression (PLSR). Food Qual Prefer. 2000;11:5–16. Scholar
  45. 45.
    Kjeldahl K, Bro R. Some common misunderstanding in chemometrics. J Chemom. 2010;24:558–64.CrossRefGoogle Scholar
  46. 46.
    Xia J, Broadhurst DI, Wilson M, Wishart DS. Translational biomarker discovery in clinical metabolomics: an introductory tutorial. Metabolomics. 2013;9:280–99. Scholar
  47. 47.
    Kohavi R (2016) A study of cross-validation and Bootstrap for accuracy estimation and model selection. IJCAI’95 Proceedings of the 14th International Joint Conference on Artificial Intelligence 2:1137–1143.Google Scholar
  48. 48.
    Molinaro AM, Simon R, Pfeiffer RM. Prediction error estimation: a comparison of resampling methods. Bioinformatics. 2005;21:3301–7. Scholar
  49. 49.
    Wood I, Visscher PM, Mengersen KL. Classification based upon gene expression data: bias and precision of error rates. Bioinformatics. 2007;23:1363–70. Scholar
  50. 50.
    Boulesteix AL, Strobl C. Optimal classifier selection and negative bias in error rate estimation: an empirical study on high-dimensional prediction. BMC Med Res Methodol. 2009;9.
  51. 51.
    Szymańska E, Saccenti E, Smilde AK, Westerhuis JA. Double-check: validation of diagnostic statistics for PLS-DA models in metabolomics studies. Metabolomics. 2012;8:3–16. Scholar
  52. 52.
    Triba MN, Le Moyec L, Amathieu R, Goossens C, Bouchemal N, Nahon P, et al. PLS/OPLS models in metabolomics: the impact of permutation of dataset rows on the K-fold cross-validation quality parameters. Mol BioSyst. 2015;11:13–9. Scholar
  53. 53.
    Braga-Neto UM, Dougherty ER. Is cross-validation valid for small-sample microarray classification? Bioinformatics. 2004;20:374–80. Scholar
  54. 54.
    Fu WJ, Carroll RJ, Wang S. Estimating misclassification error with small samples via Bootstrap cross-validation. Bioinformatics. 2005;21:1979–86. Scholar
  55. 55.
    Varma S, Simon R. Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics. 2006;10:91. Scholar
  56. 56.
    Phatak A, De Jong S. The geometry of partial least squares. J Chemom. 1997;11:311–38.<311::AID-CEM478>3.0.CO;2-4.CrossRefGoogle Scholar
  57. 57.
    Wold SSM, Eriksson L. PLS-regression: a basic tool of chemometrics. Chemom Intell Lab Syst. 2001;58:109–30.CrossRefGoogle Scholar
  58. 58.
    Mevik B-HBHB, Wehrens R. The pls package: principal component and partial least squares regression in R. J Stat Softw. 2007;2007:18.Google Scholar
  59. 59.
    Stone M. Cross-validatory choice and assessment of statistical predictions. J R Stat Soc. 1974;36:111–47. Scholar
  60. 60.
    Burman P. A comparative study of ordinary cross-validation, v-fold cross-validation and the repeated learning testing methods. Biometrika. 1989;76:503–14.CrossRefGoogle Scholar
  61. 61.
    Efron B, Tibshirani R. Estimating the error rate of a prediction rule. J Am Stat Assoc. 1983;78:316–31. Scholar
  62. 62.
    Efron B, Tibshirani R. Improvements on cross-validation: the 632+ Bootstrap method. J Am Stat Assoc. 1997;92:548–60.Google Scholar
  63. 63.
    Brereton R. Chemometrics for pattern recognition. Chichester: Wiley; 2009.Google Scholar
  64. 64.
    de Boves HP. Statistical validation of classification and calibration models using bootstrapped Latin partitions. TrAC-Trends Anal Chem. 2006;25:1112–24. Scholar
  65. 65.
    Cruciani G, Baroni M, Clementi S, Costantino G, Riganelli D, Skagerberg B. Predictive ability of regression models. Part I: standard deviation of prediction errors (SDEP). J Chemom. 1992;6:335–46. Scholar
  66. 66.
    Wan C, Harrington P d B. Screening GC-MS data for carbamate pesticides with temperature-constrained–cascade correlation neural networks. Anal Chim Acta. 2000;408:1–12. Scholar
  67. 67.
    Harrington P d B. Multiple versus single set validation of multivariate models to avoid mistakes. Crit Rev Anal Chem. 2018;48:33–46. Scholar
  68. 68.
    Harrington PB, Laurent C, Levinson DF, Levitt P, Markey SP. Bootstrap classification and point-based feature selection from age-staged mouse cerebellum tissues of matrix assisted laser desorption/ionization mass spectra using a fuzzy rule-building expert system. Anal Chim Acta. 2007;599:219–31. Scholar
  69. 69.
    de Boves HP. Support vector machine classification trees based on fuzzy entropy of classification. Anal Chim Acta. 2017;954:14–21. Scholar
  70. 70.
    Aloglu AK, Harrington PB, Sahin S, Demir C. Prediction of total antioxidant activity of Prunella L. species by automatic partial least square regression applied to 2-way liquid chromatographic UV spectral images. Talanta. 2016;161:503–10. Scholar
  71. 71.
    Rearden P, Harrington PB, Karnes JJ, Bunker CE. Fuzzy rule-building expert system classification of fuel using solid-phase microextraction two-way gas chromatography differential mobility spectrometric data. Anal Chem. 2007;79:1485–91. Scholar
  72. 72.
    Van’t Veer LJ, Dai H, Van de Vijver MJ, He YD, Hart AAM, Mao M, et al. Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002;415:530–6. Scholar
  73. 73.
    van de Vijver MJ, He YD, van’t Veer LJ, Dai H, Hart AAM, Voskuil DW, et al. A gene-expression signature as a predictor of survival in breast Cancer. N Engl J Med. 2002;347:1999–2009. Scholar
  74. 74.
    Guyon I, Li J, Mader T, Pletscher PA, Schneider G, Uhr M. Competitive baseline methods set new standards for the NIPS 2003 feature selection benchmark. Pattern Recogn Lett. 2007;28:1438–44. Scholar
  75. 75.
    Bogdanov M, Matson WR, Wang L, Matson T, Saunders-Pullman R, Bressman SS, et al. Metabolomic profiling to develop blood biomarkers for Parkinson’s disease. Brain. 2008;131:389–96. Scholar
  76. 76.
    Abaffy T, Möller MG, Riemer DD, Milikowski C, DeFazio RA. Comparative analysis of volatile metabolomics signals from melanoma and benign skin: a pilot study. Metabolomics. 2013;9:998–1008. Scholar
  77. 77.
    Bean HD, Jiménez-Díaz J, Zhu J, Hill JE. Breathprints of model murine bacterial lung infections are linked with immune response. Eur Respir J. 2015;45:181–90. Scholar
  78. 78.
    D’Amico A, Di Natale C, Paolesse R, Macagnano A, Martinelli E, Pennazza G, et al. Olfactory systems for medical applications. Sensors Actuators B Chem. 2008;130:458–65. Scholar
  79. 79.
    Franceschi P, Masuero D, Vrhovsek U, Mattivi F, Wehrens R. A benchmark spike-in data set for biomarker identification in metabolomics. J Chemom. 2012;26:16–24. Scholar
  80. 80.
    Schmekel B, Winquist F, Vikström A. Analysis of breath samples for lung cancer survival. Anal Chim Acta. 2014;840:82–6. Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Signal and Information Processing for Sensing Systems, Institute for Bioengineering of CataloniaThe Barcelona Institute for Science and TechnologyBarcelonaSpain
  2. 2.Department of Electronics and Biomedical EngineeringUniversity of BarcelonaBarcelonaSpain

Personalised recommendations