Skip to main content

Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization

  • Conference paper
Artificial Intelligence: Methods and Applications (SETN 2014)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8445))

Included in the following conference series:

Abstract

In a typical supervised data analysis task, one needs to perform the following two tasks: (a) select the best combination of learning methods (e.g., for variable selection and classifier) and tune their hyper-parameters (e.g., K in K-NN), also called model selection, and (b) provide an estimate of the performance of the final, reported model. Combining the two tasks is not trivial because when one selects the set of hyper-parameters that seem to provide the best estimated performance, this estimation is optimistic (biased / overfitted) due to performing multiple statistical comparisons. In this paper, we confirm that the simple Cross-Validation with model selection is indeed optimistic (overestimates) in small sample scenarios. In comparison the Nested Cross Validation and the method by Tibshirani and Tibshirani provide conservative estimations, with the later protocol being more computationally efficient. The role of stratification of samples is examined and it is shown that stratification is beneficial.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Anguita, D., Ghio, A., Oneto, L., Ridella, S.: In-Sample and Out-of-Sample Model Selection and Error Estimation for Support Vector Machines. IEEE Trans. Neural Networks Learn. Syst. 23, 1390–1406 (2012)

    Article  Google Scholar 

  2. Cawley, G.C., Talbot, N.L.C.: On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation. J. Mach. Learn. Res. 11, 2079–2107 (2010)

    MATH  MathSciNet  Google Scholar 

  3. Jensen, D.D., Cohen, P.R.: Multiple comparisons in induction algorithms. Mach. Learn. 38, 309–338 (2000)

    Article  MATH  Google Scholar 

  4. Kohavi, R.: A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In: International Joint Conference on Artificial Intelligence, pp. 1137–1143 (1995)

    Google Scholar 

  5. Tibshirani, R.J., Tibshirani, R.: A bias correction for the minimum error rate in cross-validation. Ann. Appl. Stat. 3, 822–829 (2009)

    Article  MATH  MathSciNet  Google Scholar 

  6. Statnikov, A., Aliferis, C.F., Tsamardinos, I., Hardin, D., Levy, S.: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics 21, 631–643 (2005)

    Article  Google Scholar 

  7. Duda, R.O., Hart, P.E., Stork, D.G.: Pattern Classification, 2nd edn. (2000)

    Google Scholar 

  8. Mitchell, T.M.: Machine Learning (1997)

    Google Scholar 

  9. Bishop, C.M.: Pattern Recognition and Machine Learning (Information Science and Statistics) (2006)

    Google Scholar 

  10. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning. Elements 1, 337–387 (2009)

    Google Scholar 

  11. Bengio, Y., Grandvalet, Y.: Bias in Estimating the Variance of K-Fold Cross-Validation. Statistical Modeling and Analysis for Complex Data Problem, 75–95 (2005)

    Google Scholar 

  12. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn. Morgan Kaufmann Series in Data Management Systems (2005)

    Google Scholar 

  13. Lagani, V., Tsamardinos, I.: Structure-based variable selection for survival data. Bioinformatics 26, 1887–1894 (2010)

    Article  Google Scholar 

  14. Statnikov, A., Tsamardinos, I., Dosbayev, Y., Aliferis, C.F.: GEMS: A system for automated cancer diagnosis and biomarker discovery from microarray gene expression data. Int. J. Med. Inform. 74, 491–503 (2005)

    Article  Google Scholar 

  15. Salzberg, S.: On Comparing Classifiers: Pitfalls to Avoid and a Recommended Approach. Data Min. Knowl. Discov. 328, 317–328 (1997)

    Article  Google Scholar 

  16. Iizuka, N., Oka, M., Yamada-Okabe, H., Nishida, M., Maeda, Y., Mori, N., Takao, T., Tamesa, T., Tangoku, A., Tabuchi, H., Hamada, K., Nakayama, H., Ishitsuka, H., Miyamoto, T., Hirabayashi, A., Uchimura, S., Hamamoto, Y.: Oligonucleotide microarray for prediction of early intrahepatic recurrence of hepatocellular carcinoma after curative resection. Lancet 361, 923–929 (2003)

    Article  Google Scholar 

  17. Kurgan, L.A., Cios, K.J., Tadeusiewicz, R., Ogiela, M., Goodenday, L.S.: Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif. Intell. Med. 23, 149–169 (2001)

    Article  Google Scholar 

  18. Bock, R.K., Chilingarian, A., Gaug, M., Hakl, F., Hengstebeck, T., Jiřina, M., Klaschka, J., Kotrč, E., Savický, P., Towers, S., Vaiciulis, A., Wittek, W.: Methods for multidimensional event classification: a case study using images from a Cherenkov gamma-ray telescope. Nucl. Instruments Methods Phys. Res. Sect. A Accel. Spectrometers, Detect. Assoc. Equip. 516, 511–528 (2004)

    Article  Google Scholar 

  19. Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V.: Quantitative structure-activity relationship models for ready biodegradability of chemicals. J. Chem. Inf. Model. 53, 867–878 (2013)

    Article  Google Scholar 

  20. Moro, S., Laureano, R.M.S.: Using Data Mining for Bank Direct Marketing: An application of the CRISP-DM methodology. In: Eur. Simul. Model. Conf., pp. 117–121 (2011)

    Google Scholar 

  21. Bendall, S.C., Simonds, E.F., Qiu, P., Amir, E.D., Krutzik, P.O., Finck, R., Bruggner, R.V., Melamed, R., Trejo, A., Ornatsky, O.I., Balderas, R.S., Plevritis, S.K., Sachs, K., Pe’er, D., Tanner, S.D., Nolan, G.P.: Single-cell mass cytometry of differential immune and drug responses across a human hematopoietic continuum. Science 332, 687–696 (2011)

    Article  Google Scholar 

  22. Fawcett, T.: An introduction to ROC analysis (2006)

    Google Scholar 

  23. Coppersmith, D., Hong, S.J., Hosking, J.R.M.: Partitioning Nominal Attributes in Decision Trees. Data Min. Knowl. Discov. 3, 197–217 (1999)

    Article  Google Scholar 

  24. Chang, C.-C., Lin, C.-J.: LIBSVM: A library for support vector machines. ACM Trans. Intell. Syst..... 2, 1–39 (2011)

    Article  Google Scholar 

  25. O’brien, R.G.: A General ANOVA Method for Robust Tests of Additive Models for Variances. J. Am. Stat. Assoc. 74, 877–880 (1979)

    Article  MATH  MathSciNet  Google Scholar 

  26. Quenouille, M.H.: Approximate tests of correlation in time-series 3 (1949)

    Google Scholar 

  27. Varma, S., Simon, R.: Bias in error estimation when using cross-validation for model selection. BMC Bioinformatics 7, 91 (2006)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Tsamardinos, I., Rakhshani, A., Lagani, V. (2014). Performance-Estimation Properties of Cross-Validation-Based Protocols with Simultaneous Hyper-Parameter Optimization. In: Likas, A., Blekas, K., Kalles, D. (eds) Artificial Intelligence: Methods and Applications. SETN 2014. Lecture Notes in Computer Science(), vol 8445. Springer, Cham. https://doi.org/10.1007/978-3-319-07064-3_1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-07064-3_1

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-07063-6

  • Online ISBN: 978-3-319-07064-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics