Skip to main content

Over-Fitting and Model Tuning

  • Chapter
Applied Predictive Modeling

Abstract

Many modern classification and regression models are highly adaptable; they are capable of modeling complex relationships. Each model's adaptability is typically governed by a set of tuning parameters, which can allow each model to pinpoint predictive patterns and structures within the data. However, these tuning parameters can very identify predictive patterns that are not reproducible. This is known as “over-fitting.” Models that are over-fit generally have excellent predictivity for the samples on which they were built, but poor predictivity for new samples. Without a methodological approach to building and evaluating models, the modeler will not know if the model is over-fit until the next set of samples are predicted. In Section 4.1 we use a simple example to illustrate the problem of over-fitting. We then describe a systematic process for tuning models (Section 4.2), which is foundational to the remaining parts of the book. Core to model tuning are appropriate ways for splitting (or spending) the data, which is covered in Section 4.3. Resampling techniques (Section 4.4) are an alternative or complementary approach to data splitting. Recommendations for approaches to data splitting are provided in Section 4.7. After evaluating a number tuning parameters via data splitting or resampling, we must choose the final tuning parameters (Section 4.6). We also discuss how to choose the optimal model across several tuned models (Section 4.8) We illustrate how to implement the recommended techniques discussed in this chapter in the Computing Section (4.9). Exercises are provided at the end of the chapter to solidify concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 49.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 64.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 99.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This model uses a radial basis function kernel, defined in Sect. 13.4. Although not explored here, we used the analytical approach discussed later for determining the kernel parameter and fixed this value for all resampling techniques.

  2. 2.

    The authors state that there are 95 samples of known oils. However, we count 96 in their Table 1 (pp. 33–35 of the article).

References

  • Ambroise C, McLachlan G (2002). “Selection Bias in Gene Extraction on the Basis of Microarray Gene–Expression Data.” Proceedings of the National Academy of Sciences, 99(10), 6562–6566.

    Article  MATH  Google Scholar 

  • Boulesteix A, Strobl C (2009). “Optimal Classifier Selection and Negative Bias in Error Rate Estimation: An Empirical Study on High–Dimensional Prediction.” BMC Medical Research Methodology, 9(1), 85.

    Article  Google Scholar 

  • Breiman L, Friedman J, Olshen R, Stone C (1984). Classification and Regression Trees. Chapman and Hall, New York.

    MATH  Google Scholar 

  • Brodnjak-Vonina D, Kodba Z, Novi M (2005). “Multivariate Data Analysis in Classification of Vegetable Oils Characterized by the Content of Fatty Acids.” Chemometrics and Intelligent Laboratory Systems, 75(1), 31–43.

    Article  Google Scholar 

  • Caputo B, Sim K, Furesjo F, Smola A (2002). “Appearance–Based Object Recognition Using SVMs: Which Kernel Should I Use?” In “Proceedings of NIPS Workshop on Statistical Methods for Computational Experiments in Visual Processing and Computer Vision,”.

    Google Scholar 

  • Clark R (1997). “OptiSim: An Extended Dissimilarity Selection Method for Finding Diverse Representative Subsets.” Journal of Chemical Information and Computer Sciences, 37(6), 1181–1188.

    Article  Google Scholar 

  • Clark T (2004). “Can Out–of–Sample Forecast Comparisons Help Prevent Overfitting?” Journal of Forecasting, 23(2), 115–139.

    Article  Google Scholar 

  • Cohen G, Hilario M, Pellegrini C, Geissbuhler A (2005). “SVM Modeling via a Hybrid Genetic Strategy. A Health Care Application.” In R Engelbrecht, AGC Lovis (eds.), “Connecting Medical Informatics and Bio–Informatics,” pp. 193–198. IOS Press.

    Google Scholar 

  • Defernez M, Kemsley E (1997). “The Use and Misuse of Chemometrics for Treating Classification Problems.” TrAC Trends in Analytical Chemistry, 16(4), 216–221.

    Article  Google Scholar 

  • Dwyer D (2005). “Examples of Overfitting Encountered When Building Private Firm Default Prediction Models.” Technical report, Moody’s KMV.

    Google Scholar 

  • Efron B (1983). “Estimating the Error Rate of a Prediction Rule: Improvement on Cross–Validation.” Journal of the American Statistical Association, pp. 316–331.

    Google Scholar 

  • Efron B, Tibshirani R (1986). “Bootstrap Methods for Standard Errors, Confidence Intervals, and Other Measures of Statistical Accuracy.” Statistical Science, pp. 54–75.

    Google Scholar 

  • Efron B, Tibshirani R (1997). “Improvements on Cross–Validation: The 632+ Bootstrap Method.” Journal of the American Statistical Association, 92(438), 548–560.

    MathSciNet  MATH  Google Scholar 

  • Eugster M, Hothorn T, Leisch F (2008). “Exploratory and Inferential Analysis of Benchmark Experiments.” Ludwigs-Maximilians-Universität München, Department of Statistics, Tech. Rep, 30.

    Google Scholar 

  • Golub G, Heath M, Wahba G (1979). “Generalized Cross–Validation as a Method for Choosing a Good Ridge Parameter.” Technometrics, 21(2), 215–223.

    Article  MathSciNet  MATH  Google Scholar 

  • Gowen A, Downey G, Esquerre C, O’Donnell C (2010). “Preventing Over–Fitting in PLS Calibration Models of Near-Infrared (NIR) Spectroscopy Data Using Regression Coefficients.” Journal of Chemometrics, 25, 375–381.

    Article  Google Scholar 

  • Hawkins D (2004). “The Problem of Overfitting.” Journal of Chemical Information and Computer Sciences, 44(1), 1–12.

    Article  Google Scholar 

  • Hawkins D, Basak S, Mills D (2003). “Assessing Model Fit by Cross–Validation.” Journal of Chemical Information and Computer Sciences, 43(2), 579–586.

    Article  Google Scholar 

  • Heyman R, Slep A (2001). “The Hazards of Predicting Divorce Without Cross-validation.” Journal of Marriage and the Family, 63(2), 473.

    Article  Google Scholar 

  • Hothorn T, Leisch F, Zeileis A, Hornik K (2005). “The Design and Analysis of Benchmark Experiments.” Journal of Computational and Graphical Statistics, 14(3), 675–699.

    Article  MathSciNet  Google Scholar 

  • Hsieh W, Tang B (1998). “Applying Neural Network Models to Prediction and Data Analysis in Meteorology and Oceanography.” Bulletin of the American Meteorological Society, 79(9), 1855–1870.

    Article  Google Scholar 

  • Kim JH (2009). “Estimating Classification Error Rate: Repeated Cross–Validation, Repeated Hold–Out and Bootstrap.” Computational Statistics & Data Analysis, 53(11), 3735–3745.

    Article  MathSciNet  MATH  Google Scholar 

  • Kohavi R (1995). “A Study of Cross–Validation and Bootstrap for Accuracy Estimation and Model Selection.” International Joint Conference on Artificial Intelligence, 14, 1137–1145.

    Google Scholar 

  • Martin J, Hirschberg D (1996). “Small Sample Statistics for Classification Error Rates I: Error Rate Measurements.” Department of Informatics and Computer Science Technical Report.

    Google Scholar 

  • Martin T, Harten P, Young D, Muratov E, Golbraikh A, Zhu H, Tropsha A (2012). “Does Rational Selection of Training and Test Sets Improve the Outcome of QSAR Modeling?” Journal of Chemical Information and Modeling, 52(10), 2570–2578.

    Article  Google Scholar 

  • Mitchell M (1998). An Introduction to Genetic Algorithms. MIT Press.

    Google Scholar 

  • Molinaro A (2005). “Prediction Error Estimation: A Comparison of Resampling Methods.” Bioinformatics, 21(15), 3301–3307.

    Article  Google Scholar 

  • Olsson D, Nelson L (1975). “The Nelder–Mead Simplex Procedure for Function Minimization.” Technometrics, 17(1), 45–51.

    Article  MATH  Google Scholar 

  • Simon R, Radmacher M, Dobbin K, McShane L (2003). “Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification.” Journal of the National Cancer Institute, 95(1), 14–18.

    Article  Google Scholar 

  • Steyerberg E (2010). Clinical Prediction Models: A Practical Approach to Development, Validation, and Updating. Springer, 1st ed. softcover of orig. ed. 2009 edition.

    Google Scholar 

  • Varma S, Simon R (2006). “Bias in Error Estimation When Using Cross–Validation for Model Selection.” BMC Bioinformatics, 7(1), 91.

    Article  MathSciNet  Google Scholar 

  • Willett P (1999). “Dissimilarity–Based Algorithms for Selecting Structurally Diverse Sets of Compounds.” Journal of Computational Biology, 6(3), 447–457.

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this chapter

Cite this chapter

Kuhn, M., Johnson, K. (2013). Over-Fitting and Model Tuning. In: Applied Predictive Modeling. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-6849-3_4

Download citation

Publish with us

Policies and ethics