Advertisement

An Imputation Method for Estimating the Learning Curve in Classification Problems

  • Eric B. LaberEmail author
  • Kerby Shedden
  • Yang Yang
Conference paper
Part of the Abel Symposia book series (ABEL, volume 11)

Abstract

The learning curve expresses the error rate of a predictive modeling procedure, when applied to a particular population, as a function of the sample size of the training dataset. It typically is a decreasing function with a positive limiting value (bounded below by the Bayes error rate). An estimate of the learning curve can be used to assess whether a modeling procedure is expected to become substantially more accurate if additional training data were obtained. Here, we consider an imputation-based procedure for estimating learning curves. We focus on classification, although the idea is applicable to other predictive modeling settings. Simulation studies indicate that useful estimates of learning curves can be obtained for roughly a four-fold increase in the size of the training set relative to the available data, and that the proposed imputation approach outperforms an alternative estimation approach based on parameterizing the learning curve. We illustrate the method with an application that predicts the risk of disease progression for people with chronic lymphocytic leukemia.

Keywords

Chronic Lymphocytic Leukemia Generalization Performance Parametric Bootstrap Positive Lymph Node Group Expect Error Rate 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Notes

Acknowledgements

Eric Laber acknowledges support from NIH grant P01 CA142538 and DNR grant PR-W-F14AF00171. Kerby Shedden acknowledges support from NSF grant NSF-CDSE-MSS-1316731.

References

  1. 1.
    Amari, S., Fujita, N., Shinomoto, S.: Four types of learning curves. Neural Comput. 4, 605–618 (1992)CrossRefGoogle Scholar
  2. 2.
    Breiman, L., Spector, P.: Submodel selection and evaluation in regression. The x-random case. Int. Stat. Rev. 60(3), 291–319 (1992)CrossRefGoogle Scholar
  3. 3.
    Davison, A.C., Hinkley, D.V.: Bootstrap Methods and Their Applications. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press, Cambridge (1997)CrossRefGoogle Scholar
  4. 4.
    Efron, B.: Estimating the error rate of a prediction rule: improvement on cross-validation. J. Am. Stat. Assoc. 78(382), 316–331 (1983)CrossRefMathSciNetzbMATHGoogle Scholar
  5. 5.
    Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)Google Scholar
  6. 6.
    Friedman, J., Tibshirani, R.: The monotone smoothing of scatterplots. Technometrics 26(3), 243–250 (1984)CrossRefGoogle Scholar
  7. 7.
    Hastie, T., Tibshirani, R., Friedman, J.H.: The Elements of Statistical Learning. Springer, New York (2009)CrossRefzbMATHGoogle Scholar
  8. 8.
    Haussler, D., Kearns, M., Sebastian Seung, H., Tishby, N.: Rigorous learning curve bounds from statistical mechanics. Mach. Learn. 25, 195–236 (1996)CrossRefzbMATHGoogle Scholar
  9. 9.
    Insel, T.R.: Translating scientific opportunity into public health impact: a strategic plan for research on mental illness. Arch. Gen. Psychiatry 66(2), 128 (2009)CrossRefGoogle Scholar
  10. 10.
    Mukherjee, S., Tamayo, P., Rogers, S., Rifkin, R., Engle, A., Campbell, C., Golub, T.R., Mesirov, J.P.: Estimating dataset size requirements for classifying dna microarray data. J. Comput. Biol. 10(2), 119–142 (2003)CrossRefGoogle Scholar
  11. 11.
    Ouillette, P., Collins, R., Shakhan, S., Li, J., Peres, E., Kujawski, L., Talpaz, M., Kaminski, M., Li, C., Shedden, K., Malek, S.N.: Acquired genomic copy number aberrations and survival in chronic lymphocytic leukemia. Blood 118(11), 3051–3061 (2011)CrossRefGoogle Scholar
  12. 12.
    Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)zbMATHGoogle Scholar
  13. 13.
    Snapinn, S.M., Knoke, J.D.: An evaluation of smoothed classification error-rate estimators. Technometrics 27(2), 199–206 (1985)MathSciNetGoogle Scholar
  14. 14.
    Toussaint, G.: Bibliography on estimation of misclassification. IEEE Trans. Inf. Theory 20, 472–479 (1974)CrossRefMathSciNetzbMATHGoogle Scholar
  15. 15.
    Vapnik, V.N.: Estimation of Dependences Based on Empirical Data. Springer, New York (1982)zbMATHGoogle Scholar
  16. 16.
    Vapnik, V.N., Chervonenkis, A.Y.: On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl. 16, 264–280 (1971)CrossRefzbMATHGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.NC State UniversityRaleighUSA
  2. 2.University of MichiganAnn ArborUSA
  3. 3.LinkedInMountain ViewUSA

Personalised recommendations