An Imputation Method for Estimating the Learning Curve in Classification Problems
The learning curve expresses the error rate of a predictive modeling procedure, when applied to a particular population, as a function of the sample size of the training dataset. It typically is a decreasing function with a positive limiting value (bounded below by the Bayes error rate). An estimate of the learning curve can be used to assess whether a modeling procedure is expected to become substantially more accurate if additional training data were obtained. Here, we consider an imputation-based procedure for estimating learning curves. We focus on classification, although the idea is applicable to other predictive modeling settings. Simulation studies indicate that useful estimates of learning curves can be obtained for roughly a four-fold increase in the size of the training set relative to the available data, and that the proposed imputation approach outperforms an alternative estimation approach based on parameterizing the learning curve. We illustrate the method with an application that predicts the risk of disease progression for people with chronic lymphocytic leukemia.
KeywordsChronic Lymphocytic Leukemia Generalization Performance Parametric Bootstrap Positive Lymph Node Group Expect Error Rate
Eric Laber acknowledges support from NIH grant P01 CA142538 and DNR grant PR-W-F14AF00171. Kerby Shedden acknowledges support from NSF grant NSF-CDSE-MSS-1316731.
- 5.Efron, B., Tibshirani, R.J.: An Introduction to the Bootstrap. CRC Press, Boca Raton (1994)Google Scholar