The Treatment of Missing Values and its Effect on Classifier Accuracy

Acuña, Edgar; Rodriguez, Caroline

doi:10.1007/978-3-642-17103-1_60

Edgar Acuña²³ &
Caroline Rodriguez²³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organisation ((STUDIES CLASS))

1895 Accesses
210 Citations

Abstract

The presence of missing values in a dataset can affect the performance of a classifier constructed using that dataset as a training sample. Several methods have been proposed to treat missing data and the one used most frequently deletes instances containing at least one missing value of a feature. In this paper we carry out experiments with twelve datasets to evaluate the effect on the misclassification error rate of four methods for dealing with missing values: the case deletion method, mean imputation, median imputation, and the KNN imputation procedure. The classifiers considered were the Linear Discriminant Analysis (LDA) and the KNN classifier. The first one is a parametric classifier whereas the second one is a nonparametric classifier.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Acufia, E., Coaquira, F. and Gonzalez, M. (2003). “A Comparison of Feature Selection Procedures for Classifiers Based on Kernel Density Estimation,” in Proceedings of the International Conference on Computer, Communication and Control Technologies, Orlando, FL: CCCT′03, Vol I, pp. 468–472.
Google Scholar
Batista G. E. A. P. A., and Monard, M. C. (2002). “K-Nearest Neighbour as Imputation Method: Experimental Results,” Technical Report 186, ICMC-USP.
Google Scholar
Bello, A. L. (1995). “Imputation Techniques in Regression Analysis: Looking Closely at Their Implementation,” Computational Statistics and Data Analysis, 20, 45–57.
Article MATH Google Scholar
Chan, P., and Dunn, O. J. (1972). “The Treatment of Missing Values in Discriminant Analysis,” Journal of the American Statistical Association, 69, 473–477.
Google Scholar
Dixon J. K. (1979). “Pattern Recognition with Partly Missing Data,” IEEE Transactions on Systems, Man, and Cybernetics, SMC-9, 10, 617–621.
Article Google Scholar
Grzymala-Busse, J. W., and Hu, M. (2000). “A Comparison of Several Approaches to Missing Attribute Values in Data Mining,” in Rough Sets and Current Trends in Computing 2000, pp. 340–347.
Google Scholar
Hastie, T., Tibshirani, R., Sherlock, G., Eisen, M, Brown, P. and Bolstein, D. (1999). “Imputing Missing Data for Gene Expression Arrays,” Techical Report, Division of Biostatistics, Stanford University.
Google Scholar
Kalton, G., and Kasprzyk, D. (1986). “The Treatment of Missing Survey Data,” Survey Methodology, 12, 1–16.
Google Scholar
Little, R. J., and Rubin, D. B. (2002). Statistical Analysis with Missing Data, second edn., John Wiley and Sons, New York.
MATH Google Scholar
Mundfrom, D. J., and Whitcomb, A. (1998). “Imputing Missing values: The Effect on the Accuracy of Classification,” Multiple Linear Regression Viewpoints, 25, 13–19.
Google Scholar
Schäfer, J. L. (1997). Analysis of Incomplete Multivariate Data, Chapman and Hall, London.
Book Google Scholar
Tresp, V., Neuneier, R., and Ahmad, S. (1994). “Efficient Methods for Dealing with Missing Data in Supervised Learning,” in NIPS 1994, eds. G. Tesauro, D. S. Touretzky, and T. K. Leen, Cambridge, MA: MIT Press, pp. 689–696.
Google Scholar
Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P. Hastie, T., Tibshirani, R., Bostein, D. and Altman, R. B. (2001). “Missing Value Estimation Methods for DNA Microarrays,” Bioinformatics, 17, 520–525.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Puerto Rico at Mayaguez, Puerto Rico
Edgar Acuña & Caroline Rodriguez

Authors

Edgar Acuña
View author publications
You can also search for this author in PubMed Google Scholar
Caroline Rodriguez
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Leanna House Institute of Statistics and Decision Sciences, Duke University, 27708, Durham, NC, USA
David Banks
Department of Mathematics, Illinois Institute of Technology, 10 West 32nd Street, 60616-3793, Chicago, IL, USA
Frederick R. McMorris
Faculty of Management, Rutgers University, 180 University Avenue, 07102-1895, Newark, NJ, USA
Phipps Arabie
Institute of Decision Theory, University of Karlsruhe, Kaiserstr. 12, 76128, Karlsruhe, Germany
Wolfgang Gaul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Acuña, E., Rodriguez, C. (2004). The Treatment of Missing Values and its Effect on Classifier Accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-17103-1_60

Download citation

DOI: https://doi.org/10.1007/978-3-642-17103-1_60
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22014-5
Online ISBN: 978-3-642-17103-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics