Skip to main content

Dropping Incomplete Records is (not so) Straightforward

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13876)

Abstract

A straightforward approach to handling missing values is dropping incomplete records from the dataset. However, for many forms of missingness, this method is known to affect the center and spread of the data distribution. In this paper, we perform an extensive empirical evaluation of the effect of the drop method on the data distribution. In particular, we analyze two scenarios that are likely to occur in practice but are not often considered in simulation studies: 1) when features are skewed rather than symmetrically distributed and 2) when multiple forms of missingness occur simultaneously in one feature. Furthermore, we investigate implications of the drop method for classification accuracy and demonstrate that dropping incomplete records is doubtful, even when test cases are dropped as well.

Keywords

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://archive.ics.uci.edu/ml/datasets/breast+cancer+wisconsin+(diagnostic).

  2. 2.

    N.B.: in the general case, this may affect training and test distribution, but it is unclear how. Homogeneity might increase, but the data might also become more scattered and hence variance might increase. Since the distribution can be affected in a wide variety of possible ways, we will simply ignore this effect; note that technically this might affect the definition of accuracy.

References

  1. Acuna, E., Rodriguez, C.: The treatment of missing values and its effect on classifier accuracy. In: Banks, D., McMorris, F.R., Arabie, P., Gaul, W. (eds.) Classification, Clustering, and Data Mining Applications. Studies in Classification, Data Analysis, and Knowledge Organisation, pp. 639–647. Springer, Berlin, Heidelberg (2004). https://doi.org/10.1007/978-3-642-17103-1_60

  2. Brand, J.P., van Buuren, S., Groothuis-Oudshoorn, K., Gelsema, E.S.: A toolkit in SAS for the evaluation of multiple imputation methods. Stat. Neerl. 57(1), 36–45 (2003)

    Article  MathSciNet  Google Scholar 

  3. van Buuren, S.: Flexible Imputation of Missing Data, 2nd edn. Chapman and Hall/CRC, Boca Raton (2018)

    Google Scholar 

  4. van Buuren, S., Brand, J.P., Groothuis-Oudshoorn, C.G., Rubin, D.B.: Fully conditional specification in multivariate imputation. J. Stat. Comput. Simul. 76(12), 1049–1064 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  5. van Buuren, S., Groothuis-Oudshoorn, K.: MICE: multivariate imputation by chained equations in R. J. Stat. Softw. 45, 1–67 (2011)

    Article  Google Scholar 

  6. Correia, A., Peharz, R., de Campos, C.P.: Joints in random forests. Adv. Neural Inf. Process. Syst. 33, 11404–11415 (2020)

    Google Scholar 

  7. García-Laencina, P.J., Sancho-Gómez, J.L., Figueiras-Vidal, A.R.: Pattern classification with missing data: a review. Neural Comput. Appl. 19(2), 263–282 (2010)

    Article  Google Scholar 

  8. Garciarena, U., Santana, R.: An extensive analysis of the interaction between missing data types, imputation methods, and supervised classifiers. Expert Syst. Appl. 89, 52–65 (2017)

    Article  Google Scholar 

  9. Hoogland, J., et al.: Handling missing predictor values when validating and applying a prediction model to new patients. Stat. Med. 39(25), 3591–3607 (2020)

    Article  MathSciNet  Google Scholar 

  10. Little, R.J.: Regression with missing X’s: a review. J. Am. Stat. Assoc. 87(420), 1227–1237 (1992)

    Google Scholar 

  11. Little, R.J., Rubin, D.B.: Statistical Analysis with Missing Data, Wiley Series in Probability and Statistics, vol. 793. Wiley, Hoboken (2019)

    Google Scholar 

  12. Mangasarian, O.L., Street, W.N., Wolberg, W.H.: Breast cancer diagnosis and prognosis via linear programming. Oper. Res. 43(4), 570–577 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  13. Miller, I., Miller, M., Freund, J.E.: John E. Freund’s Mathematical Statistics, 6th edn. Prentice Hall, Upper Saddle River, N.J. (1999)

    Google Scholar 

  14. Raji, I.D., Kumar, I.E., Horowitz, A., Selbst, A.: The fallacy of AI functionality. In: ACM Conference on Fairness, Accountability, and Transparency, pp. 959–972 (2022)

    Google Scholar 

  15. Rubin, D.B.: Inference and missing data. Biometrika 63(3), 581–592 (1976)

    Article  MathSciNet  MATH  Google Scholar 

  16. Schafer, J.L., Graham, J.W.: Missing data: our view of the state of the art. Psychol. Methods 7(2), 147 (2002)

    Article  Google Scholar 

  17. Schouten, R.M., Lugtig, P., Vink, G.: Generating missing values for simulation purposes: a multivariate amputation procedure. J. Stat. Comput. Simul. 88(15), 2909–2930 (2018)

    Article  MathSciNet  MATH  Google Scholar 

  18. Schouten, R.M., Vink, G.: The dance of the mechanisms: how observed information influences the validity of missingness assumptions. Sociol. Methods Res. 50(3), 1243–1258 (2021)

    Article  MathSciNet  Google Scholar 

  19. Schouten, R.M., Zamanzadeh, D., Singh, P.: pyampute: a python library for data amputation, August 2022. https://doi.org/10.25080/majora-212e5952-03e

  20. Street, W.N., Wolberg, W.H., Mangasarian, O.L.: Nuclear feature extraction for breast tumor diagnosis. In: Acharya, R.S., Goldgof, D.B. (eds.) Biomedical Image Processing and Biomedical Visualization. Society of Photo-Optical Instrumentation Engineers (SPIE) Conference Series, vol. 1905, pp. 861–870, July 1993

    Google Scholar 

  21. Toutenburg, H., Srivastava, V.K.: Shalabh: amputation versus imputation of missing values through ratio method in sample surveys. Stat. Pap. 49(2), 237–247 (2008)

    Article  MATH  Google Scholar 

  22. Žliobaitė, I., Pechenizkiy, M., Gama, J.: An overview of concept drift applications. In: Japkowicz, N., Stefanowski, J. (eds.) Big Data Analysis: New Algorithms for a New Society. SBD, vol. 16, pp. 91–114. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-26989-4_4

    Chapter  Google Scholar 

Download references

Acknowledgments

Many thanks to dr. Wouter Duivesteijn and prof. Mykola Pechenizkiy for their continuous support in all possible ways. Thank you Hilde Weerts for being a sparring partner.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Victoria Taşcău .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Schouten, R.M., Taşcău, V., Ziegler, G.G., Casano, D., Ardizzone, M., Erotokritou, MA. (2023). Dropping Incomplete Records is (not so) Straightforward. In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham. https://doi.org/10.1007/978-3-031-30047-9_30

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-30047-9_30

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-30046-2

  • Online ISBN: 978-3-031-30047-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics