Skip to main content

An Approach to Identifying and Filling Data Gaps in Machine Learning Procedures

  • Conference paper
  • First Online:
Lecture Notes in Computational Intelligence and Decision Making (ISDMCI 2021)

Abstract

The article considers the methods of detecting and filling gaps in data sets at the stage of preliminary data processing in machine learning procedures. A multi-stage approach to identifying and filling gaps in data sets and combined statistical quality criteria is proposed. The method consists of three stages. At the first stage, the presence of gaps in the data is determined. In the second stage, the patterns of occurrence of gaps are investigated. Three approaches are used: matrix analysis, graphical analysis and correlation analysis. One of the mechanisms of formation of gaps in the data is identified: MCAR, MAR, MNAR. In the third stage, various methods of data generation without gaps are used. The methods used include deleting part of the data with gaps, various replacement methods, methods for predicting missing values. Consider separately the methods of overcoming gaps in time series. The effectiveness of the proposed approach is investigated numerically. Examples of application of methods on various data sets are given.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Altham, P.: Introduction to Statistical Modelling in R. University of Cambridge, UK (2012)

    Google Scholar 

  2. Babichev, S., Durnyak, B., Zhydetskyy, V., Pikh, I., Senkivskyy, V.: Application of optics density-based clustering algorithm using inductive methods of complex system analysis. In: IEEE 2019 14th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2019 - Proceedings, pp. 169–172 (2019). https://doi.org/10.1109/STC-CSIT.2019.8929869

  3. Babichev, S., Škvor, J.: Technique of gene expression profiles extraction based on the complex use of clustering and classification methods. Diagnostics 10(8), 584 (2020). https://doi.org/10.3390/diagnostics10080584

  4. Bidyuk, P., Gozhyj, A., Kalinina, I., Gozhyj, V.: Analysis of uncertainty types for model building and forecasting dynamic processes. In: Conference on Computer Science and Information Technologies. Advances in Intelligent Systems and Computing II, vol. 689, pp. 66–78. Springer-Verlag (2017). https://doi.org/10.1007/978-3-319-70581-1

  5. Bidyuk, P., Gozhyj, A., Kalinina, I., Vysotska, V.: Methods for forecasting nonlinear non-stationary processes in machine learning. In: Data Stream Mining and Processing. DSMP 2020. Communications in Computer and Information Science, vol. 1158, pp. 470–485. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61656-4_32

  6. Bidyuk, P., Gozhyj, A., Matsuki, Y., Kuznetsova, N., Kalinina, I.: Modeling and forecasting economic and financial processes using combined adaptive models. In: Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020, vol. 1246, pp. 395–408. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-54215-3_25

  7. Chernick, M., LaBudde, R.: An Introduction to Bootstrap Methods with Applications to R. Wiley (2011)

    Google Scholar 

  8. Cryer, J., Chan, K.S.: Time Series Analysis With Applications in R. Springer, Berlin, Germany (2008)

    Google Scholar 

  9. Everitt, B., Hothorn, T.: A Handbook of Statistical Analyses Using R. Chapman, Hall/CRC, Boca Raton (2010)

    Google Scholar 

  10. Fox, J., Weisberg, S.: An R Companion to Applied Regression. Sage Publications, Thousand Oaks (2011)

    Google Scholar 

  11. Kabacoff, R.: R in Action: Data Analysis and Graphics With R. Manning Publications (2011)

    Google Scholar 

  12. Karahalios, A., Baglietto, L., Carlin, J., English, D., J.A., S.: A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures (2012)

    Google Scholar 

  13. Knol, M.J., et al.: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J. Clin. Epidemiol. 63, 728–736 (2010). https://doi.org/10.1016/j.jclinepi.2009.08.028

  14. Lam, L.: An Introduction to R. Vrije Universiteit Amsterdam (2010)

    Google Scholar 

  15. Little, R., Rubin, D.: Statistical analysis with missing data. Wiley, Online Library (2014)

    Google Scholar 

  16. Molenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. John Wiley and Sons, Chichester, UK (2007)

    Google Scholar 

  17. Shumway, R.H., Stoffer, D.: Time Series Analysis and its Applications with R Examples. Hardcover (2006)

    Google Scholar 

  18. VanBuuren, S.: Flexible Imputation of Missing Data. Chapman and Hall/CRC, Boca Raton (2012)

    Book  Google Scholar 

  19. Venables, W., Smith, D.: An Introduction to R. R Development Core Team (2014)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bidyuk, P., Kalinina, I., Gozhyj, A. (2022). An Approach to Identifying and Filling Data Gaps in Machine Learning Procedures. In: Babichev, S., Lytvynenko, V. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2021. Lecture Notes on Data Engineering and Communications Technologies, vol 77. Springer, Cham. https://doi.org/10.1007/978-3-030-82014-5_11

Download citation

Publish with us

Policies and ethics