Abstract
The article considers the methods of detecting and filling gaps in data sets at the stage of preliminary data processing in machine learning procedures. A multi-stage approach to identifying and filling gaps in data sets and combined statistical quality criteria is proposed. The method consists of three stages. At the first stage, the presence of gaps in the data is determined. In the second stage, the patterns of occurrence of gaps are investigated. Three approaches are used: matrix analysis, graphical analysis and correlation analysis. One of the mechanisms of formation of gaps in the data is identified: MCAR, MAR, MNAR. In the third stage, various methods of data generation without gaps are used. The methods used include deleting part of the data with gaps, various replacement methods, methods for predicting missing values. Consider separately the methods of overcoming gaps in time series. The effectiveness of the proposed approach is investigated numerically. Examples of application of methods on various data sets are given.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Altham, P.: Introduction to Statistical Modelling in R. University of Cambridge, UK (2012)
Babichev, S., Durnyak, B., Zhydetskyy, V., Pikh, I., Senkivskyy, V.: Application of optics density-based clustering algorithm using inductive methods of complex system analysis. In: IEEE 2019 14th International Scientific and Technical Conference on Computer Sciences and Information Technologies, CSIT 2019 - Proceedings, pp. 169–172 (2019). https://doi.org/10.1109/STC-CSIT.2019.8929869
Babichev, S., Škvor, J.: Technique of gene expression profiles extraction based on the complex use of clustering and classification methods. Diagnostics 10(8), 584 (2020). https://doi.org/10.3390/diagnostics10080584
Bidyuk, P., Gozhyj, A., Kalinina, I., Gozhyj, V.: Analysis of uncertainty types for model building and forecasting dynamic processes. In: Conference on Computer Science and Information Technologies. Advances in Intelligent Systems and Computing II, vol. 689, pp. 66–78. Springer-Verlag (2017). https://doi.org/10.1007/978-3-319-70581-1
Bidyuk, P., Gozhyj, A., Kalinina, I., Vysotska, V.: Methods for forecasting nonlinear non-stationary processes in machine learning. In: Data Stream Mining and Processing. DSMP 2020. Communications in Computer and Information Science, vol. 1158, pp. 470–485. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-61656-4_32
Bidyuk, P., Gozhyj, A., Matsuki, Y., Kuznetsova, N., Kalinina, I.: Modeling and forecasting economic and financial processes using combined adaptive models. In: Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2020, vol. 1246, pp. 395–408. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-54215-3_25
Chernick, M., LaBudde, R.: An Introduction to Bootstrap Methods with Applications to R. Wiley (2011)
Cryer, J., Chan, K.S.: Time Series Analysis With Applications in R. Springer, Berlin, Germany (2008)
Everitt, B., Hothorn, T.: A Handbook of Statistical Analyses Using R. Chapman, Hall/CRC, Boca Raton (2010)
Fox, J., Weisberg, S.: An R Companion to Applied Regression. Sage Publications, Thousand Oaks (2011)
Kabacoff, R.: R in Action: Data Analysis and Graphics With R. Manning Publications (2011)
Karahalios, A., Baglietto, L., Carlin, J., English, D., J.A., S.: A review of the reporting and handling of missing data in cohort studies with repeated assessment of exposure measures (2012)
Knol, M.J., et al.: Unpredictable bias when using the missing indicator method or complete case analysis for missing confounder values: an empirical example. J. Clin. Epidemiol. 63, 728–736 (2010). https://doi.org/10.1016/j.jclinepi.2009.08.028
Lam, L.: An Introduction to R. Vrije Universiteit Amsterdam (2010)
Little, R., Rubin, D.: Statistical analysis with missing data. Wiley, Online Library (2014)
Molenberghs, G., Kenward, M.G.: Missing Data in Clinical Studies. John Wiley and Sons, Chichester, UK (2007)
Shumway, R.H., Stoffer, D.: Time Series Analysis and its Applications with R Examples. Hardcover (2006)
VanBuuren, S.: Flexible Imputation of Missing Data. Chapman and Hall/CRC, Boca Raton (2012)
Venables, W., Smith, D.: An Introduction to R. R Development Core Team (2014)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Bidyuk, P., Kalinina, I., Gozhyj, A. (2022). An Approach to Identifying and Filling Data Gaps in Machine Learning Procedures. In: Babichev, S., Lytvynenko, V. (eds) Lecture Notes in Computational Intelligence and Decision Making. ISDMCI 2021. Lecture Notes on Data Engineering and Communications Technologies, vol 77. Springer, Cham. https://doi.org/10.1007/978-3-030-82014-5_11
Download citation
DOI: https://doi.org/10.1007/978-3-030-82014-5_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82013-8
Online ISBN: 978-3-030-82014-5
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)