Abstract
In a realistic scenario, the dataset has missing values encountered during the data collection. To effectively build the prediction model, the missingness of the attributes that impact crop growth needs to be appropriately handled in the crop dataset. The study aims to impute missing data in the Wheat crop yield Dataset, consisting of climatic parameters and historical data of 370 districts of Major Wheat Producer states of India. This study plays a vital role in crop estimation or forecasting of production at regular intervals. The imputation techniques that replace missing data have been categorized into Statistical and Machine Learning based Methods. We explored the performance of popular Techniques such as Arithmetic Average Replacement, Median Imputation, Linear Interpolation, Average Imputation by Nearby Districts, K-Nearest Neighbour, Miss Forest, Regression, and MICE. We have also evaluated these methods on the UCI machine learning repository's Bias and Steel energy consumption datasets. These imputed results were fed to the multiple regression prediction models to evaluate the efficiency of the imputation approaches qualitatively. The results conclude that the Arithmetic Average Replacement method provides good results among the statistical methods (R2 = 0.83; RMSE = 0.47; MAE = 0.372; MSE = 0.229), whereas in Machine Learning based methods, Miss Forest Random Forest-based method, and MICE performed well (R2 = 0.80; MAE = 0.3825; MSE = 0.249; RMSE = 0.499) to impute the missing data. We hope our results help the researchers to select the appropriate pre-processing strategies and improve the data quality.
Similar content being viewed by others
Data availability
All the data analyzed in this study are included in the references of this article.
References
Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7:37. https://doi.org/10.1186/s40537-020-00313-w
Jadhav A, Pramod D, Ramanathan K (2019) Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl Artif Intell 33:913–933. https://doi.org/10.1080/08839514.2019.1637138
Chhabra G, Vashisht V, Ranjan J (2019) A Review on Missing Data Value Estimation Using Imputation Algorithm. J Dyn Control Syst 11:312–318
Zhang Z (2015) Missing values in big data research: some basic skills. Ann Transl Med 3:21. https://doi.org/10.3978/j.issn.2305-5839.2015.12.11
Kwak SK, Kim JH (2017) Statistical data preparation: management of missing values and outliers. Korean J Anesthesiol 70(4):407–411. https://doi.org/10.4097/kjae.2017.70.4.407
Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64(5):402. https://doi.org/10.4097/kjae.2013.64.5.402
Acuna E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications. studies in classification, data analysis, and knowledge organisation. Springer, Berlin, Heidelberg, pp 639–647. https://doi.org/10.1007/978-3-642-17103-1_60
Turrado CC, López MDCM, Lasheras FS, Gómez BAR, Rollé JLC, Juez FJdC (2014) Missing data imputation of solar radiation data under different atmospheric conditions. Sensors 14:20382–20399. https://doi.org/10.3390/s141120382
Biessmann F, Salinas D, Schelter S, Schmidt P, Lange D (2018) “Deep" learning for missing value imputation in tables with non-numerical data. In: Proceedings of the 27th ACM international conference on information and knowledge management. CIKM, Italy, pp 2017–2025. https://doi.org/10.1145/3269206.3272005
Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2020) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62:2419–2437
Silva HD, Perera AS (2016) Missing data imputation using evolutionary k- Nearest neighbor algorithm for gene expression data. In: International Conference on Advances in ICT for Emerging Regions (ICTer). Negombo, Sri Lanka, pp 141–146. https://doi.org/10.1109/ICTER.2016.7829911
Cao J, Tunkiel AT, Arild O, Sui D (2023) Quantitative evaluation of imputation methods using bounds estimation of the coefficient of determination for data-driven models with an application to drilling logs. SPE J 28 (04):1895–1911. https://doi.org/10.2118/214323-PA
Luo Y (2022) Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform 23:1. https://doi.org/10.1093/bib/bbab489
Jinubala V, Lawrance R (2016) Analysis of Missing Data and Imputation on Agriculture Data With Predictive Mean Matching Method. Int j Sci Appl Inf Technol 5(1):01–04
Fu Y, Liao H, Lv L (2021) A Comparative Study of Various Methods for Handling Missing Data in UNSODA. Agriculture 11(8):727. https://doi.org/10.3390/agriculture11080727
Arciniegas-Alarcón S, García-Peña M, Krzanowski W (2016) Missing value imputation in multi-environment trials: reconsidering the krzanowski method. Crop Breed Appl Biotechnol 16(2):77–85. https://doi.org/10.1590/1984-70332016v16n2a13
Gedikoglu H, Parcell JL (2012) Implications of Missing Data Imputation for Agricultural Household Surveys: An Application to Technology Adoption. Agricultural & Applied Economics Association’s 2012 AAEA Annual Meeting. Seattle, Washington, pp 12–14
Lokupitiya R, Lokupitiya E, Paustian K (2006) Comparison of missing value imputation methods for crop yield data. Environ 17(4):339–349. https://doi.org/10.1002/env.773
Solfanelli F, Gambelli D, Vairo D, Zanoli R (2019) Estimating missing data for organic farming by multiple imputation: the case of organic fruit yields in Italy. Org Agr 9:295–303. https://doi.org/10.1007/s13165-018-0228-8
Gorard S (2020) Handling missing data in numeric analyses. Int J Soc Res Methodol 23(6):651–660. https://doi.org/10.1080/13645579.2020.1729974
Curley C, Krause RM, Feiock R, Hawkins CV (2019) Dealing with Missing Data: A Comparative Exploration of Approaches Using the Integrated City Sustainability Database. Urban Affairs Review 55(2):591–615. https://doi.org/10.1177/1078087417726394
Poulos J, Valle R (2018) Missing Data Imputation for Supervised Learning. Appl Artif Intell 32(2):186–196. https://doi.org/10.1080/08839514.2018.1448143
Crop production statistics by directorate of economics and statistics, ministry of agriculture, and farmers welfare. https://aps.dac.gov.in/APY/Public_Report1.aspx. Accessed 5 Jan 2023
Data Access Viewer. https://power.larc.nasa.gov/data-access-viewer/. Accessed 5 Jan 2023
Demirtas H (2018) Flexible imputation of missing data. J Stat Softw 85(1):1–5
Hoque G (2021) A better way to handle missing values in your dataset: using iterative imputer (PART I). Towards Data Science. https://towardsdatascience.com/a-better-way-to-handle-missing-values-in-your-dataset-using-iterativeimputer-9e6e84857d98. Accessed 10 Jan 2023
Chen Y-C (2020) Pattern graphs: a graphical approach to nonmonotone missing data. arXiv:2004.00744. https://doi.org/10.48550/arXiv.2004.00744
Scharfstein DO, Hogan J, Herman A (2012) On the prevention and analysis of missing data in randomized clinical trials: the state of the art. J Bone Joint Surg Am 94(Suppl 1):80–84
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Warnes Z (2021) Missing value handling — missing data types. Towards Data Science. https://towardsdatascience.com/missing-value-handling-missing-data-types-a89c0d81a5bb. Accessed 10 Jan 2023
Meggiorin M, Passadore G, Bertoldo S, Sottani A, Rinaldo A (2023) Comparison of Three Imputation Methods for Groundwater Level Timeseries. Water 15(4):801. https://doi.org/10.3390/w15040801
Dantan E, Proust-Lima C, Letenneur L, Jacqmin-Gadda H (2008) Pattern mixture models and latent class models for the analysis of multivariate longitudinal data with informative dropouts. Int J Biostat. 4(1):10. https://doi.org/10.2202/1557-4679.1088
Graham JW (2012) Analysis of missing data. Missing data. Springer, New York, pp 47–69
Bici R (2023) Simple methods to handle missing data. Int J Comp Econ Econ 13(2):216–242. https://doi.org/10.1504/IJCEE.2023.129986
Little RJ, Rubin DB (2019) Statistical analysis with missing data. Wiley Series in Probability and Statistics, Hoboken. https://doi.org/10.1002/9781119482260
Wafaa H, Nzar A (2023) Missing value imputation Techniques: A Survey. UHD J Sci Technol 7:72–81. https://doi.org/10.21928/uhdjst.v7n1y2023.pp72-81
Mohammed M, Zulkafli H, Mohd A, Ali N, Baba I, Baba MM (2021) Comparison of five imputation methods in handling missing data in a continuous frequency table. AIP Conf Proc 040009:0400061–0400069. https://doi.org/10.1063/5.0053286
Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091
Jahan F, Sinha NC, Rahman MM, Rahman MM, Mondal MSH, Islam MA (2019) Comparison of missing value estimation techniques in rainfall data of Bangladesh. Theor Appl Climatol 136(3):1115–1131
Dumedah G, Coulibaly P (2011) Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data. J Hydrol 400(1–2):95–102
Malhotra N (1987) Analyzing marketing research data with incomplete information on the dependent variable. J Mark Res 24:74–84
Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509
Zhang Y, Thorburn PJ (2022) Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Gener Comput Syst 128:63–72
Alexopoulos EC (2010) Introduction to multivariate regression analysis. Hippokratia 14(Suppl 1):23
Emmanuel T, Maupong T, Mpoeleng D et al (2021) A survey on missing data in machine learning. J Big Data 8:140. https://doi.org/10.1186/s40537-021-00516-9
Song Q, Shepperd M (2007) Missing data imputation techniques. Int J Bus Intell Data Min 2(3):261–291
Yu L, Liu L, Peace KE (2020) Regression multiple imputation for missing data analysis. Stat Methods Med Res 29(9):2647–2664
Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-is: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15
Amirteimoori A, Kordrostami S (2010) A Euclidean distance-based measure of efficiency in data envelopment analysis. Optimization 59(7):985–996
Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 16(3):74
Acuna E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications. Springer, New York, pp 639–647
Jiang C, Yang Z (2015) CKNNI: An Improved KNN-Based Missing Value Handling Technique. In: Huang DS, Han K (eds) Advanced intelligent computing theories and applications. ICIC 2015. Lecture notes in computer science, vol 9227. Springer, Cham. https://doi.org/10.1007/978-3-319-22053-6_47
Sun B, Ma L, Cheng W, Wen W, Goswami P, Bai G (2017) An improved k-nearest neighbours method for traffic time series imputation. In: Chinese automation congress (CAC). IEEE 10. https://doi.org/10.1109/CAC.2017.8244105
He Y, Pi D-C (2016) Improving KNN method based on reduced relational grade for microarray missing values imputation. IAENG Int J Comput Sci 43(3):1–7
Stekhoven DJ, Buhlmann P (2012) MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118
Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: Multivariate Imputation by Chained Equations in R. J Stat Softw 45(3):1–67
Tang F, Ishwaran H (2017) Random Forest missing data algorithms. Stat Analysis Data Mining 10(6):363–377
Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20(1):1–12
Ye A (2020) MissForest: the best missing data imputation algorithm? Towards Data Science. https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3. Accessed 10 Jan 2023
Honghai F, Guoshun C, Cheng Y, Bingru Y, Yumei C (2005) A SVM regression based approach to filling in missing values. In: Khosla R, Howlett RJ, Jain LC (eds) Knowledge-based intelligent information and engineering systems. KES 2005. Lecture Notes in Computer Science, vol 3683. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553939_83
Pelckmans K, De Brabanter J, Suykens JA, De Moor B (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692
Zhang Z (2016) Multiple imputation with multivariate imputation by chained equation (MICE) package. ATM Ann Transl Med 4:2
Sathishkumar VE, Changsun S, Yongyun C (2023) Steel industry energy consumption. UCI Machine Learning Repository. https://doi.org/10.24432/C52G8C
Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work?”. Int J Methods Psychiatr Res 20(1):40–49
Sattari MT, Rezazadeh-Joudi A, Kusiak A (2016) Assessment of different methods for estimation of missing data in precipitation studies. Hydrol Res. https://doi.org/10.2166/nh.2016.364
Bias correction of numerical prediction model temperature forecast (2020) UCI Machine Learning Repository. https://doi.org/10.24432/C59K76
Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420. https://doi.org/10.1177/016327878600900401
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62. https://doi.org/10.1016/j.jom.2005.03.001
Bennett DA (2001) How can I deal with missing data in my study? Aust N Z J Public Health 25(5):464–469
Tabachnick BG, Fidell LS (2012) Using multivariate statistics. 6. Needham Heights, MA: Allyn & Bacon.
Badr W (2019) 6 Different ways to compensate for missing values in a dataset (data imputation with examples). Towards Data Science. https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779. Accessed 20 Jan 2023
Pan S, Chen S (2023) Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. Int J Environ Res Public Health 20(2):1524. https://doi.org/10.3390/ijerph20021524
Gabr MI, Helmy YM, Elzanfaly DS (2023) Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study. Big Data Cogn 7(1):55. https://doi.org/10.3390/bdcc7010055
Miao X, Wu Y, Chen L, Gao Y, Yin J (2023) An Experimental Survey of Missing Data Imputation Algorithms. IEEE Trans Knowl Data Eng 35(7):6630–6650. https://doi.org/10.1109/TKDE.2022.3186498
Author information
Authors and Affiliations
Contributions
Preeti Saini: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Writing—original draft, Writing—review & editing. Bharti Nagpal: Supervision, Project administration, Validation, Visualization.
Corresponding author
Ethics declarations
Competing of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Saini, P., Nagpal, B. Analysis of missing data and comparing the accuracy of imputation methods using wheat crop data. Multimed Tools Appl 83, 40393–40414 (2024). https://doi.org/10.1007/s11042-023-17178-9
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-023-17178-9