Skip to main content
Log in

Analysis of missing data and comparing the accuracy of imputation methods using wheat crop data

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract 

In a realistic scenario, the dataset has missing values encountered during the data collection. To effectively build the prediction model, the missingness of the attributes that impact crop growth needs to be appropriately handled in the crop dataset. The study aims to impute missing data in the Wheat crop yield Dataset, consisting of climatic parameters and historical data of 370 districts of Major Wheat Producer states of India. This study plays a vital role in crop estimation or forecasting of production at regular intervals. The imputation techniques that replace missing data have been categorized into Statistical and Machine Learning based Methods. We explored the performance of popular Techniques such as Arithmetic Average Replacement, Median Imputation, Linear Interpolation, Average Imputation by Nearby Districts, K-Nearest Neighbour, Miss Forest, Regression, and MICE. We have also evaluated these methods on the UCI machine learning repository's Bias and Steel energy consumption datasets. These imputed results were fed to the multiple regression prediction models to evaluate the efficiency of the imputation approaches qualitatively. The results conclude that the Arithmetic Average Replacement method provides good results among the statistical methods (R2 = 0.83; RMSE = 0.47; MAE = 0.372; MSE = 0.229), whereas in Machine Learning based methods, Miss Forest Random Forest-based method, and MICE performed well (R2 = 0.80; MAE = 0.3825; MSE = 0.249; RMSE = 0.499) to impute the missing data. We hope our results help the researchers to select the appropriate pre-processing strategies and improve the data quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Algorithm 2
Algorithm 3
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Data availability

All the data analyzed in this study are included in the references of this article.

References 

  1. Khan SI, Hoque ASML (2020) SICE: an improved missing data imputation technique. J Big Data 7:37. https://doi.org/10.1186/s40537-020-00313-w

    Article  Google Scholar 

  2. Jadhav A, Pramod D, Ramanathan K (2019) Comparison of Performance of Data Imputation Methods for Numeric Dataset. Appl Artif Intell 33:913–933. https://doi.org/10.1080/08839514.2019.1637138

    Article  Google Scholar 

  3. Chhabra G, Vashisht V, Ranjan J (2019) A Review on Missing Data Value Estimation Using Imputation Algorithm. J Dyn Control Syst 11:312–318

    Google Scholar 

  4. Zhang Z (2015) Missing values in big data research: some basic skills. Ann Transl Med 3:21. https://doi.org/10.3978/j.issn.2305-5839.2015.12.11

    Article  Google Scholar 

  5. Kwak SK, Kim JH (2017) Statistical data preparation: management of missing values and outliers. Korean J Anesthesiol 70(4):407–411. https://doi.org/10.4097/kjae.2017.70.4.407

    Article  Google Scholar 

  6. Kang H (2013) The prevention and handling of the missing data. Korean J Anesthesiol 64(5):402. https://doi.org/10.4097/kjae.2013.64.5.402

    Article  Google Scholar 

  7. Acuna E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. In: Banks D, McMorris FR, Arabie P, Gaul W (eds) Classification, clustering, and data mining applications. studies in classification, data analysis, and knowledge organisation. Springer, Berlin, Heidelberg, pp 639–647. https://doi.org/10.1007/978-3-642-17103-1_60

  8. Turrado CC, López MDCM, Lasheras FS, Gómez BAR, Rollé JLC, Juez FJdC (2014) Missing data imputation of solar radiation data under different atmospheric conditions. Sensors 14:20382–20399. https://doi.org/10.3390/s141120382

    Article  Google Scholar 

  9. Biessmann F, Salinas D, Schelter S, Schmidt P, Lange D (2018) “Deep" learning for missing value imputation in tables with non-numerical data. In: Proceedings of the 27th ACM international conference on information and knowledge management. CIKM, Italy, pp 2017–2025. https://doi.org/10.1145/3269206.3272005

  10. Nikfalazar S, Yeh CH, Bedingfield S, Khorshidi HA (2020) Missing data imputation using decision trees and fuzzy clustering with iterative learning. Knowl Inf Syst 62:2419–2437

    Article  Google Scholar 

  11. Silva HD, Perera AS (2016) Missing data imputation using evolutionary k- Nearest neighbor algorithm for gene expression data. In: International Conference on Advances in ICT for Emerging Regions (ICTer). Negombo, Sri Lanka, pp 141–146. https://doi.org/10.1109/ICTER.2016.7829911

  12. Cao J, Tunkiel AT, Arild O, Sui D (2023) Quantitative evaluation of imputation methods using bounds estimation of the coefficient of determination for data-driven models with an application to drilling logs. SPE J 28 (04):1895–1911. https://doi.org/10.2118/214323-PA

  13. Luo Y (2022) Evaluating the state of the art in missing data imputation for clinical data. Brief Bioinform 23:1. https://doi.org/10.1093/bib/bbab489

    Article  Google Scholar 

  14. Jinubala V, Lawrance R (2016) Analysis of Missing Data and Imputation on Agriculture Data With Predictive Mean Matching Method. Int j Sci Appl Inf Technol 5(1):01–04

    Google Scholar 

  15. Fu Y, Liao H, Lv L (2021) A Comparative Study of Various Methods for Handling Missing Data in UNSODA. Agriculture 11(8):727. https://doi.org/10.3390/agriculture11080727

    Article  Google Scholar 

  16. Arciniegas-Alarcón S, García-Peña M, Krzanowski W (2016) Missing value imputation in multi-environment trials: reconsidering the krzanowski method. Crop Breed Appl Biotechnol 16(2):77–85. https://doi.org/10.1590/1984-70332016v16n2a13

  17. Gedikoglu H, Parcell JL (2012) Implications of Missing Data Imputation for Agricultural Household Surveys: An Application to Technology Adoption. Agricultural & Applied Economics Association’s 2012 AAEA Annual Meeting. Seattle, Washington, pp 12–14

    Google Scholar 

  18. Lokupitiya R, Lokupitiya E, Paustian K (2006) Comparison of missing value imputation methods for crop yield data. Environ 17(4):339–349. https://doi.org/10.1002/env.773

    Article  MathSciNet  Google Scholar 

  19. Solfanelli F, Gambelli D, Vairo D, Zanoli R (2019) Estimating missing data for organic farming by multiple imputation: the case of organic fruit yields in Italy. Org Agr 9:295–303. https://doi.org/10.1007/s13165-018-0228-8

  20. Gorard S (2020) Handling missing data in numeric analyses. Int J Soc Res Methodol 23(6):651–660. https://doi.org/10.1080/13645579.2020.1729974

    Article  Google Scholar 

  21. Curley C, Krause RM, Feiock R, Hawkins CV (2019) Dealing with Missing Data: A Comparative Exploration of Approaches Using the Integrated City Sustainability Database. Urban Affairs Review 55(2):591–615. https://doi.org/10.1177/1078087417726394

    Article  Google Scholar 

  22. Poulos J, Valle R (2018) Missing Data Imputation for Supervised Learning. Appl Artif Intell 32(2):186–196. https://doi.org/10.1080/08839514.2018.1448143

    Article  Google Scholar 

  23. Crop production statistics by directorate of economics and statistics, ministry of agriculture, and farmers welfare. https://aps.dac.gov.in/APY/Public_Report1.aspx. Accessed 5 Jan 2023

  24. Data Access Viewer. https://power.larc.nasa.gov/data-access-viewer/. Accessed 5 Jan 2023

  25. Demirtas H (2018) Flexible imputation of missing data. J Stat Softw 85(1):1–5

    MathSciNet  Google Scholar 

  26. Hoque G (2021) A better way to handle missing values in your dataset: using iterative imputer (PART I). Towards Data Science. https://towardsdatascience.com/a-better-way-to-handle-missing-values-in-your-dataset-using-iterativeimputer-9e6e84857d98. Accessed 10 Jan 2023

  27. Chen Y-C (2020) Pattern graphs: a graphical approach to nonmonotone missing data. arXiv:2004.00744. https://doi.org/10.48550/arXiv.2004.00744

  28. Scharfstein DO, Hogan J, Herman A (2012) On the prevention and analysis of missing data in randomized clinical trials: the state of the art. J Bone Joint Surg Am 94(Suppl 1):80–84

    Article  Google Scholar 

  29. Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Article  MathSciNet  Google Scholar 

  30. Warnes Z (2021) Missing value handling — missing data types. Towards Data Science. https://towardsdatascience.com/missing-value-handling-missing-data-types-a89c0d81a5bb. Accessed 10 Jan 2023

  31. Meggiorin M, Passadore G, Bertoldo S, Sottani A, Rinaldo A (2023) Comparison of Three Imputation Methods for Groundwater Level Timeseries. Water 15(4):801. https://doi.org/10.3390/w15040801

    Article  Google Scholar 

  32. Dantan E, Proust-Lima C, Letenneur L, Jacqmin-Gadda H (2008) Pattern mixture models and latent class models for the analysis of multivariate longitudinal data with informative dropouts. Int J Biostat. 4(1):10. https://doi.org/10.2202/1557-4679.1088

  33. Graham JW (2012) Analysis of missing data. Missing data. Springer, New York, pp 47–69

    Chapter  Google Scholar 

  34. Bici R (2023) Simple methods to handle missing data. Int J Comp Econ Econ 13(2):216–242. https://doi.org/10.1504/IJCEE.2023.129986

    Article  Google Scholar 

  35. Little RJ, Rubin DB (2019) Statistical analysis with missing data. Wiley Series in Probability and Statistics, Hoboken. https://doi.org/10.1002/9781119482260

  36. Wafaa H, Nzar A (2023) Missing value imputation Techniques: A Survey. UHD J Sci Technol 7:72–81. https://doi.org/10.21928/uhdjst.v7n1y2023.pp72-81

    Article  Google Scholar 

  37. Mohammed M, Zulkafli H, Mohd A, Ali N, Baba I, Baba MM (2021) Comparison of five imputation methods in handling missing data in a continuous frequency table. AIP Conf Proc 040009:0400061–0400069. https://doi.org/10.1063/5.0053286

    Article  Google Scholar 

  38. Donders ART, Van Der Heijden GJ, Stijnen T, Moons KG (2006) A gentle introduction to imputation of missing values. J Clin Epidemiol 59(10):1087–1091

    Article  Google Scholar 

  39. Jahan F, Sinha NC, Rahman MM, Rahman MM, Mondal MSH, Islam MA (2019) Comparison of missing value estimation techniques in rainfall data of Bangladesh. Theor Appl Climatol 136(3):1115–1131

    Article  Google Scholar 

  40. Dumedah G, Coulibaly P (2011) Evaluation of statistical methods for infilling missing values in high-resolution soil moisture data. J Hydrol 400(1–2):95–102

    Article  Google Scholar 

  41. Malhotra N (1987) Analyzing marketing research data with incomplete information on the dependent variable. J Mark Res 24:74–84

    Article  Google Scholar 

  42. Lin W-C, Tsai C-F (2020) Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev 53(2):1487–1509

    Article  Google Scholar 

  43. Zhang Y, Thorburn PJ (2022) Handling missing data in near real-time environmental monitoring: A system and a review of selected methods. Future Gener Comput Syst 128:63–72

    Article  Google Scholar 

  44. Alexopoulos EC (2010) Introduction to multivariate regression analysis. Hippokratia 14(Suppl 1):23

    Google Scholar 

  45. Emmanuel T, Maupong T, Mpoeleng D et al (2021) A survey on missing data in machine learning. J Big Data 8:140. https://doi.org/10.1186/s40537-021-00516-9

    Article  Google Scholar 

  46. Song Q, Shepperd M (2007) Missing data imputation techniques. Int J Bus Intell Data Min 2(3):261–291

    Google Scholar 

  47. Yu L, Liu L, Peace KE (2020) Regression multiple imputation for missing data analysis. Stat Methods Med Res 29(9):2647–2664

    Article  MathSciNet  Google Scholar 

  48. Maillo J, Ramírez S, Triguero I, Herrera F (2017) kNN-is: an iterative Spark-based design of the k-nearest neighbors classifier for big data. Knowl Based Syst 117:3–15

    Article  Google Scholar 

  49. Amirteimoori A, Kordrostami S (2010) A Euclidean distance-based measure of efficiency in data envelopment analysis. Optimization 59(7):985–996

    Article  MathSciNet  Google Scholar 

  50. Beretta L, Santaniello A (2016) Nearest neighbor imputation algorithms: a critical evaluation. BMC Med Inform Decis Mak 16(3):74

    Article  Google Scholar 

  51. Acuna E, Rodriguez C (2004) The treatment of missing values and its effect on classifier accuracy. Classification, clustering, and data mining applications. Springer, New York, pp 639–647

    Chapter  Google Scholar 

  52. Jiang C, Yang Z (2015) CKNNI: An Improved KNN-Based Missing Value Handling Technique. In: Huang DS, Han K (eds) Advanced intelligent computing theories and applications. ICIC 2015. Lecture notes in computer science, vol 9227. Springer, Cham. https://doi.org/10.1007/978-3-319-22053-6_47

  53. Sun B, Ma L, Cheng W, Wen W, Goswami P, Bai G (2017) An improved k-nearest neighbours method for traffic time series imputation. In: Chinese automation congress (CAC). IEEE 10. https://doi.org/10.1109/CAC.2017.8244105

  54. He Y, Pi D-C (2016) Improving KNN method based on reduced relational grade for microarray missing values imputation. IAENG Int J Comput Sci 43(3):1–7

    Google Scholar 

  55. Stekhoven DJ, Buhlmann P (2012) MissForest–non-parametric missing value imputation for mixed-type data. Bioinformatics 28(1):112–118

    Article  Google Scholar 

  56. Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: Multivariate Imputation by Chained Equations in R. J Stat Softw 45(3):1–67

    Article  Google Scholar 

  57. Tang F, Ishwaran H (2017) Random Forest missing data algorithms. Stat Analysis Data Mining 10(6):363–377

    Article  MathSciNet  Google Scholar 

  58. Hong S, Lynn HS (2020) Accuracy of random-forest-based imputation of missing data in the presence of non-normality, non-linearity, and interaction. BMC Med Res Methodol 20(1):1–12

    Article  Google Scholar 

  59. Ye A (2020) MissForest: the best missing data imputation algorithm? Towards Data Science. https://towardsdatascience.com/missforest-the-best-missing-data-imputation-algorithm-4d01182aed3. Accessed 10 Jan 2023

  60. Honghai F, Guoshun C, Cheng Y, Bingru Y, Yumei C (2005) A SVM regression based approach to filling in missing values. In: Khosla R, Howlett RJ, Jain LC (eds) Knowledge-based intelligent information and engineering systems. KES 2005. Lecture Notes in Computer Science, vol 3683. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11553939_83

  61. Pelckmans K, De Brabanter J, Suykens JA, De Moor B (2005) Handling missing values in support vector machine classifiers. Neural Netw 18(5–6):684–692

    Article  Google Scholar 

  62. Zhang Z (2016) Multiple imputation with multivariate imputation by chained equation (MICE) package. ATM Ann Transl Med 4:2

    Google Scholar 

  63. Sathishkumar VE, Changsun S, Yongyun C (2023) Steel industry energy consumption. UCI Machine Learning Repository. https://doi.org/10.24432/C52G8C

  64. Azur MJ, Stuart EA, Frangakis C, Leaf PJ (2011) Multiple imputation by chained equations: what is it and how does it work?”. Int J Methods Psychiatr Res 20(1):40–49

    Article  Google Scholar 

  65. Sattari MT, Rezazadeh-Joudi A, Kusiak A (2016) Assessment of different methods for estimation of missing data in precipitation studies. Hydrol Res. https://doi.org/10.2166/nh.2016.364

    Article  Google Scholar 

  66. Bias correction of numerical prediction model temperature forecast (2020) UCI Machine Learning Repository. https://doi.org/10.24432/C59K76

  67. Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420. https://doi.org/10.1177/016327878600900401

    Article  Google Scholar 

  68. Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62. https://doi.org/10.1016/j.jom.2005.03.001

    Article  Google Scholar 

  69. Bennett DA (2001) How can I deal with missing data in my study? Aust N Z J Public Health 25(5):464–469

    Article  Google Scholar 

  70. Tabachnick BG, Fidell LS (2012) Using multivariate statistics. 6. Needham Heights, MA: Allyn & Bacon.

  71. Badr W (2019) 6 Different ways to compensate for missing values in a dataset (data imputation with examples). Towards Data Science. https://towardsdatascience.com/6-different-ways-to-compensate-for-missing-values-data-imputation-with-examples-6022d9ca0779. Accessed 20 Jan 2023

  72. Pan S, Chen S (2023) Empirical Comparison of Imputation Methods for Multivariate Missing Data in Public Health. Int J Environ Res Public Health 20(2):1524. https://doi.org/10.3390/ijerph20021524

    Article  Google Scholar 

  73. Gabr MI, Helmy YM, Elzanfaly DS (2023) Effect of Missing Data Types and Imputation Methods on Supervised Classifiers: An Evaluation Study. Big Data Cogn 7(1):55. https://doi.org/10.3390/bdcc7010055

    Article  Google Scholar 

  74. Miao X, Wu Y, Chen L, Gao Y, Yin J (2023) An Experimental Survey of Missing Data Imputation Algorithms. IEEE Trans Knowl Data Eng 35(7):6630–6650. https://doi.org/10.1109/TKDE.2022.3186498

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Contributions

Preeti Saini: Conceptualization, Data curation, Formal analysis, Investigation, Methodology, Resources, Software, Writing—original draft, Writing—review & editing. Bharti Nagpal: Supervision, Project administration, Validation, Visualization.

Corresponding author

Correspondence to Preeti Saini.

Ethics declarations

Competing of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Saini, P., Nagpal, B. Analysis of missing data and comparing the accuracy of imputation methods using wheat crop data. Multimed Tools Appl 83, 40393–40414 (2024). https://doi.org/10.1007/s11042-023-17178-9

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-023-17178-9

Keywords

Navigation