Abstract
Monitoring air contaminants has become essential to exposure science, toxicology, and public health research. However, missing values are common while monitoring air contaminants, especially in resource-constrained settings such as power cuts, calibration, and sensor failure. In contaminants monitoring, evaluating existing imputation techniques for dealing with recurrent periods of missing and unobserved data are limited. The proposed study aims to perform a statistical evaluation of six univariate and four multivariate time series imputation methods. The univariate methods are based on inter-time correlation characteristics, and the multivariate approach considers muti-site to impute missing data. The present study retrieved data from 38 ground-based monitoring stations for particulate pollutants in Delhi for 4 years. For univariate methods, missing values were simulated under 0–20% (5%, 10%, 15%, and 20%), and high 40%, 60%, and 80% missing levels having long gaps. Before evaluating multivariate methods, input data underwent pre-processing steps: selecting the target station to be imputed, choosing covariates based on the spatial correlation between multiple sites, and framing a combination of target and neighbouring stations (covariates) under 20%, 40%, 60%, and 80%. Next, the particulate pollutants data of 1480 days is provided as input to four multivariate techniques. Finally, the performance of each algorithm was evaluated using error metrics. The results show that the long interval time series data and spatial correlation of multiple stations significantly improved outcomes for univariate and multivariate time series methods. The univariate Kalman_arima performs well for long-missing gaps and all missing levels (except for 60–80%), yielding low error and high R2 and d values. In contrast, multivariate MIPCA performed better than Kalman-arima for all target stations with the highest missing percentage.
Similar content being viewed by others
Data availability
Air pollutants data is freely available on the website of Central Pollution Control Board, India. Any question/ inquires pertaining to the type of data or the attributes of datasets, can be directed to the corresponding author, upon reasonable request.
Code availability
NA
References
Abayomi K, Gelman A, Levy M (2008) Diagnostics for multivariate imputations. J R Stat Soc Ser C Appl Stat 57(3):273–291
Agbailu AO, Seno A, Clement OO (2020) Kalman filter algorithm versus other methods of estimating missing values: time series evidence. Studies 4(2):1–9
Allison P (2015) Imputation by predictive mean matching: promise & peril. Statistical Horizons
Allison PD (2001) Missing data. Sage publications
Aslan S (2010) Comparison of missing value imputation methods for meteorological time series data. MS thesis, Middle East Technical University
Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a Bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156
Benavides IF, Santacruz M, Romero-Leiton JP, Barreto C, Selvaraj JJ (2022) Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm. Aquac Fish J
Breiman L (2001) Random forests. Mach Learn 45:5–32
Budhiraja B, Gawuc L, Agrawal G (2019) Seasonality of surface urban heat island in Delhi city region measured by local climate zones and conventional indicators. IEEE J Sel Top Appl Earth Obs Remote Sens 12(12):5223–5232
Canales RA (2004) The cumulative and aggregate simulation of exposure framework. Stanford University
Chan M (2015) Achieving a cleaner, more sustainable, and healthier future. The Lancet 386(10006):e27–e28
Chatterji A (2021) Air pollution in delhi: filling the policy gaps. Massach Undergr J Econ 17
Cho B, Dayrit T, Gao Y, Wang Z, Hong T, Sim A, Wu K (2020) Effective missing value imputation methods for building monitoring data. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Crawley MJ (2012) The R book. John Wiley & Sons
Doove LL, Van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104
Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216(5):657–667
Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW (2012) Missing data: a systematic review of how they are reported and handled. Epidemiology 23(5):729–732
Gaffert P, Meinfelder F, Bosch V (2018) Towards multiple-imputation-proper predictive mean matching. JSM:1026–1039
Ghazali SM, Shaadan N, Idrus Z (2020) Missing data exploration in air quality data set using R-package data visualisation tools. Bull Electr Eng Inform 9(2):755–763
Gómez-Carracedo MP, Andrade J, López-Mahía P, Muniategui S, Prada D (2014) A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr Intell Lab Syst 134:23–33
Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140
Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210
Harvey AC (1990) Forecasting, structural time series models and the Kalman filter
Huisman M (2009) Imputation of missing network data: some simple procedures. J Soc Struct 10(1):1–29
Iodice D’Enza A, Markos A, Palumbo F (2022) Chunk-wise regularised PCA-based imputation of missing data. Stat Methods Appt 31(2):365–386
John C, Ekpenyong EJ, Nworu CC (2019) Imputation of missing values in economic and financial time series data using five principal component analysis approaches. CBN J Appl Stat (JAS) 10(1):3
Josse J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la société française de statistique 150(2):28–51
Josse J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5(3):231–246
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70:1–31
Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104
Junior JRB, do Carmo Nicoletti M, Zhao L (2016) An embedded imputation method via attribute-based decision graphs. Expert Syst Appl 57:159–177
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
Kalman RE (1960) A new approach to linear filtering and prediction problems. Trans ASME J Basic Eng 82:35–45
Kleinke K (2018) Multiple imputation by predictive mean matching when sample size is small. Methodology: Euro J Res Methods Behav Res Methods 14(1):3
Kumar P (2022) A critical evaluation of air quality index models (1960–2021). Environ Monit Assess 194(4):1–45
Legates DR, McCabe GJ Jr (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35(1):233–241
Li KH, Le ND, Sun L, Zidek JV (1999) Spatial–temporal models for ambient hourly PM10 in Vancouver. Environmetrics: the official journal of the Int Environ Sci 10(3):321–338
Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons
Little RJA, Rubin DB (2002) Single imputation methods. Statistical analysis with missing data. p 59–74. https://doi.org/10.1002/9781119013563.ch4
Liu X, Wang X, Zou L, Xia J, Pang W (2020) Spatial imputation for air pollutants data sets via low rank matrix completion algorithm. Environ Int 139:105713
Lloret J, Lleonart J, Solé I (2000) Time series modelling of landings in Northwest Mediterranean Sea. ICES Mar Sci Symp 57(1):171–184
Marshall A, Altman DG, Holder RL (2010a) Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study. BMC Med Res Methodol 10(1):1–10
Marshall A, Altman DG, Royston P, Holder RL (2010b) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):1–16
Miettinen OS (2012) Theoretical epidemiology: principles of occurrence research in medicine. Theoretical epidemiology: principles of occurrence research in medicine:359–359
Molenberghs G, Kenward M (2007) Missing data in clinical studies. John Wiley & Sons
Moriasi DN, Arnold JG, Van Liew MW, Bingner RL, Harmel RD, Veith TL (2007) Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans ASABE, Appl 50(3):885–900
Moritz S, Bartz-Beielstein T (2017) ImputeTS: time series missing value imputation in R. R J 9(1):207
Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. arXiv preprint arXiv:1510.03924
Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. SciAsia 34(3):341–345
Plaia A, Bondi A (2006) Single imputation method of missing values in environmental pollution data sets. Atmos Environ 40(38):7316–7330
Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49
Ramli MN, Yahaya A, Ramli N, Yusof N, Abdullah M (2013) Roles of imputation methods for filling the missing values: a review. Adv Environ Biol 7(12 S2):3861–3870
Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Schafer JL (1997) Analysis of incomplete multivariate data. CRC press
Schenker N, Taylor JM (1996) Partially parametric techniques for multiple imputation. Comput Stat Data Anal 22(4):425–446
Siddique J, Belin TR (2008) Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med 27(1):83–102
Siddique J, Harel O (2009) MIDAS: a SAS macro for multiple imputation using distance-aided selection of donors. J Stat Softw 29:1–18
Stekhoven DJ, Stekhoven MDJ (2013) Package ‘missForest’. R package version 1
Sukatis FF, Noor NM, Zakaria NA, Ul-Saufie AZ, Annas S (2019) Estimation of missing values in air pollution dataset by using various imputation methods. Int J Conserv Sci 10(4):791–804
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62
Van Buuren S (2018) Flexible imputation of missing data. CRC press
Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Wardana I, Gardner JW, Fahmy SA (2022) Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder. Neural Comput Appl:1–26
Weerakody PB, Wong KW, Wang G, Ela W (2021) A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 441:161–178
Welch G (2006) An Introduction to the Kalman Filter. Univ. of North Carolina http://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf. Accessed 10 Oct 2022
Wijesekara W, Liyanage L (2020) Comparison of imputation methods for missing values in air pollution data: case study on Sydney air quality index. In: Future of Information and Communication Conference. Springer
Willmott CJ (1981) On the validation of models. Phys Geogr 2:184–194
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 30(1):79–82
Willmott CJ, Matsuura K (2006) On the use of dimensioned measures of error to evaluate the performance of spatial interpolators. Int J Geogr Inf Sci 20(1):89–102
World Health Organization (2016) Ambient air pollution: A global assessment of exposure and burden of disease
Zeileis A, Grothendieck G (2005) zoo: S3 Infrastructure for Regular and Irregular Time Series. J Stat Softw 14(6):1–27
Acknowledgements
The authors are thankful to CSIR-CSIO Chandigarh for providing the necessary infrastructure and support to carry out our work and especially thanks to the Central Pollution Control Board for making air pollutants data openly available to their official website.
Author information
Authors and Affiliations
Contributions
All authors have contributed substantially to the following: Priti K performed writing—original draft, conceptualization, data curation, formal analysis, and methodology. Kaushlesh Singh Shakya performed data curation, formal analysis, and visualization. Prashant kumar: supervision, validation, and review and editing.
Corresponding author
Ethics declarations
Ethics approval
NA
Consent to participate
NA
Consent for publication
NA
Conflict of interest
The authors declare no competing interests.
Additional information
Responsible Editor: Marcus Schulz
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary information
ESM 1
(DOCX 68.8 kb)
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
K, P., Shakya, K.S. & Kumar, P. Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage. Environ Sci Pollut Res 30, 75469–75488 (2023). https://doi.org/10.1007/s11356-023-27659-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11356-023-27659-x