Skip to main content
Log in

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

  • Research Article
  • Published:
Environmental Science and Pollution Research Aims and scope Submit manuscript

Abstract

Monitoring air contaminants has become essential to exposure science, toxicology, and public health research. However, missing values are common while monitoring air contaminants, especially in resource-constrained settings such as power cuts, calibration, and sensor failure. In contaminants monitoring, evaluating existing imputation techniques for dealing with recurrent periods of missing and unobserved data are limited. The proposed study aims to perform a statistical evaluation of six univariate and four multivariate time series imputation methods. The univariate methods are based on inter-time correlation characteristics, and the multivariate approach considers muti-site to impute missing data. The present study retrieved data from 38 ground-based monitoring stations for particulate pollutants in Delhi for 4 years. For univariate methods, missing values were simulated under 0–20% (5%, 10%, 15%, and 20%), and high 40%, 60%, and 80% missing levels having long gaps. Before evaluating multivariate methods, input data underwent pre-processing steps: selecting the target station to be imputed, choosing covariates based on the spatial correlation between multiple sites, and framing a combination of target and neighbouring stations (covariates) under 20%, 40%, 60%, and 80%. Next, the particulate pollutants data of 1480 days is provided as input to four multivariate techniques. Finally, the performance of each algorithm was evaluated using error metrics. The results show that the long interval time series data and spatial correlation of multiple stations significantly improved outcomes for univariate and multivariate time series methods. The univariate Kalman_arima performs well for long-missing gaps and all missing levels (except for 60–80%), yielding low error and high R2 and d values. In contrast, multivariate MIPCA performed better than Kalman-arima for all target stations with the highest missing percentage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Data availability

Air pollutants data is freely available on the website of Central Pollution Control Board, India. Any question/ inquires pertaining to the type of data or the attributes of datasets, can be directed to the corresponding author, upon reasonable request.

Code availability

NA

References

  • Abayomi K, Gelman A, Levy M (2008) Diagnostics for multivariate imputations. J R Stat Soc Ser C Appl Stat 57(3):273–291

    Google Scholar 

  • Agbailu AO, Seno A, Clement OO (2020) Kalman filter algorithm versus other methods of estimating missing values: time series evidence. Studies 4(2):1–9

    Google Scholar 

  • Allison P (2015) Imputation by predictive mean matching: promise & peril. Statistical Horizons

    Google Scholar 

  • Allison PD (2001) Missing data. Sage publications

  • Aslan S (2010) Comparison of missing value imputation methods for meteorological time series data. MS thesis, Middle East Technical University

  • Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a Bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156

    Google Scholar 

  • Benavides IF, Santacruz M, Romero-Leiton JP, Barreto C, Selvaraj JJ (2022) Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm. Aquac Fish J

  • Breiman L (2001) Random forests. Mach Learn 45:5–32

    Google Scholar 

  • Budhiraja B, Gawuc L, Agrawal G (2019) Seasonality of surface urban heat island in Delhi city region measured by local climate zones and conventional indicators. IEEE J Sel Top Appl Earth Obs Remote Sens 12(12):5223–5232

    Google Scholar 

  • Canales RA (2004) The cumulative and aggregate simulation of exposure framework. Stanford University

    Google Scholar 

  • Chan M (2015) Achieving a cleaner, more sustainable, and healthier future. The Lancet 386(10006):e27–e28

    Google Scholar 

  • Chatterji A (2021) Air pollution in delhi: filling the policy gaps. Massach Undergr J Econ 17

  • Cho B, Dayrit T, Gao Y, Wang Z, Hong T, Sim A, Wu K (2020) Effective missing value imputation methods for building monitoring data. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE

    Google Scholar 

  • Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge

    Google Scholar 

  • Crawley MJ (2012) The R book. John Wiley & Sons

  • Doove LL, Van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104

    Google Scholar 

  • Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216(5):657–667

    Google Scholar 

  • Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW (2012) Missing data: a systematic review of how they are reported and handled. Epidemiology 23(5):729–732

    Google Scholar 

  • Gaffert P, Meinfelder F, Bosch V (2018) Towards multiple-imputation-proper predictive mean matching. JSM:1026–1039

  • Ghazali SM, Shaadan N, Idrus Z (2020) Missing data exploration in air quality data set using R-package data visualisation tools. Bull Electr Eng Inform 9(2):755–763

    Google Scholar 

  • Gómez-Carracedo MP, Andrade J, López-Mahía P, Muniategui S, Prada D (2014) A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr Intell Lab Syst 134:23–33

    Google Scholar 

  • Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140

    CAS  Google Scholar 

  • Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210

    Google Scholar 

  • Harvey AC (1990) Forecasting, structural time series models and the Kalman filter

  • Huisman M (2009) Imputation of missing network data: some simple procedures. J Soc Struct 10(1):1–29

    Google Scholar 

  • Iodice D’Enza A, Markos A, Palumbo F (2022) Chunk-wise regularised PCA-based imputation of missing data. Stat Methods Appt 31(2):365–386

    Google Scholar 

  • John C, Ekpenyong EJ, Nworu CC (2019) Imputation of missing values in economic and financial time series data using five principal component analysis approaches. CBN J Appl Stat (JAS) 10(1):3

    Google Scholar 

  • Josse J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la société française de statistique 150(2):28–51

  • Josse J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5(3):231–246

    Google Scholar 

  • Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70:1–31

    Google Scholar 

  • Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104

    CAS  Google Scholar 

  • Junior JRB, do Carmo Nicoletti M, Zhao L (2016) An embedded imputation method via attribute-based decision graphs. Expert Syst Appl 57:159–177

  • Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907

    CAS  Google Scholar 

  • Kalman RE (1960) A new approach to linear filtering and prediction problems. Trans ASME J Basic Eng 82:35–45

  • Kleinke K (2018) Multiple imputation by predictive mean matching when sample size is small. Methodology: Euro J Res Methods Behav Res Methods 14(1):3

    Google Scholar 

  • Kumar P (2022) A critical evaluation of air quality index models (1960–2021). Environ Monit Assess 194(4):1–45

    Google Scholar 

  • Legates DR, McCabe GJ Jr (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35(1):233–241

    Google Scholar 

  • Li KH, Le ND, Sun L, Zidek JV (1999) Spatial–temporal models for ambient hourly PM10 in Vancouver. Environmetrics: the official journal of the Int Environ Sci 10(3):321–338

    CAS  Google Scholar 

  • Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons

    Google Scholar 

  • Little RJA, Rubin DB (2002) Single imputation methods. Statistical analysis with missing data. p 59–74. https://doi.org/10.1002/9781119013563.ch4

  • Liu X, Wang X, Zou L, Xia J, Pang W (2020) Spatial imputation for air pollutants data sets via low rank matrix completion algorithm. Environ Int 139:105713

    CAS  Google Scholar 

  • Lloret J, Lleonart J, Solé I (2000) Time series modelling of landings in Northwest Mediterranean Sea. ICES Mar Sci Symp 57(1):171–184

    Google Scholar 

  • Marshall A, Altman DG, Holder RL (2010a) Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study. BMC Med Res Methodol 10(1):1–10

    Google Scholar 

  • Marshall A, Altman DG, Royston P, Holder RL (2010b) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):1–16

    Google Scholar 

  • Miettinen OS (2012) Theoretical epidemiology: principles of occurrence research in medicine. Theoretical epidemiology: principles of occurrence research in medicine:359–359

  • Molenberghs G, Kenward M (2007) Missing data in clinical studies. John Wiley & Sons

  • Moriasi DN, Arnold JG, Van Liew MW, Bingner RL, Harmel RD, Veith TL (2007) Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans ASABE, Appl 50(3):885–900

    Google Scholar 

  • Moritz S, Bartz-Beielstein T (2017) ImputeTS: time series missing value imputation in R. R J 9(1):207

    Google Scholar 

  • Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. arXiv preprint arXiv:1510.03924

  • Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. SciAsia 34(3):341–345

    CAS  Google Scholar 

  • Plaia A, Bondi A (2006) Single imputation method of missing values in environmental pollution data sets. Atmos Environ 40(38):7316–7330

    CAS  Google Scholar 

  • Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49

    CAS  Google Scholar 

  • Ramli MN, Yahaya A, Ramli N, Yusof N, Abdullah M (2013) Roles of imputation methods for filling the missing values: a review. Adv Environ Biol 7(12 S2):3861–3870

    Google Scholar 

  • Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420

    Google Scholar 

  • Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592

    Google Scholar 

  • Schafer JL (1997) Analysis of incomplete multivariate data. CRC press

    Google Scholar 

  • Schenker N, Taylor JM (1996) Partially parametric techniques for multiple imputation. Comput Stat Data Anal 22(4):425–446

    Google Scholar 

  • Siddique J, Belin TR (2008) Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med 27(1):83–102

    Google Scholar 

  • Siddique J, Harel O (2009) MIDAS: a SAS macro for multiple imputation using distance-aided selection of donors. J Stat Softw 29:1–18

    Google Scholar 

  • Stekhoven DJ, Stekhoven MDJ (2013) Package ‘missForest’. R package version 1

  • Sukatis FF, Noor NM, Zakaria NA, Ul-Saufie AZ, Annas S (2019) Estimation of missing values in air pollution dataset by using various imputation methods. Int J Conserv Sci 10(4):791–804

    CAS  Google Scholar 

  • Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62

    Google Scholar 

  • Van Buuren S (2018) Flexible imputation of missing data. CRC press

    Google Scholar 

  • Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67

    Google Scholar 

  • Wardana I, Gardner JW, Fahmy SA (2022) Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder. Neural Comput Appl:1–26

  • Weerakody PB, Wong KW, Wang G, Ela W (2021) A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 441:161–178

    Google Scholar 

  • Welch G (2006) An Introduction to the Kalman Filter. Univ. of North Carolina http://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf. Accessed 10 Oct 2022

  • Wijesekara W, Liyanage L (2020) Comparison of imputation methods for missing values in air pollution data: case study on Sydney air quality index. In: Future of Information and Communication Conference. Springer

    Google Scholar 

  • Willmott CJ (1981) On the validation of models. Phys Geogr 2:184–194

  • Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 30(1):79–82

    Google Scholar 

  • Willmott CJ, Matsuura K (2006) On the use of dimensioned measures of error to evaluate the performance of spatial interpolators. Int J Geogr Inf Sci 20(1):89–102

    Google Scholar 

  • World Health Organization (2016) Ambient air pollution: A global assessment of exposure and burden of disease

  • Zeileis A, Grothendieck G (2005) zoo: S3 Infrastructure for Regular and Irregular Time Series. J Stat Softw 14(6):1–27

    Google Scholar 

Download references

Acknowledgements

The authors are thankful to CSIR-CSIO Chandigarh for providing the necessary infrastructure and support to carry out our work and especially thanks to the Central Pollution Control Board for making air pollutants data openly available to their official website.

Author information

Authors and Affiliations

Authors

Contributions

All authors have contributed substantially to the following: Priti K performed writing—original draft, conceptualization, data curation, formal analysis, and methodology. Kaushlesh Singh Shakya performed data curation, formal analysis, and visualization. Prashant kumar: supervision, validation, and review and editing.

Corresponding author

Correspondence to Prashant Kumar.

Ethics declarations

Ethics approval

NA

Consent to participate

NA

Consent for publication

NA

Conflict of interest

The authors declare no competing interests.

Additional information

Responsible Editor: Marcus Schulz

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

ESM 1

(DOCX 68.8 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

K, P., Shakya, K.S. & Kumar, P. Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage. Environ Sci Pollut Res 30, 75469–75488 (2023). https://doi.org/10.1007/s11356-023-27659-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11356-023-27659-x

Keywords

Navigation