Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

K, Priti; Shakya, Kaushlesh Singh; Kumar, Prashant

doi:10.1007/s11356-023-27659-x

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Research Article
Published: 23 May 2023

Volume 30, pages 75469–75488, (2023)
Cite this article

Environmental Science and Pollution Research Aims and scope Submit manuscript

395 Accesses
Explore all metrics

Abstract

Monitoring air contaminants has become essential to exposure science, toxicology, and public health research. However, missing values are common while monitoring air contaminants, especially in resource-constrained settings such as power cuts, calibration, and sensor failure. In contaminants monitoring, evaluating existing imputation techniques for dealing with recurrent periods of missing and unobserved data are limited. The proposed study aims to perform a statistical evaluation of six univariate and four multivariate time series imputation methods. The univariate methods are based on inter-time correlation characteristics, and the multivariate approach considers muti-site to impute missing data. The present study retrieved data from 38 ground-based monitoring stations for particulate pollutants in Delhi for 4 years. For univariate methods, missing values were simulated under 0–20% (5%, 10%, 15%, and 20%), and high 40%, 60%, and 80% missing levels having long gaps. Before evaluating multivariate methods, input data underwent pre-processing steps: selecting the target station to be imputed, choosing covariates based on the spatial correlation between multiple sites, and framing a combination of target and neighbouring stations (covariates) under 20%, 40%, 60%, and 80%. Next, the particulate pollutants data of 1480 days is provided as input to four multivariate techniques. Finally, the performance of each algorithm was evaluated using error metrics. The results show that the long interval time series data and spatial correlation of multiple stations significantly improved outcomes for univariate and multivariate time series methods. The univariate Kalman_arima performs well for long-missing gaps and all missing levels (except for 60–80%), yielding low error and high R² and d values. In contrast, multivariate MIPCA performed better than Kalman-arima for all target stations with the highest missing percentage.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space

Spatiotemporal Exposure Prediction with Penalized Regression

Article 17 November 2022

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Data availability

Air pollutants data is freely available on the website of Central Pollution Control Board, India. Any question/ inquires pertaining to the type of data or the attributes of datasets, can be directed to the corresponding author, upon reasonable request.

Code availability

NA

References

Abayomi K, Gelman A, Levy M (2008) Diagnostics for multivariate imputations. J R Stat Soc Ser C Appl Stat 57(3):273–291
Google Scholar
Agbailu AO, Seno A, Clement OO (2020) Kalman filter algorithm versus other methods of estimating missing values: time series evidence. Studies 4(2):1–9
Google Scholar
Allison P (2015) Imputation by predictive mean matching: promise & peril. Statistical Horizons
Google Scholar
Allison PD (2001) Missing data. Sage publications
Aslan S (2010) Comparison of missing value imputation methods for meteorological time series data. MS thesis, Middle East Technical University
Audigier V, Husson F, Josse J (2016) Multiple imputation for continuous variables using a Bayesian principal component analysis. J Stat Comput Simul 86(11):2140–2156
Google Scholar
Benavides IF, Santacruz M, Romero-Leiton JP, Barreto C, Selvaraj JJ (2022) Assessing methods for multiple imputation of systematic missing data in marine fisheries time series with a new validation algorithm. Aquac Fish J
Breiman L (2001) Random forests. Mach Learn 45:5–32
Google Scholar
Budhiraja B, Gawuc L, Agrawal G (2019) Seasonality of surface urban heat island in Delhi city region measured by local climate zones and conventional indicators. IEEE J Sel Top Appl Earth Obs Remote Sens 12(12):5223–5232
Google Scholar
Canales RA (2004) The cumulative and aggregate simulation of exposure framework. Stanford University
Google Scholar
Chan M (2015) Achieving a cleaner, more sustainable, and healthier future. The Lancet 386(10006):e27–e28
Google Scholar
Chatterji A (2021) Air pollution in delhi: filling the policy gaps. Massach Undergr J Econ 17
Cho B, Dayrit T, Gao Y, Wang Z, Hong T, Sim A, Wu K (2020) Effective missing value imputation methods for building monitoring data. In: 2020 IEEE International Conference on Big Data (Big Data). IEEE
Google Scholar
Cohen J, Cohen P, West SG, Aiken LS (2013) Applied multiple regression/correlation analysis for the behavioral sciences. Routledge
Google Scholar
Crawley MJ (2012) The R book. John Wiley & Sons
Doove LL, Van Buuren S, Dusseldorp E (2014) Recursive partitioning for missing data imputation in the presence of interaction effects. Comput Stat Data Anal 72:92–104
Google Scholar
Dray S, Josse J (2015) Principal component analysis with missing values: a comparative survey of methods. Plant Ecol 216(5):657–667
Google Scholar
Eekhout I, de Boer RM, Twisk JW, de Vet HC, Heymans MW (2012) Missing data: a systematic review of how they are reported and handled. Epidemiology 23(5):729–732
Google Scholar
Gaffert P, Meinfelder F, Bosch V (2018) Towards multiple-imputation-proper predictive mean matching. JSM:1026–1039
Ghazali SM, Shaadan N, Idrus Z (2020) Missing data exploration in air quality data set using R-package data visualisation tools. Bull Electr Eng Inform 9(2):755–763
Google Scholar
Gómez-Carracedo MP, Andrade J, López-Mahía P, Muniategui S, Prada D (2014) A practical comparison of single and multiple imputation methods to handle complex missing data in air quality datasets. Chemometr Intell Lab Syst 134:23–33
Google Scholar
Hadeed SJ, O’Rourke MK, Burgess JL, Harris RB, Canales RA (2020) Imputation methods for addressing missing data in short-term monitoring of air pollutants. Sci Total Environ 730:139140
CAS Google Scholar
Han H, Sun M, Han H, Wu X, Qiao J (2023) Univariate imputation method for recovering missing data in wastewater treatment process. Chin J Chem Eng 53:201–210
Google Scholar
Harvey AC (1990) Forecasting, structural time series models and the Kalman filter
Huisman M (2009) Imputation of missing network data: some simple procedures. J Soc Struct 10(1):1–29
Google Scholar
Iodice D’Enza A, Markos A, Palumbo F (2022) Chunk-wise regularised PCA-based imputation of missing data. Stat Methods Appt 31(2):365–386
Google Scholar
John C, Ekpenyong EJ, Nworu CC (2019) Imputation of missing values in economic and financial time series data using five principal component analysis approaches. CBN J Appl Stat (JAS) 10(1):3
Google Scholar
Josse J, Husson F (2009) Gestion des données manquantes en analyse en composantes principales. Journal de la société française de statistique 150(2):28–51
Josse J, Husson F (2011) Multiple imputation in principal component analysis. Adv Data Anal Classif 5(3):231–246
Google Scholar
Josse J, Husson F (2016) missMDA: a package for handling missing values in multivariate data analysis. J Stat Softw 70:1–31
Google Scholar
Junger W, De Leon AP (2015) Imputation of missing data in time series for air pollutants. Atmos Environ 102:96–104
CAS Google Scholar
Junior JRB, do Carmo Nicoletti M, Zhao L (2016) An embedded imputation method via attribute-based decision graphs. Expert Syst Appl 57:159–177
Junninen H, Niska H, Tuppurainen K, Ruuskanen J, Kolehmainen M (2004) Methods for imputation of missing values in air quality data sets. Atmos Environ 38(18):2895–2907
CAS Google Scholar
Kalman RE (1960) A new approach to linear filtering and prediction problems. Trans ASME J Basic Eng 82:35–45
Kleinke K (2018) Multiple imputation by predictive mean matching when sample size is small. Methodology: Euro J Res Methods Behav Res Methods 14(1):3
Google Scholar
Kumar P (2022) A critical evaluation of air quality index models (1960–2021). Environ Monit Assess 194(4):1–45
Google Scholar
Legates DR, McCabe GJ Jr (1999) Evaluating the use of “goodness-of-fit” measures in hydrologic and hydroclimatic model validation. Water Resour Res 35(1):233–241
Google Scholar
Li KH, Le ND, Sun L, Zidek JV (1999) Spatial–temporal models for ambient hourly PM10 in Vancouver. Environmetrics: the official journal of the Int Environ Sci 10(3):321–338
CAS Google Scholar
Little RJA, Rubin DB (2019) Statistical analysis with missing data, vol 793. John Wiley & Sons
Google Scholar
Little RJA, Rubin DB (2002) Single imputation methods. Statistical analysis with missing data. p 59–74. https://doi.org/10.1002/9781119013563.ch4
Liu X, Wang X, Zou L, Xia J, Pang W (2020) Spatial imputation for air pollutants data sets via low rank matrix completion algorithm. Environ Int 139:105713
CAS Google Scholar
Lloret J, Lleonart J, Solé I (2000) Time series modelling of landings in Northwest Mediterranean Sea. ICES Mar Sci Symp 57(1):171–184
Google Scholar
Marshall A, Altman DG, Holder RL (2010a) Comparison of imputation methods for handling missing covariate data when fitting a Cox proportional hazards model: a resampling study. BMC Med Res Methodol 10(1):1–10
Google Scholar
Marshall A, Altman DG, Royston P, Holder RL (2010b) Comparison of techniques for handling missing covariate data within prognostic modelling studies: a simulation study. BMC Med Res Methodol 10(1):1–16
Google Scholar
Miettinen OS (2012) Theoretical epidemiology: principles of occurrence research in medicine. Theoretical epidemiology: principles of occurrence research in medicine:359–359
Molenberghs G, Kenward M (2007) Missing data in clinical studies. John Wiley & Sons
Moriasi DN, Arnold JG, Van Liew MW, Bingner RL, Harmel RD, Veith TL (2007) Model evaluation guidelines for systematic quantification of accuracy in watershed simulations. Trans ASABE, Appl 50(3):885–900
Google Scholar
Moritz S, Bartz-Beielstein T (2017) ImputeTS: time series missing value imputation in R. R J 9(1):207
Google Scholar
Moritz S, Sardá A, Bartz-Beielstein T, Zaefferer M, Stork J (2015) Comparison of different methods for univariate time series imputation in R. arXiv preprint arXiv:1510.03924
Norazian MN, Shukri YA, Azam RN, Al Bakri AMM (2008) Estimation of missing values in air pollution data using single imputation techniques. SciAsia 34(3):341–345
CAS Google Scholar
Plaia A, Bondi A (2006) Single imputation method of missing values in environmental pollution data sets. Atmos Environ 40(38):7316–7330
CAS Google Scholar
Quinteros ME, Lu S, Blazquez C, Cárdenas-R JP, Ossa X, Delgado-Saborit J-M, Harrison RM, Ruiz-Rudolph P (2019) Use of data imputation tools to reconstruct incomplete air quality datasets: a case-study in Temuco, Chile. Atmos Environ 200:40–49
CAS Google Scholar
Ramli MN, Yahaya A, Ramli N, Yusof N, Abdullah M (2013) Roles of imputation methods for filling the missing values: a review. Adv Environ Biol 7(12 S2):3861–3870
Google Scholar
Raymond MR (1986) Missing data in evaluation research. Eval Health Prof 9(4):395–420
Google Scholar
Rubin DB (1976) Inference and missing data. Biometrika 63(3):581–592
Google Scholar
Schafer JL (1997) Analysis of incomplete multivariate data. CRC press
Google Scholar
Schenker N, Taylor JM (1996) Partially parametric techniques for multiple imputation. Comput Stat Data Anal 22(4):425–446
Google Scholar
Siddique J, Belin TR (2008) Multiple imputation using an iterative hot-deck with distance-based donor selection. Stat Med 27(1):83–102
Google Scholar
Siddique J, Harel O (2009) MIDAS: a SAS macro for multiple imputation using distance-aided selection of donors. J Stat Softw 29:1–18
Google Scholar
Stekhoven DJ, Stekhoven MDJ (2013) Package ‘missForest’. R package version 1
Sukatis FF, Noor NM, Zakaria NA, Ul-Saufie AZ, Annas S (2019) Estimation of missing values in air pollution dataset by using various imputation methods. Int J Conserv Sci 10(4):791–804
CAS Google Scholar
Tsikriktsis N (2005) A review of techniques for treating missing data in OM survey research. J Oper Manag 24(1):53–62
Google Scholar
Van Buuren S (2018) Flexible imputation of missing data. CRC press
Google Scholar
Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45:1–67
Google Scholar
Wardana I, Gardner JW, Fahmy SA (2022) Estimation of missing air pollutant data using a spatiotemporal convolutional autoencoder. Neural Comput Appl:1–26
Weerakody PB, Wong KW, Wang G, Ela W (2021) A review of irregular time series data handling with gated recurrent neural networks. Neurocomputing 441:161–178
Google Scholar
Welch G (2006) An Introduction to the Kalman Filter. Univ. of North Carolina http://www.cs.unc.edu/~welch/media/pdf/kalman_intro.pdf. Accessed 10 Oct 2022
Wijesekara W, Liyanage L (2020) Comparison of imputation methods for missing values in air pollution data: case study on Sydney air quality index. In: Future of Information and Communication Conference. Springer
Google Scholar
Willmott CJ (1981) On the validation of models. Phys Geogr 2:184–194
Willmott CJ, Matsuura K (2005) Advantages of the mean absolute error (MAE) over the root mean square error (RMSE) in assessing average model performance. Climate Res 30(1):79–82
Google Scholar
Willmott CJ, Matsuura K (2006) On the use of dimensioned measures of error to evaluate the performance of spatial interpolators. Int J Geogr Inf Sci 20(1):89–102
Google Scholar
World Health Organization (2016) Ambient air pollution: A global assessment of exposure and burden of disease
Zeileis A, Grothendieck G (2005) zoo: S3 Infrastructure for Regular and Irregular Time Series. J Stat Softw 14(6):1–27
Google Scholar

Download references

Acknowledgements

The authors are thankful to CSIR-CSIO Chandigarh for providing the necessary infrastructure and support to carry out our work and especially thanks to the Central Pollution Control Board for making air pollutants data openly available to their official website.

Author information

Authors and Affiliations

Academy of Scientific & Innovative Research (AcSIR), Ghaziabad, 201002, India
Priti K, Kaushlesh Singh Shakya & Prashant Kumar
CSIR-Central Scientific Instruments Organisation, Sector 30-C, Chandigarh, 160030, India
Priti K, Kaushlesh Singh Shakya & Prashant Kumar

Authors

Priti K
View author publications
You can also search for this author in PubMed Google Scholar
Kaushlesh Singh Shakya
View author publications
You can also search for this author in PubMed Google Scholar
Prashant Kumar
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors have contributed substantially to the following: Priti K performed writing—original draft, conceptualization, data curation, formal analysis, and methodology. Kaushlesh Singh Shakya performed data curation, formal analysis, and visualization. Prashant kumar: supervision, validation, and review and editing.

Corresponding author

Correspondence to Prashant Kumar.

Ethics declarations

Ethics approval

NA

Consent to participate

NA

Consent for publication

NA

Conflict of interest

The authors declare no competing interests.

Additional information

Responsible Editor: Marcus Schulz

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

ESM 1

(DOCX 68.8 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

K, P., Shakya, K.S. & Kumar, P. Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage. Environ Sci Pollut Res 30, 75469–75488 (2023). https://doi.org/10.1007/s11356-023-27659-x

Download citation

Received: 02 January 2023
Accepted: 11 May 2023
Published: 23 May 2023
Issue Date: June 2023
DOI: https://doi.org/10.1007/s11356-023-27659-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Abstract

Access this article

Similar content being viewed by others

Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space

Spatiotemporal Exposure Prediction with Penalized Regression

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher’s note

Supplementary information

ESM 1

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Selection of statistical technique for imputation of single site-univariate and multisite–multivariate methods for particulate pollutants time series data with long gaps and high missing percentage

Abstract

Access this article

Similar content being viewed by others

Comparing Spatial and Spatio-temporal FPCA to Impute Large Continuous Gaps in Space

Spatiotemporal Exposure Prediction with Penalized Regression

EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Ethics approval

Consent to participate

Consent for publication

Conflict of interest

Additional information

Publisher’s note

Supplementary information

ESM 1

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation