EDA and a Tailored Data Imputation Algorithm for Daily Ozone Concentrations

  • Ronald GualánEmail author
  • Víctor Saquicela
  • Long Tran-Thanh
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 884)


Air pollution is a critical environmental problem with detrimental effects on human health that is affecting all regions in the world, especially to low-income cities, where critical levels have been reached. Air pollution has a direct role in public health, climate change, and worldwide economy. Effective actions to mitigate air pollution, e.g. research and decision making, require of the availability of high resolution observations. This has motivated the emergence of new low-cost sensor technologies, which have the potential to provide high resolution data thanks to their accessible prices. However, since low-cost sensors are built with relatively low-cost materials, they tend to be unreliable. That is, measurements from low-cost sensors are prone to errors, gaps, bias and noise. All these problems need to be solved before the data can be used to support research or decision making. In this paper, we address the problem of data imputation on a daily air pollution data set with relatively small gaps. Our main contributions are: (1) an air pollution data set composed by several air pollution concentrations including criteria gases and thirteen meteorological covariates; and (2) a custom algorithm for data imputation of daily ozone concentrations based on a trend surface and a Gaussian Process. Data Visualization techniques were extensively used along this work, as they are useful tools for understanding the multi-dimensionality of point-referenced sensor data.


Air pollution Sensor data Data imputation Gaussian process 


  1. 1.
    Allenby GM, Rossi PE, McCulloch RE (2005) Hierarchical Bayes models: a practitioners guideGoogle Scholar
  2. 2.
    Bakar KS, Sahu SK et al (2015) spTimer: spatio-temporal Bayesian modelling using R. J Stat Softw 63(15):1–32CrossRefGoogle Scholar
  3. 3.
    Burke JA, Estrin D, Hansen M, Parker A, Ramanathan N, Reddy S, Srivastava MB (2006) Participatory sensing. Center for Embedded Network SensingGoogle Scholar
  4. 4.
    Cameletti M, Lindgren F, Simpson D, Rue H (2013) Spatio-temporal modeling of particulate matter concentration through the SPDE approach. AStA Adv Stat Anal 97(2):109–131MathSciNetCrossRefGoogle Scholar
  5. 5.
    Campozano L, Sánchez E, Avilés A, Samaniego E (2014) Evaluation of infilling methods for time series of daily precipitation and temperature: the case of the ecuadorian andes. Maskana 5(1):99–115Google Scholar
  6. 6.
    Cressie N, Wikle CK (2015) Statistics for spatio-temporal data. Wiley, New YorkzbMATHGoogle Scholar
  7. 7.
    Finley AO, Banerjee S, Gelfand AE (2013) spBayes for large univariate and multivariate point-referenced spatio-temporal data models. arXiv preprint arXiv:1310.8192
  8. 8.
    Gelfand AE (2012) Hierarchical modeling for spatial data problems. Spat Stat 1:30–39CrossRefGoogle Scholar
  9. 9.
    Gräler B, Pebesma E, Heuvelink G (2016) Spatio-temporal interpolation using gstat. R J 8(1):204–218Google Scholar
  10. 10.
    Hasenfratz D, Saukh O, Sturzenegger S, Thiele L (2012) Participatory air pollution monitoring using smartphones. Mob Sens 1:1–5Google Scholar
  11. 11.
    Kalnay E, Kanamitsu M, Kistler R, Collins W, Deaven D, Gandin L, Iredell M, Saha S, White G, Woollen J et al (1996) The NCEP/NCAR 40-year reanalysis project. Bull Am Meteorol Soc 77(3):437–471CrossRefGoogle Scholar
  12. 12.
    Mukhopadhyay S, Sahu SK (2017) A Bayesian spatiotemporal model to estimate long-term exposure to outdoor air pollution at coarser administrative geographies in England and Wales. J R Stat Soc Ser (Stat Soc) 181(2):465–486MathSciNetCrossRefGoogle Scholar
  13. 13.
    Pirani M, Gulliver J, Fuller GW, Blangiardo M (2014) Bayesian spatiotemporal modelling for the assessment of short-term exposure to particle pollution in urban areas. J Expo Sci Environ Epidemiol 24(3):319CrossRefGoogle Scholar
  14. 14.
    R Core Team (2013) R: a language and environment for statistical computing.
  15. 15.
    S3L (2012) Matrix factorization as data imputation \(|\) S3l.
  16. 16.
    Sahu SK, Bakar KS (2012) Hierarchical Bayesian autoregressive models for large space-time data with applications to ozone concentration modelling. Appl Stoch Model Bus Ind 28(5):395–415MathSciNetCrossRefGoogle Scholar
  17. 17.
    Sahu SK, Gelfand AE, Holland DM (2007) High-resolution space-time ozone modeling for assessing trends. J Am Stat Assoc 102(480):1221–1234MathSciNetCrossRefGoogle Scholar
  18. 18.
    Samworth RJ et al (2012) Optimal weighted nearest neighbour classifiers. Ann Stat 40(5):2733–2763MathSciNetCrossRefGoogle Scholar
  19. 19.
    Seo J, Youn D, Kim J, Lee H (2014) Extensive spatiotemporal analyses of surface ozone and related meteorological variables in south korea for the period 1999–2010. Atmos Chem Phys 14(12):6395–6415CrossRefGoogle Scholar
  20. 20.
    Snyder EG, Watkins TH, Solomon PA, Thoma ED, Williams RW, Hagler GSW, Shelow D, Hindin DA, Kilaru VJ, Preuss PW (2013) The changing paradigm of air pollution monitoring. Environ Sci Technol 47(20):11,369–11,377. Scholar
  21. 21.
    Stocker M, Baranizadeh E, Portin H, Komppula M, Rönkkö M, Hamed A, Virtanen A, Lehtinen K, Laaksonen A, Kolehmainen M (2014) Representing situational knowledge acquired from sensor data for atmospheric phenomena. Environ Model Softw 58:27–47CrossRefGoogle Scholar
  22. 22.
    US EPA (2016) Air data basic information \(|\) air data: air quality data collected at outdoor monitors across the US \(|\) US EPA.
  23. 23.
    Wen H, Xiao Z, Markham A, Trigoni N (2015) Accuracy estimation for sensor systems. IEEE Trans Mob Comput 14(7):1330–1343CrossRefGoogle Scholar
  24. 24.
    WHO (2016) WHO global urban ambient air pollution database (update 2016).
  25. 25.
    Yanosky JD, Paciorek CJ, Laden F, Hart JE, Puett RC, Liao D, Suh HH (2014) Spatio-temporal modeling of particulate air pollution in the conterminous united states using geographic and meteorological predictors. Environ Health 13(1):63CrossRefGoogle Scholar
  26. 26.
    Zakaria NA, Noor NM (2018) Imputation methods for filling missing data in urban air pollution data formalaysia. Urbanism. Arhitectura. Constructii 9(2):159Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Ronald Gualán
    • 1
    • 2
    Email author
  • Víctor Saquicela
    • 1
  • Long Tran-Thanh
    • 2
  1. 1.Department of Computer ScienceUniversity of CuencaCuencaEcuador
  2. 2.School of Electronics and Computer ScienceUniversity of SouthamptonSouthamptonUK

Personalised recommendations