Characterizing Air-Quality Data Through Unsupervised Analytics Methods

  • Elena Daraio
  • Evelina Di Corso
  • Tania Cerquitelli
  • Silvia Chiusano
Conference paper
Part of the Communications in Computer and Information Science book series (CCIS, volume 909)


Several cities have built on-the-ground air quality monitoring stations to measure daily concentration of air pollutants, like \(\textit{PM}_{10}\) and \(\textit{NO}_{2}\). The identification of the causalities for air pollution will help governments’ decision-making on mitigating air pollution and on prioritizing recommendations. This paper presents a two-level methodology based on unsupervised analytics methods, named PANDA, to discover interesting insights from air quality-related data. First, PANDA discovers groups of pollutants that have occurred with similar concentrations. Then, each cluster is locally characterized through three forms of human-readable knowledge to provide interesting correlations between air pollution and meteorological conditions at different abstraction level. As a case study, PANDA has been validated on real pollutant measurements collected in a major Italian city. Preliminary experimental results show that PANDA is effective in discovering cohesive and well-separated groups of similar concentrations of pollutants along with different forms of interpretable correlations among air pollution and weather data.


Data mining Data exploration Pollutant data Meteorological data Sensor data 


  1. 1.
    Regional Agency for the Protection of the Environment. Accessed May 2018
  2. 2.
    The Rapid Miner Project. Accessed May 2018
  3. 3.
    Acquaviva, A., et al.: Energy signature analysis: knowledge at your fingertips. In: 2015 IEEE International Congress on Big Data, New York City, NY, USA, June 27–July 2 2015 (2015)Google Scholar
  4. 4.
    Agrawal, R., Imielinski, T., Swami, A.: Mining association rules between sets of items in large databases. In: ACM SIGMOD 1993, pp. 207–216 (1993)Google Scholar
  5. 5.
    Brin, S., Motwani, R., Ullman, J.D., Tsur, S.: Dynamic itemset counting and implication rules for market basket data. ACM SIGMOD Rec. 26(2), 255–264 (1997)CrossRefGoogle Scholar
  6. 6.
    Cagliero, L., Cerquitelli, T., Chiusano, S., Garza, P., Ricupero, G., Xiao, X.: Modeling correlations among air pollution-related data through generalized association rules. In: IEEE International Conference on Smart Computing, 18–20 May 2016 (2016)Google Scholar
  7. 7.
    Cagliero, L., Chiusano, S., Garza, P., Ricupero, G.: Discovering high-utility itemsets at multiple abstraction levels. In: Kirikova, M., et al. (eds.) ADBIS 2017. CCIS, vol. 767, pp. 224–234. Springer, Cham (2017).
  8. 8.
    Cerquitelli, T., Di Corso, E.: Characterizing thermal energy consumption through exploratory data mining algorithms. In: Proceedings of the Workshops of the EDBT/ICDT 2016 Joint Conference, Bordeaux, France, 15 March 2016 (2016)Google Scholar
  9. 9.
    Data, W.U.: Accessed May 2018
  10. 10.
    Di Corso, E., Cerquitelli, T., Ventura, F.: Self-tuning techniques for large scale cluster analysis on textual data collections. In: Proceedings of the 32nd Annual ACM Symposium on Applied Computing, Marrakesh, Morocco, 3rd–7th April 2017 (2017)Google Scholar
  11. 11.
    Juang, B.H., Rabiner, L.: The segmental k-means algorithm for estimating parameters of hidden markov models. IEEE Trans. Acoust. Speech Sig. Process. 9, 1639–1641 (1990)CrossRefGoogle Scholar
  12. 12.
    MathWorks: Accessed May 2018
  13. 13.
  14. 14.
    Namieśnik, J., Rabajczyk, A.: The speciation and physico-chemical forms of metals in surface waters and sediments. Chem. Speciat. Bioavailab. 22(1), 1–24 (2010)CrossRefGoogle Scholar
  15. 15.
    Newman, P.W., Kenworthy, J.R.: The transport energy trade-off: fuel-efficient traffic versus fuel-efficient cities. Transp. Res. Part A Gen. 22, 163–174 (1988)CrossRefGoogle Scholar
  16. 16.
    Pang-Ning, T., Steinbach, M., Kumar, V.: Introduction to Data Mining. Addison-Wesley, Boston (2006)zbMATHGoogle Scholar
  17. 17.
    Ross, S.M.: Introduction to Probability and Statistics for Engineers and Scientists, 2nd edn. Academic Press, New York (2000)zbMATHGoogle Scholar
  18. 18.
    Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRefGoogle Scholar
  19. 19.
    Santini, S., Ostermaier, B., Vitaletti, A.: First experiences using wireless sensor networks for noise pollution monitoring. In: Proceedings of the Workshop on Real-World Wireless Sensor Networks, pp. 61–65. ACM (2008)Google Scholar
  20. 20.
    Zheng, Y., Capra, L., Wolfson, O., Yang, H.: Urban computing: concepts, methodologies, and applications. ACM Trans. Intell. Syst. Technol. 5, 1–55 (2014)Google Scholar
  21. 21.
    Zhu, J.Y., Zheng, Y., Yi, X., Li, V.O.: A Gaussian Bayesian model to identify spatio-temporal causalities for air pollution based on urban big data. In: 2016 IEEE Conference on Computer Communications Workshops (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Elena Daraio
    • 1
  • Evelina Di Corso
    • 1
  • Tania Cerquitelli
    • 1
  • Silvia Chiusano
    • 2
  1. 1.Dipartimento di Automatica e InformaticaPolitecnico di TorinoTurinItaly
  2. 2.Dipartimento Interateneo di Scienze, Progetto e Politiche del TerritorioPolitecnico di TorinoTurinItaly

Personalised recommendations