Skip to main content

Analysis and prediction of \(\mathrm{PM}_{10}\) concentration levels in Tunisia using statistical learning approaches

Abstract

Over the past years, the health impact of airborne particulate matter \(\mathrm{PM}_{10}\) has become a very topical subject. Thereby, a lot of research effort in the environmental sciences goes towards the modeling and the prediction of ambient \(\mathrm{PM}_{10}\) concentrations. In this paper, we are interested in the statistical classification of the daily mean \(\mathrm{PM}_{10}\) concentration in Tunisia according to the authority regulation. We consider two monitoring stations: a big industrial station and a traffic station. The main goal of this work is to determine the pertinent predictors of \(\mathrm{PM}_{10}\) concentration within a nonlinear multiclass framework. To do this, we used two popular statistical learning methods; the support vector machines (SVM) and the random forests (RF). The statistical results obtained on the real datasets, show that RF outperform SVM for the purpose of variable selection even with a reduced number of observations compared to the number of explicative variables. It was also demonstrated that the \(\mathrm{PM}_{10}\) concentration measured yesterday is the most relevant predictor of its present-day value. Moreover, we found that the more delayed values of \(\mathrm{PM}_{10}\) concentration may be crucial to get an accurate prediction.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

References

  • Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141

    Google Scholar 

  • Almanza VH, Batyrshin I, Sosa G (2014) Multi-criteria selection for an air quality model configuration based on quantitative and linguistic evaluations. Expert Syst Appl 41(3):869–876

    Article  Google Scholar 

  • Amaldi E, Kann V (1998) On the approximability of minimizing non zero variables or unsatisfied relations in linear systems. Theor Comput Sci 209(1–2):237–260

    Article  Google Scholar 

  • Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Natl Acad Sci 99(10):6562–6566

    Article  CAS  Google Scholar 

  • Antanasijević DZ, Pocajt VV, Povrenović DS, Ristić MĐ, Perić-Grujić AA (2013) \(\text{ PM }_{10}\) emission forecasting using artificial neural networks and genetic algorithm input variable optimization. Sci Total Environ 443:511–519

    Article  PubMed  Google Scholar 

  • Ben Ishak A (2007) Sélection de variables par les machinesà vecteurs supports pour la discrimination binaire etmulticlasse en grande dimension, Ph.D. diss, Université de laMéditerranée, Marseille, France

  • Ben Ishak A (2016) Variable selection using support vector regression and random forests: a comparative study. Intell Data Anal 20(1):83–104

    Article  Google Scholar 

  • Ben Ishak A, Ghattas B (2005) An efficient method for variable selection using svm-based criteria. preprint IML, l’Institut de Mathé matiques de Luminy, Marseille, France. Available at http://iml.univ-mrs.fr/editions/preprint2005/preprint2005.html

  • Boser A, Guyon I, Vapnik VN (1992) A training algorithm foroptimal margin classifiers. In: fifth annual workshop on computational learning theory. ACM, Pittsburgh, pp 144–152

  • Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey, CA

    Google Scholar 

  • Breiman L (1996) Bagging predictors. Mach Learn 24:123–140

    Google Scholar 

  • Breiman L (2001) Random Forests. Mach Learn 45:5–32

    Article  Google Scholar 

  • Carnevale C, Finzi G, Pisoni E, Singh V, Volta M (2011) An integrated air quality forecast system for a metropolitan area. J Environ Monit 13:3437–3447

    Article  CAS  PubMed  Google Scholar 

  • Chaloulakou A, Saisana M, Spyrellis N (2003) Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens. Sci Total Environ 313:1–13

    Article  CAS  PubMed  Google Scholar 

  • Corani G (2005) Air quality prediction in Milan: feed-forward neural networks, pruned neural networks and lazy learning. Ecol Model 185:513–529

    Article  Google Scholar 

  • Cordelino C, Chang M, St John J, Murphey B, Cordle J, Ballagas R, Patterson L, Powell K, Stogner J, Zimmer DS (2001) Ozone prediction in Atlanta Georgia: analysis of the 1999 ozone season. J Air Waste Manag Assoc 51:1227–36

    Article  Google Scholar 

  • Cristiannini N, Taylor JS (2000) An introduction to support vector machines and other kernel based learning methods. Cambridge university Press, New York

    Book  Google Scholar 

  • Dìaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(3):1–13

  • Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artif Intell Res 2:263–286

    Google Scholar 

  • Domańska D, Wojtylak M (2012) Application of fuzzy time series models for forecasting pollution concentrations. Expert Syst Appl 39(9):7673–7679

    Article  Google Scholar 

  • Dong M, Yang D, Kuang Y, He D, Erdal S, Kenski D (2009) \(\text{ PM }_{2.5}\) concentration prediction using hidden semi-Markov model-based times series data mining. Expert Syst Appl 36(5):9046–9055

    Article  Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York

    Google Scholar 

  • Feki A, Ben Ishak A, Feki S (2012) Feature selection using bayesian and multiclass support vector machines approaches: Application to bank risk prediction. Expert Syst Appl 39(3):3087–3099

    Article  Google Scholar 

  • Genuer R, Poggi JM, Tuleau C (2010) Variable selection using random forests. Pattern Recognit Lett 31:2225–2236

    Article  Google Scholar 

  • Ghattas B, Ben Ishak A (2008) Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux données de biopuces. J Soc Fr Stat 149(3):43–66

    Google Scholar 

  • Gregorutti B, Michel B, Saint-Pierre P (2015) Grouped variable importance with random forests and application to multivariate functional data analysis. Comput Stat Data Anal 90:15–35

    Article  Google Scholar 

  • Grivas G, Chaloulakou A (2006) Artificial neural network models for prediction of \(\text{ PM }_{10}\) hourly concentrations, in the greater area of Athens, Greece. Atmos Environ 40:1216–1229

    Article  CAS  Google Scholar 

  • Guermeur Y (2007) VC theory of large margin multi-category classifiers. J Mach Learn Res 8:2551–2594

    Google Scholar 

  • Guermeur Y (2012) A generic model of multi-class support vector machine. Int J Intell Inform Database Syst 6(6):555–577

    Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik VN (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422

    Article  Google Scholar 

  • Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    Google Scholar 

  • Hauck H, Berner A, Frischer T, Gomiscek B, Kundi M, Neuberger M, Puxbaum H, Preining O (2004) AUPHEPTeam, AUPHEP- Austrian project on health effects of particulates–general overview. Atmos Environ 38:3905–3915

    Article  CAS  Google Scholar 

  • Hoi KI, Yuen KV, Mok KM (2009) Prediction of daily averaged \(\text{ PM }_{10}\) concentrations by statistical time-varying model. Atmos Environ 43:2579–2581

    Article  CAS  Google Scholar 

  • Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537

    Article  Google Scholar 

  • Kreßel UHG (1999) Pairwise classification and support vector machines. In Advances in kernel methods: Support vector learning, Cambridge, MA, USA: MIT Press, pp 255–268

  • Kukkonen J, Partanen L, Karppinen A, Ruuskanen J, Junninen H, Kolehmainen M, Niska H, Dorling S, Chatterton T, Foxall R, Cawley G (2003) Extensive evaluation of neural network models for the prediction of \(\text{ NO }_{2}\) and \(\text{ PM }_{10}\) concentrations, compared with a deterministic modeling system and measurements in central Helsinski. Atmos Environ 37:4539–4550

    Article  CAS  Google Scholar 

  • Kurt A, Oktay AB (2010) Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks. Expert Syst Appl 37(12):7986–7992

    Article  Google Scholar 

  • Moshammer H, Neuberger M (2003) The active surface of suspended particles as a predictor of lung function and pulmonary symptoms in Austrian school children. Atmos Environ 37:1737–1744

    Article  CAS  Google Scholar 

  • Paschalidou AK, Kassomenos P, Bartzokas A (2009) A comparative study on various statistical techniques predicting ozone concentrations: implications to environmental management. Environ Monit Assess 148:277–89

    Article  CAS  PubMed  Google Scholar 

  • Paschalidou AK, Karakitsios S, Kleanthous S, Kassomenos PA (2011) Forecasting hourly \(\text{ PM }_{10}\) concentration in Cyprus through artificial neural networks and multiple regression models: implications to local environmental management. Environ Sci Pollut Res 18:316–327

    Article  CAS  Google Scholar 

  • Perez L, Medina-Ramon M, Konzli N, Alastuey A, Pey J, Perez N, Garcia R, Tobias A, Querol X, Sunyer J (2009) Size fractionate particulate matter, vehicle traffic, and case-specific daily mortality in Barcelona, Spain. Environ Sci Technol 43(13):4707–4714

    Article  CAS  PubMed  Google Scholar 

  • Phetkaew T, Kijsirikul B, Rivepiboon W (2002) Reordering adaptive directed acyclic graphs for multiclass support vector machines. In: Proceedings of the third international conference on intelligent technologies

  • Poggi JM, Portier B (2011) \(\text{ PM }_{10}\) forecasting using clusterwise regression. Atmos Environ 45:7005–7014

    Article  CAS  Google Scholar 

  • Pope CA (2000) Review: epidemiological basis for particulate air pollution health standards. Aerosol Sci Technol 32:4–14

    Article  CAS  Google Scholar 

  • Pope C III, Dockery D (2006) Health effects of fine particulate air pollution: lines that connect. J Air Waste Manag Assoc 56:709–742

    Article  CAS  PubMed  Google Scholar 

  • Qin S, Liu F, Wang J, Sun B (2014) Analysis and forecasting of the particulate matter (PM) concentration levels over four major cities of China using hybrid models. Atmos Environ 98:665–675

    Article  CAS  Google Scholar 

  • Rakotomamonjy A (2003) Variable selection using SVM-based criteria. J Mach Learn Res 3:1357–1370

    Google Scholar 

  • Russell AG, Brunekreef B (2009) A focus on particulate matter and health. Environ Sci Technol 43:4620–4625

    Article  CAS  PubMed  Google Scholar 

  • Sfetsos A, Vlachogiannis D (2010) Time series forecasting of hourly \(\text{ PM }_{10}\) using localized linear models. J Softw Eng Appl 3:374–383

    Article  Google Scholar 

  • Slini T, Kaprara A, Karatzas K, Moussiopoulos N (2006) \(\text{ PM }_{10}\) forecasting for Thessaloniki, Greece. Environ Model Softw 21:559–565

    Article  Google Scholar 

  • Stadlober E, Hörmann S, Pfeiler B (2008) Quality and performance of a \(\text{ PM }_{10}\) daily forecasting model. Atmos Environ 42:1098–1109

    Article  CAS  Google Scholar 

  • Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307

    Article  Google Scholar 

  • Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67

    Article  Google Scholar 

  • Vapnik VN (1995) The nature of statistical learning theory. Springer, New York

    Book  Google Scholar 

  • Vapnik VN (1998) Statistical learning theory. Wiley, New York

    Google Scholar 

  • Vapnik VN, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9):2013–2036

    Article  CAS  PubMed  Google Scholar 

  • Wang P, Liu Y, Qin Z, Zhang G (2015) A novel hybrid forecasting model for \(\text{ PM }_{10}\) and \(\text{ SO }_{2}\) daily concentrations. Sci Total Environ 505:1202–1212

    Article  CAS  PubMed  Google Scholar 

  • Yang ZC (2014) Modeling and forecasting daily movement of ambient air mean \(\text{ PM }_{2.5}\) concentration based on the elliptic orbit model with weekly quasi-periodic extension: a case study. Environ Sci Pollut Res 21(16):9959–9972

    Article  CAS  Google Scholar 

Download references

Acknowledgments

The authors are extremely grateful to the editor and the anonymous reviewers for their insightful comments and valuable suggestions, which contributed much to improving the manuscript.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anis Ben Ishak.

Additional information

Handling Editor: Pierre Dutilleul.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Ben Ishak, A., Moslah, Z. & Trabelsi, A. Analysis and prediction of \(\mathrm{PM}_{10}\) concentration levels in Tunisia using statistical learning approaches. Environ Ecol Stat 23, 469–490 (2016). https://doi.org/10.1007/s10651-016-0349-8

Download citation

  • Received:

  • Revised:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10651-016-0349-8

Keywords

  • Nonlinear multiclass classification
  • Particulate matter \(\mathrm{PM}_{10}\)
  • Random forests
  • Stepwise algorithm
  • Support vector machines
  • Variable selection