Abstract
Over the past years, the health impact of airborne particulate matter \(\mathrm{PM}_{10}\) has become a very topical subject. Thereby, a lot of research effort in the environmental sciences goes towards the modeling and the prediction of ambient \(\mathrm{PM}_{10}\) concentrations. In this paper, we are interested in the statistical classification of the daily mean \(\mathrm{PM}_{10}\) concentration in Tunisia according to the authority regulation. We consider two monitoring stations: a big industrial station and a traffic station. The main goal of this work is to determine the pertinent predictors of \(\mathrm{PM}_{10}\) concentration within a nonlinear multiclass framework. To do this, we used two popular statistical learning methods; the support vector machines (SVM) and the random forests (RF). The statistical results obtained on the real datasets, show that RF outperform SVM for the purpose of variable selection even with a reduced number of observations compared to the number of explicative variables. It was also demonstrated that the \(\mathrm{PM}_{10}\) concentration measured yesterday is the most relevant predictor of its present-day value. Moreover, we found that the more delayed values of \(\mathrm{PM}_{10}\) concentration may be crucial to get an accurate prediction.
This is a preview of subscription content, access via your institution.








References
Allwein EL, Schapire RE, Singer Y (2000) Reducing multiclass to binary: a unifying approach for margin classifiers. J Mach Learn Res 1:113–141
Almanza VH, Batyrshin I, Sosa G (2014) Multi-criteria selection for an air quality model configuration based on quantitative and linguistic evaluations. Expert Syst Appl 41(3):869–876
Amaldi E, Kann V (1998) On the approximability of minimizing non zero variables or unsatisfied relations in linear systems. Theor Comput Sci 209(1–2):237–260
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Natl Acad Sci 99(10):6562–6566
Antanasijević DZ, Pocajt VV, Povrenović DS, Ristić MĐ, Perić-Grujić AA (2013) \(\text{ PM }_{10}\) emission forecasting using artificial neural networks and genetic algorithm input variable optimization. Sci Total Environ 443:511–519
Ben Ishak A (2007) Sélection de variables par les machinesà vecteurs supports pour la discrimination binaire etmulticlasse en grande dimension, Ph.D. diss, Université de laMéditerranée, Marseille, France
Ben Ishak A (2016) Variable selection using support vector regression and random forests: a comparative study. Intell Data Anal 20(1):83–104
Ben Ishak A, Ghattas B (2005) An efficient method for variable selection using svm-based criteria. preprint IML, l’Institut de Mathé matiques de Luminy, Marseille, France. Available at http://iml.univ-mrs.fr/editions/preprint2005/preprint2005.html
Boser A, Guyon I, Vapnik VN (1992) A training algorithm foroptimal margin classifiers. In: fifth annual workshop on computational learning theory. ACM, Pittsburgh, pp 144–152
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth and Brooks, Monterey, CA
Breiman L (1996) Bagging predictors. Mach Learn 24:123–140
Breiman L (2001) Random Forests. Mach Learn 45:5–32
Carnevale C, Finzi G, Pisoni E, Singh V, Volta M (2011) An integrated air quality forecast system for a metropolitan area. J Environ Monit 13:3437–3447
Chaloulakou A, Saisana M, Spyrellis N (2003) Comparative assessment of neural networks and regression models for forecasting summertime ozone in Athens. Sci Total Environ 313:1–13
Corani G (2005) Air quality prediction in Milan: feed-forward neural networks, pruned neural networks and lazy learning. Ecol Model 185:513–529
Cordelino C, Chang M, St John J, Murphey B, Cordle J, Ballagas R, Patterson L, Powell K, Stogner J, Zimmer DS (2001) Ozone prediction in Atlanta Georgia: analysis of the 1999 ozone season. J Air Waste Manag Assoc 51:1227–36
Cristiannini N, Taylor JS (2000) An introduction to support vector machines and other kernel based learning methods. Cambridge university Press, New York
Dìaz-Uriarte R, Alvarez de Andrés S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinform 7(3):1–13
Dietterich TG, Bakiri G (1995) Solving multiclass learning problems via error-correcting output codes. J Artif Intell Res 2:263–286
Domańska D, Wojtylak M (2012) Application of fuzzy time series models for forecasting pollution concentrations. Expert Syst Appl 39(9):7673–7679
Dong M, Yang D, Kuang Y, He D, Erdal S, Kenski D (2009) \(\text{ PM }_{2.5}\) concentration prediction using hidden semi-Markov model-based times series data mining. Expert Syst Appl 36(5):9046–9055
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Feki A, Ben Ishak A, Feki S (2012) Feature selection using bayesian and multiclass support vector machines approaches: Application to bank risk prediction. Expert Syst Appl 39(3):3087–3099
Genuer R, Poggi JM, Tuleau C (2010) Variable selection using random forests. Pattern Recognit Lett 31:2225–2236
Ghattas B, Ben Ishak A (2008) Sélection de variables pour la classification binaire en grande dimension: comparaisons et application aux données de biopuces. J Soc Fr Stat 149(3):43–66
Gregorutti B, Michel B, Saint-Pierre P (2015) Grouped variable importance with random forests and application to multivariate functional data analysis. Comput Stat Data Anal 90:15–35
Grivas G, Chaloulakou A (2006) Artificial neural network models for prediction of \(\text{ PM }_{10}\) hourly concentrations, in the greater area of Athens, Greece. Atmos Environ 40:1216–1229
Guermeur Y (2007) VC theory of large margin multi-category classifiers. J Mach Learn Res 8:2551–2594
Guermeur Y (2012) A generic model of multi-class support vector machine. Int J Intell Inform Database Syst 6(6):555–577
Guyon I, Weston J, Barnhill S, Vapnik VN (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1–3):389–422
Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182
Hauck H, Berner A, Frischer T, Gomiscek B, Kundi M, Neuberger M, Puxbaum H, Preining O (2004) AUPHEPTeam, AUPHEP- Austrian project on health effects of particulates–general overview. Atmos Environ 38:3905–3915
Hoi KI, Yuen KV, Mok KM (2009) Prediction of daily averaged \(\text{ PM }_{10}\) concentrations by statistical time-varying model. Atmos Environ 43:2579–2581
Ishwaran H (2007) Variable importance in binary regression trees and forests. Electron J Stat 1:519–537
Kreßel UHG (1999) Pairwise classification and support vector machines. In Advances in kernel methods: Support vector learning, Cambridge, MA, USA: MIT Press, pp 255–268
Kukkonen J, Partanen L, Karppinen A, Ruuskanen J, Junninen H, Kolehmainen M, Niska H, Dorling S, Chatterton T, Foxall R, Cawley G (2003) Extensive evaluation of neural network models for the prediction of \(\text{ NO }_{2}\) and \(\text{ PM }_{10}\) concentrations, compared with a deterministic modeling system and measurements in central Helsinski. Atmos Environ 37:4539–4550
Kurt A, Oktay AB (2010) Forecasting air pollutant indicator levels with geographic models 3 days in advance using neural networks. Expert Syst Appl 37(12):7986–7992
Moshammer H, Neuberger M (2003) The active surface of suspended particles as a predictor of lung function and pulmonary symptoms in Austrian school children. Atmos Environ 37:1737–1744
Paschalidou AK, Kassomenos P, Bartzokas A (2009) A comparative study on various statistical techniques predicting ozone concentrations: implications to environmental management. Environ Monit Assess 148:277–89
Paschalidou AK, Karakitsios S, Kleanthous S, Kassomenos PA (2011) Forecasting hourly \(\text{ PM }_{10}\) concentration in Cyprus through artificial neural networks and multiple regression models: implications to local environmental management. Environ Sci Pollut Res 18:316–327
Perez L, Medina-Ramon M, Konzli N, Alastuey A, Pey J, Perez N, Garcia R, Tobias A, Querol X, Sunyer J (2009) Size fractionate particulate matter, vehicle traffic, and case-specific daily mortality in Barcelona, Spain. Environ Sci Technol 43(13):4707–4714
Phetkaew T, Kijsirikul B, Rivepiboon W (2002) Reordering adaptive directed acyclic graphs for multiclass support vector machines. In: Proceedings of the third international conference on intelligent technologies
Poggi JM, Portier B (2011) \(\text{ PM }_{10}\) forecasting using clusterwise regression. Atmos Environ 45:7005–7014
Pope CA (2000) Review: epidemiological basis for particulate air pollution health standards. Aerosol Sci Technol 32:4–14
Pope C III, Dockery D (2006) Health effects of fine particulate air pollution: lines that connect. J Air Waste Manag Assoc 56:709–742
Qin S, Liu F, Wang J, Sun B (2014) Analysis and forecasting of the particulate matter (PM) concentration levels over four major cities of China using hybrid models. Atmos Environ 98:665–675
Rakotomamonjy A (2003) Variable selection using SVM-based criteria. J Mach Learn Res 3:1357–1370
Russell AG, Brunekreef B (2009) A focus on particulate matter and health. Environ Sci Technol 43:4620–4625
Sfetsos A, Vlachogiannis D (2010) Time series forecasting of hourly \(\text{ PM }_{10}\) using localized linear models. J Softw Eng Appl 3:374–383
Slini T, Kaprara A, Karatzas K, Moussiopoulos N (2006) \(\text{ PM }_{10}\) forecasting for Thessaloniki, Greece. Environ Model Softw 21:559–565
Stadlober E, Hörmann S, Pfeiler B (2008) Quality and performance of a \(\text{ PM }_{10}\) daily forecasting model. Atmos Environ 42:1098–1109
Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A (2008) Conditional variable importance for random forests. BMC Bioinform 9:307
Van Buuren S, Groothuis-Oudshoorn K (2011) Mice: multivariate imputation by chained equations in R. J Stat Softw 45(3):1–67
Vapnik VN (1995) The nature of statistical learning theory. Springer, New York
Vapnik VN (1998) Statistical learning theory. Wiley, New York
Vapnik VN, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12(9):2013–2036
Wang P, Liu Y, Qin Z, Zhang G (2015) A novel hybrid forecasting model for \(\text{ PM }_{10}\) and \(\text{ SO }_{2}\) daily concentrations. Sci Total Environ 505:1202–1212
Yang ZC (2014) Modeling and forecasting daily movement of ambient air mean \(\text{ PM }_{2.5}\) concentration based on the elliptic orbit model with weekly quasi-periodic extension: a case study. Environ Sci Pollut Res 21(16):9959–9972
Acknowledgments
The authors are extremely grateful to the editor and the anonymous reviewers for their insightful comments and valuable suggestions, which contributed much to improving the manuscript.
Author information
Authors and Affiliations
Corresponding author
Additional information
Handling Editor: Pierre Dutilleul.
Rights and permissions
About this article
Cite this article
Ben Ishak, A., Moslah, Z. & Trabelsi, A. Analysis and prediction of \(\mathrm{PM}_{10}\) concentration levels in Tunisia using statistical learning approaches. Environ Ecol Stat 23, 469–490 (2016). https://doi.org/10.1007/s10651-016-0349-8
Received:
Revised:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10651-016-0349-8
Keywords
- Nonlinear multiclass classification
- Particulate matter \(\mathrm{PM}_{10}\)
- Random forests
- Stepwise algorithm
- Support vector machines
- Variable selection