Revisiting urban air quality forecasting: a regression approach

Karatzas, Kostas; Katsifarakis, Nikos; Orlowski, Cezary; Sarzyński, Arkadiusz

doi:10.1007/s40595-018-0113-0

Revisiting urban air quality forecasting: a regression approach

Regular Paper
Open access
Published: 24 May 2018

Volume 5, pages 177–184, (2018)
Cite this article

Download PDF

You have full access to this open access article

Vietnam Journal of Computer Science

Revisiting urban air quality forecasting: a regression approach

Download PDF

Kostas Karatzas ORCID: orcid.org/0000-0002-1033-5985¹,
Nikos Katsifarakis¹,
Cezary Orlowski² &
…
Arkadiusz Sarzyński³

2921 Accesses
12 Citations
Explore all metrics

Abstract

We address air quality (AQ) forecasting as a regression problem employing computational intelligence (CI) methods for the Gdańsk Metropolitan Area (GMA) in Poland and the Thessaloniki Metropolitan Area (TMA) in Greece. Linear Regression as well as Artificial Neural Network models are developed, accompanied by Random Forest models, for five locations per study area and for a dataset of limited feature dimensionality. An ensemble approach is also used for generating and testing AQ forecasting models. Results indicate good model performance with a correlation coefficient between forecasts and measurements for the daily mean $\hbox {PM}_{10}$ concentration one day in advance reaching 0.765 for one of the TMA locations and 0.64 for one of the GMA locations. Overall results suggest that the specific modelling approach can support the provision of air quality forecasts on the basis of limited feature space dimensionality and by employing simple linear regression models.

Urban Air Quality Forecasting: A Regression and a Classification Approach

An integrated framework for predicting air quality index using pollutant concentration and meteorological data

Article 24 October 2023

Prediction of Air Quality Using Machine Learning

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In a recently published paper [1] we underlined the importance of air quality (AQ) forecasting in urban environmental management as well as in contemporary smart city development [2, 3]. In the current paper we revisit and extend our initial approach, focusing on urban AQ forecasting from the regression point of view and incorporating an ensemble modelling approach. For doing so, we take into account that in the framework of smart city information systems, environmental management plays an important role [4] and air pollution abatement is one of its main targets [5]. Air Quality forecasting is among the main pillars of AQ management [6] and is materialized with the aid of appropriate AQ models. Such models are establishing a time-varying relationship between the concentration of air pollutants at a specific time and location ${{\varvec{c}}}( {t,{{\varvec{x}}}})$, and other parameters ${{\varvec{p}}}({t,{{\varvec{x}}}})$ affecting the urban atmospheric environment. Such a relationship may be expressed with the aid of the following general function:

$$\begin{aligned} {{\varvec{c}}}( t,{{\varvec{x}}} )=f({{\varvec{p}}}( t,{{\varvec{x}}} ) ) \end{aligned}$$

(1)

Here t represents time and ${{\varvec{x}}}$ is the location vector corresponding to physical space. In this case the vector ${{\varvec{c}}}(t,{{\varvec{x}}})$ refers to concentration values of air pollutants like Nitrogen Dioxide ($\hbox {NO}_{2})$, Carbon Monoxide (CO), Ozone ($\hbox {O}_{3})$ and Particulate Matter (PM), while ${{\varvec{p}}}( {t,{{\varvec{x}}}} )$ includes parameters like wind speed, wind direction, air temperature, solar radiation, air pollutant emissions, air pollutant concentrations, land use type, land surface height, etc. The nature of function f is dictated by the model type employed: thus, if f reconstructs the physical and chemical relationships between the parameters ${{\varvec{p}}}( {t,{{\varvec{x}}}} )$ and values ${{\varvec{c}}}(t,{{\varvec{x}}})$, where ${{\varvec{x}}}$ addresses the whole area of interest in a 3-D gridded manner, then models are said to follow an analytic-deterministic approach [7], while if f is a statistical or data-mining oriented function, then models are said to follow a data-driven approach (as reported in [8] and in references therein). In the latter case, ${{\varvec{x}}}$ refers to specific areas within the studied area, which usually correspond to AQ measuring station locations. Thus, ${{\varvec{x}}}$ is not varying with time and is excluded, leading to an equation of the form:

$$\begin{aligned} {{\varvec{c}}}( t )=f( {{{\varvec{p}}}( t )} ) \end{aligned}$$

(2)

The objective of this paper is to suggest CI-based, ensemble oriented models that are able to depict as much information as possible from atmospheric quality data of low dimensionality, and to thus contribute to the scientific area of urban AQ forecasting. For this reason we employ a variety of CI methods and we suggest and test ensemble functions f in Eqs. (1) and (2). The geographic areas of interest are the Gdańsk Metropolitan Area (GMA) in Poland and the Thessaloniki Metropolitan Area (TMA) in Greece, and the parameter of interest is the daily concentration of Particulate Matter with a mean aerodynamic diameter of $10~\upmu \hbox {m}$ ($\hbox {PM}_{10})$, approx. $1/5^{\mathrm{th}}$ of the diameter of the human hair. The specific pollutant is able to penetrate in the bronchial part of the human lung system [9] and is one of the most important pollutants in the GMA [10] as well as in the TMA [11]. Air pollutant concentrations are addressed as numerical values. AQ forecasting follows a twofold approach:

a)
Each AQ monitoring station is treated individually, i.e. AQ models are developed and tested per station location. Thus, the forecasting of the parameter of interest is performed as a regression problem.
b)
Regression models are being created based on ensemble modelling principles, and are evaluated via their ability to forecast AQ levels at different locations (i.e. at each monitoring station).

The mean daily concentration level of $\hbox {PM}_{10}$ one day in advance is the target of the forecasting models under development. This choice corresponds to the requirements posed by relevant legislation for citizens as well as the decision makers to be informed about the expected $\hbox {PM}_{10}$ levels for the next day, not to exceed $50~\upmu \hbox {g/m}^{3}$ more than 35 days per year according to the European Regulations [9, 12] and according to the World Health Organization guidelines [13]. Combustion processes, traffic and natural sources directly emit $\hbox {PM}_{10}$, while in some regions the mechanical degradation of the road surface and of winter tires also contributes to its production. $\hbox {PM}_{10}$ are part of the inhalable fraction of PM and have adverse effects to human health [9].

The research question posed in the current paper moves one step ahead of our previously published results [1] and addresses (a) the ability of a low dimensionality feature space (small number of input parameters) to support effective data-driven models for $\hbox {PM}_{10}$ forecasting and (b) the modelling approach to be used in terms of algorithms and their setup (single vs. ensemble oriented models). In addition, we make use of an ensemble approach based on an ANN model of simple architecture which can be applied to multiple geographic areas, thus simplifying the ensemble approach suggested by [14] and [15], while maintaining a performance comparable to the one reported by similar studies [16], and therefore providing with a novel approach to the problem at hand.

In the rest of the paper we firstly present the materials of our study (Chapter 2), followed by the computational methods (Chapter 3). Then we proceed with the presentation and the discussion of the results in Chapter 4, and we draw our conclusion in Chapter 5.

2 Materials: area of study and data made available

The areas of study as well as the AQ problem addressed have been the focus of multiple studies performed in the past.

In the case of Gdansk ANNs have been employed for AQ forecasting in [17]. The same data set has been used for $\hbox {PM}_{10}$ forecasting in [18] as well as for the adaptation of an AQ forecasting model developed for Gdansk to the Thessaloniki area [19].

The air pollution of Thessaloniki has been studied and modeled with the aid of ANNs [20], with special emphasis on $\hbox {PM}_{10}$ [21]. The similarity of the GMA as well as of the TMA in terms of population and existence of a sea front suggest that there might also be a similarity in the way that $\hbox {PM}_{10}$ oriented air pollution can be modeled in both areas. Moreover, the need for the construction of data-driven models which use a small number of input parameters, suggested that a generalized, ensemble-based approach should be employed for the AQ modeling in both areas of interest, these being the novelty points of the research results at hand.

2.1 The two areas of interest

The city of Gdańsk is located on the Baltic coast in the south-west of the bay of Gdańsk, in the northern part of Poland. It is the capital of a tri-city metropolitan area merging with Gdynia (known for its shipyards) and Sopot (a recreational resort) and adding more than 1,000,000 residents in the GMA taking into account suburban communities also. The economy in Gdańsk is dominated by shipbuilding, petrochemicals and chemical industries, which are all concentrated quite close to the city center. The majority of air pollutant emissions originate from the industrial sector, the port activities and the city traffic [22], while the most important pollutants are $\hbox {PM}_{10}$, $\hbox {NO}_{2}$ and $\hbox {SO}_{2}$ (http://www.airqualitynow.eu).

The city of Thessaloniki faces an oval harbor bay and stands on a rising ground at the heart of a long gulf which is formed by the peninsula of Chalcidice. Various municipalities surround the city while an industrial zone is located in the north-west of its outskirts. The TMA is the second largest urban agglomeration in Greece accounting for more than 1,000,000 inhabitants, with a considerable accumulation of urban traffic as well as industrial activities. The TMA is characterized by high pollution levels especially related to $\hbox {PM}_{10}$ while $\hbox {O}_{3}$ appears to be high in suburban locations of the area and $\hbox {NO}_{2}$ levels are still high in dense urban areas in association with traffic [11].

2.2 The atmospheric quality data

In both the GMA and in the TMA a number of AQ monitoring stations operate (9 and 17 respectively), which routinely record concentration values of basic pollutants as well as the variation of meteorological parameters. As not all pollutants are recorded at all stations, and in order to focus on the pollutant of interest ($\hbox {PM}_{10})$, we decided to select five stations from each area of interest (included in Table 1), that were able to provide with $\hbox {PM}_{10}$ concentrations as well as meteorological data, in order to come up with data sets that are identical in terms of the parameters they include. In order to deal with the non-negligible frequency of missing data, we selected data from the year 2013 which contained only daily $\hbox {PM}_{10}$ concentrations as well as information for air temperature and relative humidity.

Table 1 The Air Quality monitoring stations used for the current study in GMA and TMA

Full size table

As a result and for each station, the same atmospheric parameters were used for the modelling and forecasting process: the model input or feature vector ${{\varvec{x}}}$ included five parameters, namely $\hbox {PM}_{10}$ concentration of the current day as well as temperature and relative humidity of the current day, complemented by the day and the month of the year. The target parameter to be forecasted ${{\varvec{y}}}$ was the $\hbox {PM}_{10}$ concentration of the next day. A summary of the basic statistical characteristics of the parameters involved in our study is included in Table 2.

Table 2 Basic statistics for the AQ and meteorological parameters available for each station at GMA and TMA

Full size table

3 Computational methods

The forecasting of the numerical value of $\hbox {PM}_{10}$ concentration levels for the next day was the goal set for the development of relevant forecasting models. For this reason, we made use of the available datasets for each AQ monitoring station to develop individual (per station) AQ forecasting models.

3.1 Algorithms for single station model creation

The algorithms applied were selected based on computational experiments employing various CI methods, which were conducted with the aid of Matlab (www.mathworks.com) as well as of the WEKA computational environment [23]. On this basis, we chose the following three algorithms as the basis for AQ model development:

(i)
Linear Regression (LR). Here the relationship between the forecasted parameter and the input parameters are described by an equation of the form:
$$\begin{aligned} {{\varvec{y}}}={{\varvec{x}}}\cdot {\varvec{\beta }} +{\varvec{\varepsilon }} \end{aligned}$$
(3)

where ${{\varvec{x}}}$ is the input vector, ${\varvec{\beta }}$ is the slope vector and ${\varvec{\varepsilon }}$ the error vector. The slope vector is commonly calculated via the least square method, thus:

$$\begin{aligned} {\hat{{\varvec{\beta }}}} =({{\varvec{x}}}'\cdot {{\varvec{x}}})^{-1}\cdot {{\varvec{x}}}'\cdot {{\varvec{y}}} \end{aligned}$$

(4)

(ii)
Artificial Neural Networks (ANNs). In ANNs the input vector ${{\varvec{x}}}$ for each neuron k, is weighted with the aid of a weighting vector ${{\varvec{w}}}_k $, and the result is summed (taking into account any bias) and then fed into a transfer function f to produce the overall output vector ${{\varvec{y}}}_k $:
$$\begin{aligned} {{\varvec{y}}}_k =f( {{{\varvec{w}}}_k^T \cdot {{\varvec{x}}}} ) \end{aligned}$$
(5)

The training of the ANN aims at reducing the error ${{\varvec{e}}}_k$ between the model output ${{\varvec{y}}}_k$ and the actual (real) value observed ${{\varvec{d}}}_k$, which here is the $\hbox {PM}_{10}$ concentration of the next day for each station.

$$\begin{aligned} {{\varvec{e}}}_k =\Vert {{\varvec{y}}}_k -{{\varvec{d}}}_k\Vert \end{aligned}$$

(6)

This error reduction is based on a number of methods all of which aim at recalculating the initial weights so that the overall network error is minimized. In the case of the gradient descent method (which is the simples of all but nevertheless representative of the way that the weights are recalculated), the relationship between the updated and the initial weighting vector for all neurons k of the ANN, is given by:

$$\begin{aligned} {{\varvec{w}}}( {t+1} )={{\varvec{w}}}( t )-a( t ){{\varvec{g}}}( t ) \end{aligned}$$

(7)

Here t and $t+1$ denote the initial and the updated weights, while the error term is described by:

$$\begin{aligned} {{\varvec{g}}}( {{\varvec{t}}} )={{\varvec{J}}}^{{{\varvec{T}}}}( {{\varvec{t}}} )\cdot {{\varvec{e}}}( {{\varvec{t}}} ) \end{aligned}$$

(8)

where ${{\varvec{J}}}^{T}$ is the (transposed) Jacobian and ${{\varvec{e}}}( t )$ is the overall error vector [1].

In this specific case a MultiLayer Perceptron Network with a feed-forward architecture and a back propagation training method was used, with an input layer consisting 5 nodes (i.e. all the input parameters per station), an output layer consisting of only one node (the $\hbox {PM}_{10}$ concentration of the next day) and a hidden layer with 10 nodes. The sigmoid function is employed as the transfer function while the gradient descent algorithm is used for minimizing the error function.

(iii) Random Forests (RF), an ensemble method originating from the Decision Tree family of algorithms [24] that has shown high capacity to effectively model atmospheric parameters of interest [1]. The method creates N subsets of the input vector ${{\varvec{x}}}$ using random selection with replacement, each subset containing 2/3 of the initial data, while the remaining data are used to estimate error and variable importance. Then for each subset, a decision tree is created with the aid of an arbitrary number of nodes, where for each node the splitting is based on a (randomly) selected subset of L attributes that optimize a target function (best split criterion). In our case $L={\text {int}}[ {log_2 ( {{\text {Number of attributes}}} )+1} ]$. Each of the aforementioned random trees had an unlimited number of levels and nodes. The prediction created by each tree is averaged and thus the ensemble-based overall prediction of the RF (here the $\hbox {PM}_{10}$ concentration of the next day) is generated. A pseudocode for this method based on http://dataaspirant.com/ is presented below:

The prediction is then made on the basis of an ensemble of results based on voting for each one of the trees generated.

3.2 Ensemble models

In addition to the above approach, we investigated the possibility to develop ensemble-based models to be common for all monitoring stations. More specifically:

1.
A single ensemble model was created for each one of the two areas of interest, and then applied to all individual AQ monitoring stations for the same area (local ensemble).
2.
The ensemble created in the one of the geographic areas was applied to each one of AQ monitoring stations of the other geographic area (foreign ensemble).
3.
Both local and foreign ensembles are combined to generate a cross ensemble model, which is then applied to each one of the AQ monitoring stations for both geographic areas of interest.

The aforementioned approach was materialized for both LR and ANN models as follows:

1.
Local ensemble: In the case of LR, the parameters of the slope vector ${\varvec{\beta }}$ of the ensemble model were calculated as weighted mean values of the parameters of each one of the individual LR models, and the local ensemble model was then applied to all stations. In the case of the ANN models, the weights of the individual models were used for the calculation of the weighted mean value of the weights of the local ensemble model. In both cases, the weighted means were calculated on the basis of the correlation coefficients of each one of the models participating in the ensemble, as resulting from their application to the monitoring station for which they were developed.
2.
Foreign ensemble: the calculation was done exactly as in the case of the local ensemble, yet making use of the foreign individual model slope vectors (for LR) and weights (for ANN) instead of the local individual model characteristics.
3.
Cross ensemble: the parameters of the local and the foreign ensemble models were averaged in order to calculate the parameters of the cross ensemble models.

3.3 Model validation

Table 3 Correlation coefficient (r), Mean absolute error (MAE) and Root mean square error (RMSE) for three models per monitoring station concerning the forecast of the mean daily PM10 concentration one day in advance

Full size table

In order to validate the results of the $\hbox {PM}_{10}$ predictions, it is important to make use of as many of the available data as possible for the training as well as for the testing phase. For this reason we followed a 10-fold cross validation procedure [25] for each one of the individual models developed: we randomly divided the initial dataset into 10 equal subsets. Then 9 out of these datasets were used for training the model, while the 10th one was used for testing, This process was repeated 10 times, each time leaving a different subset out of the training phase and using it for the test phase. The overall model results are the mean values of the statistical indices of the 10 models developed. Concerning the ensemble models, these were defined on the basis of the (pre-existing) individual models per algorithm used, and therefore no additional model validation was used.

Model results were evaluated based on the following statistical indices:

(a)
Pearson’s correlation coefficient r that describes the degree of linear relationship between forecasted and real $\hbox {PM}_{10}$ concentration values.
(b)
Mean Absolute Error (MAE), which is a measure of the mean absolute distance between forecasted and real values.
(c)
Root Mean Squared Error (RMSE), which is the square of the Mean Square Error and expresses the standard deviation of the differences between forecasted and actual values.

4 Results and discussion

Based on the model calculations performed as described in Chapter 3, the Pearson’s correlation coefficient r accompanied by the Mean Absolute Error and the Root Mean Squared Error were calculated for the three models developed and for each one of the ten AQ monitoring stations for which data were available (Table 3).

Results suggest that the algorithm leading to the best (highest) correlation coefficient between forecasted and monitored values is LR, with an r ranging from 0.406 for station AM3 up to 0.641 for station AM4 for the GMA. Concerning the TMA, LR is again the best algorithm in terms of the highest correlation coefficient achieved, with an r value ranging from 0.72 for Eptapyrgiou station up to 0.742 for the Malakopi station. The RF algorithm can be ranked as 2nd, achieving correlation coefficients very close to the ones received with the aid of LR (and surpassing it for the Eptapyrgiou station), while in some cases leading to the best (lower) MAE (like in the AM3, Martiou and Eptapyrgiou stations) and to the best (lower) RMSE (like in the AM3 and in the Eptapyrgiou stations). LR is a simple algorithm of linear logic generally considered weak in depicting nonlinear phenomena like the ones involved in AQ problems, and usually performing more poorly when compared with algorithms like ANNs or RF [1]. The success of the specific algorithm in our case has to do with the limited number of atmospheric quality parameters being available in all studied areas and stations (low number of features), thus leading to the (possible) exclusion of nonlinear dependencies from the available dataset, and dictating persistence as the main mechanism affecting the forecast of $\hbox {PM}_{10}$ levels one day in advance [26].

Table 4 Results of the local, foreign and cross ensemble models for the ANN and LR algorithms in all stations of the GMA and TMA

Full size table

In the case of the ensemble approach used (local, foreign and cross ensembles), the results of the two algorithms employed (LR and ANN) are presented in Table 4. The optimum ensemble approach is selected on the basis of the highest correlation coefficient achieved and taking in parallel with the lowest possible error metric values (MAE and RMSE). On this basis the local ensemble achieves the best results, followed by the cross ensemble and leaving the foreign ensemble last. The result may be attributed to the ability of the local ensemble to better represent the dependencies between the modelled parameter (mean daily $\hbox {PM}_{10}$ concentration for the next day) and the parameters of the feature space (input parameters). In terms of algorithms employed, LR is always better in comparison to ANNs. Concerning the areas of study r, values range from 0.505 (station AM2) up to 0.64 (station AM4) for the GMA, while r values range from 0.710 (station Egnatia) up to 0.765 (station Malakopi) for the TMA. The value range of the correlation coefficient achieved for the TMA corresponds to a value range of the coefficient of determination (which is actually the correlation coefficient squared) between 0.504 (for Egnatia) and 0.585 (for Malakopi), which are better in comparison to the values achieved for the TMA but for two different stations, as reported by [27] and [28].

By comparing ensembles with the local models, it is evident that in the case of LR-based models, the local ensemble provides with a better performance in comparison to the local models for all GMA stations with the exception of AM4, while in the case of the TMA local ensembles outperform local models for three out of five stations (Lagkada, Eptapyrgiou and Malakopi). In the case of ANN modes, both the local ensemble and the local models perform almost equally in terms of correlation coefficient values achieved.

5 Conclusions

In this paper, we address the problem of air quality forecasting for two different geographical areas of interest, the GMA and the TMA, by employing a regression approach, making use of a limited dimension feature space, and targeting at the forecast of the mean daily $\hbox {PM}_{10}$ concentration of the next day. We initially develop location specific models by employing ANNs, LR and RF, and achieving correlation coefficients between 0.406 and 0.641 for the GMA stations, and between 0.693 and 0.742 for the TMA stations. The best performance was provided by the LR models, followed by the RF and the ANN models. In addition, we developed and tested three types of ensemble models per area, namely the local, the foreign and the cross ensemble models. Their application proved the local ensemble models to be the superior for both ANNs and LR algorithms. These results indicate that even when the feature space is of limited dimensionality, the best individual model outperforms the common model for all the monitoring stations, making use of the ensemble principle, and employing the recalculation of weights in a simple LR model. This suggests that city authorities may develop effective AQ models by targeting their investment in AQ monitoring to the parameters of interest, a vast feature space not being necessary for the success of the modelling approach.

In terms of geographic area of interest, models for the GMA present with a lower overall performance in comparison to TMA models, regardless of the algorithm employed. Taking into account that in both areas the same features were made available and used for the development of the relevant models, this result indicates the importance of additional feature space parameters (reflecting atmospheric mechanisms) in order to further improve modelling performance. When coming to the choice of algorithms for the development of AQ models, the superiority of LR-based models in our study supports the finding that in the case of feature spaces of low dimension, the basic mechanisms which influence the quality of the atmospheric environment are persistence and linear dependencies. This result is of use for those wishing apply AQ models in the frame of an urban environmental management system, having a low-dimension feature space available for model deployment.

References

Karatzas, K., Katsifarakis, N., Orlowski, C. Sarzyński A.: Urban air quality forecasting: a regression and a classification approach. In: In Nguyen N.T. et al. (eds.): Intelligent information and database systems, $9^{\text{th}}$ Asian Conference on Intelligent Information and Database Systems, Part II, Lecture Notes in Artificial Intelligence vol. 10192, pp. 1–10 (2017). https://doi.org/10.1007/978-3-319-54430-4_52
Riffat, S., Powell, R., Aydin, D.: Future cities and environmental sustainability. Future Cities Environ. 2, 1 (2016). https://doi.org/10.1186/s40984-016-0014-2
Article Google Scholar
Webel, S.: Forecasting Software that’s a Breath of Fresh Air. Pictures of the Future Siemens Magazine, (2016) http://www.siemens.com/innovation/en/home/pictures-of-the-future/infrastructure-and-finance/smart-cities-air-pollution-forecasting-models.html. Accessed 18 Aug 2017
Dawe, S. Paradice, D.: A systems approach to smart city infrastructure: a small city perspective. In: Proceedings of the Thirty Seventh International Conference on Information Systems, Dublin, http://iot-smartcities.lero.ie/wp-content/uploads/2016/12/A-Systems-Approach-to-Smart-City-Infrastructure-A-Small-City-Perspective.pdf. Accessed 18 Aug 2017
Marinov, M.B., Topalov, I., Gieva, E., Nikolov, G.: Air quality monitoring in urban environments. In: 39th International Spring Seminar on Electronics Technology (ISSE), Pilsen, pp. 443–448. (2016). https://doi.org/10.1109/ISSE.2016.7563237
Bukoski, B., Taylor, E.M.: Air quality forecasting. Air quality management 129–138 (2014)
Kukkonen, J., Olsson, T., Schultz, D.M., Baklanov, A., Klein, T., Miranda, A.I., Monteiro, A., Hirtl, M., Tarvainen, V., Boy, M., Peuch, V.-H., Poupkou, A., Kioutsioukis, I., Finardi, S., Sofiev, M., Sokhi, R., Lehtinen, K.E.J., Karatzas, K., San José, R., Astitha, M., Kallos, G., Schaap, M., Reimer, E., Jakobs, H., Eben, K.: A review of operational, regional-scale, chemical weather forecasting models in Europe. Atmos. Chem. Phys. 12, 1–87 (2012)
Article Google Scholar
Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid of computational intelligence methods in Thessaloniki, Greece. Simul. Modelling Pract. Theory 15(10), 1310–1319 (2007)
Article Google Scholar
EEA, 2016: Air quality in Europe—2016 report, European Environment Agency, https://doi.org/10.2800/80982. https://www.eea.europa.eu//publications/air-quality-in-europe-2016. Accessed 18 Aug 2017
Juda-Rezler, K., Trapp, W., Reizer, M.: Modelling the impact of climate changes on particulate matter levels over Poland. In: Steyn, D.G., Rao, S.T. (eds.) Air pollution modeling and its application XX, pp. 499–450 (2010)
Moussiopoulos, N., Vlachokostas, C., Tsilingiridis, G., Douros, I., Hourdakis, E., Naneris, C., Sidiropoulos, C.: Air quality status in Greater Thessaloniki Area and the emission reductions needed for attaining the EU air quality legislation. Sci. Total Environ. 407(4), 1268–1285 (2009)
Article Google Scholar
Andrews, A.: The clean air handbook, a practical guideline to EU air quality law, https://www.clientearth.org/reports/20140515-clientearth-air-pollution-clean-air-handbook.pdf. Accessed 18 Aug 2017
WHO: Air Quality Guidelines, global update 2005, ISBN 92 890 2192 6 via http://www.euro.who.int. Accessed 18 Aug. 2017
Siwek, K., Osowski, S.: Improving the accuracy of prediction of PM10 pollution by the wavelet transformation and an ensemble of neural predictors. Eng. Appl. Artif. Intel. 25(6), 1246–1258 (2012)
Article Google Scholar
Zhou, Q., Jiang, H., Wang, J., Zhou, J.: A hybrid model for PM2.5 forecasting based on ensemble empirical mode decomposition and a general regression neural network. Sci. Total Environ. 496, 264–274 (2014)
Article Google Scholar
Biancofiore, F., Busilacchio, M., Verdecchia, M., Tomassetti, B., Aruffo, E., Bianco, S., Di Tommaso, S., Colangeli, C., Rosatelli, G., Carlo, P.: Recursive neural network model for analysis and forecast of PM10 and PM2.5. atmospheric. Pollut. Res. 8(4), 652–659 (2017)
Article Google Scholar
Khokhlov, V.N., Glushkov, A.V., Loboda, N.S., Bunyakova, Y.Y.: Short-range forecast of atmospheric pollutants using non-linear prediction method. Atmos. Environ. 42(31), 7284–7292 (2008)
Article Google Scholar
Orłowski, C., Sarzyński, A.: A model for forecasting pm10 levels with the use of artificial neural networks. In: Information Systems Architecture and Technology—the use of IT Technologies to Support Organizational Management in Risky Environment, Wrocław (2014)
Orłowski, C., Sarzyński, A., Karatzas, K., Katsifarakis, N., Nazarko J.: Adaptation of an ANN-based air quality forecasting model to a new application area. In: Król D., Nguyen N., Shirai K. (eds) Advanced Topics in Intelligent Information and Database Systems 479-488 (2017)
Karatzas, K., Kaltsatos, S.: Air pollution modelling with the aid of computational intelligence methods in Thessaloniki, Greece. Simul. Model. Pract. Theory 15(10), 1310–1319 (2007)
Article Google Scholar
Voukantsis, D., Karatzas, K., Kukkonen, J., Räsänen, T., Karppinen, A., Kolehmainen, M.: Intercomparison of air quality data using principal component analysis, and forecasting of PM10 and PM2.5 concentrations using artificial neural networks, in Thessaloniki and Helsinki. Sci. Total Environ. 409, 1266–1276 (2011)
Article Google Scholar
Szczepaniak, K., Astel, A., Bode, P., Sârbu, C., Biziuk, M., Raińska, E., Gos, K.: Assessment of atmospheric inorganic pollution in the urban region of Gdańsk. J. Radioanal. Nuclear Chem. 270(1), 35–42 (2006)
Article Google Scholar
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.: The WEKA data mining software: an update. SIGKDD Explorations 11(1), 10–18 (2009)
Article Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article MATH Google Scholar
Kohavi, R.: A study of cross-validation and bootstrap for accuracy estimation and model selection. Proc. Fourteenth Int. Joint Conf. Artif. Intel. 2(12), 1137–1143 (1995)
Google Scholar
EPA: Guidelines for developing an air quality (ozone and PM2.5) forecasting program, U.S. Environmental Protection Agency report EPA-456/R-03-002, https://www3.epa.gov/airnow/aq_forecasting_guidance-1016.pdf. Accessed 18 Aug 2017
Voukantsis, D., Niska, H., Karatzas, K., Riga, M., Damialis, A., Vokou, D.: Forecasting daily pollen concentrations using data-driven modeling methods in Thessaloniki, Greece. Atmos. Environ. 44(39), 5101–5111 (2010)
Article Google Scholar
Tzima, F., Mitkas, P., Voukantsis, D., Karatzas, K.: Sparse episode identification in environmental datasets: the case of air quality assessment. Expert Syst. with Appl. 38(5), 5019–5027 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Mechanical Engineering, Environmental Informatics Research Group, Aristotle University, Thessaloniki, Greece
Kostas Karatzas & Nikos Katsifarakis
Institute of Management and Finance, WSB University in Gdańsk, Gdańsk, Poland
Cezary Orlowski
Department of Applied Business Informatics, Faculty of Management and Economics, Gdańsk University of Technology, Gdańsk, Poland
Arkadiusz Sarzyński

Authors

Kostas Karatzas
View author publications
You can also search for this author in PubMed Google Scholar
Nikos Katsifarakis
View author publications
You can also search for this author in PubMed Google Scholar
Cezary Orlowski
View author publications
You can also search for this author in PubMed Google Scholar
Arkadiusz Sarzyński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kostas Karatzas.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

Reprints and permissions

About this article

Cite this article

Karatzas, K., Katsifarakis, N., Orlowski, C. et al. Revisiting urban air quality forecasting: a regression approach. Vietnam J Comput Sci 5, 177–184 (2018). https://doi.org/10.1007/s40595-018-0113-0

Download citation

Received: 23 August 2017
Accepted: 11 May 2018
Published: 24 May 2018
Issue Date: May 2018
DOI: https://doi.org/10.1007/s40595-018-0113-0

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Revisiting urban air quality forecasting: a regression approach

Abstract

Similar content being viewed by others

Urban Air Quality Forecasting: A Regression and a Classification Approach

An integrated framework for predicting air quality index using pollutant concentration and meteorological data

Prediction of Air Quality Using Machine Learning

1 Introduction