1 Introduction

In a recently published paper [1] we underlined the importance of air quality (AQ) forecasting in urban environmental management as well as in contemporary smart city development [2, 3]. In the current paper we revisit and extend our initial approach, focusing on urban AQ forecasting from the regression point of view and incorporating an ensemble modelling approach. For doing so, we take into account that in the framework of smart city information systems, environmental management plays an important role [4] and air pollution abatement is one of its main targets [5]. Air Quality forecasting is among the main pillars of AQ management [6] and is materialized with the aid of appropriate AQ models. Such models are establishing a time-varying relationship between the concentration of air pollutants at a specific time and location \({{\varvec{c}}}( {t,{{\varvec{x}}}})\), and other parameters \({{\varvec{p}}}({t,{{\varvec{x}}}})\) affecting the urban atmospheric environment. Such a relationship may be expressed with the aid of the following general function:

$$\begin{aligned} {{\varvec{c}}}( t,{{\varvec{x}}} )=f({{\varvec{p}}}( t,{{\varvec{x}}} ) ) \end{aligned}$$
(1)

Here t represents time and \({{\varvec{x}}}\) is the location vector corresponding to physical space. In this case the vector \({{\varvec{c}}}(t,{{\varvec{x}}})\) refers to concentration values of air pollutants like Nitrogen Dioxide (\(\hbox {NO}_{2})\), Carbon Monoxide (CO), Ozone (\(\hbox {O}_{3})\) and Particulate Matter (PM), while \({{\varvec{p}}}( {t,{{\varvec{x}}}} )\) includes parameters like wind speed, wind direction, air temperature, solar radiation, air pollutant emissions, air pollutant concentrations, land use type, land surface height, etc. The nature of function f is dictated by the model type employed: thus, if f reconstructs the physical and chemical relationships between the parameters \({{\varvec{p}}}( {t,{{\varvec{x}}}} )\) and values \({{\varvec{c}}}(t,{{\varvec{x}}})\), where \({{\varvec{x}}}\) addresses the whole area of interest in a 3-D gridded manner, then models are said to follow an analytic-deterministic approach [7], while if f is a statistical or data-mining oriented function, then models are said to follow a data-driven approach (as reported in [8] and in references therein). In the latter case, \({{\varvec{x}}}\) refers to specific areas within the studied area, which usually correspond to AQ measuring station locations. Thus, \({{\varvec{x}}}\) is not varying with time and is excluded, leading to an equation of the form:

$$\begin{aligned} {{\varvec{c}}}( t )=f( {{{\varvec{p}}}( t )} ) \end{aligned}$$
(2)

The objective of this paper is to suggest CI-based, ensemble oriented models that are able to depict as much information as possible from atmospheric quality data of low dimensionality, and to thus contribute to the scientific area of urban AQ forecasting. For this reason we employ a variety of CI methods and we suggest and test ensemble functions f in Eqs. (1) and (2). The geographic areas of interest are the Gdańsk Metropolitan Area (GMA) in Poland and the Thessaloniki Metropolitan Area (TMA) in Greece, and the parameter of interest is the daily concentration of Particulate Matter with a mean aerodynamic diameter of \(10~\upmu \hbox {m}\) (\(\hbox {PM}_{10})\), approx. \(1/5^{\mathrm{th}}\) of the diameter of the human hair. The specific pollutant is able to penetrate in the bronchial part of the human lung system [9] and is one of the most important pollutants in the GMA [10] as well as in the TMA [11]. Air pollutant concentrations are addressed as numerical values. AQ forecasting follows a twofold approach:

  1. a)

    Each AQ monitoring station is treated individually, i.e. AQ models are developed and tested per station location. Thus, the forecasting of the parameter of interest is performed as a regression problem.

  2. b)

    Regression models are being created based on ensemble modelling principles, and are evaluated via their ability to forecast AQ levels at different locations (i.e. at each monitoring station).

The mean daily concentration level of \(\hbox {PM}_{10}\) one day in advance is the target of the forecasting models under development. This choice corresponds to the requirements posed by relevant legislation for citizens as well as the decision makers to be informed about the expected \(\hbox {PM}_{10}\) levels for the next day, not to exceed \(50~\upmu \hbox {g/m}^{3}\) more than 35 days per year according to the European Regulations [9, 12] and according to the World Health Organization guidelines [13]. Combustion processes, traffic and natural sources directly emit \(\hbox {PM}_{10}\), while in some regions the mechanical degradation of the road surface and of winter tires also contributes to its production. \(\hbox {PM}_{10}\) are part of the inhalable fraction of PM and have adverse effects to human health [9].

The research question posed in the current paper moves one step ahead of our previously published results [1] and addresses (a) the ability of a low dimensionality feature space (small number of input parameters) to support effective data-driven models for \(\hbox {PM}_{10}\) forecasting and (b) the modelling approach to be used in terms of algorithms and their setup (single vs. ensemble oriented models). In addition, we make use of an ensemble approach based on an ANN model of simple architecture which can be applied to multiple geographic areas, thus simplifying the ensemble approach suggested by [14] and [15], while maintaining a performance comparable to the one reported by similar studies [16], and therefore providing with a novel approach to the problem at hand.

In the rest of the paper we firstly present the materials of our study (Chapter 2), followed by the computational methods (Chapter 3). Then we proceed with the presentation and the discussion of the results in Chapter 4, and we draw our conclusion in Chapter 5.

2 Materials: area of study and data made available

The areas of study as well as the AQ problem addressed have been the focus of multiple studies performed in the past.

In the case of Gdansk ANNs have been employed for AQ forecasting in [17]. The same data set has been used for \(\hbox {PM}_{10}\) forecasting in [18] as well as for the adaptation of an AQ forecasting model developed for Gdansk to the Thessaloniki area [19].

The air pollution of Thessaloniki has been studied and modeled with the aid of ANNs [20], with special emphasis on \(\hbox {PM}_{10}\) [21]. The similarity of the GMA as well as of the TMA in terms of population and existence of a sea front suggest that there might also be a similarity in the way that \(\hbox {PM}_{10}\) oriented air pollution can be modeled in both areas. Moreover, the need for the construction of data-driven models which use a small number of input parameters, suggested that a generalized, ensemble-based approach should be employed for the AQ modeling in both areas of interest, these being the novelty points of the research results at hand.

2.1 The two areas of interest

The city of Gdańsk is located on the Baltic coast in the south-west of the bay of Gdańsk, in the northern part of Poland. It is the capital of a tri-city metropolitan area merging with Gdynia (known for its shipyards) and Sopot (a recreational resort) and adding more than 1,000,000 residents in the GMA taking into account suburban communities also. The economy in Gdańsk is dominated by shipbuilding, petrochemicals and chemical industries, which are all concentrated quite close to the city center. The majority of air pollutant emissions originate from the industrial sector, the port activities and the city traffic [22], while the most important pollutants are \(\hbox {PM}_{10}\), \(\hbox {NO}_{2}\) and \(\hbox {SO}_{2}\) (http://www.airqualitynow.eu).

The city of Thessaloniki faces an oval harbor bay and stands on a rising ground at the heart of a long gulf which is formed by the peninsula of Chalcidice. Various municipalities surround the city while an industrial zone is located in the north-west of its outskirts. The TMA is the second largest urban agglomeration in Greece accounting for more than 1,000,000 inhabitants, with a considerable accumulation of urban traffic as well as industrial activities. The TMA is characterized by high pollution levels especially related to \(\hbox {PM}_{10}\) while \(\hbox {O}_{3}\) appears to be high in suburban locations of the area and \(\hbox {NO}_{2}\) levels are still high in dense urban areas in association with traffic [11].

2.2 The atmospheric quality data

In both the GMA and in the TMA a number of AQ monitoring stations operate (9 and 17 respectively), which routinely record concentration values of basic pollutants as well as the variation of meteorological parameters. As not all pollutants are recorded at all stations, and in order to focus on the pollutant of interest (\(\hbox {PM}_{10})\), we decided to select five stations from each area of interest (included in Table 1), that were able to provide with \(\hbox {PM}_{10}\) concentrations as well as meteorological data, in order to come up with data sets that are identical in terms of the parameters they include. In order to deal with the non-negligible frequency of missing data, we selected data from the year 2013 which contained only daily \(\hbox {PM}_{10}\) concentrations as well as information for air temperature and relative humidity.

Table 1 The Air Quality monitoring stations used for the current study in GMA and TMA

As a result and for each station, the same atmospheric parameters were used for the modelling and forecasting process: the model input or feature vector \({{\varvec{x}}}\) included five parameters, namely \(\hbox {PM}_{10}\) concentration of the current day as well as temperature and relative humidity of the current day, complemented by the day and the month of the year. The target parameter to be forecasted \({{\varvec{y}}}\) was the \(\hbox {PM}_{10}\) concentration of the next day. A summary of the basic statistical characteristics of the parameters involved in our study is included in Table 2.

Table 2 Basic statistics for the AQ and meteorological parameters available for each station at GMA and TMA

3 Computational methods

The forecasting of the numerical value of \(\hbox {PM}_{10}\) concentration levels for the next day was the goal set for the development of relevant forecasting models. For this reason, we made use of the available datasets for each AQ monitoring station to develop individual (per station) AQ forecasting models.

3.1 Algorithms for single station model creation

The algorithms applied were selected based on computational experiments employing various CI methods, which were conducted with the aid of Matlab (www.mathworks.com) as well as of the WEKA computational environment [23]. On this basis, we chose the following three algorithms as the basis for AQ model development:

  1. (i)

    Linear Regression (LR). Here the relationship between the forecasted parameter and the input parameters are described by an equation of the form:

    $$\begin{aligned} {{\varvec{y}}}={{\varvec{x}}}\cdot {\varvec{\beta }} +{\varvec{\varepsilon }} \end{aligned}$$
    (3)

where \({{\varvec{x}}}\) is the input vector, \({\varvec{\beta }}\) is the slope vector and \({\varvec{\varepsilon }}\) the error vector. The slope vector is commonly calculated via the least square method, thus:

$$\begin{aligned} {\hat{{\varvec{\beta }}}} =({{\varvec{x}}}'\cdot {{\varvec{x}}})^{-1}\cdot {{\varvec{x}}}'\cdot {{\varvec{y}}} \end{aligned}$$
(4)
  1. (ii)

    Artificial Neural Networks (ANNs). In ANNs the input vector \({{\varvec{x}}}\) for each neuron k, is weighted with the aid of a weighting vector \({{\varvec{w}}}_k \), and the result is summed (taking into account any bias) and then fed into a transfer function f to produce the overall output vector \({{\varvec{y}}}_k \):

    $$\begin{aligned} {{\varvec{y}}}_k =f( {{{\varvec{w}}}_k^T \cdot {{\varvec{x}}}} ) \end{aligned}$$
    (5)

The training of the ANN aims at reducing the error \({{\varvec{e}}}_k\) between the model output \({{\varvec{y}}}_k\) and the actual (real) value observed \({{\varvec{d}}}_k\), which here is the \(\hbox {PM}_{10}\) concentration of the next day for each station.

$$\begin{aligned} {{\varvec{e}}}_k =\Vert {{\varvec{y}}}_k -{{\varvec{d}}}_k\Vert \end{aligned}$$
(6)

This error reduction is based on a number of methods all of which aim at recalculating the initial weights so that the overall network error is minimized. In the case of the gradient descent method (which is the simples of all but nevertheless representative of the way that the weights are recalculated), the relationship between the updated and the initial weighting vector for all neurons k of the ANN, is given by:

$$\begin{aligned} {{\varvec{w}}}( {t+1} )={{\varvec{w}}}( t )-a( t ){{\varvec{g}}}( t ) \end{aligned}$$
(7)

Here t and \(t+1\) denote the initial and the updated weights, while the error term is described by:

$$\begin{aligned} {{\varvec{g}}}( {{\varvec{t}}} )={{\varvec{J}}}^{{{\varvec{T}}}}( {{\varvec{t}}} )\cdot {{\varvec{e}}}( {{\varvec{t}}} ) \end{aligned}$$
(8)

where \({{\varvec{J}}}^{T}\) is the (transposed) Jacobian and \({{\varvec{e}}}( t )\) is the overall error vector [1].

In this specific case a MultiLayer Perceptron Network with a feed-forward architecture and a back propagation training method was used, with an input layer consisting 5 nodes (i.e. all the input parameters per station), an output layer consisting of only one node (the \(\hbox {PM}_{10}\) concentration of the next day) and a hidden layer with 10 nodes. The sigmoid function is employed as the transfer function while the gradient descent algorithm is used for minimizing the error function.

figure a

(iii) Random Forests (RF), an ensemble method originating from the Decision Tree family of algorithms [24] that has shown high capacity to effectively model atmospheric parameters of interest [1]. The method creates N subsets of the input vector \({{\varvec{x}}}\) using random selection with replacement, each subset containing 2/3 of the initial data, while the remaining data are used to estimate error and variable importance. Then for each subset, a decision tree is created with the aid of an arbitrary number of nodes, where for each node the splitting is based on a (randomly) selected subset of L attributes that optimize a target function (best split criterion). In our case \(L={\text {int}}[ {log_2 ( {{\text {Number of attributes}}} )+1} ]\). Each of the aforementioned random trees had an unlimited number of levels and nodes. The prediction created by each tree is averaged and thus the ensemble-based overall prediction of the RF (here the \(\hbox {PM}_{10}\) concentration of the next day) is generated. A pseudocode for this method based on http://dataaspirant.com/ is presented below:

figure b

The prediction is then made on the basis of an ensemble of results based on voting for each one of the trees generated.

3.2 Ensemble models

In addition to the above approach, we investigated the possibility to develop ensemble-based models to be common for all monitoring stations. More specifically:

  1. 1.

    A single ensemble model was created for each one of the two areas of interest, and then applied to all individual AQ monitoring stations for the same area (local ensemble).

  2. 2.

    The ensemble created in the one of the geographic areas was applied to each one of AQ monitoring stations of the other geographic area (foreign ensemble).

  3. 3.

    Both local and foreign ensembles are combined to generate a cross ensemble model, which is then applied to each one of the AQ monitoring stations for both geographic areas of interest.

The aforementioned approach was materialized for both LR and ANN models as follows:

  1. 1.

    Local ensemble: In the case of LR, the parameters of the slope vector \({\varvec{\beta }}\) of the ensemble model were calculated as weighted mean values of the parameters of each one of the individual LR models, and the local ensemble model was then applied to all stations. In the case of the ANN models, the weights of the individual models were used for the calculation of the weighted mean value of the weights of the local ensemble model. In both cases, the weighted means were calculated on the basis of the correlation coefficients of each one of the models participating in the ensemble, as resulting from their application to the monitoring station for which they were developed.

  2. 2.

    Foreign ensemble: the calculation was done exactly as in the case of the local ensemble, yet making use of the foreign individual model slope vectors (for LR) and weights (for ANN) instead of the local individual model characteristics.

  3. 3.

    Cross ensemble: the parameters of the local and the foreign ensemble models were averaged in order to calculate the parameters of the cross ensemble models.

3.3 Model validation

Table 3 Correlation coefficient (r), Mean absolute error (MAE) and Root mean square error (RMSE) for three models per monitoring station concerning the forecast of the mean daily PM10 concentration one day in advance

In order to validate the results of the \(\hbox {PM}_{10}\) predictions, it is important to make use of as many of the available data as possible for the training as well as for the testing phase. For this reason we followed a 10-fold cross validation procedure [25] for each one of the individual models developed: we randomly divided the initial dataset into 10 equal subsets. Then 9 out of these datasets were used for training the model, while the 10th one was used for testing, This process was repeated 10 times, each time leaving a different subset out of the training phase and using it for the test phase. The overall model results are the mean values of the statistical indices of the 10 models developed. Concerning the ensemble models, these were defined on the basis of the (pre-existing) individual models per algorithm used, and therefore no additional model validation was used.

Model results were evaluated based on the following statistical indices:

  1. (a)

    Pearson’s correlation coefficient r that describes the degree of linear relationship between forecasted and real \(\hbox {PM}_{10}\) concentration values.

  2. (b)

    Mean Absolute Error (MAE), which is a measure of the mean absolute distance between forecasted and real values.

  3. (c)

    Root Mean Squared Error (RMSE), which is the square of the Mean Square Error and expresses the standard deviation of the differences between forecasted and actual values.

4 Results and discussion

Based on the model calculations performed as described in Chapter 3, the Pearson’s correlation coefficient r accompanied by the Mean Absolute Error and the Root Mean Squared Error were calculated for the three models developed and for each one of the ten AQ monitoring stations for which data were available (Table 3).

Results suggest that the algorithm leading to the best (highest) correlation coefficient between forecasted and monitored values is LR, with an r ranging from 0.406 for station AM3 up to 0.641 for station AM4 for the GMA. Concerning the TMA, LR is again the best algorithm in terms of the highest correlation coefficient achieved, with an r value ranging from 0.72 for Eptapyrgiou station up to 0.742 for the Malakopi station. The RF algorithm can be ranked as 2nd, achieving correlation coefficients very close to the ones received with the aid of LR (and surpassing it for the Eptapyrgiou station), while in some cases leading to the best (lower) MAE (like in the AM3, Martiou and Eptapyrgiou stations) and to the best (lower) RMSE (like in the AM3 and in the Eptapyrgiou stations). LR is a simple algorithm of linear logic generally considered weak in depicting nonlinear phenomena like the ones involved in AQ problems, and usually performing more poorly when compared with algorithms like ANNs or RF [1]. The success of the specific algorithm in our case has to do with the limited number of atmospheric quality parameters being available in all studied areas and stations (low number of features), thus leading to the (possible) exclusion of nonlinear dependencies from the available dataset, and dictating persistence as the main mechanism affecting the forecast of \(\hbox {PM}_{10}\) levels one day in advance [26].

Table 4 Results of the local, foreign and cross ensemble models for the ANN and LR algorithms in all stations of the GMA and TMA

In the case of the ensemble approach used (local, foreign and cross ensembles), the results of the two algorithms employed (LR and ANN) are presented in Table 4. The optimum ensemble approach is selected on the basis of the highest correlation coefficient achieved and taking in parallel with the lowest possible error metric values (MAE and RMSE). On this basis the local ensemble achieves the best results, followed by the cross ensemble and leaving the foreign ensemble last. The result may be attributed to the ability of the local ensemble to better represent the dependencies between the modelled parameter (mean daily \(\hbox {PM}_{10}\) concentration for the next day) and the parameters of the feature space (input parameters). In terms of algorithms employed, LR is always better in comparison to ANNs. Concerning the areas of study r, values range from 0.505 (station AM2) up to 0.64 (station AM4) for the GMA, while r values range from 0.710 (station Egnatia) up to 0.765 (station Malakopi) for the TMA. The value range of the correlation coefficient achieved for the TMA corresponds to a value range of the coefficient of determination (which is actually the correlation coefficient squared) between 0.504 (for Egnatia) and 0.585 (for Malakopi), which are better in comparison to the values achieved for the TMA but for two different stations, as reported by [27] and [28].

By comparing ensembles with the local models, it is evident that in the case of LR-based models, the local ensemble provides with a better performance in comparison to the local models for all GMA stations with the exception of AM4, while in the case of the TMA local ensembles outperform local models for three out of five stations (Lagkada, Eptapyrgiou and Malakopi). In the case of ANN modes, both the local ensemble and the local models perform almost equally in terms of correlation coefficient values achieved.

5 Conclusions

In this paper, we address the problem of air quality forecasting for two different geographical areas of interest, the GMA and the TMA, by employing a regression approach, making use of a limited dimension feature space, and targeting at the forecast of the mean daily \(\hbox {PM}_{10}\) concentration of the next day. We initially develop location specific models by employing ANNs, LR and RF, and achieving correlation coefficients between 0.406 and 0.641 for the GMA stations, and between 0.693 and 0.742 for the TMA stations. The best performance was provided by the LR models, followed by the RF and the ANN models. In addition, we developed and tested three types of ensemble models per area, namely the local, the foreign and the cross ensemble models. Their application proved the local ensemble models to be the superior for both ANNs and LR algorithms. These results indicate that even when the feature space is of limited dimensionality, the best individual model outperforms the common model for all the monitoring stations, making use of the ensemble principle, and employing the recalculation of weights in a simple LR model. This suggests that city authorities may develop effective AQ models by targeting their investment in AQ monitoring to the parameters of interest, a vast feature space not being necessary for the success of the modelling approach.

In terms of geographic area of interest, models for the GMA present with a lower overall performance in comparison to TMA models, regardless of the algorithm employed. Taking into account that in both areas the same features were made available and used for the development of the relevant models, this result indicates the importance of additional feature space parameters (reflecting atmospheric mechanisms) in order to further improve modelling performance. When coming to the choice of algorithms for the development of AQ models, the superiority of LR-based models in our study supports the finding that in the case of feature spaces of low dimension, the basic mechanisms which influence the quality of the atmospheric environment are persistence and linear dependencies. This result is of use for those wishing apply AQ models in the frame of an urban environmental management system, having a low-dimension feature space available for model deployment.