Abstract
The main goal of this work is to obtain reliable predictions of pollutant concentrations related to maritime traffic (SO2, PM10, NO2, NOX, and NO) in the Bay of Algeciras, located in Andalusia, the south of Spain. Furthermore, the objective is to predict future air quality levels of the principal maritime traffic-related pollutants in the Bay of Algeciras as a function of the rest of the pollutants, the meteorological variables, and vessel data. In this sense, three scenarios were analysed for comparison, namely Alcornocales Park and the cities of La Línea and Algeciras. A database of hourly records of air pollution immissions, meteorological measurements in the Bay of Algeciras region and a database of maritime traffic in the port of Algeciras during the years 2017 to 2019 were used. A resampling procedure using a five-fold cross-validation procedure to assure the generalisation capabilities of the tested models was designed to compute the pollutant predictions with different classification models and also with artificial neural networks using different numbers of hidden layers and units. This procedure enabled appropriate and reliable multiple comparisons among the tested models and facilitated the selection of a set of top-performing prediction models. The models have been compared using several quality classification indexes such as sensitivity, specificity, accuracy, and precision. The distance (d1) to the perfect classifier (1, 1, 1, 1) was also used as a discriminant feature, which allowed for the selection of the best models. Concerning the number of variables, an analysis was conducted to identify the most relevant ones for each pollutant. This approach aimed to obtain models with fewer inputs, facilitating the design of an optimised monitoring network. These more compact models have proven to be the optimal choice in many cases. The obtained sensitivities in the best models were 0.98 for SO2, 0.97 for PM10, 0.82 for NO2 and NOX, and 0.83 for NO. These results demonstrate the potential of the models to forecast air pollution in a port city or a complex scenario and to be used by citizens and authorities to prevent exposure to pollutants and to make decisions concerning air quality.
Similar content being viewed by others
Avoid common mistakes on your manuscript.
1 Introduction
Air pollution is a real threat in today's world according to the World Health Organization (WHO). The European Directive 2008/50/EC regulates several key atmospheric pollutants, including particulate matter (PM), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), and carbon monoxide (CO). Vessels-related atmospheric pollutants encompass sulfur dioxide (SO2), nitrogen oxides (NOx), and particulate matter (PM). Exposure to hazardous air pollutants emissions can lead to a range of human health problems, including respiratory disorders, cardiovascular disease, and increased risk of stroke. Manisalidis et al. (2020) showed an overview of the effects of air pollution on human health. A large number of scientific work has demonstrated that particulate matter directly affects human health by reducing air quality (Adeyemi et al. 2022). Air pollution in urban areas is a complex mixture of toxic components that have unhealthy effects on residents, especially sensitive populations such as children and people with cardiac and respiratory diseases (Kolehmainen et al. 2001). From an environmental point of view, the conduct of a study on the prediction of air pollutant levels or concentrations (inmisions) is crucial for the protection of human health and the environment. This research pretends to provide valuable insights into the factors influencing the distribution, temporal variations, and potential exposure risks associated with ambient pollutants. Accurate prediction models can be developed to forecast pollution levels, identify pollution hotspots and assess compliance with regulatory standards. These predictive models play a vital role in urban planning, industrial siting and the formulation of effective emission control strategies. By proactively predicting and mitigating high pollution episodes, air pollution forecasting research contributes to protecting public health, reducing environmental impact and promoting sustainable communities (Pope and Dockery 2006; Stieb et al. 2009; Kloog et al. 2013). A review of models to forecast air pollution health outcomes is presented by Oliveri et al. (2017), where different huge cities were compared regarding different pollutants. Besides, in Savouré et al. (2021), Subramaniam et al. (2022), Traina et al. (2022) artificial intelligence is applied to forecast air pollution related to human health.
Different studies show the air pollutants related to vessel traffic (Miola and Ciuffo 2011; Moreno-Gutiérrez et al. 2015; Ekmekçioğlu et al. 2020), and estimate the amount of pollution associated with ships in port areas (Lu et al. 2006; Liu et al. 2014; Fameli et al. 2020). These pollutants are sulphur dioxide (SO2), nitrogen oxides (NOx) and Particulate Matter (PM). Marine pollution is regulated by the International Maritime Organisation (IMO) through the Marine Pollution Protocol (MARPOL). Decarbonisation is the main purpose of the IMO and the reduction of emissions of Greenhouse gas emissions. An energy efficiency index is applied to vessels to indicate their classification (A, B, C, D, E) (MARPOL, Annex VI). The aim of the IMO is to achieve zero emissions by 2050 (IMO 2021). The air pollutants responsible for acid rain are sulphur dioxide (SO2) and nitrogen oxides (NOx) in the atmosphere, which react with water, oxygen, and other chemicals to form sulphuric acid and nitric acid. NO2 is primarily responsible for the formation of smog and acid rain in urban areas, causing both acute and chronic effects (Menezes and Popowicz 2022). These pollutants are emitted from the combustion of fossil fuels in industrial processes, power generation, and transport. The main pollutants associated with port activity are presented in (Yang et al. 2022; Yeh et al. 2022; Mueller et al. 2023).
In recent decades, artificial neural networks (ANNs) have been applied in the field of air quality forecasting in a wide range of literature (Kukkonen et al. 2003; Fernando et al. 2012; Hu et al. 2021; Muruganandam and Arumugam 2023). Numerous studies have been developed using artificial intelligence (AI) and machine learning techniques in monitoring the air quality (Bai et al. 2018; Mclean et al. 2019; Baklanov and Zhang 2020; Liu et al. 2021; Masood and Ahmad 2021). Bai et al. (2018) analysed the three classical methods for forecasting air pollution (statistical, artificial intelligence, and numerical prediction methods). There is literature on air quality in urban areas using different statistical methods to forecast air quality (Mavroidis et al. 2007; Ilacqua et al. 2007; Lu et al. 2014). Considering meteorological aspects, in Mavroidis et al. (2007) a successful methodology was suggested for assessing the impact of different emission reduction scenarios on the attainment of air quality standards for CO and NO2 in the Athens area. Furthermore, in Ribeiro and Gonçalves, (2022), in Portugal, NO2 is classified as a binary objective using a benchmark model. In Durão et al. (2016), classification and regression tree techniques were successfully used to predict ozone in Sines (Portugal). For NO2, Prati et al. (2015) provided an insight into the relevance of a spatial analysis of data that provides knowledge on how ship emissions affect the air in a port city. To forecast air quality in urban areas, Lu et al. (2014) proposed different semi-parametric regression models. Particulate matter (PM) sources in three European cities (Athens, Basle, and Helsinki) are described and analysed using structural equation modelling in parallel with traditional principal components (Ilacqua et al. 2007). Similar machine learning techniques are used by Lakra and Avishek, (2022) to forecast fog, which is also related to meteorological factors. Other techniques are used to construct air quality models. In García-Nieto et al. (2015) air quality in Oviedo (Spain) was modelled using multivariate adaptive regression splines (MARS) and subsequently, support vector regression (SVR), multilayer perceptron (MLP), were specifically used to forecast PM10 concentrations in the same city by García-Nieto et al. (2018). In addition, meteorological variables are considered by Luna et al. (2019), where low-cost electrochemical sensors are used to quantify air pollution exposure, prediction, and control of CO2 and SO2 concentrations using ANNs. The most relevant information extracted from this study was that pollution prediction is sensitive to humidity, wind speed, and temperature. Therefore, the use of ANNs could predict and impute missing values or re-evaluate doubtful values. A method for predicting SO2 emissions in several cities is shown by Ju et al. (2023), which is of great help for accurate control of this pollutant. Applied to megacities, He et al. (2014) provided an ANN-based method, in particular a multilayer perceptron (MLP), that predicts fine particles suggesting that particulate matter concentrations are generated by traffic and controlled by weather conditions.
Air quality assessment, from an operational point of view, requires the characterisation of atmospheric quality (Corani and Scanagatta 2016; Méndez et al. 2023). The aim of this work is to predict future values of the levels of each pollutant. Machine learning methods based on classification models have been used for this purpose. A comprehensive comparison of classification models was developed. The classifiers tested were trees, support vector machines (SVMs), artificial neural networks (ANNs), ensembles, K-nearest neighbours (KNNs), discriminant, and naïve Bayes. Most of them have already been successfully used by authors in different papers (Turias et al. 2008; Ruiz-Aguilar et al. 2020; Song and Fu 2020; González-Enrique et al. 2021; Moscoso-López et al. 2022). Regarding local studies, the impact of ship propulsion systems on air pollution in the Strait of Gibraltar in 2017 is presented in Durán-Grados et al. (2022). This study is based on an inventory of ships crossing the Strait and calling at the ports of Algeciras, Tarifa, and Ceuta. In Martín et al. (2008) air pollution was modelled with classification techniques in the Bay of Algeciras (Spain). Additionally, Rodríguez-García et al. (2022) conducted an extensive analysis of statistical, risk, and trends developed in the area of the Bay of Algeciras from 2010 to 2015. Furthermore, due to the large number of inputs used to build the models, the problem of the curse of dimensionality (Bishop 2006) could arise. Therefore, a feature selection stage was applied using the Minimum Redundancy Maximum Relevance (mRMR) method, which has been successfully tested by the authors previously in air pollution forecasting problems (González-Enrique et al. 2021).
The main motivation of this manuscript is to provide citizens with reliable information on air pollution forecasts. This challenge is achieved through a data-driven approach using historical data and machine learning techniques, which will be explained in more detail in the next sections. Improving the air quality in populated cities is another of the main motivations for this study, which is carried out in the Bay of Algeciras (southern Spain), where the most important port in Spain and the fourth in Europe in terms of cargo traffic is located. The importance of maritime traffic in Algeciras, which has experienced a massive increase in the last ten years, in terms of air pollution, lies in the fact that this increase in the number of vessels in the port of Algeciras may affect the air quality in the area and in the nearest city (Algeciras). Since there have been few studies on air pollution in this strategic area of port activities in terms of pollution, this research can make a specific contribution.
Another main contribution of this work is the use of a classification-based machine learning scheme to predict the next level of a pollutant, including an analysis of the most relevant variables (using mRMR) for each of the pollutants and sites studied. In addition, many different classification methods were used and compared. This research has allowed us to develop a procedure for predicting future pollution levels, both on an hourly basis for nitrogen oxides (NO2, NOx, and NO) and, on a daily basis for SO2 and PM10. The results obtained are suitable for the design of air pollution forecasting system that can be used by citizens or institutions to support decision making.
The rest of this article is organised as follows: Sect. 2 describes the database, the site, the case study and the regulations to be applied, Sect. 3 presents the methodology including the classification models tested in the study together with the feature selection process and the experimental procedure used to achieve the objectives, Sect. 4 presents and discusses the results and, finally, Sect. 5 draws the main conclusions.
2 Materials
The importance of environmental studies in this area is due to the fact that the Port of Algeciras is located in this area, handling more than 100 million tonnes of goods per year since 2017, and is located in an area with special meteorological and orographic conditions, the Strait of Gibraltar, as well as in a highly industrialised region where the Port of Algeciras coexists with numerous industries (a refinery, several chemical and thermal power plants, a stainless steel factory, etc.), together with several highways and the Gibraltar airport.), together with several motorways and Gibraltar airport, contribute to a very complex air pollution scenario. Maritime traffic in Algeciras has increased dramatically over the last decade. It is logical to think that the increase in the number of vessels in the Port of Algeciras could affect the air quality in the area.
In order to develop this study, the main pollutants related to port activities were selected as shown in (Yang et al. 2022; Yeh et al. 2022; Mueller et al. 2023). Immission data of SO2, NO2, NOX, NO and, PM10 concentrations, meteorological data (relative humidity, solar radiation, temperature, atmospheric pressure, wind speed, wind direction, and rainfall) were provided through the Andalusian Government's monitoring network, and the vessel gross tonnage (GT) database was provided by the Algeciras Bay Port Authority, all for the years 2017 to 2019. Similar studies, such as López-Aparicio et al. (2017), analysed all these pollutants in a Nordic port and concluded that the main emission contributions come from berthed vessels and manoeuvres.
The Andalusian Government’s system of sensors in the Bay of Algeciras includes a total of sixteen air pollutant monitoring stations and five specialised meteorological sensors (W1-5) distributed throughout the bay (see Fig. 1), which record hourly data of each pollutant and meteorological values over a three-year period, from 1st January 2017 to 31st December 2019 (see Table 1). The meteorological sensors W3, W4, and W5 are located in the chimney of a refinery at different heights, 10 m, 15 m, and 60 m. The data analysed are recorded at stations in the towns of Algeciras and La Línea and in the Alcornocales Park, in order to compare three distant locations. The importance of Algeciras and La Línea spots is due to their coastal areas and the huge port of Algeciras, with massive truck traffic, and Alcornocales Park is an unspoilt area far from anthropogenic activity. In addition, La Línea and Algeciras are two cities located opposite each other, thus studying both can shed more light on air pollution immissions. Algeciras is the most populated city in the bay with 122,982 inhabitants in 2021 and La Línea is the second most populated city with 63,365 inhabitants.Footnote 1 The entire database consists of 131 variables. In each experiment, the output variable is the concentration of each pollutant in each of the monitoring stations according to the rest of the study variables described in Table 1 (pollutant concentrations in the rest of the monitoring stations, meteorological information and vessel data).
This study has been developed in three stages: preprocessing of the data, classification stage and the stage of feature selection to reduce the number of variables. Among the wide range of feature selection methods, the mRMR method was used in this work to rank the variables considered as inputs. Feature selection, one of the fundamental problems in pattern recognition and machine learning, involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance. These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets. In this paper, the ten most relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.
3 Methodology
The main objective of this work is to predict the future air quality levels of the main maritime pollutants in the Bay of Algeciras as a function of other pollutants, meteorological variables, and vessel data. In order to achieve this objective, the time series were considered according to the limits marked in the European Directive 2008/50/EC (Table 2), and the outputs were transformed into disjoint quartiles (Q1–Q4).
The predictions are calculated using pollutant concentrations in each station (Algeciras, Alcornocales, and La Línea) as outputs and the rest of the variables as inputs (pollutants in other stations, meteorological parameters, and the vessel data). Different classification techniques are compared together with ANN models in order to find improvements and the best model. The performance of the tested models is calculated using hourly and daily mean data time series.
Equation 1 shows mathematically the prediction approach, where \(t\) is the time and \(t+1\) is one step ahead to be predicted. In the case of hourly data, the next 1 h-mean period concentration value is predicted and in the case of daily data, the next day mean concentration value is predicted. Inputs \(\widetilde{x}\left(t\right)\) consist of all other pollutants measured at the monitoring stations together with meteorological variables and vessel time series. The scheme of the process is shown in Fig. 2.
Three stages were developed. The first step is the preprocessing of the data. On the one hand, the imputation of missing values was done using a previous algorithm successfully proposed by the authors (González-Enrique et al. 2019a, 2019b; Rodríguez-García et al. 2022). On the other hand, the standarisation of the database. A transformation of the vessel data, given as incoming and outgoing vessels in the bay into hourly data was also performed. Once the databases are transformed and unified, the data consist of 26,280 hourly records × 131 variables (130 inputs and 1 output) of a unique database. Each row is a record of hourly data for the three years from 2017 to 2019. The database has been normalised and the output has been divided into disjoint quartiles. The second stage of classification is described in Sect. 3.1 and the third stage is a feature selection procedure using the mRMR approach proposed by Peng et al. (2005), which is a feature selection algorithm that ranks a set of features according to their relevance to the target variable. It also penalises redundant features. The best features are those with the highest trade-off between maximum relevance with the target variable and minimum redundancy with the remaining features.
3.1 Classification
In this stage, 29 classification models (Table 3) were tested to select the best classifier. Classification is a type of supervised machine learning where an algorithm learns to classify new observations from labelled data samples. In this work, the database is labelled in quartiles, as shown in Table 3. The different classification schemes are briefly explained below.
3.1.1 Trees
Trees are a hierarchical non-parametric supervised learning algorithm consisting of a root node, branches, internal nodes, and leaf nodes. It is based on classification principles that predict the outcome of a decision for both classification and regression tasks (Breiman et al. 1984). Three types of trees were used depending on the maximum number of splits (100, 20, 4). The maximum number of splits equal to 100 is when many leaves are used to make many fine distinctions between classes. When the number of leaves is equal to 4, the distinctions that can be made are stronger.
3.1.2 Discriminant analysis
Discriminant analysis is a statistical transformation technique that produces a function capable of classifying phenomena (Fisher 1936). The objective is to maximise the between-group variance and minimise the within-group variance through these linear (or quadratic) combinations. The procedure is to discover the autovalues and autovectors of a quotient matrix of the interclass distance matrix and the intraclass distance matrix. For linear discriminant analysis, the model has the same covariance matrix for each class; only the means vary. For quadratic discriminant analysis, both the means and the covariances of each class vary.
3.1.3 Naïve Bayes
Naive Bayes models assume that observations have a multivariate distribution with regard to class membership, although the predictors or features that make up the observation are independent. This framework can accommodate a full set of features, so that an observation is a set of multinomial counts (Mitchell 1997). Normal (Gaussian) distribution is appropriate for predictors that have normal distributions in each class. The Naïve Bayes classifier estimates a separate normal distribution for each class by calculating the mean and standard deviation of the training data in that class. The kernel distribution is suitable for predictors that have a continuous distribution. It does not require a strong assumption such as a normal distribution, and you can use it in cases where the distribution of a predictor may be skewed or have multiple peaks or modes.
3.1.4 Support Vector Machines (SVMs)
The goal of SVM is to find out a hyperplane that best separates two different classes of data points with the widest margin between the two classes. The algorithm can only find this hyperplane in problems that allow linear separation; in most practical problems, the algorithm maximises the flexible margin by allowing a small number of misclassifications. The support vectors refer to a subset of the training observations that identify the location of the separation hyperplane. SVMs can use a kernel function to transform the features. Kernel functions map the data into a different, usually higher dimensional space, with the expectation that it will be easier to separate the classes after this transformation (Vapnik and Chervonenkis 1971; Cortes and Vapnik 1995). The types tested are Linear SVM (makes a simple linear separation between classes), Quadratic SVM, Cubic SVM, and three categories of Gaussian SVM (fine, with kernel scale set to \(\sqrt{P}/4\); medium, with kernel scale set to \(\sqrt{P}\); and coarse, with kernel scale set to \(\sqrt{P}\cdot 4\), where P is the number of predictors).
3.1.5 KNN
The k-nearest neighbour algorithm, also known as KNN or k-NN, is a non-parametric supervised learning classifier, that uses proximity to make classifications or predictions about the clustering of a single data point. While it can be used for regression or classification problems, it is generally used as a classification algorithm, based on the assumption that similar points will be found close together. Usually, the number k is an odd number (1,3,5…) (Silverman and Jones 1989). The types of trees tested were Fine KNN (the number of neighbours is set to 1), Medium KNN (the number of neighbours is set to 10), Coarse KNN (the number of neighbours is set to 100), Cosine KNN, using a cosine distance metric (the number of neighbours is set to 10), Cubic KNN, using a cubic distance metric (the number of neighbours is set to 10), Weighted KNN, using a distance weight (the number of neighbours is set to 10).
3.1.6 Ensemble learning
Classification ensemble learning uses multiple learning algorithms to obtain a better predictive model, which is aa weighted combination of several classification models. In general, the combination of several classification models increases the predictive power. The types of ensembles tested were: Subspace with discriminant learners, Subspace with nearest neighbour learners, and RUSBoost, Random Forest Bag, and AdaBoost, with decision tree learners (Breiman 1996, 2001; Hastie et al. 2008; Freund 2009).
3.1.7 Artificial neural networks (ANNs)
ANNs were also included in the second stage. A feedforward fully connected ANN can be arbitrarily well suited to multidimensional mapping problems, given consistent data and enough neurons in its hidden layer (Hornik et al. 1989). The authors have successfully used ANNs in similar prediction problems (Gonzalez-Enrique et al., 2019b; Ruiz-Aguilar et al. 2020; Moscoso-López et al. 2022). ANNs were trained with the backpropagation algorithm (Rumelhart et al. 1986) using the Levenberg–Marquardt optimisation procedure. Finally, the obtained results were statistically analysed and compared using a resampling procedure in order to select the model with the best generalisation capabilities. ANN models with different hidden units were compared to determine the effect of adding non-linear processing capabilities on model performance. Each model is a feedforward fully connected neural network with a different number of fully connected layers and hidden units. A ReLU activation function was used in each model. The rectified linear activation function, or ReLU, is a non-linear or piecewise linear function that directly outputs the input if it is positive, otherwise, it outputs zero (Glorot et al. 2011). It is the most commonly used activation function in neural networks since 2017 (Ramachandran et al. 2017). The types of tested ANNs were: One hidden layer with 10, 25, and 100 neurons; two hidden layers with 10 \(x\) 10 neurons and three hidden layers with 10 \(x\) 10 \(x\) 10 neurons.
3.2 Feature selection
The third stage is a feature selection procedure. The Minimum Redundancy Maximum Relevance (mRMR) approach (Peng et al. 2005) is a feature selection algorithm that ranks a set of features according to their relevance to the target variable. It also penalises redundant features. The best features are those with the highest trade-off between maximum relevance with the target variable and minimum redundancy with the remaining features.
Among the wide range of feature selection methods, the mRMR method has been used in this work to rank the variables considered as inputs. This method has been successfully used by the authors in other studies related to air pollution (González-Enrique et al. 2021). Feature selection, one of the fundamental problems in pattern recognition and machine learning, involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance. These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets. In this paper, the top ten relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.
3.3 Experimental procedure
A resampling procedure was used to reduce the prediction error of a test set and to reduce the effects of overfitting. The strategy randomly divided the database into three parts (training 70%, validation 10%, and test sets 20%) and the performance results were collected only for the test set in order to estimate the generalisation error of each model using unseen data, as the authors have successfully implemented in other papers (Turias et al. 2008; González-Enrique et al. 2019a; Ruiz-Aguilar et al. 2020; Moscoso-López et al. 2022). In this research, all of the simulations were developed and tested in Matlab © software.
The whole system can be seen as a mapping from a set of input features to an output variable. The mathematical form of the mapping is determined by the data (training set). Of course, we need to build a system that is capable of making good predictions on unseen data. In order to measure this generalisation ability, cross-validation is used with another set of samples (test set) is used. We adopted five-fold cross-validation to select the best model based on the generalisation performance of each model. The available data were divided into three different groups (training, validation, and test sets). The parameters of each model were estimated using one of the groups (the training set). A validation set is used for early stopping and to avoid overfitting. Finally, the test set is used to test the classification quality indexes (sensitivity, specificity, accuracy, and precision), simulating the real performance of the model. This process is repeated 20 times and the results are averaged over these runs. To visualise the obtained results with a classification model, the confusion matrix is used (Ting 2010). Each row (i) of the matrix (C) represents the number of predicted values for each class and each column (j) represents the number of real values for each class (C(i,j)). In this case, four classes are considered, one for each of the quartiles of the output. Once an air pollutant has been considered, its values are divided into four quartiles, each containing 25% of the total distribution. The confusion matrix is calculated and then the quality indexes of sensitivity, specificity, accuracy, and precision are also calculated. The Euclidean distance (d1) to a perfect classifier in terms of the quality indexes (sensitivity = 1, specificity = 1, accuracy = 1, precision = 1) is also calculated (expressed by Eq. 2).
In this case, the confusion matrix has a 4 × 4 dimension due to data are divided into four disjoint quartiles (classes). In order to obtain individual classification results for each quartile, the matrix was sequentially transformed, quartile by quartile, into an equivalent 2 × 2 confusion matrix (Table 4), which was used to calculate the well-known and above-mentioned classification measures (sensitivity, specificity, accuracy, and precision, see Eqs. 3–6). The lower d1 distance is chosen to indicate the best classification model for each quartile. Quartiles are the statistical values that divide the dataset into four equal parts or quarters, each containing 25% of the data, resulting in lower, lower-middle, middle-high, and upper divisions.
True-positive (TP) and true-negative (TN) results are correctly classified, while false-negative (FN) and false-positive (FP) results are two types of errors calculated according to the literature (Ting 2010).
All the calculations are performed separately. The air pollutants (SO2, PM10, NO2, NOX, and NO) as outputs in the three locations (Algeciras, Alcornocales, and La Línea), using all variables or only the ten most relevant variables, in a total of 30 scenarios, repeated 20 times each, following the resampling procedure explained above. The time series of SO2 and PM10 concentrations are calculated as daily averages and NO2, NOX, and NO on an hourly basis. Once the experiments have been developed, the results are presented in the next section.
4 Results
Simulations and prediction experiments were computed for five pollutants directly related to maritime traffic: SO2, PM10, NO2, NOX, and NO. The models were tested at three different locations, in the cities of Algeciras and La Línea, and at a third location at a certain distance in the remote area of the Alcornocales Park. As explained above in Table 2, the averages were calculated hourly or daily to comply with the European Directive 2008/50/EC. Figures 3, 4 show the time series graphs with their upper assessment thresholds of the pollutants analysed on an hourly or daily basis according to the Directive measured in µg/m3. These graphs show average concentrations and it is worth noting that in 2017 the average SO2 concentrations in La Línea, where a refinery is located, are very high compared to the rest of the years, which seems to be due to the installation of a desulphurisation unit in 2018 in this refinery. Considering particulate matter, the lowest concentrations are found at the Alcornocales station, and the highest at Algeciras, although overall concentrations are very similar in both Algeciras and La Línea. On the other hand, the pollutant NO2 (and nitrogen oxides in general) clearly shows very high average concentrations in Algeciras compared to La Línea and Alcornocales, which are quite similar. This increase could be an indication of the high presence of diesel engines in Algeciras, which is consistent with the heavy truck traffic in and out of the port, the ships berthed in the Port of Algeciras and the higher traffic density, since it is the most densely populated city in the Bay.
Since the pollutant thresholds are defined in the regulations in terms of hourly and daily values, and in order to better understand the behaviour of each pollutant, weekly average graphs of each air pollutant at the different stations have been calculated (Fig. 5). The pollutant SO2 shows a higher concentration in La Línea, probably because the prevailing winds carry the pollution from ships and the surrounding industries more towards La Línea (westerly situations), and in easterly situations SO2 seems to move towards Los Alcornocales, the remote area 30 km from the bay, which paradoxically has a higher concentration than Algeciras. In the case of the PM10 averages, it can be seen that concentrations decrease during the night, and from the early hours of the morning, when anthropogenic activity begins, the values increase until late in the day. At weekends there is not much difference compared to the rest of the week. In the case of nitrogen oxides, there is a daily decrease in the early hours of the morning, then an increase to a maximum around midday, and then a downward trend with a slowdown around mid-afternoon, which coincides with the pace of human activity and therefore traffic, especially vehicle traffic. The trend is higher in the cities of La Línea and Algeciras. At the remote Los Alcornocales station, there is only a slight increase at midday. In terms of daily averages, maximum values are observed on Tuesdays and Fridays, with a significant decrease at weekends. It should be noted that NO2 is higher in Algeciras than in La Línea, probably due to road traffic. Nitrogen oxides have two peaks per day, which suggests that they are related to human activity, and especially to diesel engines whereas particulate matter and SO2 have only one peak per day.
As explained above, one-step-ahead prediction models have been developed with the aim of predicting the next value of a time series of quartile concentrations in order to contrast with exceedances of the thresholds set in the Directive. Several classification models, including ANNs, were tested and compared for their performance using the resampling procedure explained in Sect. 3. In each case, two experiments were calculated, one using all variables as inputs and another one with only the ten most relevant variables. It should be noted that the results shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 are always calculated for test sets (unseen data). In general, the obtained results are quite adequate, with higher values for the classification quality indexes. Results of around 90% indicate that the prediction for the next timestamp-ahead or the next daily/hourly mean is quite accurate and represents a very reliable prediction. The results are collected for the different separated quartiles to achieve a more detailed picture of the prediction.
In Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14, the best prediction model for each air pollutant, location, and quartile is shaded underline (the combination with the smallest distance d1). The best counterpart model (same model, location, quality index, and quartile) is shown in bold between Tables 5, 7, 9, 11, and 13 which present the results of the models using all variables, and Tables 6, 8, 10, 12, and 14, which use only the ten most relevant variables. The distance d1 was used to compare models of the same location and to select the best model in each case.
Comparing Table 5 with Table 6 for the pollutant SO2, prediction models using all variables with the models using only the relevant variables, it can be seen that in all cases the ANN models significantly improve their prediction performance when using only relevant variables, although the tree classifiers predict better than the ANNs as their distance d1 is the smallest. For this pollutant, tree classifiers are the best predictors in all cases. For SO2 in Algeciras, quartiles, Q1 and Q2 are best predicted by the tree classifiers using only the ten most relevant variables. However, quartile Q3 is also best predicted by tree classifiers using all 130 variables and Q4 is best predicted by ANNs models using the ten most relevant variables. The best performing quartile (with the lowest d1) in Algeciras is the Q1 with sensitivity, specificity, accuracy, and precision above 0.90. For SO2 in Alcornocales, better predictions are obtained in quartiles Q1 and Q2 with tree-type classifiers and using only the ten most relevant variables. In the case of quartiles Q3 and Q4, better predictions are obtained with ensemble classifiers using all variables. The best results for Alcornocales are obtained for Q1, Q3, and Q4, all with values up to 0.97. For SO2 in La Línea, better predictions were obtained for quartiles Q1 and Q4 using relevant variable models with tree classifiers, and Q2 and Q3 were better predicted with ensemble classifiers using all variables. In La Línea the Q4 is the one with the best results, with quality indexes above 0.98.
For the PM10 pollutants, Tables 7 and 8 show that the use of relevant variables improves the results of the classification models in almost all cases. The results of the best models for each quartile are those shaded in underline, regardless of whether they include all variables or only the relevant ones, and turn out to be the trees with slightly better results than the ANNs models. The obtained results for the best models for PM10 pollutants have the highest quality indexes. For instance, in Algeciras, the Q1 is the best predicted with tree classifiers using relevant variables, obtaining quality indexes above 0.95. In Alcornocales, quartile Q4 is also the best predicted with tree classifiers using relevant top ten variables with quality indexes above 0.97. In La Línea, the best predicted quartile is Q4 with quality indexes up to 0.96.
The results for NO2 are shown in Tables 9 and 10. In this case, the relevant variables give better results only for Alcornocales and the quartile Q1 of La Línea. Tree-type models are also the best predictors for Alcornocales and La Línea, especially when all the variables are used, and only for quartiles Q3 and Q4 of Alcornocales do neural network models perform better when the relevant variables are used. In the case of Algeciras, all quartiles are predicted equally well by SVM classifiers using all variables. For NO2, the best results are obtained in the case of quartile Q1 in Algeciras with all quality indexes above 0.82, Q1 in Alcornocales with ensemble models using all variables and quality indexes above 0.82, and Q1 in La Línea with quality indexes above 0.82 with ensemble tree classifiers using the top ten variables.
In the case of NOX, reasonably equivalent behaviour is observed between models using all variables and models using only the relevant variables. By using fewer but more relevant variables, a large number of models improve their overall performance. In La Línea, the best models are ANNs using the relevant variables for quartiles Q2-Q4. In Algeciras, the performance of the ANNs is similar for the quartiles Q2 and Q4, and in Alcornocales for Q1. The rest of the best models use all variables and correspond to SVM and ensembles. The values are somewhat lower than for other pollutants, reaching sensitivities above 80% and higher specificities above 93%. In the case of NO in Alcornocales (Tables 13, 14), no values have been obtained for Q2 because most of the data available in the database for this pollutant are at such low values that they correspond for the most part to Q1, except for some peaks of exceedances found in the Q3 and Q4 quartiles. For NO, ANNs seem to be the models that best predict the quartiles using the relevant variables. In fact, the best result is obtained for the Q1 quartile with more than 94% precision for Alcornocales.
Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 show that ensemble boosted trees and tree classifiers produce better results than ANN models in most cases, but by reducing the number of variables to the best 10, ANNs improve quite a lot. Tables 15, 16, 17, 18, and 19 have also been included, highlighting the most leveraged variables used for each prediction model using the mRMR method. In these tables, only the best ten most relevant variables are shown. Using these variables, similar prediction results were obtained to those shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 for the models using all the variables. Therefore, using only these top ten variables, a more efficient monitoring system could be designed, saving economic and time resources in the sensor network by measuring fewer variables to store and transmit, thus designing a more energy sustainable system with a lower carbon footprint. Tables 15, 16, 17, 18, and 19 show the ten most relevant variables for each pollutant (SO2, PM10, NO2, NOX, and NO) and monitoring station (Algeciras, La Línea, and Alcornocales). In these tables, the meteorological variables for each pollutant are marked in yellow, and the rest of the relevant pollutants, different from those analysed and repeated in at least two stations, are marked in other colours. In the Tables 15, 16, 17, 18, and 19, it is expected that each pollutant's own time series (SO2(t), PM10(t), NO2(t), NOX(t), and NO(t)) will always appear, and this is indeed the case. For instance, Table 15 shows the most relevant meteorological variables for SO2, namely wind direction (WD) and rainfall (RF). For SO2, O3 and nitrogen oxides are the most relevant air pollutants, as expected. Table 16 shows the relevant variables for the PM10 pollutant, indicating that the most relevant meteorological variables are wind speed (WS), rainfall (RF), and relative pressure (RP), and the most relevant pollutants are nitrogen oxides. Similarly, Table 17 for the pollutant NO2 indicates that the most relevant meteorological variables are related to wind (wind direction (WD) and wind speed (WS)) and rainfall (RF), and the most relevant pollutants are particulate matter (PM10 and PM2.5), O3 and SO2. Table 18 for each NOX case, shows the same relevant meteorological variables as for NO2 and includes relative humidity (RH) and the same relevant pollutants except SO2. In the case of the NO pollutant, Table 19 shows that the relevant meteorological variables are related to the wind (wind direction (WD) and wind speed (WS)), solar radiation (SR), and rainfall (RF).
The best models for each pollutant and location for the fourth quartile are shown in italics. Results are given for all quartiles, but we assume that the fourth quartile is the most important for prediction as it represents the most dangerous concentration levels.
5 Conclusions
In this work, an experimental procedure using a resampling strategy with five-fold cross-validation allowed the statistical comparison of the different classification models tested. The proposed approach is based on classification modelling, since the desired output is the next level (quartile at t + 1) of an air pollutant as a function of the other variables at a given time t. Two approaches have been used, one with hourly mean data for nitrogen oxides (NO2, NOX, and NO) and another one with daily mean data (for SO2 and PM10), due to the thresholds established in the European Directive 2008/50/EC, in order to obtain more reliable information in the study area. The approaches were developed in three different and separate locations: the main city of Algeciras, the city of La Línea, and the unspoilt remote area of Alcornocales, in order to contrast them and obtain more details on the behaviour of the air pollutants.
The main conclusions of this study are as follows:
-
The classification models can be adequately used to provide very good air quality prediction results with quality indexes up to 90% in most cases.
-
In general, the use of the ten relevant variables improves the results in most cases.
-
Ensemble boosted trees, SVM, trees, and ANNs classifiers tend to be the best prediction models in most cases.
-
The results obtained with ANNs are always improved by reducing the number of variables to the ten relevant ones.
-
Variable selection models can be used to rank the importance of leverage variables.
-
By selecting fewer variables, it is possible to design a more energy sustainable system with a lower carbon footprint.
-
All forecasts can be useful to the citizens, institutions, businesses in the port area, and the cities surrounding the port.
-
There is background radiation (averages that are constantly repeated) that does not provide useful or accurate information from the ships. The conclusion that can be drawn from the data is that we need more sensors close to the dock area where the ships are located in order to be able to deduce the direct effect of pollutants coming directly from the ships.
The logistical activity of a port has an impact on air quality. Therefore, it is necessary to implement predictive models to provide reliable forecasts that help citizens, companies and institutions, to make decisions and drive policy changes to ensure a healthier and cleaner environment for present and future generations.
Notes
References
Adeyemi A, Molnar P, Boman J, Wichmann J (2022) Particulate matter (PM2.5) characterization, air quality level and origin of air masses in an urban background in pretoria. Arch Environ Contam Toxicol 83(1):77–94. https://doi.org/10.1007/s00244-022-00937-4
Bai L, Wang J, Ma X, Lu H (2018) Air pollution forecasts: an overview. Int J Environ Res Public Health 15(4):780. https://doi.org/10.3390/ijerph15040780
Baklanov A, Zhang Y (2020) Advances in air quality modeling and forecasting. Global Transitions 2:261–270. https://doi.org/10.1016/j.glt.2020.11.001
Bishop CM (2006) Pattern Recognition and Machine Learning. Springer, Berlin
Breiman L, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Routledge, p 368. ISBN 978-0-412-04841-8. https://doi.org/10.1201/9781315139470
Breiman L (1996) Bagging predictors. Mach Learn 26:123–140
Breiman L (2001) Random forests. Mach Learn 45:5–32
Corani G, Scanagatta M (2016) Air pollution prediction via multi-label classification. Environ Model Softw 80:259–264
Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20:273–297. https://doi.org/10.1007/BF00994018
Durán-Grados V, Rodríguez-Moreno R, Calderay-Cayetano F, Amado-Sánchez Y, Pájaro-Velázquez E, Nunes RAO, Alvim-Ferraz M, Sousa S, Moreno-Gutiérrez J (2022) The influence of emissions from maritime transport on air quality in the strait of gibraltar (Spain). Sustainability 14(19):12507. https://doi.org/10.3390/su141912507
Durão RM, Mendes MT, Pereira JM (2016) Forecasting O3 levels in industrial area surroundings up to 24 h in advance, combining classification trees and MLP models. Atmos Pollut Res 7(6):961–970
Ekmekçioğlu AS, Levent K, Ünlügençoğlu K, Çelebi UB (2020) Assessment of shipping emission factors through monitoring and modelling studies. Sci Total Environ 743:140742. https://doi.org/10.1016/j.scitotenv.2020.140742
EU (2008) Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on ambient air quality and cleaner air for Europe
Fameli KM, Kotrikla AM, Psanis C, Biskos G, Polydoropoulou A (2020) Estimation of the emissions by transport in two port cities of the northeastern Mediterranean, Greece. Environ Pollut 257:113598. https://doi.org/10.1016/j.envpol.2019.113598
Fernando HJSF, Mammarella MC, Grandoni G, Fedele P, Marco RD, Dimitrova R, Hyde P (2012) Forecasting PM10 in metropolitan areas: efficacy of neural networks. Environ Pollut 163:62–67. https://doi.org/10.1016/J.ENVPOL.2011.12.018
Fisher RA (1936) The use of multiple measurements in taxanomic problems. Ann Eugen 7(2):179–188. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x
Freund Y (2009) A more robust boosting algorithm. Vol. 1. https://doi.org/10.48550/arXiv.0905.2138
García-Nieto PJ, Álvarez Antón JC, Vilán Vilán JA, García-Gonzalo E (2015) Air quality modeling in the Oviedo urban area (NW Spain) by using multivariate adaptive regression splines. Environ Sci Pollut Res 22:6642–6659. https://doi.org/10.1007/s11356-014-3800-0
García-Nieto PJ, Sánchez Lasheras F, García-Gonzalo E, de Cos Juez FJ (2018) PM10 concentration forecasting in the metropolitan area of Oviedo (Northern Spain) using models based on SVM, MLP, VARMA and ARIMA: a case study. Sci Total Environ 621:753–761. https://doi.org/10.1016/j.scitotenv.2017.11.291
Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the 14th International Conference on Artificial Intelligence and Statistics (AISTATS). Rectifier and softplus activation functions. The second one is a smooth version of the first. Journal of Machine Learning Research
González-Enrique J, Turias IJ, Ruiz-Aguilar JJ, Moscoso-López JA, Franco L (2019a) Spatial and meteorological relevance in NO2 estimations: a case study in the Bay of Algeciras (Spain). Stoch Environ Res Risk Assess 33(3):801–815. https://doi.org/10.1007/s00477-018-01644-0
González-Enrique J, Turias IJ, Ruiz-Aguilar JJ, Moscoso-López JA, Jerez- Aragonés J, Franco L (2019b) Estimation of NO2 concentration values in a monitoring sensor network using a fusion approach. Fresenius Environ Bull 28:681–686
González-Enrique J, Ruiz-Aguilar JJ, Moscoso-López JA, Urda D, Turias IJ (2021) A comparison of ranking filter methods applied to the estimation of NO2 concentrations in the Bay of Algeciras (Spain). Stochastic Environ Res Risk Assessment 35(10):1999–2019. https://doi.org/10.1007/s00477-021-01992-4
Hastie T, Tibshirani R, Friedman J (2008) The elements of statistical learning. Data mining, inference, and prediction, 2nd edn. Springer, New York
He H-d, Lu W-Z, Xue Yu (2014) Prediction of particulate matter at street level using artificial neural networks coupling with chaotic particle swarm optimization algorithm. Build Environ 78:111–117. https://doi.org/10.1016/j.buildenv.2014.04.011
Hornik K, Stinchcombe M, White H (1989) Multilayer feedforward networks are universal approximators. Neural Netw 2(5):359–366. https://doi.org/10.1016/0893-6080(89)90020-8
Hu L, Yan G, Duan Z, Chen C (2021) Intelligent modeling strategies for forecasting air quality time series: a review. Appl Soft Comput 102:106957. https://doi.org/10.1016/j.asoc.2020.106957
Ilacqua V, Hänninen O, Saarela K, Katsouyanni K, Künzli N, Jantunen M (2007) Source apportionment of population representative samples of PM2.5 in three European cities using structural equation modelling. Sci Total Environ 384(1–3):77–92. https://doi.org/10.1016/j.scitotenv.2007.06.020
IMO (International Maritime Organization) (2021) The International Convention for the Prevention of Pollution from Ships (MARPOL), annex VI. London
Ju T, Lei M, Guo G, Xi J, Zhang Y, Xu Y, Lou Q (2023) A new prediction method of industrial atmospheric pollutant emission intensity based on pollutant emission standard quantification. Front Environ Sci Eng. https://doi.org/10.1007/s11783-023-1608-1
Kloog I, Ridgway B, Koutrakis P, Coull BA, Schwartz JD (2013) Long-and short-term exposure to PM2.5 and mortality: using novel exposure models. Epidemiology 24(4):555–561
Kolehmainen M, Martikainen H, Ruuskanen J (2001) Neural networks and periodic components used in air quality forecasting. Atmos Environ 35(5):815–825. https://doi.org/10.1016/S1352-2310(00)00385-X
Kukkonen J, Partanen L, Karppinen A, Ruuskanen J, Junninen H, Kolehmainen M, Niska H, Dorling S, Chatterton T, Foxall R, Gavin C (2003) Extensive evaluation of neural networks models for the prediction of NO2 and PM10 concentrations, compared with a deterministic modelling system and measurement in central Helsinki. Atmos Environ 37:4539–4550. https://doi.org/10.1016/S1352-2310(03)00583-1
Lakra K, Avishek K (2022) A review on factors influencing fog formation, classification, forecasting, detection and impacts. Rendiconti Lincei-Scienze Fisiche e Naturali 33(2, SI):319–353
Liu H, Yan G, Duan Z, Chen C (2021) Intelligent modeling strategies for forecasting air quality time series: a review. Appl Soft Comput J 102:106957. https://doi.org/10.1016/j.asoc.2020.106957
Liu TK, Sheu HY, Tsai JY (2014) Sulfur dioxide emission estimates from merchant vessels in a Port area and related control strategies. Aerosol Air Quality Res 14(1):413–421. https://doi.org/10.4209/aaqr.2013.02.0061
López-Aparicio S, Tønnesen D, Thanh TH, Neilson H (2017) Shipping emissions in a Nordic port: assessment of mitigation strategies. Transp Res Part D Transp Environ 53:205–216. https://doi.org/10.1016/j.trd.2017.04.021
Lu G, Brook JR, Rami Alfarra M, Anlauf K, Richard Leaitch W, Sharma S, Wang D, Worsnop DR, Phinney L (2006) Identification and characterization of inland ship plumes over Vancouver, BC. Atmos Environ 40(15):2767–2782. https://doi.org/10.1016/j.atmosenv.2005.12.054
Lu H, Zhang Y, Wang X, He L (2014) A semiparametric statistical approach for forecasting SO2 and NOX concentrations. Environ Sci Pollut Res 21(13):7985–7995. https://doi.org/10.1007/s11356-014-2748-4
Luna A, Talavera A, Navarro H, Cano L (2019) Monitoring of air quality with low-cost electrochemical sensors and the use of artificial neural networks for the atmospheric pollutants concentration levels prediction. Commun Computer Inf Sci 898:137–150. https://doi.org/10.1007/978-3-030-11680-4_15
Masood A, Ahmad K (2021) A review on emerging artificial intelligence (AI) techniques for air pollution forecasting: fundamentals, application and performance. J Cleaner Prod 322:129072. https://doi.org/10.1016/j.jclepro.2021.129072
Manisalidis I, Stavropoulou E, Stavropoulos A, Bezirtzoglou E (2020) Environmental and health impacts of air pollution: a review. Front Public Health. https://doi.org/10.3389/fpubh.2020.00014
MARPOL (Marine Pollution). Annex VI the International Convention for the Prevention of Pollution from Ships.
Martín ML, Turias IJ, González FJ, Galindo PL, Trujillo FJ, Puntonet CG, Gorriz JM (2008) Prediction of CO maximum ground level concentrations in the Bay of Algeciras, Spain using artificial neural networks. Chemosphere 70(7):1190–1195. https://doi.org/10.1016/j.chemosphere.2007.08.039
Mavroidis I, Gavriil I, Chaloulakou A (2007) Statistical modelling of CO and NO2 concentrations in the Athens area. Evaluation of emission abatement policies. Environ Sci Pollut Res 14(2):130–136. https://doi.org/10.1065/espr2006.04.299
Mclean S, Kaiser J, Ben Richard B (2019) A review of artificial neural network models for ambient air pollution prediction. Environ Model Softw 119:285–304. https://doi.org/10.1016/j.envsoft.2019.06.014
Méndez M, Merayo MG, Núñez M (2023) Machine learning algorithms to forecast air quality: a survey. Artif Intell Rev. https://doi.org/10.1007/s10462-023-10424-4
Menezes F, Popowicz GM (2022) Acid Rain and Flue Gas: Quantum Chemical Hydrolysis of NO2. ChemPhysChem. https://doi.org/10.1002/cphc.202200395
Miola A, Ciuffo B (2011) Estimating air emissions from ships: Meta-analysis of modelling approaches and available data sources. Atmos Environ 45(13):2242–2251. https://doi.org/10.1016/j.atmosenv.2011.01.046
Mitchell T (1997) Machine learning. International Student Edition. McGraw‐Hill, Maidenhead. ISBN: 0‐07‐115467‐1, 414
Moreno-Gutiérrez J, Calderay F, Saborido N, Boile M, Rodríguez R, Durán-Grados V (2015) Methodologies for estimating shipping emissions and energy consumption: a comparative analysis of current methods. Energy 86:603–616. https://doi.org/10.1016/j.energy.2015.04.083
Moscoso-López JA, González-Enrique J, Urda D, Ruiz-Aguilar JJ, Turias IJ (2022) Hourly pollutants forecasting using a deep learning approach to obtain the AQI. Logic J IGPL. https://doi.org/10.1093/jigpal/jzac035
Mueller M, Westerby M, Nieuwenhuijsen M (2023) Health impact assessments of shipping and port-sourced air pollution on a global scale: A scoping literature review. Environ Res 216:114460. https://doi.org/10.1016/j.envres.2022.114460
Muruganandam NS, Arumugam U (2023) Dynamic ensemble multivariate time series forecasting model for PM2.5. Comput Syst Sci Eng 44(2):979–989. https://doi.org/10.32604/csse.2023.024943
Oliveri G, Heibati B, Kloog I, Fiore M, Ferrante M (2017) A review of AirQ Models and their applications for forecasting the air pollution health outcomes. Environ Sci Pollut Res 24:6426–6445. https://doi.org/10.1007/s11356-016-8180-1
Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 27:1226–1238. https://doi.org/10.1109/TPAMI.2005.159
Pope CA, Dockery DW (2006) Health effects of fine particulate air pollution: lines that connect. J Air Waste Manag Assoc 56:709–742. https://doi.org/10.1080/10473289.2006.10464485
Prati MV, Costagliola MA, Quaranta F, Murena F (2015) Assessment of ambient air quality in the port of Naples. J Air Waste Manag Assoc 65(8):970–979. https://doi.org/10.1080/10962247.2015.1050129
Ramachandran P, Zoph B, Le QV (2017) Searching for activation functions. https://doi.org/10.48550/arXiv.1710.05941
Rodríguez-García MI, González-Enrique J, Moscoso-López JA, Ruiz-Aguilar JJ, Turias IJ (2022) Air pollution relevance analysis in the bay of Algeciras (Spain). Int J Environ Sci Technol. https://doi.org/10.1007/s13762-022-04466-4
Ribeiro VM, Gonçalves R (2022) Classification and prediction of nitrogen dioxide in a portuguese air quality critical zone. Atmosphere 13(10). In: 2nd international conference on cybernetics and intelligent system (ICORIS).
Ruiz-Aguilar JJ, Turias I, González-Enrique J, Urda D, Elizondo D (2020) A permutation entropy-based EMD–ANN forecasting ensemble approach for wind speed prediction. Neural Comput Appl. https://doi.org/10.1007/s00521-020-05141-w
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning internal representation by error propagation. Parallel distributed processing: explorations in the microstructures of cognition, vol 1. MIT Press, Cambridge
Savouré M, Lequy E, Bousquet J, Chen J, de Hoogh K, Goldberg M, Vienneau D, Zins M, Nadif R, Jacquemin B (2021) Long-term exposures to PM2.5, black carbon and NO2 and prevalence of current rhinitis in French adults: the Constances Cohort. Environ Int 157:106839. https://doi.org/10.1016/j.envint.2021.106839
Silverman BW, Jones MC (1989) E. Fix and J.L. Hodges (1951): an important contribution to nonparametric discriminant analysis and density estimation: commentary on Fix and Hodges (1951). Int Stat Rev 57(3):233–238. https://doi.org/10.2307/1403796
Song C, Fu X (2020) Research on different weight combination in air quality forecasting models. J Cleaner Prod 261:121169
Stieb DM, Burnett RT, Smith-Doiron M, Brion O, Shin HH, Economou V, Dales RE (2009) A new multipollutant, no-threshold air quality health index based on short-term associations observed in daily time-series analyses. J Air Waste Manag Assoc 59(3):299–307
Subramaniam S, Raju N, Ganesan A, Rajavel N, Maheswari Chenniappan M, Prakash C, Pramanik A, Basak AK, Dixit S (2022) Artificial intelligence technologies for forecasting air pollution and human health: a narrative review. Sustainability 14(16):9951. https://doi.org/10.3390/su14169951
Ting KM (2010) Confusion matrix. Encycl Mach Learn Data Min. https://doi.org/10.1007/978-1-4899-7687-1_50
Traina G, Bolzacchini E, Bonini M, Contini D, Mantecca P, Caimmi SME, Licari A (2022) Role of air pollutants mediated oxidative stress in respiratory diseases. Pediatr Allergy Immunol 22:38–40
Turias IJ, González FJ, Martin ML, Galindo PL (2008) Prediction models of CO, SPM and SO2 concentrations in the Campo de Gibraltar Region, Spain: a multiple comparison strategy. Environ Monit Assess 143(1–3):131–146. https://doi.org/10.1007/s10661-007-9963-0
Vapnik VN, Chervonenkis A (1971) Theory of uniform convergence of frequencies of events to their probabilities and problems of search for an optimal solution from empirical data. Avtomat i Telemekh 2:42–53
Yang L, Zhang Q, Lv Z, Zhang Y, Yang Z, Fu F, Lv J, Wu L, Mao H (2022) Efficiency of DECA on ship emission and urban air quality: a case study of China port. J Cleaner Prod 362:132. https://doi.org/10.1016/j.jclepro.2022.132556
Yeh CK, Lin C, Shen HC, Cheruiyot NK, Nguyen DH, Chang CC (2022) Real-time energy consumption and air pollution emission during the transpacific crossing of a container ship. Sci Rep 12:1. https://doi.org/10.1038/s41598-022-19605-7
Funding
Funding for open access publishing: Universidad de Cádiz/CBUA. This work is part of the research project RTI2018-098160-B-I00 supported by 'MICINN. ‘Programa Estatal de I+D+i Orientada a los Retos de la Sociedad' and besides, it is partially financed by national funds through FCT – Fundação para a Ciência e a Tecnologia under the project UIDB/00006/2020. This research is supported by ‘Plan Propio de la Universidad de Cádiz’. Data used in this work have been kindly provided by the Andalusian Regional Government.
Author information
Authors and Affiliations
Contributions
Conceptualization, M.I.R.-G, C.R., and I.J.T.; data curation, M.I.R.-G, J.G.-E., and J.J.R.-A.; formal analysis, M.I.R.-G, J.G.-E. and I.J.T.; funding acquisition, I.J.T.; investigation, M.I.R.-G and J.G.-E..; methodology, M.I.R.-G, C.R., and I.J.T.; project administration, C.R., and I.J.T.; software, M. I.R.-G, J.G.-E., and I.J.T.; supervision, C.R., J.J.R.-A., and I.J.T.; validation, M.I.R.-G, C.R., and I.J.T.; visualization, M.I.R.-G. and C.R.; writing—original draft, M.I.R.-G, C.R., and I.J.T.; writing—review and editing, M.I.R.-G., C.R. and I.J.T. All authors have read and agreed to the published version of the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors have not got relevant conflicts of interest to declare to the content of this article.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Rodríguez-García, M.I., Ribeiro Rodrigues, M.C., González-Enrique, J. et al. Forecasting air pollutants using classification models: a case study in the Bay of Algeciras (Spain). Stoch Environ Res Risk Assess 37, 4359–4383 (2023). https://doi.org/10.1007/s00477-023-02512-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00477-023-02512-2