Forecasting air pollutants using classification models: a case study in the Bay of Algeciras (Spain)

The main goal of this work is to obtain reliable predictions of pollutant concentrations related to maritime traffic (SO2, PM10, NO2, NOX, and NO) in the Bay of Algeciras, located in Andalusia, the south of Spain. Furthermore, the objective is to predict future air quality levels of the principal maritime traffic-related pollutants in the Bay of Algeciras as a function of the rest of the pollutants, the meteorological variables, and vessel data. In this sense, three scenarios were analysed for comparison, namely Alcornocales Park and the cities of La Línea and Algeciras. A database of hourly records of air pollution immissions, meteorological measurements in the Bay of Algeciras region and a database of maritime traffic in the port of Algeciras during the years 2017 to 2019 were used. A resampling procedure using a five-fold cross-validation procedure to assure the generalisation capabilities of the tested models was designed to compute the pollutant predictions with different classification models and also with artificial neural networks using different numbers of hidden layers and units. This procedure enabled appropriate and reliable multiple comparisons among the tested models and facilitated the selection of a set of top-performing prediction models. The models have been compared using several quality classification indexes such as sensitivity, specificity, accuracy, and precision. The distance (d1) to the perfect classifier (1, 1, 1, 1) was also used as a discriminant feature, which allowed for the selection of the best models. Concerning the number of variables, an analysis was conducted to identify the most relevant ones for each pollutant. This approach aimed to obtain models with fewer inputs, facilitating the design of an optimised monitoring network. These more compact models have proven to be the optimal choice in many cases. The obtained sensitivities in the best models were 0.98 for SO2, 0.97 for PM10, 0.82 for NO2 and NOX, and 0.83 for NO. These results demonstrate the potential of the models to forecast air pollution in a port city or a complex scenario and to be used by citizens and authorities to prevent exposure to pollutants and to make decisions concerning air quality.


Introduction
Air pollution is a real threat in today's world according to the World Health Organization (WHO).The European Directive 2008/50/EC regulates several key atmospheric pollutants, including particulate matter (PM), nitrogen dioxide (NO 2 ), sulfur dioxide (SO 2 ), ozone (O 3 ), and carbon monoxide (CO).Vessels-related atmospheric pollutants encompass sulfur dioxide (SO 2 ), nitrogen oxides (NO x ), and particulate matter (PM).Exposure to hazardous air pollutants emissions can lead to a range of human health problems, including respiratory disorders, cardiovascular disease, and increased risk of stroke.Manisalidis et al. (2020) showed an overview of the effects of air pollution on human health.A large number of scientific work has demonstrated that particulate matter directly affects human health by reducing air quality (Adeyemi et al. 2022).Air pollution in urban areas is a complex mixture of toxic components that have unhealthy effects on residents, especially sensitive populations such as children and people with cardiac and respiratory diseases (Kolehmainen et al. 2001).From an environmental point of view, the conduct of a study on the prediction of air pollutant levels or concentrations (inmisions) is crucial for the protection of human health and the environment.This research pretends to provide valuable insights into the factors influencing the distribution, temporal variations, and potential exposure risks associated with ambient pollutants.Accurate prediction models can be developed to forecast pollution levels, identify pollution hotspots and assess compliance with regulatory standards.These predictive models play a vital role in urban planning, industrial siting and the formulation of effective emission control strategies.By proactively predicting and mitigating high pollution episodes, air pollution forecasting research contributes to protecting public health, reducing environmental impact and promoting sustainable communities (Pope and Dockery 2006;Stieb et al. 2009;Kloog et al. 2013).A review of models to forecast air pollution health outcomes is presented by Oliveri et al. (2017), where different huge cities were compared regarding different pollutants.Besides, in Savoure ´et al. (2021), Subramaniam et al. (2022), Traina et al. (2022) artificial intelligence is applied to forecast air pollution related to human health.
Different studies show the air pollutants related to vessel traffic (Miola and Ciuffo 2011;Moreno-Gutie ´rrez et al. 2015;Ekmekc ¸iog ˘lu et al. 2020), and estimate the amount of pollution associated with ships in port areas (Lu et al. 2006;Liu et al. 2014;Fameli et al. 2020).These pollutants are sulphur dioxide (SO 2 ), nitrogen oxides (NO x ) and Particulate Matter (PM).Marine pollution is regulated by the International Maritime Organisation (IMO) through the Marine Pollution Protocol (MARPOL).Decarbonisation is the main purpose of the IMO and the reduction of emissions of Greenhouse gas emissions.An energy efficiency index is applied to vessels to indicate their classification (A, B, C, D, E) (MARPOL, Annex VI).The aim of the IMO is to achieve zero emissions by 2050 (IMO 2021).The air pollutants responsible for acid rain are sulphur dioxide (SO 2 ) and nitrogen oxides (NO x ) in the atmosphere, which react with water, oxygen, and other chemicals to form sulphuric acid and nitric acid.NO 2 is primarily responsible for the formation of smog and acid rain in urban areas, causing both acute and chronic effects (Menezes and Popowicz 2022).These pollutants are emitted from the combustion of fossil fuels in industrial processes, power generation, and transport.The main pollutants associated with port activity are presented in (Yang et al. 2022;Yeh et al. 2022;Mueller et al. 2023).
In recent decades, artificial neural networks (ANNs) have been applied in the field of air quality forecasting in a wide range of literature (Kukkonen et al. 2003;Fernando et al. 2012;Hu et al. 2021;Muruganandam and Arumugam 2023).Numerous studies have been developed using artificial intelligence (AI) and machine learning techniques in monitoring the air quality (Bai et al. 2018;Mclean et al. 2019;Baklanov and Zhang 2020;Liu et al. 2021;Masood and Ahmad 2021).Bai et al. (2018) analysed the three classical methods for forecasting air pollution (statistical, artificial intelligence, and numerical prediction methods).There is literature on air quality in urban areas using different statistical methods to forecast air quality (Mavroidis et al. 2007;Ilacqua et al. 2007;Lu et al. 2014).Considering meteorological aspects, in Mavroidis et al. (2007) a successful methodology was suggested for assessing the impact of different emission reduction scenarios on the attainment of air quality standards for CO and NO 2 in the Athens area.Furthermore, in Ribeiro and Gonc ¸alves, (2022), in Portugal, NO 2 is classified as a binary objective using a benchmark model.In Dura ˜o et al. (2016), classification and regression tree techniques were successfully used to predict ozone in Sines (Portugal).For NO 2 , Prati et al. (2015) provided an insight into the relevance of a spatial analysis of data that provides knowledge on how ship emissions affect the air in a port city.To forecast air quality in urban areas, Lu et al. (2014) proposed different semi-parametric regression models.Particulate matter (PM) sources in three European cities (Athens, Basle, and Helsinki) are described and analysed using structural equation modelling in parallel with traditional principal components (Ilacqua et al. 2007).Similar machine learning techniques are used by Lakra and Avishek, (2022) to forecast fog, which is also related to meteorological factors.Other techniques are used to construct air quality models.In Garcı ´a-Nieto et al. (2015) air quality in Oviedo (Spain) was modelled using multivariate adaptive regression splines (MARS) and subsequently, support vector regression (SVR), multilayer perceptron (MLP), were specifically used to forecast PM 10 concentrations in the same city by Garcı ´a-Nieto et al. (2018).In addition, meteorological variables are considered by Luna et al. (2019), where low-cost electrochemical sensors are used to quantify air pollution exposure, prediction, and control of CO 2 and SO 2 concentrations using ANNs.The most relevant information extracted from this study was that pollution prediction is sensitive to humidity, wind speed, and temperature.Therefore, the use of ANNs could predict and impute missing values or re-evaluate doubtful values.A method for predicting SO 2 emissions in several cities is shown by Ju et al. (2023), which is of great help for accurate control of this pollutant.Applied to megacities, He et al. ( 2014) provided an ANN-based method, in particular a multilayer perceptron (MLP), that predicts fine particles suggesting that particulate matter concentrations are generated by traffic and controlled by weather conditions.
Air quality assessment, from an operational point of view, requires the characterisation of atmospheric quality (Corani and Scanagatta 2016;Me ´ndez et al. 2023).The aim of this work is to predict future values of the levels of each pollutant.Machine learning methods based on classification models have been used for this purpose.A comprehensive comparison of classification models was developed.The classifiers tested were trees, support vector machines (SVMs), artificial neural networks (ANNs), ensembles, K-nearest neighbours (KNNs), discriminant, and naı ¨ve Bayes.Most of them have already been successfully used by authors in different papers (Turias et al. 2008;Ruiz-Aguilar et al. 2020;Song and Fu 2020;Gonza ´lez-Enrique et al. 2021;Moscoso-Lo ´pez et al. 2022).Regarding local studies, the impact of ship propulsion systems on air pollution in the Strait of Gibraltar in 2017 is presented in Dura ´n- Grados et al. (2022).This study is based on an inventory of ships crossing the Strait and calling at the ports of Algeciras, Tarifa, and Ceuta.In Martı ´n et al. (2008) air pollution was modelled with classification techniques in the Bay of Algeciras (Spain).Additionally, Rodrı ´guez-Garcı ´a et al. ( 2022) conducted an extensive analysis of statistical, risk, and trends developed in the area of the Bay of Algeciras from 2010 to 2015.Furthermore, due to the large number of inputs used to build the models, the problem of the curse of dimensionality (Bishop 2006) could arise.Therefore, a feature selection stage was applied using the Minimum Redundancy Maximum Relevance (mRMR) method, which has been successfully tested by the authors previously in air pollution forecasting problems (Gonza ´lez-Enrique et al.

2021).
The main motivation of this manuscript is to provide citizens with reliable information on air pollution forecasts.This challenge is achieved through a data-driven approach using historical data and machine learning techniques, which will be explained in more detail in the next sections.Improving the air quality in populated cities is another of the main motivations for this study, which is carried out in the Bay of Algeciras (southern Spain), where the most important port in Spain and the fourth in Europe in terms of cargo traffic is located.The importance of maritime traffic in Algeciras, which has experienced a massive increase in the last ten years, in terms of air pollution, lies in the fact that this increase in the number of vessels in the port of Algeciras may affect the air quality in the area and in the nearest city (Algeciras).Since there have been few studies on air pollution in this strategic area of port activities in terms of pollution, this research can make a specific contribution.
Another main contribution of this work is the use of a classification-based machine learning scheme to predict the next level of a pollutant, including an analysis of the most relevant variables (using mRMR) for each of the pollutants and sites studied.In addition, many different classification methods were used and compared.This research has allowed us to develop a procedure for predicting future pollution levels, both on an hourly basis for nitrogen oxides (NO 2 , NO x , and NO) and, on a daily basis for SO 2 and PM 10 .The results obtained are suitable for the design of air pollution forecasting system that can be used by citizens or institutions to support decision making.
The rest of this article is organised as follows: Sect. 2 describes the database, the site, the case study and the regulations to be applied, Sect. 3 presents the methodology including the classification models tested in the study together with the feature selection process and the experimental procedure used to achieve the objectives, Sect. 4 presents and discusses the results and, finally, Sect. 5 draws the main conclusions.

Materials
The importance of environmental studies in this area is due to the fact that the Port of Algeciras is located in this area, handling more than 100 million tonnes of goods per year since 2017, and is located in an area with special meteorological and orographic conditions, the Strait of Gibraltar, as well as in a highly industrialised region where the Port of Algeciras coexists with numerous industries (a refinery, several chemical and thermal power plants, a stainless steel factory, etc.), together with several highways and the Gibraltar airport.),together with several motorways and Gibraltar airport, contribute to a very complex air pollution scenario.Maritime traffic in Algeciras has increased dramatically over the last decade.It is logical to think that the increase in the number of vessels in the Port of Algeciras could affect the air quality in the area.
In order to develop this study, the main pollutants related to port activities were selected as shown in (Yang et al. 2022;Yeh et al. 2022;Mueller et al. 2023).Immission data of SO 2 , NO 2 , NO X , NO and, PM 10 concentrations, meteorological data (relative humidity, solar radiation, temperature, atmospheric pressure, wind speed, wind direction, and rainfall) were provided through the Andalusian Government's monitoring network, and the vessel gross tonnage (GT) database was provided by the Algeciras Bay Port Authority, all for the years 2017 to 2019.Similar studies, such as Lo ´pez-Aparicio et al. (2017), analysed all these pollutants in a Nordic port and concluded that the main emission contributions come from berthed vessels and manoeuvres.
The Andalusian Government's system of sensors in the Bay of Algeciras includes a total of sixteen air pollutant monitoring stations and five specialised meteorological sensors (W 1-5 ) distributed throughout the bay (see Fig. 1), which record hourly data of each pollutant and meteorological values over a three-year period, from 1st January 2017 to 31st December 2019 (see Table 1).The meteorological sensors W 3 , W 4 , and W 5 are located in the chimney of a refinery at different heights, 10 m, 15 m, and 60 m.The data analysed are recorded at stations in the towns of Algeciras and La Lı ´nea and in the Alcornocales Park, in order to compare three distant locations.The importance of Algeciras and La Lı ´nea spots is due to their coastal areas and the huge port of Algeciras, with massive truck traffic, and Alcornocales Park is an unspoilt area far from anthropogenic activity.In addition, La Lı ´nea and Algeciras are two cities located opposite each other, thus studying both can shed more light on air pollution immissions.Algeciras is the most populated city in the bay with 122,982 inhabitants in 2021 and La Lı ´nea is the second most populated city with 63,365 inhabitants. 1 The entire database consists of 131 variables.In each experiment, the output variable is the concentration of each pollutant in each of the monitoring stations according to the rest of the study variables described in Table 1 (pollutant concentrations in the rest of the monitoring stations, meteorological information and vessel data).
This study has been developed in three stages: preprocessing of the data, classification stage and the stage of feature selection to reduce the number of variables.Among the wide range of feature selection methods, the mRMR method was used in this work to rank the variables considered as inputs.Feature selection, one of the fundamental problems in pattern recognition and machine learning,  involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance.These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets.In this paper, the ten most relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.

Methodology
The main objective of this work is to predict the future air quality levels of the main maritime pollutants in the Bay of Algeciras as a function of other pollutants, meteorological variables, and vessel data.In order to achieve this objective, the time series were considered according to the limits marked in the European Directive 2008/50/EC (Table 2), and the outputs were transformed into disjoint quartiles (Q1-Q4).
The predictions are calculated using pollutant concentrations in each station (Algeciras, Alcornocales, and La Lı ´nea) as outputs and the rest of the variables as inputs (pollutants in other stations, meteorological parameters, and the vessel data).Different classification techniques are compared together with ANN models in order to find improvements and the best model.The performance of the tested models is calculated using hourly and daily mean data time series.
Equation 1 shows mathematically the prediction approach, where t is the time and t þ 1 is one step ahead to be predicted.In the case of hourly data, the next 1 h-mean period concentration value is predicted and in the case of daily data, the next day mean concentration value is predicted.Inputs e x t ð Þ consist of all other pollutants measured at the monitoring stations together with meteorological variables and vessel time series.The scheme of the process is shown in Fig. 2.
Three stages were developed.The first step is the preprocessing of the data.On the one hand, the imputation of missing values was done using a previous algorithm successfully proposed by the authors (Gonza ´lez- Enrique et al. 2019aEnrique et al. , 2019b;;Rodrı ´guez-Garcı ´a et al. 2022).On the other hand, the standarisation of the database.A transformation of the vessel data, given as incoming and outgoing vessels in the bay into hourly data was also performed.Once the databases are transformed and unified, the data consist of 26,280 hourly records 9 131 variables (130 inputs and 1 output) of a unique database.Each row is a record of hourly data for the three years from 2017 to 2019.The database has been normalised and the output has been divided into disjoint quartiles.The second stage of classification is described in Sect.3.1 and the third stage is a feature selection procedure using the mRMR approach proposed by Peng et al. (2005), which is a feature selection algorithm that ranks a set of features according to their relevance to the target variable.It also penalises redundant features.The best features are those with the highest tradeoff between maximum relevance with the target variable and minimum redundancy with the remaining features.

Classification
In this stage, 29 classification models (Table 3) were tested to select the best classifier.Classification is a type of supervised machine learning where an algorithm learns to classify new observations from labelled data samples.In this work, the database is labelled in quartiles, as shown in Table 3.The different classification schemes are briefly explained below.

Trees
Trees are a hierarchical non-parametric supervised learning algorithm consisting of a root node, branches, internal nodes, and leaf nodes.It is based on classification principles that predict the outcome of a decision for both classification and regression tasks (Breiman et al. 1984).Three types of trees were used depending on the maximum number of splits (100,20,4).The maximum number of splits equal to 100 is when many leaves are used to make many fine distinctions between classes.When the number of leaves is equal to 4, the distinctions that can be made are stronger.

Discriminant analysis
Discriminant analysis is a statistical transformation technique that produces a function capable of classifying phenomena (Fisher 1936).The objective is to maximise the between-group variance and minimise the within-group variance through these linear (or quadratic) combinations.
The procedure is to discover the autovalues and autovectors of a quotient matrix of the interclass distance matrix and the intraclass distance matrix.For linear discriminant analysis, the model has the same covariance matrix for each class; only the means vary.For quadratic discriminant analysis, both the means and the covariances of each class vary.

Naı ¨ve Bayes
Naive Bayes models assume that observations have a multivariate distribution with regard to class membership, although the predictors or features that make up the observation are independent.This framework can accommodate a full set of features, so that an observation is a set of multinomial counts (Mitchell 1997).Normal (Gaussian) distribution is appropriate for predictors that have normal distributions in each class.The Naı ¨ve Bayes classifier estimates a separate normal distribution for each class by calculating the mean and standard deviation of the training data in that class.The kernel distribution is suitable for predictors that have a continuous distribution.It does not require a strong assumption such as a normal distribution, and you can use it in cases where the distribution of a predictor may be skewed or have multiple peaks or modes.

Support Vector Machines (SVMs)
The goal of SVM is to find out a hyperplane that best separates two different classes of data points with the widest margin between the two classes.The algorithm can only find this hyperplane in problems that allow linear separation; in most practical problems, the algorithm maximises the flexible margin by allowing a small number of misclassifications.The support vectors refer to a subset of the training observations that identify the location of the separation hyperplane.SVMs can use a kernel function to transform the features.Kernel functions map the data into a different, usually higher dimensional space, with the expectation that it will be easier to separate the classes after this transformation (Vapnik and Chervonenkis 1971;Cortes and Vapnik 1995).The types tested are Linear SVM (makes a simple linear separation between classes), Quadratic SVM, Cubic SVM, and three categories of Gaussian SVM (fine, with kernel scale set to ffiffiffi P p =4; medium, with kernel scale set to ffiffiffi P p ; and coarse, with kernel scale set to ffiffiffi P p Á 4, where P is the number of predictors).

KNN
The k-nearest neighbour algorithm, also known as KNN or k-NN, is a non-parametric supervised learning classifier, that uses proximity to make classifications or predictions about the clustering of a single data point.While it can be used for regression or classification problems, it is generally used as a classification algorithm, based on the assumption that similar points will be found close together.Usually, the number k is an odd number (1,3,5…) (Silverman and Jones 1989).The types of trees tested were Fine KNN (the number of neighbours is set to 1), Medium KNN (the number of neighbours is set to 10), Coarse KNN (the number of neighbours is set to 100), Cosine KNN, using a cosine distance metric (the number of neighbours is set to 10), Cubic KNN, using a cubic distance metric (the number of neighbours is set to 10), Weighted KNN, using a distance weight (the number of neighbours is set to 10).

Ensemble learning
Classification ensemble learning uses multiple learning algorithms to obtain a better predictive model, which is aa weighted combination of several classification models.In general, the combination of several classification models increases the predictive power.The types of ensembles tested were: Subspace with discriminant learners, Subspace with nearest neighbour learners, and RUSBoost, Random Forest Bag, and AdaBoost, with decision tree learners (Breiman 1996(Breiman , 2001;;Hastie et al. 2008;Freund 2009).

Artificial neural networks (ANNs)
ANNs were also included in the second stage.A feedforward fully connected ANN can be arbitrarily well suited to multidimensional mapping problems, given consistent data and enough neurons in its hidden layer (Hornik et al. 1989).The authors have successfully used ANNs in similar prediction problems (Gonzalez-Enrique et al., 2019b;Ruiz-Aguilar et al. 2020;Moscoso-Lo ´pez et al. 2022).ANNs were trained with the backpropagation algorithm (Rumelhart et al. 1986) using the Levenberg-Marquardt optimisation procedure.Finally, the obtained results were statistically analysed and compared using a resampling procedure in order to select the model with the best generalisation capabilities.ANN models with different hidden units were compared to determine the effect of adding nonlinear processing capabilities on model performance.Each model is a feedforward fully connected neural network with a different number of fully connected layers and hidden units.A ReLU activation function was used in each model.The rectified linear activation function, or ReLU, is a non-linear or piecewise linear function that directly outputs the input if it is positive, otherwise, it outputs zero (Glorot et al. 2011).It is the most commonly used activation function in neural networks since 2017 (Ramachandran et al. 2017).The types of tested ANNs were: One hidden layer with 10, 25, and 100 neurons; two hidden layers with 10 x 10 neurons and three hidden layers with 10 x 10 x 10 neurons.

Feature selection
The third stage is a feature selection procedure.The Minimum Redundancy Maximum Relevance (mRMR) approach (Peng et al. 2005) is a feature selection algorithm that ranks a set of features according to their relevance to the target variable.It also penalises redundant features.The best features are those with the highest trade-off between maximum relevance with the target variable and minimum redundancy with the remaining features.
Among the wide range of feature selection methods, the mRMR method has been used in this work to rank the variables considered as inputs.This method has been successfully used by the authors in other studies related to air pollution (Gonza ´lez-Enrique et al. 2021).Feature selection, one of the fundamental problems in pattern recognition and machine learning, involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance.These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets.In this paper, the top ten relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.

Experimental procedure
A resampling procedure was used to reduce the prediction error of a test set and to reduce the effects of overfitting.The strategy randomly divided the database into three parts (training 70%, validation 10%, and test sets 20%) and the performance results were collected only for the test set in order to estimate the generalisation error of each model using unseen data, as the authors have successfully implemented in other papers (Turias et al. 2008;Gonza ´lez-Enrique et al. 2019a;Ruiz-Aguilar et al. 2020;Moscoso-Lo ´pez et al. 2022).In this research, all of the simulations were developed and tested in Matlab Ó software.
The whole system can be seen as a mapping from a set of input features to an output variable.The mathematical form of the mapping is determined by the data (training set).Of course, we need to build a system that is capable of making good predictions on unseen data.In order to measure this generalisation ability, cross-validation is used with another set of samples (test set) is used.We adopted five-fold cross-validation to select the best model based on the generalisation performance of each model.The available data were divided into three different groups (training, validation, and test sets).The parameters of each model were estimated using one of the groups (the training set).A validation set is used for early stopping and to avoid overfitting.Finally, the test set is used to test the classification quality indexes (sensitivity, specificity, accuracy, and precision), simulating the real performance of the model.This process is repeated 20 times and the results are averaged over these runs.To visualise the obtained results with a classification model, the confusion matrix is used (Ting 2010).Each row (i) of the matrix (C) represents the number of predicted values for each class and each column (j) represents the number of real values for each class (C(i,j)).In this case, four classes are considered, one for each of the quartiles of the output.Once an air pollutant has been considered, its values are divided into four quartiles, each containing 25% of the total distribution.The confusion matrix is calculated and then the quality indexes of sensitivity, specificity, accuracy, and precision are also calculated.The Euclidean distance (d 1 ) to a perfect classifier in terms of the quality indexes (sensitivity = 1, specificity = 1, accuracy = 1, precision = 1) is also calculated (expressed by Eq. 2).
In this case, the confusion matrix has a 4 9 4 dimension due to data are divided into four disjoint quartiles (classes).In order to obtain individual classification results for each quartile, the matrix was sequentially transformed, quartile by quartile, into an equivalent 2 9 2 confusion matrix (Table 4), which was used to calculate the well-known and above-mentioned classification measures (sensitivity, specificity, accuracy, and precision, see Eqs. 3-6).The lower d 1 distance is chosen to indicate the best classification model for each quartile.Quartiles are the statistical values that divide the dataset into four equal parts or quarters, each containing 25% of the data, resulting in lower, lower-middle, middle-high, and upper divisions.
True-positive (TP) and true-negative (TN) results are correctly classified, while false-negative (FN) and falsepositive (FP) results are two types of errors calculated according to the literature (Ting 2010).
All the calculations are performed separately.The air pollutants (SO 2 , PM 10 , NO 2 , NO X , and NO) as outputs in the three locations (Algeciras, Alcornocales, and La Lı ´nea), using all variables or only the ten most relevant variables, in a total of 30 scenarios, repeated 20 times each, following the resampling procedure explained above.The time series of SO 2 and PM 10 concentrations are calculated as daily averages and NO 2 , NO X , and NO on an hourly basis.Once the experiments have been developed, the results are presented in the next section.

Results
Simulations and prediction experiments were computed for five pollutants directly related to maritime traffic: SO 2 , PM 10 , NO 2 , NO X , and NO.The models were tested at three different locations, in the cities of Algeciras and La Lı ´nea, and at a third location at a certain distance in the remote area of the Alcornocales Park.As explained above in Table 2, the averages were calculated hourly or daily to comply with the European Directive 2008/50/EC.
Figures 3, 4 show the time series graphs with their upper assessment thresholds of the pollutants analysed on an hourly or daily basis according to the Directive measured in lg/m 3 .These graphs show average concentrations and it is worth noting that in 2017 the average SO 2 concentrations in La Lı ´nea, where a refinery is located, are very high compared to the rest of the years, which seems to be due to the installation of a desulphurisation unit in 2018 in this refinery.Considering particulate matter, the lowest concentrations are found at the Alcornocales station, and the highest at Algeciras, although overall concentrations are very similar in both Algeciras and La Lı ´nea.On the other hand, the pollutant NO 2 (and nitrogen oxides in general) clearly shows very high average concentrations in Algeciras compared to La Lı ´nea and Alcornocales, which are quite similar.This increase could be an indication of the high presence of diesel engines in Algeciras, which is consistent with the heavy truck traffic in and out of the port, the ships berthed in the Port of Algeciras and the higher traffic density, since it is the most densely populated city in the Bay.
Since the pollutant thresholds are defined in the regulations in terms of hourly and daily values, and in order to better understand the behaviour of each pollutant, weekly average graphs of each air pollutant at the different stations have been calculated (Fig. 5).The pollutant SO 2 shows a higher concentration in La Lı ´nea, probably because the prevailing winds carry the pollution from ships and the surrounding industries more towards La Lı ´nea (westerly situations), and in easterly situations SO 2 seems to move towards Los Alcornocales, the remote area 30 km from the bay, which paradoxically has a higher concentration than Algeciras.In the case of the PM 10 averages, it can be seen that concentrations decrease during the night, and from the early hours of the morning, when anthropogenic activity begins, the values increase until late in the day.At weekends there is not much difference compared to the rest of the week.In the case of nitrogen oxides, there is a daily decrease in the early hours of the morning, then an increase to a maximum around midday, and then a downward trend with a slowdown around mid-afternoon, which coincides with the pace of human activity and therefore traffic, especially vehicle traffic.The trend is higher in the cities of La Lı ´nea and Algeciras.At the remote Los Alcornocales station, there is only a slight increase at midday.In terms of daily averages, maximum values are observed on Tuesdays and Fridays, with a significant decrease at weekends.It should be noted that NO 2 is higher in Algeciras than in La Lı ´nea, probably due to road traffic.Nitrogen oxides have two peaks per day, which suggests that they are related to human activity, and especially to diesel engines whereas particulate matter and SO 2 have only one peak per day.
As explained above, one-step-ahead prediction models have been developed with the aim of predicting the next value of a time series of quartile concentrations in order to contrast with exceedances of the thresholds set in the Directive.Several classification models, including ANNs, were tested and compared for their performance using the resampling procedure explained in Sect.3. In each case, two experiments were calculated, one using all variables as inputs and another one with only the ten most relevant variables.It should be noted that the results shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 are always calculated for test sets (unseen data).In general, the obtained results are quite adequate, with higher values for the classification quality indexes.Results of around 90% indicate that the prediction for the next timestamp-ahead or the next daily/hourly mean is quite accurate and represents a very reliable prediction.The results are collected for the different separated quartiles to achieve a more detailed picture of the prediction.
In Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14, the best prediction model for each air pollutant, location, and quartile is shaded underline (the combination with the smallest distance d 1 ).The best counterpart model (same model, location, quality index, and quartile) is shown in bold between Tables 5, 7, 9, 11, and 13 which present the results of the models using all variables, and Tables 6, 8, Comparing Table 5 with Table 6 for the pollutant SO 2 , prediction models using all variables with the models using only the relevant variables, it can be seen that in all cases the ANN models significantly improve their prediction performance when using only relevant variables, although the tree classifiers predict better than the ANNs as their distance d 1 is the smallest.For this pollutant, tree classifiers are the best predictors in all cases.For SO 2 in Algeciras, quartiles, Q 1 and Q 2 are best predicted by the tree classifiers using only the ten most relevant variables.However, quartile Q 3 is also best predicted by tree classifiers using all 130 variables and Q 4 is best predicted by ANNs models using the ten most relevant variables.The best performing quartile (with the lowest d 1 ) in Algeciras is the Q 1 with sensitivity, specificity, accuracy, and precision above 0.90.For SO 2 in Alcornocales, better predictions are obtained in quartiles Q 1 and Q 2 with tree-type classifiers and using only the ten most relevant variables.In the case of quartiles Q 3 and Q 4 , better predictions are obtained with ensemble classifiers using all variables.The best results for Alcornocales are obtained for Q 1 , Q 3 , and Q 4 , all with values up to 0.97.For SO 2 in La Lı ´nea, better predictions were obtained for quartiles Q 1 and Q 4 using relevant variable Fig. 5 continued models with tree classifiers, and Q 2 and Q 3 were better predicted with ensemble classifiers using all variables.In La Lı ´nea the Q 4 is the one with the best results, with quality indexes above 0.98.
For the PM 10 pollutants, Tables 7 and 8 show that the use of relevant variables improves the results of the classification models in almost all cases.The results of the best models for each quartile are those shaded in underline, regardless of whether they include all variables or only the relevant ones, and turn out to be the trees with slightly better results than the ANNs models.The obtained results for the best models for PM 10 pollutants have the highest quality indexes.For instance, in Algeciras, the Q 1 is the best predicted with tree classifiers using relevant variables, obtaining quality indexes above 0.95.In Alcornocales, quartile Q 4 is also the best predicted with tree classifiers using relevant top ten variables with quality indexes above 0.97.In La Lı ´nea, the best predicted quartile is Q 4 with quality indexes up to 0.96.
The results for NO 2 are shown in Tables 9 and 10.In this case, the relevant variables give better results only for Alcornocales and the quartile Q 1 of La Lı ´nea.Tree-type models are also the best predictors for Alcornocales and La Lı ´nea, especially when all the variables are used, and only for quartiles Q 3 and Q 4 of Alcornocales do neural network models perform better when the relevant variables are used.In the case of Algeciras, all quartiles are predicted equally well by SVM classifiers using all variables.For NO 2 , the best results are obtained in the case of quartile Q 1 in Algeciras with all quality indexes above 0.82, Q 1 in Alcornocales with ensemble models using all variables and quality indexes above 0.82, and Q 1 in La Lı ´nea with quality indexes above 0.82 with ensemble tree classifiers using the top ten variables.In the case of NO X , reasonably equivalent behaviour is observed between models using all variables and models using only the relevant variables.By using fewer but more relevant variables, a large number of models improve their overall performance.In La Lı ´nea, the best models are ANNs using the relevant variables for quartiles Q 2 -Q 4 .In Algeciras, the performance of the ANNs is similar for the quartiles Q 2 and Q 4 , and in Alcornocales for Q 1 .The rest of the best models use all variables and correspond to SVM and ensembles.The values are somewhat lower than for other pollutants, reaching sensitivities above 80% and higher specificities above 93%.In the case of NO in Alcornocales (Tables 13, 14), no values have been obtained for Q 2 because most of the data available in the database for this pollutant are at such low values that they correspond for the most part to Q 1 , except for some peaks of exceedances found in the Q 3 and Q 4 quartiles.For NO, ANNs seem to be the models that best predict the quartiles using the relevant variables.In fact, the best result is obtained for the Q 1 quartile with more than 94% precision for Alcornocales.
Tables 5, 6 , 7, 8, 9, 10, 11, 12, 13 and 14 show that ensemble boosted trees and tree classifiers produce better results than ANN models in most cases, but by reducing the number of variables to the best 10, ANNs improve quite a lot.Tables 15, 16, 17, 18, and 19 have also been included, highlighting the most leveraged variables used for each prediction model using the mRMR method.In these tables, only the best ten most relevant variables are shown.Using these variables, similar prediction results were obtained to those shown in Tables 5,6,7,8,9,10,11,12,13 and 14 for the models using all the variables.Therefore, using only these top ten variables, a more efficient monitoring system could be designed, saving economic and time resources in the sensor network by measuring fewer variables to store and transmit, thus designing a more energy sustainable system with a lower carbon footprint.Tables 15 , 16, 17, 18, and 19 show the ten most relevant variables for each pollutant (SO 2 , PM 10 , NO 2 , NO X , and NO) and monitoring station (Algeciras, La Lı ´nea, and Alcornocales).In these tables, the meteorological variables for each pollutant are marked in yellow, and the rest of the relevant pollutants, different from those analysed and repeated in at least two stations, are marked in other colours.In the Tables 15, 16, 17, 18, and 19, it is expected that each pollutant's own time series (SO 2 (t), PM 10 (t), NO 2 (t), NO X (t), and NO(t)) will always appear, and this is indeed the case.For instance, Table 15 shows the most relevant meteorological variables for SO 2 , namely wind direction (WD) and rainfall (RF).
For SO 2 , O 3 and nitrogen oxides are the most relevant air pollutants, as expected.Table 16 shows the relevant variables for the PM 10 pollutant, indicating that the most relevant meteorological variables are wind speed (WS), rainfall (RF), and relative pressure (RP), and the most relevant pollutants are nitrogen oxides.Similarly, Table 17 for the pollutant NO 2 indicates that the most relevant meteorological variables are related to wind (wind direction (WD) and wind speed (WS)) and rainfall (RF), and the most relevant pollutants are particulate matter (PM 10 and PM 2.5 ), O 3 and SO 2 .Table 18 for each NO X case, shows the same relevant meteorological variables as for NO 2 and includes relative humidity (RH) and the same relevant pollutants except SO 2 .In the case of the NO pollutant, Table 19 shows that the relevant meteorological variables are related to the wind (wind direction (WD) and wind speed (WS)), solar radiation (SR), and rainfall (RF).
The best models for each pollutant and location for the fourth quartile are shown in italics.Results are given for all quartiles, but we assume that the fourth quartile is the most important for prediction as it represents the most dangerous concentration levels.

Conclusions
In this work, an experimental procedure using a resampling strategy with five-fold cross-validation allowed the statistical comparison of the different classification models tested.The proposed approach is based on classification modelling, since the desired output is the next level (quartile at t ? 1) of an air pollutant as a function of the other variables at a given time t.Two approaches have been used, one with hourly mean data for nitrogen oxides (NO 2 , NO X , and NO) and another one with daily mean data (for SO 2 and PM 10 ), due to the thresholds established in the European Directive 2008/50/EC, in order to obtain more reliable information in the study area.The approaches were developed in three different and separate locations: the main city of Algeciras, the city of La Lı ´nea, and the unspoilt remote area of Alcornocales, in order to contrast them and obtain more details on the behaviour of the air pollutants.
The main conclusions of this study are as follows: • The classification models can be adequately used to provide very good air quality prediction results with quality indexes up to 90% in most cases.• In general, the use of the ten relevant variables improves the results in most cases.• Ensemble boosted trees, SVM, trees, and ANNs classifiers tend to be the best prediction models in most cases.• The results obtained with ANNs are always improved by reducing the number of variables to the ten relevant ones.• Variable selection models can be used to rank the importance of leverage variables.• By selecting fewer variables, it is possible to design a more energy sustainable system with a lower carbon footprint.
• All forecasts can be useful to the citizens, institutions, businesses in the port area, and the cities surrounding the port.• There is background radiation (averages that are constantly repeated) that does not provide useful or accurate information from the ships.The conclusion that can be drawn from the data is that we need more sensors close to the dock area where the ships are located in order to be able to deduce the direct effect of pollutants coming directly from the ships.
The logistical activity of a port has an impact on air quality.Therefore, it is necessary to implement predictive models to provide reliable forecasts that help citizens, companies and institutions, to make decisions and drive policy changes to ensure a healthier and cleaner environment for present and future generations.

Fig. 1
Fig. 1 Location of the area of study.Spain, Andalusia and The Bay of Algeciras in the Strait of Gibraltar.The three studied monitoring stations in the cities of Algeciras and La Lı ´nea and Alcornocales Park and the rest of sensors over the Bay

Fig. 3
Fig. 3 Daily mean time series of SO 2 and PM 10 from 2017 to 2019 with the Directive 2008/50/EC limit thresholds

Table 1
Monitoring stations codes.Meteorological variables codes.

Table 2
Simulation scenarios and Directive 2008/50/EC limit values for pollutants of the study NO Fig.2Methodology scheme.The output data was transformed into quartiles (Q 1 -Q 4 ).The inputs and output at the timestamp t are the predictors of the quartile at timestamp t ? 1

Table 5
Best prediction model results for daily SO 2 (t ? 1) concentrations using all variables at t In bold: the best model; in underline: with all/relevant variables; in italics: the best model for Q 4

Table 6
Best prediction model results for daily SO 2 (t ? 1) concentrations using top ten relevant features at t

Table 7
Best prediction model results for daily PM 10 (t ? 1) concentrations using all variables at t

Table 8
Best prediction model results for daily PM 10 (t ? 1) concentrations using top ten relevant features at t

Table 9
Best prediction model results for hourly NO 2 (t ? 1) concentrations using all variables at t

Table 10
Best prediction model results for hourly NO 2 (t ? 1) concentrations using top ten relevant features at t

Table 11
Best prediction model results for hourly NO X (t ? 1) concentrations using all variables at t

Table 12
Best prediction model for hourly NO X (t ? 1) concentrations using top ten relevant features at t

Table 13
Best prediction model results for hourly NO (t ? 1) concentrations using all variables at t

Table 12
In bold: the best model; in underline: with all/relevant variables; in italics: the best model for Q 4

Table 14
Best prediction model results for hourly NO (t ? 1) concentrations using top ten relevant features at tIn bold: the best model; in underline: with all/relevant variables; in italics: the best model for Q 4

Table 15
The ten most relevant variables for each SO 2 (t ? 1) level prediction

Table 16
The ten most relevant variables for each PM 10 (t ? 1) level prediction PM 10 Algeciras daily concentrations PM 10 Alcornocales daily concentrations PM 10 La Línea dail y concentrations

Table 17
The ten most relevant variables for each NO 2 (t ? 1) level prediction

Table 18
The ten most relevant variables for each NO X (t ? 1) level prediction