1 Introduction

Air pollution is a real threat in today's world according to the World Health Organization (WHO). The European Directive 2008/50/EC regulates several key atmospheric pollutants, including particulate matter (PM), nitrogen dioxide (NO2), sulfur dioxide (SO2), ozone (O3), and carbon monoxide (CO). Vessels-related atmospheric pollutants encompass sulfur dioxide (SO2), nitrogen oxides (NOx), and particulate matter (PM). Exposure to hazardous air pollutants emissions can lead to a range of human health problems, including respiratory disorders, cardiovascular disease, and increased risk of stroke. Manisalidis et al. (2020) showed an overview of the effects of air pollution on human health. A large number of scientific work has demonstrated that particulate matter directly affects human health by reducing air quality (Adeyemi et al. 2022). Air pollution in urban areas is a complex mixture of toxic components that have unhealthy effects on residents, especially sensitive populations such as children and people with cardiac and respiratory diseases (Kolehmainen et al. 2001). From an environmental point of view, the conduct of a study on the prediction of air pollutant levels or concentrations (inmisions) is crucial for the protection of human health and the environment. This research pretends to provide valuable insights into the factors influencing the distribution, temporal variations, and potential exposure risks associated with ambient pollutants. Accurate prediction models can be developed to forecast pollution levels, identify pollution hotspots and assess compliance with regulatory standards. These predictive models play a vital role in urban planning, industrial siting and the formulation of effective emission control strategies. By proactively predicting and mitigating high pollution episodes, air pollution forecasting research contributes to protecting public health, reducing environmental impact and promoting sustainable communities (Pope and Dockery 2006; Stieb et al. 2009; Kloog et al. 2013). A review of models to forecast air pollution health outcomes is presented by Oliveri et al. (2017), where different huge cities were compared regarding different pollutants. Besides, in Savouré et al. (2021), Subramaniam et al. (2022), Traina et al. (2022) artificial intelligence is applied to forecast air pollution related to human health.

Different studies show the air pollutants related to vessel traffic (Miola and Ciuffo 2011; Moreno-Gutiérrez et al. 2015; Ekmekçioğlu et al. 2020), and estimate the amount of pollution associated with ships in port areas (Lu et al. 2006; Liu et al. 2014; Fameli et al. 2020). These pollutants are sulphur dioxide (SO2), nitrogen oxides (NOx) and Particulate Matter (PM). Marine pollution is regulated by the International Maritime Organisation (IMO) through the Marine Pollution Protocol (MARPOL). Decarbonisation is the main purpose of the IMO and the reduction of emissions of Greenhouse gas emissions. An energy efficiency index is applied to vessels to indicate their classification (A, B, C, D, E) (MARPOL, Annex VI). The aim of the IMO is to achieve zero emissions by 2050 (IMO 2021). The air pollutants responsible for acid rain are sulphur dioxide (SO2) and nitrogen oxides (NOx) in the atmosphere, which react with water, oxygen, and other chemicals to form sulphuric acid and nitric acid. NO2 is primarily responsible for the formation of smog and acid rain in urban areas, causing both acute and chronic effects (Menezes and Popowicz 2022). These pollutants are emitted from the combustion of fossil fuels in industrial processes, power generation, and transport. The main pollutants associated with port activity are presented in (Yang et al. 2022; Yeh et al. 2022; Mueller et al. 2023).

In recent decades, artificial neural networks (ANNs) have been applied in the field of air quality forecasting in a wide range of literature (Kukkonen et al. 2003; Fernando et al. 2012; Hu et al. 2021; Muruganandam and Arumugam 2023). Numerous studies have been developed using artificial intelligence (AI) and machine learning techniques in monitoring the air quality (Bai et al. 2018; Mclean et al. 2019; Baklanov and Zhang 2020; Liu et al. 2021; Masood and Ahmad 2021). Bai et al. (2018) analysed the three classical methods for forecasting air pollution (statistical, artificial intelligence, and numerical prediction methods). There is literature on air quality in urban areas using different statistical methods to forecast air quality (Mavroidis et al. 2007; Ilacqua et al. 2007; Lu et al. 2014). Considering meteorological aspects, in Mavroidis et al. (2007) a successful methodology was suggested for assessing the impact of different emission reduction scenarios on the attainment of air quality standards for CO and NO2 in the Athens area. Furthermore, in Ribeiro and Gonçalves, (2022), in Portugal, NO2 is classified as a binary objective using a benchmark model. In Durão et al. (2016), classification and regression tree techniques were successfully used to predict ozone in Sines (Portugal). For NO2, Prati et al. (2015) provided an insight into the relevance of a spatial analysis of data that provides knowledge on how ship emissions affect the air in a port city. To forecast air quality in urban areas, Lu et al. (2014) proposed different semi-parametric regression models. Particulate matter (PM) sources in three European cities (Athens, Basle, and Helsinki) are described and analysed using structural equation modelling in parallel with traditional principal components (Ilacqua et al. 2007). Similar machine learning techniques are used by Lakra and Avishek, (2022) to forecast fog, which is also related to meteorological factors. Other techniques are used to construct air quality models. In García-Nieto et al. (2015) air quality in Oviedo (Spain) was modelled using multivariate adaptive regression splines (MARS) and subsequently, support vector regression (SVR), multilayer perceptron (MLP), were specifically used to forecast PM10 concentrations in the same city by García-Nieto et al. (2018). In addition, meteorological variables are considered by Luna et al. (2019), where low-cost electrochemical sensors are used to quantify air pollution exposure, prediction, and control of CO2 and SO2 concentrations using ANNs. The most relevant information extracted from this study was that pollution prediction is sensitive to humidity, wind speed, and temperature. Therefore, the use of ANNs could predict and impute missing values or re-evaluate doubtful values. A method for predicting SO2 emissions in several cities is shown by Ju et al. (2023), which is of great help for accurate control of this pollutant. Applied to megacities, He et al. (2014) provided an ANN-based method, in particular a multilayer perceptron (MLP), that predicts fine particles suggesting that particulate matter concentrations are generated by traffic and controlled by weather conditions.

Air quality assessment, from an operational point of view, requires the characterisation of atmospheric quality (Corani and Scanagatta 2016; Méndez et al. 2023). The aim of this work is to predict future values of the levels of each pollutant. Machine learning methods based on classification models have been used for this purpose. A comprehensive comparison of classification models was developed. The classifiers tested were trees, support vector machines (SVMs), artificial neural networks (ANNs), ensembles, K-nearest neighbours (KNNs), discriminant, and naïve Bayes. Most of them have already been successfully used by authors in different papers (Turias et al. 2008; Ruiz-Aguilar et al. 2020; Song and Fu 2020; González-Enrique et al. 2021; Moscoso-López et al. 2022). Regarding local studies, the impact of ship propulsion systems on air pollution in the Strait of Gibraltar in 2017 is presented in Durán-Grados et al. (2022). This study is based on an inventory of ships crossing the Strait and calling at the ports of Algeciras, Tarifa, and Ceuta. In Martín et al. (2008) air pollution was modelled with classification techniques in the Bay of Algeciras (Spain). Additionally, Rodríguez-García et al. (2022) conducted an extensive analysis of statistical, risk, and trends developed in the area of the Bay of Algeciras from 2010 to 2015. Furthermore, due to the large number of inputs used to build the models, the problem of the curse of dimensionality (Bishop 2006) could arise. Therefore, a feature selection stage was applied using the Minimum Redundancy Maximum Relevance (mRMR) method, which has been successfully tested by the authors previously in air pollution forecasting problems (González-Enrique et al. 2021).

The main motivation of this manuscript is to provide citizens with reliable information on air pollution forecasts. This challenge is achieved through a data-driven approach using historical data and machine learning techniques, which will be explained in more detail in the next sections. Improving the air quality in populated cities is another of the main motivations for this study, which is carried out in the Bay of Algeciras (southern Spain), where the most important port in Spain and the fourth in Europe in terms of cargo traffic is located. The importance of maritime traffic in Algeciras, which has experienced a massive increase in the last ten years, in terms of air pollution, lies in the fact that this increase in the number of vessels in the port of Algeciras may affect the air quality in the area and in the nearest city (Algeciras). Since there have been few studies on air pollution in this strategic area of port activities in terms of pollution, this research can make a specific contribution.

Another main contribution of this work is the use of a classification-based machine learning scheme to predict the next level of a pollutant, including an analysis of the most relevant variables (using mRMR) for each of the pollutants and sites studied. In addition, many different classification methods were used and compared. This research has allowed us to develop a procedure for predicting future pollution levels, both on an hourly basis for nitrogen oxides (NO2, NOx, and NO) and, on a daily basis for SO2 and PM10. The results obtained are suitable for the design of air pollution forecasting system that can be used by citizens or institutions to support decision making.

The rest of this article is organised as follows: Sect. 2 describes the database, the site, the case study and the regulations to be applied, Sect. 3 presents the methodology including the classification models tested in the study together with the feature selection process and the experimental procedure used to achieve the objectives, Sect. 4 presents and discusses the results and, finally, Sect. 5 draws the main conclusions.

2 Materials

The importance of environmental studies in this area is due to the fact that the Port of Algeciras is located in this area, handling more than 100 million tonnes of goods per year since 2017, and is located in an area with special meteorological and orographic conditions, the Strait of Gibraltar, as well as in a highly industrialised region where the Port of Algeciras coexists with numerous industries (a refinery, several chemical and thermal power plants, a stainless steel factory, etc.), together with several highways and the Gibraltar airport.), together with several motorways and Gibraltar airport, contribute to a very complex air pollution scenario. Maritime traffic in Algeciras has increased dramatically over the last decade. It is logical to think that the increase in the number of vessels in the Port of Algeciras could affect the air quality in the area.

In order to develop this study, the main pollutants related to port activities were selected as shown in (Yang et al. 2022; Yeh et al. 2022; Mueller et al. 2023). Immission data of SO2, NO2, NOX, NO and, PM10 concentrations, meteorological data (relative humidity, solar radiation, temperature, atmospheric pressure, wind speed, wind direction, and rainfall) were provided through the Andalusian Government's monitoring network, and the vessel gross tonnage (GT) database was provided by the Algeciras Bay Port Authority, all for the years 2017 to 2019. Similar studies, such as López-Aparicio et al. (2017), analysed all these pollutants in a Nordic port and concluded that the main emission contributions come from berthed vessels and manoeuvres.

The Andalusian Government’s system of sensors in the Bay of Algeciras includes a total of sixteen air pollutant monitoring stations and five specialised meteorological sensors (W1-5) distributed throughout the bay (see Fig. 1), which record hourly data of each pollutant and meteorological values over a three-year period, from 1st January 2017 to 31st December 2019 (see Table 1). The meteorological sensors W3, W4, and W5 are located in the chimney of a refinery at different heights, 10 m, 15 m, and 60 m. The data analysed are recorded at stations in the towns of Algeciras and La Línea and in the Alcornocales Park, in order to compare three distant locations. The importance of Algeciras and La Línea spots is due to their coastal areas and the huge port of Algeciras, with massive truck traffic, and Alcornocales Park is an unspoilt area far from anthropogenic activity. In addition, La Línea and Algeciras are two cities located opposite each other, thus studying both can shed more light on air pollution immissions. Algeciras is the most populated city in the bay with 122,982 inhabitants in 2021 and La Línea is the second most populated city with 63,365 inhabitants.Footnote 1 The entire database consists of 131 variables. In each experiment, the output variable is the concentration of each pollutant in each of the monitoring stations according to the rest of the study variables described in Table 1 (pollutant concentrations in the rest of the monitoring stations, meteorological information and vessel data).

Fig. 1
figure 1

Location of the area of study. Spain, Andalusia and The Bay of Algeciras in the Strait of Gibraltar. The three studied monitoring stations in the cities of Algeciras and La Línea and Alcornocales Park and the rest of sensors over the Bay

Table 1 Monitoring stations codes. Meteorological variables codes. Pollutant variables

This study has been developed in three stages: preprocessing of the data, classification stage and the stage of feature selection to reduce the number of variables. Among the wide range of feature selection methods, the mRMR method was used in this work to rank the variables considered as inputs. Feature selection, one of the fundamental problems in pattern recognition and machine learning, involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance. These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets. In this paper, the ten most relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.

3 Methodology

The main objective of this work is to predict the future air quality levels of the main maritime pollutants in the Bay of Algeciras as a function of other pollutants, meteorological variables, and vessel data. In order to achieve this objective, the time series were considered according to the limits marked in the European Directive 2008/50/EC (Table 2), and the outputs were transformed into disjoint quartiles (Q1–Q4).

Table 2 Simulation scenarios and Directive 2008/50/EC limit values for pollutants of the study

The predictions are calculated using pollutant concentrations in each station (Algeciras, Alcornocales, and La Línea) as outputs and the rest of the variables as inputs (pollutants in other stations, meteorological parameters, and the vessel data). Different classification techniques are compared together with ANN models in order to find improvements and the best model. The performance of the tested models is calculated using hourly and daily mean data time series.

$${\widehat{y}}_{q}\left(t+1\right)={f}_{classification}(\widetilde{x}\left(t\right),y(t))$$
(1)

Equation 1 shows mathematically the prediction approach, where \(t\) is the time and \(t+1\) is one step ahead to be predicted. In the case of hourly data, the next 1 h-mean period concentration value is predicted and in the case of daily data, the next day mean concentration value is predicted. Inputs \(\widetilde{x}\left(t\right)\) consist of all other pollutants measured at the monitoring stations together with meteorological variables and vessel time series. The scheme of the process is shown in Fig. 2.

Fig. 2
figure 2

Methodology scheme. The output data was transformed into quartiles (Q1–Q4). The inputs and output at the timestamp t are the predictors of the quartile at timestamp t + 1

Three stages were developed. The first step is the preprocessing of the data. On the one hand, the imputation of missing values was done using a previous algorithm successfully proposed by the authors (González-Enrique et al. 2019a, 2019b; Rodríguez-García et al. 2022). On the other hand, the standarisation of the database. A transformation of the vessel data, given as incoming and outgoing vessels in the bay into hourly data was also performed. Once the databases are transformed and unified, the data consist of 26,280 hourly records × 131 variables (130 inputs and 1 output) of a unique database. Each row is a record of hourly data for the three years from 2017 to 2019. The database has been normalised and the output has been divided into disjoint quartiles. The second stage of classification is described in Sect. 3.1 and the third stage is a feature selection procedure using the mRMR approach proposed by Peng et al. (2005), which is a feature selection algorithm that ranks a set of features according to their relevance to the target variable. It also penalises redundant features. The best features are those with the highest trade-off between maximum relevance with the target variable and minimum redundancy with the remaining features.

3.1 Classification

In this stage, 29 classification models (Table 3) were tested to select the best classifier. Classification is a type of supervised machine learning where an algorithm learns to classify new observations from labelled data samples. In this work, the database is labelled in quartiles, as shown in Table 3. The different classification schemes are briefly explained below.

Table 3 Classification models

3.1.1 Trees

Trees are a hierarchical non-parametric supervised learning algorithm consisting of a root node, branches, internal nodes, and leaf nodes. It is based on classification principles that predict the outcome of a decision for both classification and regression tasks (Breiman et al. 1984). Three types of trees were used depending on the maximum number of splits (100, 20, 4). The maximum number of splits equal to 100 is when many leaves are used to make many fine distinctions between classes. When the number of leaves is equal to 4, the distinctions that can be made are stronger.

3.1.2 Discriminant analysis

Discriminant analysis is a statistical transformation technique that produces a function capable of classifying phenomena (Fisher 1936). The objective is to maximise the between-group variance and minimise the within-group variance through these linear (or quadratic) combinations. The procedure is to discover the autovalues and autovectors of a quotient matrix of the interclass distance matrix and the intraclass distance matrix. For linear discriminant analysis, the model has the same covariance matrix for each class; only the means vary. For quadratic discriminant analysis, both the means and the covariances of each class vary.

3.1.3 Naïve Bayes

Naive Bayes models assume that observations have a multivariate distribution with regard to class membership, although the predictors or features that make up the observation are independent. This framework can accommodate a full set of features, so that an observation is a set of multinomial counts (Mitchell 1997). Normal (Gaussian) distribution is appropriate for predictors that have normal distributions in each class. The Naïve Bayes classifier estimates a separate normal distribution for each class by calculating the mean and standard deviation of the training data in that class. The kernel distribution is suitable for predictors that have a continuous distribution. It does not require a strong assumption such as a normal distribution, and you can use it in cases where the distribution of a predictor may be skewed or have multiple peaks or modes.

3.1.4 Support Vector Machines (SVMs)

The goal of SVM is to find out a hyperplane that best separates two different classes of data points with the widest margin between the two classes. The algorithm can only find this hyperplane in problems that allow linear separation; in most practical problems, the algorithm maximises the flexible margin by allowing a small number of misclassifications. The support vectors refer to a subset of the training observations that identify the location of the separation hyperplane. SVMs can use a kernel function to transform the features. Kernel functions map the data into a different, usually higher dimensional space, with the expectation that it will be easier to separate the classes after this transformation (Vapnik and Chervonenkis 1971; Cortes and Vapnik 1995). The types tested are Linear SVM (makes a simple linear separation between classes), Quadratic SVM, Cubic SVM, and three categories of Gaussian SVM (fine, with kernel scale set to \(\sqrt{P}/4\); medium, with kernel scale set to \(\sqrt{P}\); and coarse, with kernel scale set to \(\sqrt{P}\cdot 4\), where P is the number of predictors).

3.1.5 KNN

The k-nearest neighbour algorithm, also known as KNN or k-NN, is a non-parametric supervised learning classifier, that uses proximity to make classifications or predictions about the clustering of a single data point. While it can be used for regression or classification problems, it is generally used as a classification algorithm, based on the assumption that similar points will be found close together. Usually, the number k is an odd number (1,3,5…) (Silverman and Jones 1989). The types of trees tested were Fine KNN (the number of neighbours is set to 1), Medium KNN (the number of neighbours is set to 10), Coarse KNN (the number of neighbours is set to 100), Cosine KNN, using a cosine distance metric (the number of neighbours is set to 10), Cubic KNN, using a cubic distance metric (the number of neighbours is set to 10), Weighted KNN, using a distance weight (the number of neighbours is set to 10).

3.1.6 Ensemble learning

Classification ensemble learning uses multiple learning algorithms to obtain a better predictive model, which is aa weighted combination of several classification models. In general, the combination of several classification models increases the predictive power. The types of ensembles tested were: Subspace with discriminant learners, Subspace with nearest neighbour learners, and RUSBoost, Random Forest Bag, and AdaBoost, with decision tree learners (Breiman 1996, 2001; Hastie et al. 2008; Freund 2009).

3.1.7 Artificial neural networks (ANNs)

ANNs were also included in the second stage. A feedforward fully connected ANN can be arbitrarily well suited to multidimensional mapping problems, given consistent data and enough neurons in its hidden layer (Hornik et al. 1989). The authors have successfully used ANNs in similar prediction problems (Gonzalez-Enrique et al., 2019b; Ruiz-Aguilar et al. 2020; Moscoso-López et al. 2022). ANNs were trained with the backpropagation algorithm (Rumelhart et al. 1986) using the Levenberg–Marquardt optimisation procedure. Finally, the obtained results were statistically analysed and compared using a resampling procedure in order to select the model with the best generalisation capabilities. ANN models with different hidden units were compared to determine the effect of adding non-linear processing capabilities on model performance. Each model is a feedforward fully connected neural network with a different number of fully connected layers and hidden units. A ReLU activation function was used in each model. The rectified linear activation function, or ReLU, is a non-linear or piecewise linear function that directly outputs the input if it is positive, otherwise, it outputs zero (Glorot et al. 2011). It is the most commonly used activation function in neural networks since 2017 (Ramachandran et al. 2017). The types of tested ANNs were: One hidden layer with 10, 25, and 100 neurons; two hidden layers with 10 \(x\) 10 neurons and three hidden layers with 10 \(x\) 10 \(x\) 10 neurons.

3.2 Feature selection

The third stage is a feature selection procedure. The Minimum Redundancy Maximum Relevance (mRMR) approach (Peng et al. 2005) is a feature selection algorithm that ranks a set of features according to their relevance to the target variable. It also penalises redundant features. The best features are those with the highest trade-off between maximum relevance with the target variable and minimum redundancy with the remaining features.

Among the wide range of feature selection methods, the mRMR method has been used in this work to rank the variables considered as inputs. This method has been successfully used by the authors in other studies related to air pollution (González-Enrique et al. 2021). Feature selection, one of the fundamental problems in pattern recognition and machine learning, involves identifying subsets of data that are relevant to the parameters used, usually referred to as maximum relevance. These subsets often contain material that is relevant but redundant, and mRMR attempts to address this problem by eliminating these redundant subsets. In this paper, the top ten relevant features were selected as inputs to the different models to test whether there are significant differences when all variables are used in the models.

3.3 Experimental procedure

A resampling procedure was used to reduce the prediction error of a test set and to reduce the effects of overfitting. The strategy randomly divided the database into three parts (training 70%, validation 10%, and test sets 20%) and the performance results were collected only for the test set in order to estimate the generalisation error of each model using unseen data, as the authors have successfully implemented in other papers (Turias et al. 2008; González-Enrique et al. 2019a; Ruiz-Aguilar et al. 2020; Moscoso-López et al. 2022). In this research, all of the simulations were developed and tested in Matlab © software.

The whole system can be seen as a mapping from a set of input features to an output variable. The mathematical form of the mapping is determined by the data (training set). Of course, we need to build a system that is capable of making good predictions on unseen data. In order to measure this generalisation ability, cross-validation is used with another set of samples (test set) is used. We adopted five-fold cross-validation to select the best model based on the generalisation performance of each model. The available data were divided into three different groups (training, validation, and test sets). The parameters of each model were estimated using one of the groups (the training set). A validation set is used for early stopping and to avoid overfitting. Finally, the test set is used to test the classification quality indexes (sensitivity, specificity, accuracy, and precision), simulating the real performance of the model. This process is repeated 20 times and the results are averaged over these runs. To visualise the obtained results with a classification model, the confusion matrix is used (Ting 2010). Each row (i) of the matrix (C) represents the number of predicted values for each class and each column (j) represents the number of real values for each class (C(i,j)). In this case, four classes are considered, one for each of the quartiles of the output. Once an air pollutant has been considered, its values are divided into four quartiles, each containing 25% of the total distribution. The confusion matrix is calculated and then the quality indexes of sensitivity, specificity, accuracy, and precision are also calculated. The Euclidean distance (d1) to a perfect classifier in terms of the quality indexes (sensitivity = 1, specificity = 1, accuracy = 1, precision = 1) is also calculated (expressed by Eq. 2).

$${d}_{1}=\sqrt{{\left(1-sensitivity\right)}^{2 }+{\left(1-specificity\right)}^{2}+ {(1-accuracy)}^{2}+ {(1-precision)}^{2})}$$
(2)

In this case, the confusion matrix has a 4 × 4 dimension due to data are divided into four disjoint quartiles (classes). In order to obtain individual classification results for each quartile, the matrix was sequentially transformed, quartile by quartile, into an equivalent 2 × 2 confusion matrix (Table 4), which was used to calculate the well-known and above-mentioned classification measures (sensitivity, specificity, accuracy, and precision, see Eqs. 36). The lower d1 distance is chosen to indicate the best classification model for each quartile. Quartiles are the statistical values that divide the dataset into four equal parts or quarters, each containing 25% of the data, resulting in lower, lower-middle, middle-high, and upper divisions.

Table 4 Equivalent multi-class confusion matrix
$$Accuracy=\frac{TP+TN}{TP+TN +FP+ FN}$$
(3)
$$Precision=\frac{TP}{TP+FP}$$
(4)
$$Sensitivity=\frac{TP}{TP+FN}$$
(5)
$$Specificity=\frac{TN}{TN+FP}$$
(6)

True-positive (TP) and true-negative (TN) results are correctly classified, while false-negative (FN) and false-positive (FP) results are two types of errors calculated according to the literature (Ting 2010).

All the calculations are performed separately. The air pollutants (SO2, PM10, NO2, NOX, and NO) as outputs in the three locations (Algeciras, Alcornocales, and La Línea), using all variables or only the ten most relevant variables, in a total of 30 scenarios, repeated 20 times each, following the resampling procedure explained above. The time series of SO2 and PM10 concentrations are calculated as daily averages and NO2, NOX, and NO on an hourly basis. Once the experiments have been developed, the results are presented in the next section.

4 Results

Simulations and prediction experiments were computed for five pollutants directly related to maritime traffic: SO2, PM10, NO2, NOX, and NO. The models were tested at three different locations, in the cities of Algeciras and La Línea, and at a third location at a certain distance in the remote area of the Alcornocales Park. As explained above in Table 2, the averages were calculated hourly or daily to comply with the European Directive 2008/50/EC. Figures 3, 4 show the time series graphs with their upper assessment thresholds of the pollutants analysed on an hourly or daily basis according to the Directive measured in µg/m3. These graphs show average concentrations and it is worth noting that in 2017 the average SO2 concentrations in La Línea, where a refinery is located, are very high compared to the rest of the years, which seems to be due to the installation of a desulphurisation unit in 2018 in this refinery. Considering particulate matter, the lowest concentrations are found at the Alcornocales station, and the highest at Algeciras, although overall concentrations are very similar in both Algeciras and La Línea. On the other hand, the pollutant NO2 (and nitrogen oxides in general) clearly shows very high average concentrations in Algeciras compared to La Línea and Alcornocales, which are quite similar. This increase could be an indication of the high presence of diesel engines in Algeciras, which is consistent with the heavy truck traffic in and out of the port, the ships berthed in the Port of Algeciras and the higher traffic density, since it is the most densely populated city in the Bay.

Fig. 3
figure 3

Daily mean time series of SO2 and PM10 from 2017 to 2019 with the Directive 2008/50/EC limit thresholds

Fig. 4
figure 4

Hourly mean time series of NO2, NOX and NO from 2017 to 2019 with the Directive 2008/50/EC limit thresholds

Since the pollutant thresholds are defined in the regulations in terms of hourly and daily values, and in order to better understand the behaviour of each pollutant, weekly average graphs of each air pollutant at the different stations have been calculated (Fig. 5). The pollutant SO2 shows a higher concentration in La Línea, probably because the prevailing winds carry the pollution from ships and the surrounding industries more towards La Línea (westerly situations), and in easterly situations SO2 seems to move towards Los Alcornocales, the remote area 30 km from the bay, which paradoxically has a higher concentration than Algeciras. In the case of the PM10 averages, it can be seen that concentrations decrease during the night, and from the early hours of the morning, when anthropogenic activity begins, the values increase until late in the day. At weekends there is not much difference compared to the rest of the week. In the case of nitrogen oxides, there is a daily decrease in the early hours of the morning, then an increase to a maximum around midday, and then a downward trend with a slowdown around mid-afternoon, which coincides with the pace of human activity and therefore traffic, especially vehicle traffic. The trend is higher in the cities of La Línea and Algeciras. At the remote Los Alcornocales station, there is only a slight increase at midday. In terms of daily averages, maximum values are observed on Tuesdays and Fridays, with a significant decrease at weekends. It should be noted that NO2 is higher in Algeciras than in La Línea, probably due to road traffic. Nitrogen oxides have two peaks per day, which suggests that they are related to human activity, and especially to diesel engines whereas particulate matter and SO2 have only one peak per day.

Fig. 5
figure 5figure 5

Hourly and daily mean week diagrams for pollutants from 2017 to 2019

As explained above, one-step-ahead prediction models have been developed with the aim of predicting the next value of a time series of quartile concentrations in order to contrast with exceedances of the thresholds set in the Directive. Several classification models, including ANNs, were tested and compared for their performance using the resampling procedure explained in Sect. 3. In each case, two experiments were calculated, one using all variables as inputs and another one with only the ten most relevant variables. It should be noted that the results shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 are always calculated for test sets (unseen data). In general, the obtained results are quite adequate, with higher values for the classification quality indexes. Results of around 90% indicate that the prediction for the next timestamp-ahead or the next daily/hourly mean is quite accurate and represents a very reliable prediction. The results are collected for the different separated quartiles to achieve a more detailed picture of the prediction.

Table 5 Best prediction model results for daily SO2 (t + 1) concentrations using all variables at t
Table 6 Best prediction model results for daily SO2 (t + 1) concentrations using top ten relevant features at t
Table 7 Best prediction model results for daily PM10 (t + 1) concentrations using all variables at t
Table 8 Best prediction model results for daily PM10 (t + 1) concentrations using top ten relevant features at t
Table 9 Best prediction model results for hourly NO2 (t + 1) concentrations using all variables at t
Table 10 Best prediction model results for hourly NO2 (t + 1) concentrations using top ten relevant features at t
Table 11 Best prediction model results for hourly NOX (t + 1) concentrations using all variables at t
Table 12 Best prediction model for hourly NOX (t + 1) concentrations using top ten relevant features at t
Table 13 Best prediction model results for hourly NO (t + 1) concentrations using all variables at t
Table 14 Best prediction model results for hourly NO (t + 1) concentrations using top ten relevant features at t

In Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14, the best prediction model for each air pollutant, location, and quartile is shaded underline (the combination with the smallest distance d1). The best counterpart model (same model, location, quality index, and quartile) is shown in bold between Tables 5, 7, 9, 11, and 13 which present the results of the models using all variables, and Tables 6, 8, 10, 12, and 14, which use only the ten most relevant variables. The distance d1 was used to compare models of the same location and to select the best model in each case.

Comparing Table 5 with Table 6 for the pollutant SO2, prediction models using all variables with the models using only the relevant variables, it can be seen that in all cases the ANN models significantly improve their prediction performance when using only relevant variables, although the tree classifiers predict better than the ANNs as their distance d1 is the smallest. For this pollutant, tree classifiers are the best predictors in all cases. For SO2 in Algeciras, quartiles, Q1 and Q2 are best predicted by the tree classifiers using only the ten most relevant variables. However, quartile Q3 is also best predicted by tree classifiers using all 130 variables and Q4 is best predicted by ANNs models using the ten most relevant variables. The best performing quartile (with the lowest d1) in Algeciras is the Q1 with sensitivity, specificity, accuracy, and precision above 0.90. For SO2 in Alcornocales, better predictions are obtained in quartiles Q1 and Q2 with tree-type classifiers and using only the ten most relevant variables. In the case of quartiles Q3 and Q4, better predictions are obtained with ensemble classifiers using all variables. The best results for Alcornocales are obtained for Q1, Q3, and Q4, all with values up to 0.97. For SO2 in La Línea, better predictions were obtained for quartiles Q1 and Q4 using relevant variable models with tree classifiers, and Q2 and Q3 were better predicted with ensemble classifiers using all variables. In La Línea the Q4 is the one with the best results, with quality indexes above 0.98.

For the PM10 pollutants, Tables 7 and 8 show that the use of relevant variables improves the results of the classification models in almost all cases. The results of the best models for each quartile are those shaded in underline, regardless of whether they include all variables or only the relevant ones, and turn out to be the trees with slightly better results than the ANNs models. The obtained results for the best models for PM10 pollutants have the highest quality indexes. For instance, in Algeciras, the Q1 is the best predicted with tree classifiers using relevant variables, obtaining quality indexes above 0.95. In Alcornocales, quartile Q4 is also the best predicted with tree classifiers using relevant top ten variables with quality indexes above 0.97. In La Línea, the best predicted quartile is Q4 with quality indexes up to 0.96.

The results for NO2 are shown in Tables 9 and 10. In this case, the relevant variables give better results only for Alcornocales and the quartile Q1 of La Línea. Tree-type models are also the best predictors for Alcornocales and La Línea, especially when all the variables are used, and only for quartiles Q3 and Q4 of Alcornocales do neural network models perform better when the relevant variables are used. In the case of Algeciras, all quartiles are predicted equally well by SVM classifiers using all variables. For NO2, the best results are obtained in the case of quartile Q1 in Algeciras with all quality indexes above 0.82, Q1 in Alcornocales with ensemble models using all variables and quality indexes above 0.82, and Q1 in La Línea with quality indexes above 0.82 with ensemble tree classifiers using the top ten variables.

In the case of NOX, reasonably equivalent behaviour is observed between models using all variables and models using only the relevant variables. By using fewer but more relevant variables, a large number of models improve their overall performance. In La Línea, the best models are ANNs using the relevant variables for quartiles Q2-Q4. In Algeciras, the performance of the ANNs is similar for the quartiles Q2 and Q4, and in Alcornocales for Q1. The rest of the best models use all variables and correspond to SVM and ensembles. The values are somewhat lower than for other pollutants, reaching sensitivities above 80% and higher specificities above 93%. In the case of NO in Alcornocales (Tables 13, 14), no values have been obtained for Q2 because most of the data available in the database for this pollutant are at such low values that they correspond for the most part to Q1, except for some peaks of exceedances found in the Q3 and Q4 quartiles. For NO, ANNs seem to be the models that best predict the quartiles using the relevant variables. In fact, the best result is obtained for the Q1 quartile with more than 94% precision for Alcornocales.

Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 show that ensemble boosted trees and tree classifiers produce better results than ANN models in most cases, but by reducing the number of variables to the best 10, ANNs improve quite a lot. Tables 15, 16, 17, 18, and 19 have also been included, highlighting the most leveraged variables used for each prediction model using the mRMR method. In these tables, only the best ten most relevant variables are shown. Using these variables, similar prediction results were obtained to those shown in Tables 5, 6, 7, 8, 9, 10, 11, 12, 13 and 14 for the models using all the variables. Therefore, using only these top ten variables, a more efficient monitoring system could be designed, saving economic and time resources in the sensor network by measuring fewer variables to store and transmit, thus designing a more energy sustainable system with a lower carbon footprint. Tables 15, 16, 17, 18, and 19 show the ten most relevant variables for each pollutant (SO2, PM10, NO2, NOX, and NO) and monitoring station (Algeciras, La Línea, and Alcornocales). In these tables, the meteorological variables for each pollutant are marked in yellow, and the rest of the relevant pollutants, different from those analysed and repeated in at least two stations, are marked in other colours. In the Tables 15, 16, 17, 18, and 19, it is expected that each pollutant's own time series (SO2(t), PM10(t), NO2(t), NOX(t), and NO(t)) will always appear, and this is indeed the case. For instance, Table 15 shows the most relevant meteorological variables for SO2, namely wind direction (WD) and rainfall (RF). For SO2, O3 and nitrogen oxides are the most relevant air pollutants, as expected. Table 16 shows the relevant variables for the PM10 pollutant, indicating that the most relevant meteorological variables are wind speed (WS), rainfall (RF), and relative pressure (RP), and the most relevant pollutants are nitrogen oxides. Similarly, Table 17 for the pollutant NO2 indicates that the most relevant meteorological variables are related to wind (wind direction (WD) and wind speed (WS)) and rainfall (RF), and the most relevant pollutants are particulate matter (PM10 and PM2.5), O3 and SO2. Table 18 for each NOX case, shows the same relevant meteorological variables as for NO2 and includes relative humidity (RH) and the same relevant pollutants except SO2. In the case of the NO pollutant, Table 19 shows that the relevant meteorological variables are related to the wind (wind direction (WD) and wind speed (WS)), solar radiation (SR), and rainfall (RF).

The best models for each pollutant and location for the fourth quartile are shown in italics. Results are given for all quartiles, but we assume that the fourth quartile is the most important for prediction as it represents the most dangerous concentration levels.

Table 15 The ten most relevant variables for each SO2 (t + 1) level prediction
Table 16 The ten most relevant variables for each PM10 (t + 1) level prediction
Table 17 The ten most relevant variables for each NO2 (t + 1) level prediction
Table 18 The ten most relevant variables for each NOX (t + 1) level prediction
Table 19 The ten most relevant variables for each NO (t + 1) level prediction

5 Conclusions

In this work, an experimental procedure using a resampling strategy with five-fold cross-validation allowed the statistical comparison of the different classification models tested. The proposed approach is based on classification modelling, since the desired output is the next level (quartile at t + 1) of an air pollutant as a function of the other variables at a given time t. Two approaches have been used, one with hourly mean data for nitrogen oxides (NO2, NOX, and NO) and another one with daily mean data (for SO2 and PM10), due to the thresholds established in the European Directive 2008/50/EC, in order to obtain more reliable information in the study area. The approaches were developed in three different and separate locations: the main city of Algeciras, the city of La Línea, and the unspoilt remote area of Alcornocales, in order to contrast them and obtain more details on the behaviour of the air pollutants.

The main conclusions of this study are as follows:

  • The classification models can be adequately used to provide very good air quality prediction results with quality indexes up to 90% in most cases.

  • In general, the use of the ten relevant variables improves the results in most cases.

  • Ensemble boosted trees, SVM, trees, and ANNs classifiers tend to be the best prediction models in most cases.

  • The results obtained with ANNs are always improved by reducing the number of variables to the ten relevant ones.

  • Variable selection models can be used to rank the importance of leverage variables.

  • By selecting fewer variables, it is possible to design a more energy sustainable system with a lower carbon footprint.

  • All forecasts can be useful to the citizens, institutions, businesses in the port area, and the cities surrounding the port.

  • There is background radiation (averages that are constantly repeated) that does not provide useful or accurate information from the ships. The conclusion that can be drawn from the data is that we need more sensors close to the dock area where the ships are located in order to be able to deduce the direct effect of pollutants coming directly from the ships.

The logistical activity of a port has an impact on air quality. Therefore, it is necessary to implement predictive models to provide reliable forecasts that help citizens, companies and institutions, to make decisions and drive policy changes to ensure a healthier and cleaner environment for present and future generations.