1 Introduction

Ambient air pollution is one of the serious problems encountered by developing countries like India. In 2018, India was ranked 177 among 180 countries in the Environmental Performance Index (EPI) [1]. The study area selected is the industrial area of Ambalamugal in the South Indian state of Kerala. It is ranked 24th amongst the critically polluted areas (CPA) in India [2]. The major industries located in the area comprise of a petroleum refinery, a phosphatic fertiliser unit, petrochemical plant and a carbon black manufacturing industry. In the urban environment, various types of pollutants are released into the atmosphere at varying concentrations, at different heights and meteorological parameters also determine the transportation, dilution, dispersion, transformation, deposition and absorption of pollutants [3].

The main air pollutants in the atmosphere are carbon monoxide (CO), nitrogen oxides (NOx), particulate matter (PM), sulphur dioxide (SO2) and ozone (O3) [4]. Exposure to air pollutants escalates the risk of contravention of respiratory diseases, such as asthma, respiratory infections and chronic obstructive pulmonary disease, in children and adults alike [5]. Considering the area of study, sulphur dioxide (SO2) and sulphur trioxide (SO3) are the major oxides of sulphur responsible for pollution along with sulphate-containing compounds (SO42−). They are mainly produced by the combustion of fuel containing sulphur, from refining, manufacturing processes, municipal incineration and metal extraction processes. SO2 can cause acid rains, corrosion, damage to human, plant and animal health. These pollutants also pose heightened danger due to their synergistic effects as there is prospects of their interactions, having to exist in the same medium [6].

It is important that sufficient tools are available to predict and forecast the concentrations of these harmful pollutants well in advance. Air pollution models currently being adopted are of two types (i) Deterministic or Mechanistic models and (ii) Statistical or Data-driven models [7, 8]. The former approach considers mathematical representation of various pollutant transportation mechanisms and their reactions, chemically. Hence, they tend to be strenuous, exhaustive and computationally expensive. The latter on the other hand have the ability to identify the relationships between the input and output variables without having to evaluate all transportation mechanisms or chemical transformations of the pollutants. Artificial Neural Network (ANN) is a statistical model and its capability of learning, training and predicting with parameters that have nonlinear relationships with each other, is being utilized in this work [9]. ANN model is designed to replicate the simple function of a biological network and is used for solving complex nonlinear functions. ANN has the capability to recognize nonlinear relationships between variables and complex patterns in data sets that may not be adequately described by simple mathematical equations [6]. ANN can model relationships without making any prior assumptions regarding the data distributions [10]. If sufficiently trained, statistical models like ANN are found to be more appropriate in determining dependencies between pollutant concentration and predictor parameters than deterministic models [11].

ANN models (Fig. 1) are similar to working biological systems which consists of simple but numerous nerve cells called “Neurons” which are interconnected to form networks. Generally, there are three layers present in ANN model. The first layer is called as the input layer and the total number of nodes in this layer equal to the number of input parameters. The final layer is called as the output layer and the number of nodes equals the number of outputs. The hidden layer connects the input and output layers. Input data are introduced to the network through the input layer and the output layer provides the response of the network [12]. Neurons have the ability to work in parallel and learn. Hence, a neural network need not be necessarily programmed, but they can learn from training samples or by means of encouragement. The resulting successful network has a high degree of error tolerance against noisy data. ANN models are designed to mimic these capabilities of biological neural systems.

Fig. 1
figure 1

Artificial neuron model

The activation functions of the input and output layers can be either linear or nonlinear, while the hidden layer can have only nonlinear activation functions [13]. Sigmoid functions are widely used as there exist simple relationship between the value of the function and derivative value at any point. This reduces the computational complexity of the network [14]. The hyperbolic tangent and logarithmic sigmoid functions are defined in Eqs. (1) and (2), respectively [15]:

$$f(x) = \frac{{e^{x} - e^{ - x} }}{{e^{x} + e^{ - x} }}$$
(1)
$$f(x) = \ln \left( {\frac{1}{{1 + e^{ - x} }}} \right)$$
(2)

The most common network used for prediction modelling is the multilayer perceptron feed-forward network where the information flows in one direction from input to output nodes [16]. The study was to develop an optimized feed-forward ANN model employing back propagation algorithm suitable for predicting the air quality parameters for the South Indian industrial area of Ambalamugal. A multilayer feed-forward ANN model is characterized by one input, at least one hidden and one output layers, each composed of neurons [13]. ANN models with Levenberg–Marquardt backpropagation algorithm having one hidden layer is used for prediction models usually [12, 15, 17]. Backpropagation algorithm feeds back the error produced by neural networks to the nodes to modify the connection weights and thresholds [12]. For this particular study intended at predicting the air quality parameters, time-series modelling was utilized. Time-series modelling analyses the data which is already available in the past and by extrapolating the future values based on this historical data [18]. It employed a type of dynamic filtering, i.e. the parameter values in the past of one or more time series were utilized to predict the data in future. Referred to as dynamic neural networks, they utilized tapped delay lines used for nonlinear filtering and prediction. ANN models have been effectively used to determine nonlinear solutions to prediction problems. ANN models can be effectively used to approximate any quantifiable function if they are designed and trained by appropriate historical data. ANN has the ability to predict values even in the presence of noise data and it learns and determines the patterns and generalizations in the available data.

If seasonal variations are duly considered, a full year of data is sufficient for developing statistical models if the emissions are present throughout the year [19]. The data sets have to be partitioned into training, validation and testing data sets such that every element of a subset represents the entire data set [15]. The data sets used in the modelling have to be cleaned, randomized and normalized. Removal of unusual values, errors and outliers is necessary before modelling as the network learns according to the input data sets. If introduced to an erroneous data set, the network will learn accordingly and ultimately results in erroneous results. The data sets have to be normalized to ensure that all the input values fall into comparable range. If not normalized, inputs having higher numerical magnitude have a tendency to mask the inputs which are of numerically smaller values [16].

As the number of input parameters increases, the complexity of the developed ANN model also increases which results in degraded model performance. Hence it is imperative that input optimization techniques are applied to narrow down the parameters to optimum by suitably eliminating those variables that have the least effect on the model’s efficiency [15]. Forward selection starts with no predictor variable in the model and subsequently adds those variables that have the highest correlation with the target output. Backward elimination begins with all predictor input variables and subsequently eliminates those parameters that provides the least increase in the squared error [15].

The number of hidden layers, the number of neurons constituting the hidden layers, the transfer functions and the training algorithm also affect the performance of the ANN models [13]. Training is the process of determining connection weights and thresholds to minimize the difference between the actual and predicted outputs [12]. Mean squared error (MSE) and coefficient of determination (R2) are some of the parameters used for evaluation of developed networks. The higher the value of coefficient of determination (R2 = 1) and lower the value of mean squared error (MSE = 0), the network’s performance is considered to be superior [6]. The work also optimizes input parameters based on the performance of the ANN model using forward selection and backward elimination. Comparison of performance of optimized models is achieved using model performance evaluation. The effects of meteorological and pollutant parameters on the prediction capability of the ANN model are also compared.

There is a crucial need to monitor the concentrations of various pollutants and the relationship of various meteorological and pollutant parameters in the prediction of ambient air quality parameters. Even though studies are available for prediction of ambient air quality parameters, air quality models for monitoring pollutant concentrations are not popular in developing world. In particular, in critically polluted area like Ambalamugal in Kerala where the study is being conducted, even though continuous ambient air quality monitoring facilities are available, additional tools like ANN model is to be used to predict the concentrations of air quality parameters considering the equipment is taken out for maintenance or calibration. During the severe floods which affected the state of Kerala recently on 8 August 2018 due to unusually high rainfall during the monsoon season, the monitoring equipment was found not being functional. It is imperative that auxiliary mechanisms or techniques that can predict the air quality parameters are evolved, in addition to direct measurement devices installed in these areas. Additionally, comparison of effects of pollutant and meteorological parameters on the prediction capabilities of ANN-based modelling is to be conducted. This study aims to determine the capability of ANN networks in prediction of air quality parameters in an industrial area and the same is demonstrated by assessing the prediction capability of concentration of sulphur dioxide (SO2) using various input parameters. Owing to site specificity, ANN models are individually constructed for each location. A neural network constructed and trained to perform for a particular monitoring location will not be effective in predicting the pollutant concentrations at a different location as the predictor variables would vary with sites [11, 17]. Hence the boundary of the applicability of the developed model is limited to the site under study.

2 Materials and methods

Artificial Neural Networks are statistical models used for training and prediction of outputs that have nonlinear relationships with their corresponding inputs. The topology, ANN parameters and the learning algorithm have been selected satisfactorily to predict the outputs with required performance level.

2.1 Study area

The study area selected was Ambalamugal, an industrial suburb in the South Indian state of Kerala. It is located 14 km towards East from Ernakulam in Kerala and has an altitude of 12 m above the sea level.

2.2 Software

The Neural Network Toolbox of MATLAB R2018a (The MathWorks Inc. USA) was used for constructing the air quality prediction model. The Neural Network Toolbox allowed selection of various parameters for configuring the desired network architecture.

2.3 Data pre-processing

The prediction model was developed using data sets collected from September 2016 to September 2018 stretching over a period of 2 years. Pollutant data was collected from the three continuous ambient air quality monitoring stations maintained by the petroleum refinery located in the study area. The facility is capable of measuring six air quality parameters, namely sulphur dioxide (SO2), nitrogen oxide (NOx), ammonia (NH3), carbon monoxide (CO), particulate matter of size less than 10 µm in diameter (PM10) and particulate matter of size less than 2.5 µm in diameter (PM2.5). Meteorological data were collected from Indian Meteorological Department (IMD). Meteorological parameters being considered were surface temperature (T), rainfall (RF), relative humidity (RH), wind direction (WD) and wind velocity (WV). The time parameters considered for the study were month of the year (MY), day of the month (DM) and day of the week (DW). This was achieved by considering values from 1 to 12 for January to December; values from 1 to 31 for the corresponding day of the month; and values 1–7 for the corresponding day of the week (1—Sunday to 7—Saturday).

The collected data were pre-processed by cleaning, normalizing and randomizing the input data sets. Any unclear data, anomalies, outliers or errors were removed from the data sets. The original inputs were normalized for convergence of the model using Eq. (3), and normalized outputs were converted back to original data sets using Eq. (4), given below:

$$X_{\text{norm}} = \frac{{X - X_{\mathrm{min} } }}{{X_{\mathrm{max} } - X_{\mathrm{min} } }} \times (r_{\mathrm{max} } - r_{\mathrm{min} } ) + r_{\mathrm{min} }$$
(3)
$$X = \left( {\left( {\frac{{X_{\text{norm}} - r_{\mathrm{min} } }}{{r_{\mathrm{max} } - r_{\mathrm{min} } }}} \right) \times (X_{\mathrm{max} } - X_{\mathrm{min} } )} \right) + X_{\mathrm{min} }$$
(4)

where Xnorm was the normalized value, X was the original value, Xmin and Xmax were the minimum, and maximum values of X, rmin and rmax were the values of − 0.9 and 0.9, respectively.

2.4 ANN model

Out of the total data sets collected, 70% of the data sets was used for training the constructed neural network. 15% of the data sets was used for validation and the rest for testing purpose. In this study, a feedforward time-series neural network was used. The model comprised of three distinct layers—input layer, hidden layer and output layer. Generally, one hidden layer was considered for the construction of ANN models used for prediction problems [14]. The number of nodes in the input layer equalled the number of input parameters, which is 14. The number of neurons in the hidden layer was decided by observing the performance of the network during trial and error process using varying number of neurons. The number of hidden neurons was determined by considering up to (2N + 1) values of neurons; where N was the number of input parameters employed in each model [16]. The number of neurons which provided the best performance was considered as the final value. The number of nodes in the output layer equalled the number of output to be obtained from the network model. The network was trained using Levenberg–Marquardt backpropagation algorithm. The algorithm was based on the error correction gradient descent learning method. Time-series prediction models were used in the study which extrapolated the output data based on the historical input data which were already available. Here, the multilayer perceptron model used input data whose seasonal variations were retained, by prediction using time series models and the future data was predicted as output. The multilayer perceptron model performed the following functional mapping as given in Eq. (5).

$$y_{t} = f\left( {y_{t - 1} , y_{t - 2} , \ldots , y_{t - n} } \right)$$
(5)

where yt was the estimated output, and yt−1, yt−2, …, ytn were the training pattern which consisted of a fixed number (n) of lagged observations of the series [18]. The number of delays used for time-series forecasting was determined by trial and error process by observing the performance of the network using varying number of delays, keeping all the other parameters constant. The number of delays which resulted in the best performing model was finalised as the delay value.

The report was prepared based on the pollutant, meteorological and time parameters. Accordingly, three different types of architectures were considered. Model A was constructed using all the 14 input parameters. Model B was constructed using only SO2, meteorological and time parameters. Here the other pollutant parameters were excluded. Model C was constructed using only the pollutant and time parameters. In Model C, the meteorological parameters were considered excluded.

Typical architecture of the ANN model A constituting of the 14 input parameters (six pollutant + five meteorological + three time parameters) and one output is provided in Fig. 2.

Fig. 2
figure 2

MLP architecture for air quality prediction using all parameters

Typical architecture of the ANN model B constituting of the nine input parameters (five meteorological + three time parameters + SO2), and one output is provided in Fig. 3.

Fig. 3
figure 3

MLP architecture for air quality prediction using meteorological and time parameters + SO2

Typical architecture of the ANN model C constituting of the nine input parameters (six pollutant + three time parameters) and one output is provided in Fig. 4.

Fig. 4
figure 4

MLP architecture for air quality prediction using pollutant and time parameters

Flowchart of the general methodology followed in the construction and execution of the ANN model is provided in Fig. 5.

Fig. 5
figure 5

Flowchart of ANN modelling method

2.5 Input parameter optimization

Model A was constructed using all the 14 input parameters; Model B using only the meteorological and time parameters and SO2; and Model C using only the pollutant and time parameters. Three different models were considered within a particular type of model, to understand the relevance of air quality, time and meteorological parameters in the performance of the ANN model. Model 1 was constructed using all the input parameters based on the original data set. Model 2 was constructed by choosing input parameters with forward selection technique. Model C was constructed using ANN model with backward elimination technique. The performance of the ANN network was then evaluated to conclude which input optimization technique provides satisfactory results and the network was updated based on the optimized inputs. The list of the ANN models considered in this study is provided in Table 1.

Table 1 List of ANN models considered in the study

2.6 Model performance evaluation criteria

Determination of the performance of the constructed network depends on the degree of convergence of the network predicted outputs and the actual output values. Model performance evaluation criteria used in the evaluation of the ANN model were Mean Squared Error (MSE) and Coefficient of determination (R2). The equations for calculation of MSE and R2 are given in Eqs. (6) and (7), respectively:

$${\text{MSE}} = \sum\limits_{i = 1}^{n} {\frac{{\left( {x_{i} - y_{i} } \right)^{2} }}{n}}$$
(6)
$$R^{2} = 1 - \frac{{\sum \left( {x_{i} - y_{i} } \right)^{2} }}{{\sum y_{i}^{2} - \frac{{\sum y_{i}^{2} }}{n}}}$$
(7)

where xi denoted the actual output, yi denoted the predicted outputs and n was the total number of observations [4, 20].

3 Results and discussion

The number of hidden neurons was determined by considering up to (2N + 1) values of neurons; where N is the number of input parameters employed in the model. Here, there were 14 input parameters, ANN model developed was trained using varying number of hidden neurons from 1 to 29 (2 × 14 + 1 = 29); maintaining all the remaining conditions the same. Comparing the results for 1 to 29 neurons, it was concluded that the best performance of the model was noted with 15 number of hidden neurons which resulted in MSE value of 0.002 and coefficient of determination as 0.987. Figure 6 represents the graphical variation of MSE and R2 values with respect to the number of hidden neurons.

Fig. 6
figure 6

Variation of mean squared error (MSE) and coefficient of determination (R2) with number of neurons

The number of delays was decided based on trial and error by varying the number of delays. All the other parameters within the network were maintained the same and the performance of the networks with varying number of delays were then evaluated. Comparing the results, it was found that the best performance of the model was noted with number of delays as 2. Incidentally, this was the default set value for number of delays in the MATLAB software’s timer-series application. Further increase in delays was not found to improve the performance of the ANN models. The model resulted in MSE value of 0.016 and coefficient of determination as 0.863. Figure 7 represents the graphical variation of MSE and R2 values with respect to the number of delays.

Fig. 7
figure 7

Variation of mean squared error (MSE) and coefficient of determination (R2) with number of delays

Model A was developed using 9184 input data sets (14 parameters × 656 observations) collected from September, 2016 to September, 2018 stretching over a period of 2 years. The data sets were also partitioned such that 70% of the data (6440 data sets) was used for training, 15% (1372 data sets) for validation and 15% for testing (1372 data sets). Similarly, Model B was developed using 5904 input data sets (9 parameters × 656 observations). Model C was developed using 9 input parameters which consisted of pollutant and time variables. This prediction model B was developed using 5904 input data sets (9 parameters × 656 observations).

Table 2 provides the details of the performance evaluation of the different models developed using input optimization techniques. Models which combined ANN modelling with input optimization technique provided better performance than conventional ANN model which considered all inputs without any optimization. There was reduction in the error value and increase in correlation, when compared to the conventional model. However, comparing the results, backward elimination technique in general was found to provide better results than forward selection.

Table 2 Performance evaluation of ANN models

Out of Model ‘A’ ANN models, Model A4 which was constructed using input parameters chosen utilizing input optimization techniques, provided better result when compared to the remaining models. In Model ‘B’, the ANN model combined with backward elimination provided the best results and Model ‘C’ also converged in similar lines. Out of all the 10 models evaluated, the best result was exhibited by Model C3. This model was constructed using pollutant and time parameters only. According to this model, 5 parameters namely, month of the year, day of week, concentration of SO2, NH3 and PM2.5 were found to contribute maximum to the prediction capability of the model.

Figure 8 represents the regression plots of Model A1, A2, A3 and A4.

Fig. 8
figure 8

Regression plot of model A networks

Figure 9 depicts the variation of actual and predicted SO2 concentrations using best model C3.

Fig. 9
figure 9

Variation of actual and predicted SO2 concentrations using best model C3

Also the predicted concentration of SO2 concentrations was compared with the National Ambient Air Quality Standards values [21] and were all below the allowable limit of 80 µg/m3 (24 h TWA value). The maximum value of SO2 obtained from predicted values was only 55% of the maximum allowable concentration in ambient air.

4 Conclusions

The current experimental research utilized feed-forward back propagation artificial neural network modelling to predict the air quality parameters based on 6 pollutant, 5 meteorological and 3 time parameters. The prediction capability of the model was demonstrated by predicting the concentration of sulphur dioxide in an industrial area using time-series ANN modelling. Different ANN models were developed based on parameter selection techniques to determine the input parameters which affected the prediction capability of the ANN model. Similarly different ANN models were also constructed and evaluated to study the effects of pollutant and meteorological variables.

  • Out of all the models, the conventional model with all parameters exhibited reduced performance which implied input optimization and subsequent parameter selection resulted in improved performance. The other models exhibited an average reduction in MSE by 26% and an average improvement in correlation by 4%.

  • Comparison of performance of the best performing models of A, B and C ANN models showed that the best performance was exhibited by ANN model having pollutant and time parameters only. The minimum MSE obtained in any model was 0.0115 and maximum R2 value of 0.8979.

  • The best result was provided by models incorporating pollutant variables since the concentration of SO2 pollutant was being predicted in the study. Additionally, any model trained and evaluated showed a minimum reduction in MSE of 9% and improvement in correlation by 3%.

  • When compared with the National Ambient Air Quality Standards values, the predicted concentrations of SO2 were found to fall well below the allowable limit; as much as 55% of the maximum allowable concentration in ambient air. The rate of gaseous SO2 emissions into the ambient air and their control measures in place could be monitored by similar comparisons and were found to be in agreement.

  • The results revealed that ANN models can be effectively utilized for predicting the concentration of pollutants. As far as prediction capacity of the ANN models for primary air pollutants was concerned, the models exhibited very promising results. However, requirements of including additional parameters, while trying to predict secondary air pollutants where other reaction mechanisms have also to be considered, has to be studied on.

  • Comparing the results it was found in general that, all the models were showing values of MSE closer to zero and R2 closer to unity. Methods of input optimization and considerations of whether to include pollutant or meteorological parameters—all helped in fine-tuning the predicted results and reduced the disparity between the actual and predicted outputs.

Also, further studies could be conducted to understand the mechanism by which ANN was evolving relationships between input and output variables. Hybrid ANN model can be effectively utilized to predict the concentration of SO2 and similar air quality parameters in an industrial area.