1 Introduction

A framework for the development of a computationally-efficient surrogate model for air pollution dispersion is studied in this paper.

Surrogate modelling (Keane et al. 2008; Koziel and Leifsson 2013; Jiang et al. 2020) is an engineering method that is used when we cannot compute the response of the model of the system of interest easily enough. It is a method that helps to alleviate computationally demanding computations necessary for tasks like design optimisation, design-space exploration, and various computationally intensive analyses. Speeding up computations can be done by constructing an approximation model known as a surrogate model, meta model, emulation model, or emulator. The surrogate model is usually developed by selecting a suitable data-driven black-box model. Such a model is obtained from the input-output response of the original mathematical model. It is constructed based on appropriately selected input values that excite the original mathematical model in the region of interest as illustrated in Fig. 1.

Fig. 1
figure 1

The principle of surrogate modelling

Surrogate models have been used in different scientific fields for different tasks, see (Alizadeh et al. 2020) for a review. The methodology is used also in atmospheric sciences for dispersion modelling, e.g. (Carnevale et al. 2012; Bowman and Woods 2016; Gunawardena et al. 2021; Le et al. 2019; Francom et al. 2019; Girard et al. 2020). The relevance of dispersion modelling is non-disputable and addressed in numerous studies dealing with different environments, situations and methods, e.g. (Ravina et al. 2021).

The purposes of developing surrogate models range from uncertainty quantification (Francom et al. 2019) to spatial-deposition prediction (Gunawardena et al. 2021). The core of all applications is to replace a computationally demanding model with a faster surrogate one. Details of some recent investigations are as follows.

To investigate the relative impact of a collection of uncertain model inputs and their interactions on the outputs of the atmospheric dispersion for the Fukushima nuclear catastrophe, Girard et al. (2016) applied Sobol’s global sensitivity analysis using Gaussian-process emulation. At 64 measurement locations, the emulators’ ability to forecast time- and space-aggregated gamma dose rates, as well as time-integrated gamma radiation rates, were assessed.

Le et al. (2019) describe the use of surrogate models for the prediction of integrated statistical measures, e.g. Root Mean Square Error, based on inputs describing meteorological forecasts and a source term. The authors have shown an application of surrogate modelling for replacing an Eulerian model for the Fukushima event.

Pal et al. (2019) developed a surrogate model of physics-based radiation model using deep neural networks to reduce the computational cost.

For 2D dose prediction in a Brazilian nuclear power plant, Desterro et al. (2020) utilised a Deep Rectifier Neural Network. The method was developed for immediate prediction up-to 1 h after the accident and considered five model inputs (wind velocity, wind direction, position x, position y and time after the accident started). The data samples for the investigation were generated by the simulation model and, consequently, the obtained model can be considered as a surrogate.

Gunawardena et al. (2021) proposed a data-driven surrogate model to predict the spatial deposition of radioactive materials from a nuclear power plant for a single radiological release over a wide area for a particular period of 48 h. A grid of linear regression and logistic models is used for the surrogate model with categorical variables from an NWP system as inputs.

Carnevale et al. (2012) used a grid of neural networks for the static mapping between precursor emissions in each cell and its neighbouring cells and PM10 pollutant in a domain of interest.

Bowman and Woods (2016) proposed a surrogate model of atmospheric dispersion for a Gaussian puff model. The surrogate model is composed of basis functions whose coefficients are modelled with a Gaussian-process model. They compare different modelling methods with the uncertainty quantification of coefficients. The surrogate model is a static model and does not take dynamics into account. The static model means that the surrogate model output at every time instant does not encounter the past or derivations of input and/or output variables.

Uncertainty quantification of spline coefficients is also the purpose of (Francom et al. 2019), which efficiently uses Bayesian adaptive splines for surrogate modelling to which they use categorical inputs, i.e. discrete values from a finite set of choices from a numerical weather prediction (NWP) system. The purpose of the developed surrogate model is the characterisation of atmospheric release. In particular, the authors model the weights of empirical orthogonal functions in space and time for plume emulation using the adaptive splines.

Mallet et al. (2018) built a meta-model that replicates the key aspects of the air quality model ADMS-Urban to simulate \(\hbox {NO}_2\) and PM10 emissions on an urban scale with street resolution and continuous emissions from emission sources. The original model lacks a temporal dimension in inputs and outputs and is static with low-dimensional inputs and high-dimensional outputs. The principal component analysis is used to condense the model’s outputs before multiple linear regression and Kriging interpolate them.

For anticipating atmospheric dispersion of methane (\(\hbox {CH}_4\)) in 2D over complicated terrain, Lauret et al. (2016) developed cellular automata paired with an artificial-neural-network model. A dynamic model based on wind field data processed with Computational-Fluid-Dynamics simulation is utilised to supply input data for the artificial neural network.

A surrogate model for the Computational-Fluid-Dynamics wind model is developed by Moonen and Allegrini (2015). The surrogate model was a Gaussian-process model applied in a case study representing an urban area.

A surrogate model of a Computational-Fluid-Dynamics model of pollution dispersion (Mendil et al. 2021) and recently (Mendil et al. 2022) is a deep-neural-network model that mimics the accidental release of a radioactive pollutant from multiple sources for up to 2 h ahead in an urban area.

There are other uses of surrogate modelling that prove the utility of the methodology. Each of the described cases is a bit different in the model’s purpose or different in the methods used, but with a common goal of reducing computational burden with the use of some approximation. Nevertheless, the listed surrogate models only utilise the information about the present and not about the past. In this sense, they are not dynamic models. One possible reason is that using information from previous time instants considerably increases the dimension of input space.

1.1 Problem statement

The problem at stake, different from those published, deals with modelling a dispersion of a pollutant with a single source of pollution over complex terrain. In Europe, major accidents involving dangerous chemicals are prevented and controlled through the Seveso Directive (European Commission 2020). Seveso-type industrial facilities and nuclear power plants have the potential for accidents with serious consequences, especially when the release occurs over complex terrain. When such an accident occurs, the consequences for the environment and the human population must be quickly forecasted. Air pollution dispersion in a complex environment is commonly modelled and simulated with the Lagrangian particle dispersion model (Girard et al. 2020), where the cost for its forecast accuracy is a heavy computational burden. The dispersion model is a dynamic model, which adds to its complexity.

The problem we tackle is to develop a surrogate model that will provide reliable forecasts of air pollution dispersion relatively quickly so that authorities will have enough time and information to act upon it. The surrogate model also should take into account the dynamics of the air pollution dispersion. The primary goal is to improve the computational speed of prediction to use the model for potential experimentation and long-range forecasting, i.e. forecasting for an extensive period of time, as long as information on weather variables or average weather variables is available. If the surrogate model is available for potentially important Seveso objects in advance, then in the event of a disaster, emergency services could use meteorological forecasts and calculate the dispersion forecasts quickly. The investigation is intended for the ground layer because that is where most people are exposed. The investigation can be done also for higher layers with the same methods and corresponding data. The ground layer is usually also the most complex layer due to the influence of terrain and land use.

1.2 Contribution

We propose two methods for modelling an input-output dynamic surrogate model of continuous point-source originating pollution over complex terrain based on meteorological variables obtained from an NWP system or other sources of weather information.

The contribution of the investigation is as follows:

  • A method for the modelling of a surrogate model of air pollution dispersion based on meteorological variables as inputs and 2D representation of relative pollution concentration at the output as a grid of independent dynamic models (GIM) for each output cell.

  • A method for the modelling of a surrogate model of air pollution dispersion based on meteorological variables as inputs and 2D representation of relative pollution concentration at the output as a reduced grid of dynamic models with interpolation of outputs (RGI).

  • A fast and applicable case-study demonstration of the listed methods on a simulated Seveso-type point release of a pollutant over complex terrain.

In addition to developing the surrogate model, the contributed methods also can be used for modelling a data-driven dispersion model if the necessary training data is available. Nevertheless, this is not the emphasis of the paper.

The emphasis of the paper is on developing a surrogate model that reduces the computational burden of model prediction and forecasting with acceptable accuracy. Note that this paper is not about the particular data-driven machine learning method nor is its focus exclusively on the point-accuracy of dispersion predictions. Different methods can be used in the proposed framework and the speed of computation is the focus of the study. The proposed framework can be used effectively for developing surrogate models for computationally demanding experimentations.

The structure of the paper is as follows. The following section describes the air pollution Lagrangian particle dispersion model at the selected location with complex terrain. Section 3 describes decision-tree models and the Gaussian-process-grid model that are used for solving the fast-dispersion prediction problem. Results are discussed in Sect. 4, and conclusions are gathered in Sect. 5.

2 Mathematical model and numerical simulation

The case study for demonstrating the development of a surrogate model for pollution dispersion is the Šoštanj thermal power plant. The pollution dispersion of this power plant frequently attracts attention and was also used for early modelling with artificial neural networks (Božnar et al. 1993).

A constant and continuous sulphur-dioxide (\(\hbox {SO}_2\)) pollution source emission of unit value was presumed, and which can be, when necessary, proportionally sized for real-life situations. The location of the Šoštanj power plant is at the edge of the Velenje Basin in Slovenia (Fig. 2). It is surrounded by the Alps to the north and north-west. The basin consists of narrow valleys with rivers flowing along them and, as such, represents highly complex terrain. Winds are stronger on elevated levels and weaker in the basin. A temperature inversion in winter and other circumstances additionally complicate the situation.

Fig. 2
figure 2

Relief view of the position of Šoštanj power plant (46\(^\circ\)22′26.56″ N, 15\(^\circ\)3′5.15″ E)

Pollution dispersion in such a complex terrain was successfully modelled by an air pollution Lagrangian particle dispersion model (Mlakar et al. 2015), which represents a suitable method to deal with the complexity of the terrain. The Lagrangian particle dispersion model was combined with a corresponding meteorological preprocessor able to reconstruct a three-dimensional diagnostic nondivergent wind field. In particular, the SPRAY Lagrangian particle dispersion model (Castelli et al. 2018), the MINERVE diagnostic mass consistent wind field model (Finardi et al. 1998) and the SURFPRO meteorological preprocessor (Finardi et al. 1997) were used for the dispersion modelling (Mlakar et al. 2015). Inputs in the Lagrangian particle dispersion model were weather variables that can be collected at different weather stations or weather-forecast variables, the digital model of the terrain heights, and land cover data of the region. Realistic information regarding land use was used in the Lagrangian particle dispersion model. The weather-forecast variables are valuable, especially in the case of major accidents involving dangerous chemicals, where we would like to forecast the direction and values of pollution dispersion.

For our demonstration, we did not use all the weather variables as described in (Mlakar et al. 2015), but have made a study for the simplified weather situation as follows. The temperature profile, wind velocity and direction were provided only at two altitudes, 10 m and 500 m, at the location of the thermal power plant in Šoštanj. The weather was presumed sunny with a clear sky. The output variable was the relative concentration of \(\hbox {SO}_2\), but any other chemical of interest could be used in the study. The relative concentration (s/\(\hbox {m}^3\)) is a ratio between absolute pollutant concentration in \(\mu\)g/\(\hbox {m}^3\) and the rate of emission (kg/s) (Mlakar et al. 2019). The relative concentration enables rescaling the results to any other form of pollutant emission. The region of interest was \(15 \times 15\) km described with \(100 \times 100\) square cells of \(150 \times 150\) m each. Consequently, we dealt with the situation in 10,000 cells.

What we were striving for was the computational aspect of the described model. Nevertheless, the matching between the Lagrangian particle air pollution dispersion model output and the real situation in the field played a noticeable role in our study, but we were aware that it can be improved a bit further with additional training data and modelling effort. The software was run on an i9 desktop computer with MS Windows operating system. The model at the designated computer calculated every half-hour response at approximately a few tens of seconds. While this was an expected and acceptable computational performance, it was too slow to serve for numerical experimentation, which would be necessary for the real-time forecasting of accidents and especially not for long-range forecast studies because the time of calculation increases linearly with the length of the forecast horizon.

We took the developed Lagrangian particle dispersion model combined with the corresponding meteorological preprocessor as a benchmark and tried to improve the computational performance at an acceptable loss of accuracy with a surrogate model.

3 Methods

3.1 Surrogate modelling

3.1.1 The modelling procedure

The brief procedure for developing a surrogate model in our case is as follows (Fig. 3).

  1. 1.

    The development of a Lagrangian particle dispersion model with the accuracy suitable for the model’s purpose.

  2. 2.

    Dataset generation with the Lagrangian particle dispersion model for surrogate modelling.

  3. 3.

    The selection of a model-development method for the surrogate model.

  4. 4.

    The selection of a surrogate-model structure (regressors, regression method, etc.).

  5. 5.

    Data-driven modelling of a large number of independent models, each for a cell of interest, or modelling of a smaller number of independent models and interpolation of their responses to the field of cells.

  6. 6.

    The prediction of the obtained surrogate model with data not used for modelling.

Fig. 3
figure 3

Flow chart for the development of a surrogate model

3.1.2 Constraints and assumptions

One of the most important steps in surrogate modelling is the generation of input samples that are intelligently distributed within the entire input space. This is usually done with optimal experimental design or with active learning (Breiman et al. 2017). In the case of atmospheric phenomena, the input variables utilised for modelling are commonly weather variables. This was also in our case. Values at the inputs were not obtained with a designed experiment and its implementation, but we used weather forecasts from a numerical weather simulator. The reason for this choice is that realistic combinations of values of input variables cannot fill all the subspaces in the input-variables space because all possible combinations are not natural. Therefore, lots of combinations never occur. Consequently, we use data from available weather sources and not optimal experimental design or active learning. This particular selection of input data means that the number of data is not optimal and a large amount of data is necessary to encompass the relevant information.

The dispersion provided by the Lagrangian particle dispersion module has a resolution of 10,000 cells, with each cell measuring \(150 \times 150\) m. This means that we deal with a system of 10,000 outputs, which introduces an identification problem with the excessive number of outputs. Such a problem can be tackled with dimension-reduction methods, e.g. (Girard et al. 2020). We decided to approach the problem differently and offer an alternative solution. The idea is to divide the model into a large number of submodels, as it was utilised in (Gunawardena et al. 2021; Carnevale et al. 2012), under the assumption that outputs do not influence each other. Nevertheless, the results described in the continuation showed that this working assumption provides some applicable results. Further research on alternative models and some output reduction methods applicable to this problem is envisaged for the future.

3.1.3 Performance metrics

Modelling performance was evaluated with two cost functions. The first is selected to evaluate time-dependent predictions of every submodel in the entire system in comparison with the original system. This evaluation is done with the standardised mean-squared error—SMSE (Le et al. 2019; Rasmussen and Williams 2006):

$$\begin{aligned} \mathrm {SMSE}=\frac{1}{N}\frac{\left\| \mathbf {y}-E(\hat{\mathbf {y}})\right\| ^2}{\sigma _\mathbf {y}^2}, \end{aligned}$$
(1)

where

  • \(\mathbf {y}\)–the vector of observations,

  • \(E(\hat{\mathbf {y}})\)–the mean value of estimations \(\hat{\mathbf {y}}\),

  • \(\sigma _\mathbf {y}^2\)–the variance of observations,

  • N–the number of observations.

SMSE is a frequently used standardised measure for the accuracy of predictions’ mean values with values between 0 and 1, where the value 0 is the result of a perfect model.

Pearson correlation coefficient \(\mathrm {R}\) and the coefficient of determination \(\mathrm {R}^2\) are also given for comparison. Pearson correlation coefficient is defined as

$$\begin{aligned} \mathrm {R} = \frac{\mathrm {cov}(\mathbf {y},E(\hat{\mathbf {y}}))}{\sigma _\mathbf {y}\sigma _{\hat{\mathbf {y}}}}, \end{aligned}$$
(2)

where \(\mathrm {cov}\) is covariance, and \(\sigma _{\hat{\mathbf {y}}}\) is the standard deviation of the mean values of estimations. The value is between − 1 and 1, and the more positive value is better.

The coefficient of determination is defined as

$$\begin{aligned} \mathrm {R}^2 = 1 - \frac{\left\| \mathbf {y}-E(\hat{\mathbf {y}})\right\| ^2}{N\sigma _\mathbf {y}^2}=1-\mathrm {SMSE}. \end{aligned}$$
(3)

The value is between 0 and 1, and a bigger value is better.

The presentation of pollution dispersion at ground level in our case is a two-dimensional field—an image of the dispersion. Consequently, the second selected cost function, the statistical coefficient of the space analysis, is the figure of merit in space—FMS (Mosca et al. 1998) also known as the Jaccard similarity coefficient

$$\begin{aligned} \mathrm {FMS}=\frac{A_1\bigcap A_2}{A_1\bigcup A_2}, \end{aligned}$$
(4)

where \(A_1\) and \(A_2\) represent the measured and predicted areas respectively. The FMS is calculated at each time instant, with a fixed-threshold concentration level that distinguishes two categories of concentration values. Therefore it does not validate concentration levels, but the coverage of pollution. Values of the FMS close to 1 correspond to good model performance. Low FMS does not necessarily correspond to a bad model performance due to the shift of pollution plumes. Therefore, the FMS value should be evaluated together with a graphical representation of the measured \(A_1\) and modelled areas \(A_2\).

Ensembles of decision trees and a Gaussian-process grid were used for the development of surrogate models for the selected case study.

3.2 Decision trees

The literature provides many different algorithms for learning decision trees, which can be classification or regression trees (Breiman et al. 2017). While classification trees have a categorical output that indicates belonging to a finite set of outputs, regression trees provide numeric responses. A binary model tree, whether it is of a classification or regression type, consists of a split node with a threshold test of a particular variable \(x_i\in \mathcal {X}\), where \(\mathcal {X}\) is a set of regressors. Given identification data \(D = {(\mathbf {x}, \mathbf {y})\mid x\in \mathbb {R}^p , y\in \mathbb {R}^r}\), where \(\mathbf {x}\) is the vector of regressors (inputs) and y is the output, a model tree partitions the input space \(\mathbb {R}^r\) into several partitions, called leaves. The split node creates a binary partition of the input space and has left and right offspring nodes. Such tree-like structures are frequently used for surrogate modelling (Alizadeh et al. 2020). They are robust, have the internal regressors’ selection mechanism, are computationally efficient, are interpretable to a certain extent, etc. The disadvantage is that piece-wise continuous estimates, as in the case of regression trees, may create certain inaccuracy, particularly for small trees.

The accuracy of regression trees can be improved with the use of tree ensembles (Mendes-Moreira et al. 2012; Aleksovski et al. 2016). Tree ensembles are created from several uncorrelated regression-tree models. A combination of the imperfect predictions obtained from each model tree should improve the prediction accuracy over a single tree and thus provide a more accurate model. Different principles exist to create ensembles. One of them is the bagging principle (Breiman 1996; Breiman et al. 2017). In bagging, bootstrap replicates are created, i.e. random samples with replacement of the training dataset D that have an equal number of data points as the training set. Each of the replicates \(D_i\) is used to build one model tree. The learning procedure starts by creating n bootstrap replicates of the training data D. Using each of the n data samples, a collection of model trees is built: \(m_1, m_2 , \ldots , m_n\). Denoting the predictions of the n single-output model trees of the ensemble with \(\hat{f}_i(x)\), the overall prediction from the model tree ensemble is the average of the base model predictions for one output variable:

$$\begin{aligned} \hat{y}(\mathbf {x}) = \frac{1}{n}\sum _{i=1}^n\hat{f}_i(\mathbf {x}). \end{aligned}$$
(5)

The rationale for selecting ensembles of bagged regression trees for modelling surrogate models was twofold: (1) learning of regression trees is faster than for other tested models and (2) ensembles improved the accuracy of predictions. Other selections might be viable as well.

3.3 Gaussian-process grid

Gaussian process modelling, also known as kriging, is a further method used in surrogate modelling (Alizadeh et al. 2020). Gaussian process models (Rasmussen and Williams 2006; Kocijan 2016) describe the input-output mapping of data \(f(\mathbf {x})\) from regression vector \(\mathbf {x}\) with a Gaussian process (GP). GP is a stochastic process containing random variables \(f(\mathbf {x}_i)\) with a normal probability distribution,

$$\begin{aligned} p(f (\mathbf {x}_1),\ldots , f (\mathbf {x}_N)\mid \mathbf {x}_1,\ldots , \mathbf {x}_N) ) = \mathcal {N} (\mathbf {m}, \mathbf {K}). \end{aligned}$$
(6)

The vectors \(\mathbf {x}_i\) are regressor vectors, f is the GP, \(\mathbf {m}\) is the mean vector and \(\mathbf {K}\) is the covariance matrix of the Gaussian distribution \(\mathcal N\). In GP modelling, we describe the GP with a mean function and a covariance function,

$$\begin{aligned} \mathbf {m}_i = m(\mathbf {x}_i),~~~\mathbf {K}_{ij} = C(\mathbf {x}_i, \mathbf {x}_j), \end{aligned}$$
(7)

where \(m(\mathbf {x}_i)\) is the mean function and \(C(\mathbf {x}_i, \mathbf {x}_j)\) is the covariance function. GP models are flexible function approximators, which can be used to represent complex structures with covariance kernels (Kocijan 2016).

When dealing with the grid structure of output data, grid-based covariance approximations can be utilised (Wilson and Nickisch 2015). This is the way to decrease computational costs in training and prediction. In the case of multidimensional inputs on a Cartesian grid, \(\mathbf {x}\in \mathcal {X}_1\times \cdots \times \mathcal {X}_p\), and a product kernel across the grid dimensions, \(C(\mathbf {x}_i, \mathbf {x}_j) = \prod _{k=1}^pC(\mathbf {x}_{ik}, \mathbf {x}_{jk})\), then the \(n\times n\) covariance matrix \(\mathbf {K}\) can be expressed as a Kronecker product \(\mathbf {K} = \mathbf {K}_1 \otimes \cdots \otimes \mathbf {K}_p\). The product of grid sizes in all the dimensions \(N = \prod _{i=1}^pn_p\) is a product of the number of points \(n_p\) per grid dimension. The method takes advantage of the computational properties of a grid-based covariance matrix via the structured kernel interpolation. The reader is referred to (Wilson and Nickisch 2015) for more details.

4 Results and discussion

4.1 Grid of independent models—GIM

4.1.1 Data

Data for training, validation and testing were obtained with the simulation of air pollution Lagrangian particle dispersion model for the Šoštanj thermal power plant. The data sequence contained three years of data (July 2018–July 2021) with a 30-min sampling time. The data sequence contained \(\hbox {SO}_2\) concentration as a system’s output signal and 7 meteorological variables as a system’s input signals, which were temperature, wind velocity and wind direction at heights 10 m and 500 m, the global solar IR radiation and the ground projection of the dispersion forecast for each time instant.

This set of more than 52,500 data pieces was divided into training, validation and test sets. Since we need as much data as possible for training to get as much information as possible, we ended dividing data into 51 subsets where one was immediately set aside as the test dataset (June 2021–July 2021), while the rest of the data was used for training and validation (July 2018–May 2021). The test data incorporated samples at the end of the completed dataset.

The data that was not used for the test was normalised, namely centred to have a mean of 0 and scaled to have a standard deviation of 1. The obtained mean values and variances were then also used for the normalisation of the test data.

4.1.2 Structure

Dispersion of air pollution is a dynamic system. It nonlinearly depends on meteorological and other environmental variables and not just on their present time values, but also the variable’s history. Therefore the surrogate models have to be dynamic models as well. A static model would be just an approximation of a dynamic one in one time instant. The purpose of the entire system of models is forecasting for extended time horizons, therefore long-range forecasting and the entire system of models has a large number of outputs. The scheme of GIM is shown in Fig. 4.

Fig. 4
figure 4

The scheme of GIM

Dynamics is incorporated in a model with regressors, which are the delayed values of outputs and inputs in the form of infinite-impulse-response or finite-impulse-response models (Nelles 2002).

We selected a finite-impulse-response (FIR) model structure for our submodels because an excessive number of outputs would make their feedback very inconvenient. These models have delayed values of inputs only (no delayed outputs) for regressors and the number of delays corresponds to the time in which impulse-excited response fades out. Since air pollution dispersion is a nonlinear process, we used a nonlinear-FIR (NFIR) model, in particular an ensemble model of bagged regression trees. Regression trees were selected due to the speed of training and their accuracy was improved using ensembles. Other data-driven modelling methods were tested, but all have performed worse than the selected one regarding the speed of training.

The next step in the structure selection is the selection of regressors. All seven available input signals were used as inputs, while outputs were values of \(\hbox {SO}_2\) at each cell. The number of input delays was selected with 4-fold cross-validation on the dataset of about 21,000 data points. The reason for not taking a larger amount of data is that we wanted to keep a reasonable computational time. The training and validation sets were large enough to make the obtained results instrumental also for larger datasets even though the values of cost functions for cross-validation are worse than in the case when a larger dataset is used.

The results of 4-fold cross-validation for delay-selection are given in “Appendix A”. Note that we have validated the same amount of delays on all inputs, without eliminating particular uninformative regressors due to simplicity. Therefore, if the maximal delay is, e.g. 3, this means that we have 21 regressors, 7 with delay 1, 7 with delay 2, and 7 with delay 3 for the prediction of the value at the moment of delay 0, i.e. at present moment. The cross-validation study was pursued for the delay interval between 2 and 6, where the minimum was located.

The best SMSE and FMS results were obtained with the delay of 4 samples, which corresponds to delays up to 2 h. This can be interpreted that a transient of 2 h encompasses the most information about the pollution dispersion.

The number of observations per tree leaf(or partition) was selected next. Using the already selected model parameters, a 4-fold cross-validation study was done to determine the optimal number of observations. The obtained results are given in “Appendix A”. Averages of the SMSE measure are shown in Fig. 5. SMSE was used for the evaluation because we are evaluating the prediction ability of each independent submodel in the GIM model.

Fig. 5
figure 5

The dependency of the predictions’ SMSE from the number of observations per tree leaf

It is clear from Fig. 5 that the prediction-quality index is the lowest when 10 observations per tree leaf are used.

The next selection concerns the structure of ensembles, in particular the number of regression models within the ensemble. Using the already selected model parameters, a 4-fold cross-validation study was done to determine the optimal number of models. The obtained results are given in “Appendix A” and the average SMSE measure is shown in Fig. 6. SMSE was used for the evaluation because we are again evaluating the prediction ability of each independent submodel in the GIM model.

Fig. 6
figure 6

The dependency of the predictions’ SMSE from the number of models in the ensemble

It is clear from Fig. 6 that the quality of the model increases with the increasing number of models in ensembles. However, the computation burden also increases with the number of models. Consequently, we selected the corresponding number of models, where the change of SMSE is the biggest. This can be seen in Fig. 6 as a ‘knee’ of depicted function. Having in mind the computational burden, 60 models were selected as an acceptable number of models in ensembles.

The final structure was therefore as follows:

  • NFIR model structure,

  • 7 signals as inputs, each delayed up to 4 time steps, i.e. 2 h, which results in 28 regressors,

  • ensembles of regression trees composed of 60 models with 10 observations per leaf.

4.1.3 Test results

The complete set of data excluding the test data was used for training. The Statistics and Machine Learning Toolbox of Matlab was used for the training and testing of the obtained models.

Two examples of the obtained images for two different weather situations contained in the test data are given in Figs. 7 and 8. The complete set of test-data responses can be seen in the video (Online Resources 1 and 2—(Kocijan et al. 2022)). The visual matching of the predictions of the grid of independent submodels and the original Lagrangian particle dispersion model is relatively good. However, it depends on the purpose of the surrogate model which level of accuracy and details are important.

While the direct comparison of the computation time was not possible because the Lagrangian particle dispersion model was run on a designated computer (Intel Core i9 10900 @ 5.60 GHz, 32 GB RAM) and the surrogate model was run on another computer (Intel Core i7 8700HQ CPU @ 3.70GHz, 32 GB RAM), the comparison can be done only qualitatively. The purpose of the surrogate model was to make long-range predictions. For the prediction of approximately 1000 data samples, the original dispersion model at the designated computer took about 35,000 s, while the surrogate model took about 300 s. The computing load increased with the number of predictions linearly. This very rough comparison provides an indicator that the prediction with the surrogate model is much faster than with the Lagrangian particle dispersion model.

Fig. 7
figure 7

An example of a weather situation with a strong wind from test data. Original-model response in the left figure, GIM surrogate-model response in the right figure. The scale is identical for both figures, values over the maximum value of scale are drawn in magenta colour, the maximum value in the left figure is \(7.530\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(4.190\cdot 10^{-7}\) s/\(\hbox {m}^3\)

Fig. 8
figure 8

An example of a weather situation with a weak wind from test data. Original-model response in the left figure, GIM surrogate-model response in the right figure. The scale is identical for both figures, values over the maximum value of scale are drawn in magenta colour, the maximum value in the left figure is \(4.310\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(3.830\cdot 10^{-7}\) s/\(\hbox {m}^3\)

Regarding the accuracy of the model, we can compare the SMSE of the predictions on the test set for each of the submodels, that is for each of the 10,000 cells. These are graphically shown in Fig. 9 as an image. Differences in the quality of predictions can be observed in the figure, which is mainly an indicator of the different information content of inputs and outputs concerning cells of the two-dimensional representation. The average SMSE over the independent models is 0.5167, while \(\hbox {R}^2\) = 0.4833 and R = 0.6952.

Fig. 9
figure 9

The planar distribution of model-responses SMSE values on the test data, where the value 0 is the result of a perfect model

How the predictions of dispersion in each time instant cover the original model’s predictions of dispersion is given with FMS values at each time instant of the test data sequence. The FMS values are graphically shown in Fig. 10. The average FMS over time is 0.595.

Fig. 10
figure 10

FMS of the surrogate model for each time instant on the test data

Even though the GIM accelerates the prediction in comparison with the Lagrangian particle dispersion model, the following section describes an alternative in which a lower number of independent submodels is calculated and predictions in each cell are not independent anymore. This can be achieved with the RGI.

4.2 Reduced grid with interpolation of outputs—RGI

RGI is intended as the structure for the acceleration of computation in the comparison with GIM. The same data was used for modelling and testing and with the same division as in the case of GIM.

4.2.1 Structure

The entire model is composed of two parts: GIM, but not for the complete grid, and the Gaussian-process grid for filling up the gaps in the grid of predictions. The scheme of RGI is shown in Fig. 11.

Fig. 11
figure 11

The scheme of RGI

The structure of independent cell models was kept as it was for the complete GIM. We selected every third cell to be modelled, which reduced the number of cells to be modelled on \(34\times 34 =1156\) cells. This is a considerable reduction. In our case, the cells to be modelled with independent models are distributed uniformly, but this is not necessary.

Predictions of available independent cell models were inputs into the GP grid. The GP grid uses covariance functions for the calculation of the grid-based covariance matrix. The individual covariance functions in our case were two squared exponential covariance functions with isotropic distance measures, each of them as

$$\begin{aligned} C(\mathbf{x}_i, \mathbf{x}_j) = \sigma _f^2 \exp \left( -\frac{r^2}{2l^2}\right) . \end{aligned}$$
(8)

The hyperparameter \(\sigma _f^2\) represents the scaling factor of the possible variations of the function or the vertical scaling factor and the hyperparameter l is called the horizontal scaling factor and determines the relative weight on distance for the input variable \(\mathbf {x}\). The variable r is the input-distance measure and is \(r=\mid \mathbf{x}_i -\mathbf{x}_j\mid\), where \(\mathbf{x}\) is a regression vector for the GP grid. In RGI’s case, regressors of the GP grid are the cell’s indices.

The Kronecker covariance matrix (Wilson and Nickisch 2015) is calculated from two covariance matrices, each for one input dimension.

Hyperparameters \(\sigma _f^2\) and l were optimised for the best grid performance.

4.2.2 Test results

The predictions for two different weather situations are provided in Figs. 12 and 13.

Fig. 12
figure 12

An example of a weather situation with a strong wind from test data. Original-model response in the left figure, RGI surrogate-model response in the right figure. The scale is identical for both figures, values over the maximum value of scale are drawn in magenta colour, the maximum value in the left figure is \(7.530\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(3.070\cdot 10^{-7}\) s/\(\hbox {m}^3\)

Fig. 13
figure 13

An example of a weather situation with a weak wind from test data. Original-model response in the left figure, RGI surrogate-model response in the right figure. The scale is identical for both figures, values over the maximum value of scale are drawn in magenta colour, the maximum value in the left figure is \(4.310\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(2.980\cdot 10^{-7}\) s/\(\hbox {m}^3\)

SMSEs of the predictions on the test set for all models in the GIM model combined into an image is shown in Fig. 14 and FMS values at each time instant of the test data sequence are graphically shown in Fig. 15. The average SMSE over predictions on the test data and all cells is 0.5283, while \(\hbox {R}^2\) = 0.4717 and R = 0.6868 and the average FMS over time is 0.599, which are both around 1 % difference in comparison with the GIM and consequently provides similar graphs.

Fig. 14
figure 14

The planar distribution of model-responses SMSE values on the test data, where the value 0 is the result of a perfect model

Fig. 15
figure 15

FMS of the surrogate model for each time instant on the test data

It is clear from Figs. 12, 13, 14 and 15 that the accuracy of RGI predictions is close, but not equal to those of GIM. Especially, maximum values are not that well predicted. However, the computation of predictions for the test data, approximately 1000 data samples, takes about one-third of the time with the combined model. The complete set of responses can be seen in the video (Online Resources 1 and 2—(Kocijan et al. 2022)).

4.3 Comparison of responses

A different aspect to the matching between the responses of the original model, GIM model and RGI model is provided with some additional processing. Figure 16 depicts the average relative concentrations over the domain calculated for the period of time of the test dataset. The comparison of responses of the original dispersion system and its surrogate model shows a good match in regards to both location and value. A very good match of maximum values of averages can also be observed. The response of the GIM surrogate model matches some of the details of the original system better than the response of the RGI surrogate model.

Fig. 16
figure 16

Average relative concentrations over domain calculated for the validation dataset period June 9, 2021–June 30, 2021; the scale is identical for all figures; values over the maximum value of scale are drawn in magenta colour; the maximum value in the left figure is \(1.004\cdot 10^{-7}\) s/\(\hbox {m}^3\), in the middle figure is \(0.8470\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(0.7273\cdot 10^{-7}\) s/\(\hbox {m}^3\)

Figure 17 shows maximum relative concentrations over the domain calculated for the period of time of the test dataset. The comparison of responses of the original dispersion system and its surrogate model shows, like in Fig. 16, a good match location-wise as well as value-wise. The same is with the maximum values of maximum relative concentrations. The match of maximum values of surrogate models is a bit worse than the match of average relative concentrations. Nevertheless, these deviations are expected, because the prediction of maximum values is a difficult task also in the original system itself.

Fig. 17
figure 17

Maximal relative concentrations over domain calculated for the validation dataset period June 9, 2021–June 30, 2021; scale is identical for all figures; values over the maximum value of scale are drawn in magenta colour; the maximum value in the left figure is \(1.098\cdot 10^{-5}\) s/\(\hbox {m}^3\), in the middle figure is \(0.4450\cdot 10^{-5}\) s/\(\hbox {m}^3\) and in the right figure is \(0.3430\cdot 10^{-5}\) s/\(\hbox {m}^3\)

Figure 18 shows the 95th percentile of relative concentrations over the domain calculated for the time period of the test dataset. The match of responses and maximums of the 95th percentile of relative concentrations among models are good. The match is not as good as the other two comparisons, but still acceptable for our purpose. The response of the GIM surrogate model again matches some of the details of the original system better than the response of the RGI surrogate model.

Fig. 18
figure 18

The 95th percentile of relative concentrations over domain calculated for the validation dataset period June 9, 2021–June 30, 2021; scale is identical for all figures; values over the maximum value of scale are drawn in magenta colour; the maximum value in the left figure is \(4.415\cdot 10^{-7}\) s/\(\hbox {m}^3\), in the middle figure is \(3.200\cdot 10^{-7}\) s/\(\hbox {m}^3\) and in the right figure is \(2.560\cdot 10^{-7}\) s/\(\hbox {m}^3\)

The 95th percentile of relative concentrations represents the top 5 % response values. These are predicted well for the Seveso-type event in space and time. The comparison of maximum values in Fig. 17 confirms a good match based on previous evaluation investigations (Grašič et al. 2011) in the field of pollution dispersion. This all confirms that the developed surrogate models fulfilled the purpose of replacing the original dispersion system in the Seveso-type simulations at a lower computational cost.

The acceptable accuracy of a surrogate model depends on the purpose of the model. The modeller has to put enough effort into the modelling to achieve the acceptable accuracy. Moreover, if more data or data with better information content is used, a better model can be obtained.

One has to keep in mind that the training of models takes a considerably longer time than the prediction, but the time is still reasonable. The computational load of the model training increases nonlinearly with the number of training data, but the increase in computational load for surrogate-model predictions is linear with the number of test data and not considerable in comparison with the prediction time of the original Lagrangian particles model.

5 Conclusion

The objective of our investigation was to develop a surrogate model that will replace the air pollution Lagrangian particle dispersion model for computationally intensive applications like computer experimentations at a considerably lower computational time. We proposed two methods, i.e. GIM and RGI, that considerably raise the potential for numerical experimentation. The obtained surrogate models can be used for computer experimentation like long-range predictions, simulations, parameter optimisation, etc., which would be very time-consuming with the air pollution Lagrangian particle dispersion model.

The accuracy of the surrogate model depends on the amount of training data used and its information content. The computational load for surrogate model development increases with the number of training data. However, the increase in computational load for predictions is linear for the proposed models and not considerable in comparison with the original model.

Other studies using surrogate models have not tackled the problem of air pollution dispersion in the same way as the present one. Two alternative methods that represent a solution of surrogate modelling for air-pollution dispersion based on dynamic models were demonstrated. While the idea of using a grid of models has been used before in somewhat different contexts, the use of dynamic GIM and RGI is novel for the dispersion-modelling problem of interest. Moreover, the investigation demonstrated the utility of surrogate modelling in the modelling of air pollution dispersion over complex terrain.

Further studies exploring more complex models, using different model structures, output-reduction methods and different kinds of dispersion problems are envisaged in future research.