1 Introduction

In recent years, China’s rapid urbanization and industrialization have brought about rapid economic development, while air pollution in most Chinese cities has become increasingly serious (Huang and Hu 2018). Air pollution directly affects the quality of the environment and people's physical and mental health seriously, among which PM2.5 is one of the main components of air pollutants, mainly composed of highly reactive toxic and harmful substances. A large number of clinical cases and related studies have proved that there is a correlation between the occurrence of various respiratory and cardiovascular diseases and high concentrations of PM2.5. PM2.5 comes not only from natural sources such as wind, sand, dust and forest fires, but also mainly from human energy combustion and industrial production (Samal et al. 2021). During the continuous development of China's economy, air pollution is in some sense unavoidable. However, this does not mean that pollution emissions cannot be effectively prevented and controlled.

In order to strengthen air pollution control and improve air quality, China has amended the Air Pollution Control Law (Tao et al. 2013). In addition, accurate prediction of PM2.5 concentrations is listed as one of the key objectives of air pollution prevention in China's Action Plan for the Prevention and Control of Air Pollution, which was proposed in 2018. Accurate PM2.5 concentration prediction can help people understand future changes in air quality, so that they can prepare protective measures in advance to protect their health, such as wearing anti-haze masks (Zhu et al. 2018). It can also help researchers to develop response strategies in advance to prevent further deterioration of air quality (Liu et al. 2018). Therefore, the accurate prediction and early warning of PM2.5 concentration has become a hot issue in the field of air pollution management research (Wu et al. 2016).

In a review of related studies, prediction models for air pollutant concentrations can be broadly classified into four types: chemical transport models (CMT), statistical models, artificial intelligence (AI) techniques, and hybrid models. CMT is a deterministic prediction based on the sources and transport of chemical substances (Xu et al. 2021; Shin et al. 2021). However, the prediction accuracy of CMT depends on the accurate description of the physical–chemical processes of pollutants and the quality of emission data (Konovalov et al. 2009). Therefore, CMT is more time consuming and complex than statistical models, while the accuracy is not stable (Han et al. 2008). Common statistical models are multiple linear regression and autoregressive integrated moving average (ARIMA). Donnelly et al. (2015) constructed a real-time air quality prediction model using multiple linear regression. García et al. (2018) constructed ARIMA to predict daily PM10 concentrations in northern Spain with good prediction accuracy. Zhang et al. (2018) used ARIMA to analyze the trend of PM2.5 concentrations and found a significant positive correlation with the changes in PM10, SO2 and NO2 concentrations. Although a statistical model can obtain valid prediction results, it is based on a set of statistical assumptions. This makes statistical models not capable enough to capture nonlinear features from time series (Li et al. 2021a, b, c). To overcome the limitations of statistical models, AI models started to be applied to time series forecasting.

Data-driven AI techniques have excellent nonlinear fitting ability and robustness, so researchers have applied them widely in air pollutant prediction (Ren et al. 2021). Common AI models include artificial neural networks (ANN) (Ogliari et al. 2021; Zhang et al. 1998), generalized regression neural networks (GRNN) (Li et al. 2013), and recursive neural networks (Biancofiore et al. 2017). Feng et al. (2015) used air mass trajectory analysis to improve the accuracy of ANN prediction for daily average PM2.5 concentrations. Combining ANN with effective training algorithms can extract potential nonlinear relationships between variables. It is demonstrated that a fast and economical air pollution warning system can be constructed using neural networks (Bo et al. 2021). Biancofiore et al. (2017) used the measured meteorological parameters as input variables to the recursive neural networks and predicted PM10 concentrations for the next one to three days. Yan et al. (2021) used GRNN to predict PM2.5 concentration levels in three urban clusters in China. The results showed that GRNN could accurately predict PM2.5 concentration levels in these clusters. Time series prediction is a prediction relative to data over a period of time. Using only the latest PM2.5 concentration data for prediction, information from past data will be lost. Unlike traditional neural networks that ignore the long-term dependence of time series, recurrent neural networks (RNN) are able to maintain the memory of recent information. This gives it excellent performance in processing time series data (Wang et al. 2021). Long short-term memory neural network (LSTM), as a variant of RNN, has long-time memory capability and improves the problems of long-term dependence and gradient explosion that exist in RNN (Ahmed et al. 2021). Bai et al. (2019a, b) used LSTM to forecast PM2.5 concentrations from two Beijing meteorological stations. The results demonstrate that LSTM can effectively capture complex features in nonlinear sequences. Although AI models have some advantages in air pollution prediction, single AI models still have problems such as unstable prediction results and easy over-fitting.

In order to further improve the performance and stability of prediction models, researchers have developed various hybrid models by effectively integrating different techniques and methods. And among various types of hybrid models, the hybrid model based on the idea of decomposition and integration can effectively deal with nonlinear and nonstationary time series and has excellent prediction performance, which becomes one of the hot spots in time series forecasting nowadays. Based on the idea of decomposition followed by integration, the nonlinear time series is first decomposed into several smoother subseries. Then, a suitable prediction model is constructed based on the characteristics of the decomposed time series, and finally the obtained results are integrated. The decomposition and integration method can effectively improve the prediction accuracy and prediction stability of nonlinear and nonsmooth time series.

As an important module for decomposing integrated models, time series decomposition methods can extract more meaningful information and reduce the difficulty of prediction (Sun et al. 2022). Wavelet analysis is considered to be an effective algorithm for decomposing time series (Kisi and Alizamir, 2018). Huang and Wang (2018) used db6 wavelets and wavelet neural networks (WNN) to forecast four energy market price forecasts and experimentally prove that the hybrid model has higher accuracy. Nourani and Farboudfamm (2019) combined sym3 wavelet decomposition with LSSVM and ANN models for decomposing rainfall time series. However, the researcher needs to select a suitable wavelet basis function subjectively and without a specific theoretical basis when performing wavelet analysis. Empirical mode decomposition (EMD) is a data-driven adaptive method capable of decomposing nonlinear and nonstationary signals. Huang et al. (2012) constructed a decomposition integration framework based on EMD and gated recurrent unit neural network (GRU) for PM2.5 prediction. Due to the lack of a complete theoretical basis itself, EMD algorithms suffer from problems such as modal mixing and endpoint effects (Li et al. 2021a, b, c). To solve the drawbacks of EMD, Wu and Huang (2009) proposed ensemble empirical mode decomposition (EEMD). Bai et al. (2019a, b) applied EEMD to PM2.5 concentration prediction and improved the prediction accuracy. Although increasing the number of EEMD integrations can minimize the reconstruction error, the reconstructed components still contain residual noise of some magnitude. To extract more efficient features, Guo et al. (2020) applied a combination of complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN) and LSTM to chaotic sequence prediction. Lin et al. (2021) used the CEEMDAN-LSTM model to forecast the Chinese stock index, which proved to be the best among developed and emerging stock markets. CEEMDAN has almost zero reconstruction error by adaptively increasing and weakening the white noise, which allows it to extract more effective information. However, multi-scale decomposition of nonlinear time series using signal decomposition algorithms often results in a new set of subseries. Previous studies usually predict each subsequence, but this also leads to an increase in the computational complexity of the model and may bring problems such as the accumulation of errors in prediction. The existence of similar trends and complexity among different subsequences is often overlooked, and the effective treatment of decomposed subsequences still needs further research.

Although the hybrid model based on the decomposition integration framework has been successfully applied to the prediction of various pollutant concentrations. However, PM2.5 concentration variation is affected by complex factors such as meteorological and environmental factors (Hu et al. 2013), such as topography, vegetation, wind speed and temperature (Zhu et al. 2017). The training data consisting of only historical time series of PM2.5 concentrations cannot provide enough valid information, which will undoubtedly hinder the prediction accuracy and generalization performance of the model (Wang et al. 2015). Related studies have shown that meteorological factors and pollutants have a strong influence on PM2.5 fluctuations. Yoo et al. (2014) analyzed Korean PM2.5 data between 2000 and 2012 and found a significant negative correlation between atmospheric precipitation and PM2.5. Ma et al. (2021) found that factors such as precipitation, temperature, and wind direction can affect the concentration and dispersion range of PM2.5. Bai et al. (2019a, b) concluded that meteorological data had a significant seasonal effect on PM2.5 and used Kendall correlation analysis to extract the relationship between meteorological factors and PM2.5. Gu et al. (2020) collected meteorological and pollutant data and divided PM2.5 concentration data into environmental factors, temporal factors, and selected samples to construct a new superposition selective integration support predictor to achieve effective prediction of PM2.5. However, there may be redundancy and similarity among different influencing factors, and if all factors are directly introduced into the prediction model, it may bring the problem of error accumulation and reduce the accuracy of the prediction model (Feng et al. 2021). Therefore, it is still challenging and needs further research to select the appropriate influencing factors for prediction.

From the above literature review, it can be found that although the hybrid model has excellent predictive performance and robustness, it still has some drawbacks. First, the similar complexity between decomposed subsequences is often ignored by researchers. Modeling each subseries separately not only increases the computational complexity of the model, but also may lead to the accumulation of errors making the prediction accuracy lower. Second, many previous studies often use historical time series data of PM2.5 for modeling, ignoring the influence of complex influencing factors on PM2.5 fluctuation trends, which limits the prediction performance of the models. Third, deep learning models, as the main prediction models, are very sensitive to the selection of their hyperparameters. Different subsequences have different data characteristics, and choosing appropriate hyperparameters to model them can effectively improve the accuracy and stability of prediction. However, in previous studies, the selection of hyperparameters relied on empirical selection or repeated debugging, which made it difficult to determine the optimal hyperparameters. Fourth, after obtaining the prediction results for each subsequence, the present integration methods are mainly limited to linear integration, i.e., the predicted values are accumulated to obtain the final prediction results. Due to the errors in the prediction process, the linear integration method is not applicable to all cases and may lead to a decrease in prediction accuracy and stability. The nonlinear integration method can explore the intrinsic features among subseries to further improve the prediction accuracy.

Based on the above considerations, this study proposes a multi-factor multi-scale and intelligent optimization based two-stage deep learning hybrid framework for air pollutant forecasting and early warning, including CEEMDAN, fuzzy entropy (FE), the max-relevance and min-redundancy (mRMR), Gray Wolf Optimization algorithm (GWO) and LSTM. First, the PM2.5 concentration sequence is decomposed into several subseries using CEEMDAN to reduce the complexity of the sequence and make it smoother. Then, each subsequence is reconstructed into several new components based on its FE value, which reduces the complexity of the model and improves the computational efficiency. Then, the mRMR algorithm is used to select several exogenous variables for each reconstructed component that have a large impact on it for prediction. Next, a two-stage intelligent optimization prediction model based on GWO algorithm and LSTM is developed to predict and nonlinearly integrate the reconstructed components to obtain the final PM2.5 concentration prediction results. Finally, based on the accurate PM2.5 concentration prediction results, an effective air pollution warning is achieved. In this paper, historical data of PM2.5 concentrations in three Chinese cities are selected to validate the proposed hybrid framework. Compared with other benchmark models, the proposed model has good performance and prediction accuracy.

As shown above, the main contributions and innovations of this paper are as follows:

  1. (1)

    Considering the similar trend and complexity between different decomposition patterns, this paper develops a novel feature extraction method combining CEEMDAN and FE to effectively decompose PM2.5 concentration sequences and extract different types of components from them, which improves the computational efficiency and accuracy of the prediction model.

  2. (2)

    Most previous studies have focused on prediction models based on historical PM2.5 concentration time series. This paper develops an mRMR-based feature selection method that uses PM2.5 data and multi-influence factor data as input features in the modeling process to construct a multi-influence factor-based hybrid prediction framework.

  3. (3)

    In order to further improve the prediction performance and stability of the neural network, this paper uses GWO to intelligently seek the optimal hyperparameters of the LSTM. Based on GWO-LSTM, a two-stage intelligent optimization model is developed to model and predict each reconstruction component separately and integrate the predictions nonlinearly.

  4. (4)

    In this paper, a two-stage deep learning hybrid prediction framework based on multi-factor multi-scale and intelligent optimization are proposed for the first time. The hybrid framework outperforms all comparative models and has good prediction accuracy and stability. Based on this hybrid prediction framework, an air pollution prediction and early warning system is established to achieve effective forecasting and warning of future air pollutant concentrations and air pollution conditions.

The rest of the paper is structured as follows: Sect. 2 outlines the methods used in this paper. Section 3 details the structure of the hybrid prediction framework and the evaluation metrics of prediction performance. Section 4 describes the data preprocessing, forecasting process and comparative experiments. Finally, Sect. 5 shows conclusions and outlook.

2 Methodology

2.1 Complete ensemble empirical mode decomposition with adaptive noise (CEEMDAN)

Huang et al. (1998) proposed an adaptive signal processing method for handling nonlinear nonstationary data, namely EMD. EMD does not require assumptions on the data and can decompose complex nonstationary signals and preserve the time scale of the data. However, in practical applications, EMD often encounters the problem of modal mixing (Sun et al. 2022). Therefore, Wu et al. (2009) proposed EEMD based on EMD. EEMD decomposes the data by repeatedly adding varied white noise to the original signal, based on the condition that white noise's average value is zero. This can effectively improve the modal mixing problem, but it also has the problems of large upper reconstruction error and long computation time. For this reason, the EEMD-based CEEMDAN is proposed to solve the above problems. CEEMDAN not only effectively overcomes the problem of modal mixing by adding adaptive white noise, but also removes the reconstruction error and reduces the computing cost (Li et al. 2021a, b, c). Therefore, CEEMDAN can handle nonsmooth and nonlinear data more effectively.

The CEEMDAN algorithm is implemented in the following steps.

Step 1: The Gaussian white noise with mean zero is first added to the original signal \(s\left( t \right)\) to obtain the preprocessed signal \(s_{i} (t)\) for \(k\) experiments.

$$s_{i} (t) = \varepsilon \omega_{i} (t) + s(t), \, i = 1,2,...k$$
(1)

where \(\omega_{i} (t)\) is the Gaussian white noise of the ith processing, \(\varepsilon\) is the noise ratio bar. Then the first intrinsic mode function (IMF) component \(IMF_{1}^{i} (t)\) is obtained by decomposing \(s_{i} (t)\) using EMD, and its mean value is found as the first IMF component \(IMF_{1} (t)\) obtained by CEEMDAN decomposition.

$$IMF_{1} (t) = \frac{1}{k}\sum\limits_{i = 1}^{k} {IMF_{1}^{i} (t)}$$
(2)

Step 2: To calculate the first residual \(r_{1} (t)\), subtract the first IMF from the initial sequence.

$$r_{1} (t) = s(t) - IMF_{1} (t)$$
(3)

Step 3: Gaussian white noise is added into the residual signal of the jth stage obtained from the decomposition, and the EMD decomposition is continued.

$$IMF_{j} (t) = \frac{1}{k}\sum\limits_{i = 1}^{k} {E_{1} (r_{j - 1} (t) - \varepsilon_{j - 1} E_{j - 1} (\delta_{i} (t)))}$$
(4)
$$r_{j} (t) = r_{j - 1} (t) - IMF_{j} (t)$$
(5)

where \(IMF_{j} (t)\) is the jth IMF obtained from the CEEMDAN decomposition, \(E_{j - 1}\) is the jth IMF component obtained by performing the EMD decomposition, and \(\varepsilon_{j - 1}\) is the noise factor added to the residual component of the stage \(j - 1\). Finally, \(r_{j} (t)\) is the residual component of the i-th stage.

Step 4: Repeat steps 1 to 3 until the number of extreme value points of the residual components is less than or equal to 2, and the decomposition process of CEEMDAN is finished when the decomposition cannot continue. At this time, the PM2.5 concentration sequence is decomposed into several IMF components and one residual component.

2.2 Fuzzy entropy (FE)

In order to strike a balance between the computational efficiency of the model and the accuracy of the prediction, a method to measure the complexity of time series is adopted, which is called fuzzy entropy (FE). The FE algorithm is an improvement of the sample entropy (SE) and approximate entropy (AE) methods, which retains the advantages of sample entropy and approximate entropy and addresses the shortcomings of imprecise analysis in the presence of small fluctuations and baseline drift. In general, the larger the value of FE, the lower the serial autocorrelation. Therefore, the features can be recombined according to the values calculated by the FE algorithm to balance the computational efficiency and prediction accuracy. The calculation process of FE are as follows:

Step 1: Defining the phase space dimension as \(m\), the phase space reconstruction is performed on the N-dimensional time series \(\{ x(1),x(2),...,x(N)\}\) to obtain \(X_{i}^{m}\). Where \(w_{0} (i)\) is the mean value.

$$X_{i}^{m} = \{ x(i), \, x(i + 1), \, ... \, , \, x(i + m - 1)\} - w_{0} (i)$$
(6)
$$w_{0} (i) = \frac{1}{m}\sum\limits_{j = 0}^{m - 1} {x(i + j)}$$
(7)

Step 2: Define the absolute distance \(d_{a,b}^{m}\) between the vectors \(X_{i}^{m}\) and \(X_{j}^{m}\) as the maximum value of the difference of their corresponding elements. Where \(j = 1,2,...,N - m + 1\), and \(j \ne i\).

$$d_{ij}^{m} = d[x_{a}^{m} ,x_{b}^{m} ] = \mathop {\max }\limits_{k = 0,1,2,...,m} (|x(a + k - 1) - w_{0} (i)| - |x(a + k - 1) - w_{0} (j)|)$$
(8)

Step 3: Next, a fuzzy function \(F(d_{ij}^{m} ,n,r)\) is introduced to define the correlation \(D_{ij}^{m}\) between \(X_{i}^{m}\) and \(X_{j}^{m}\). In Eq. (9), \(r\) denotes the boundary width and \(n\) denotes the boundary gradient.

$$D_{ij}^{m} = F(d_{ij}^{m} ,n,r) = \exp \left( { - \left( {\frac{{d_{ij}^{m} }}{r}} \right)^{n} } \right)$$
(9)

Step 4: Define the fuzzy degree similarity function as:

$$\delta^{m} (n,r) = \frac{1}{N - m}\sum\limits_{i = 1}^{N - m} {(\frac{1}{N - m - 1}} \sum\limits_{j = 1,j \ne i}^{N - m} {D_{ij}^{m} } )$$
(10)

Step 5: Change the phase space dimension to \(m + 1\) and repeat the above calculation steps to obtain the function.

$$\delta^{m + 1} (n,r) = \frac{1}{N - m}\sum\limits_{i = 1}^{N - m} {\left( {\frac{1}{N - m - 1}\sum\limits_{j = 1,j \ne i}^{N - m} {D_{ij}^{m + 1} } } \right)}$$
(11)

Step 6: Ultimately, the fuzzy entropy of this time series is defined as:

$$FuzzyEn(N,m,n,r) = \ln \delta^{m} (n,r) - \ln \delta^{m + 1} (n,r)$$
(12)

The results of FE algorithm are mainly determined by the parameters \(m\), \(n\) and \(r\). In general, \(m\) is often taken as 1 or 2. \(r\) is usually set to \(0.1\sigma_{{{\text{SD}}}}\) to \(0.25\sigma_{{{\text{SD}}}}\), and the \(\sigma_{{{\text{SD}}}}\) is the standard deviation of the original sequence. \(n\) is generally taken as a smaller integer value, such as 1 or 2.

2.3 Max-relevance and min-redundancy (mRMR)

The mRMR method is a typical spatial search-based filtering method proposed by Peng et al. in 2005, which uses mutual information to measure the relevance and redundancy of features (González-Enrique et al. 2021). Let \(W_{n} = \{ z_{1} ,z_{2} ,...,z_{n} \}\) be the set of influencing factor features, we need to select \(m\) meteorological features with high correlation with PM2.5 from \(n\) influencing factors. Firstly, the mutual information \(MI(s(t),z_{i} )\) between PM2.5 concentration \(s(t)\) and all influencing factors is calculated as:

$$MI(s(t),z_{i} ) = \int {\int {p(s(t),z_{i} )\log \frac{{p(s(t),z_{i} )}}{{p(s(t))p(z_{i} )}}ds(t)dz_{i} } }$$
(13)

The mutual information between the influencing factors is:

$$I(z_{i} ,z_{j} ) = \int {\int {p(z_{i} ,z_{j} )\log \frac{{p(z_{i} ,z_{j} )}}{{p(z_{i} )p(z_{j} )}}dz_{i} dz_{j} } }$$
(14)

where \(p\) is the probability density function, \(z_{i} ,z_{j} \in W_{n}\), \(i \ne j\), and \(s(t)\) is the PM2.5 concentration sequence. Then find the feature subset \(S_{m}\) containing \(m\) features, where \(m \le n\), \(S_{m} \subseteq W_{n}\). The formulae for the maximum relevance calculation principle and the minimum redundancy calculation principle are as follows:

$$D(S_{m} ,s(t)) = \frac{1}{{|S_{m} |}}\sum\limits_{{z_{i} \in S_{m} }} {MI(s(t),z_{i} )}$$
(15)
$$N(S_{m} ) = \frac{1}{{|S_{m} |^{2} }}\sum\limits_{{z_{i} ,z_{j} \in S_{m} }} {MI(s(t),z_{i} )}$$
(16)

where \(|S_{m} |\) is the number of features in the set \(S_{m}\). The formula for integrating the maximum relevance and minimum redundancy is as follows.

$$\max \phi (D,N), \, \phi = D - N$$
(17)

Suppose that the factor \(z_{k}\) with the largest mutual information with \(s(t)\) among the influencing factors is extracted as the first characteristic factor within \(S_{m}\), and the remaining influencing factors are \(W_{n} = W_{n} - z_{k}\). The mutual information of the influencing factors within \(W_{n}\) with \(s(t)\) is calculated separately, and \(\phi\) is maximized by selecting the characteristics. \(\phi\) is calculated as:

$$\max \vartriangle \phi = MI(s(t),z_{i} ) - \frac{1}{{|W_{n} | - 1}}\sum\limits_{{z_{i} ,z_{j} \in W_{n} }} {MI(z_{i} ,z_{i} )}$$
(18)

In the above equation, \(\vartriangle \phi\) is the operator increment, which is the difference between the mutual information of influences \(z_{i}\) within \(W_{n}\) and \(s(t)\) and the mutual information of \(z_{i}\) and other influences within \(W_{n}\). The magnitude of \(\vartriangle \phi\) can be used as a basis for evaluating the importance of features. In addition, \(|W_{n} |\) is the number of feature values in the set \(W_{n}\).

2.4 Long short-term memory (LSTM)

RNN as a new type of neural network with memory function is suitable for time series problems. However, it also suffers from problems such as gradient disappearance and gradient explosion. LSTM as a modified RNN consists of a memory cell, input gate, forgetting gate and output gate. The input and forgetting gates are used to determine whether to add new input information and whether to forget past states. The output gate, on the other hand, determines whether the long-term state is propagated to the final state. LSTM effectively avoids the gradient disappearance problem of RNN and has long-time memory capability at the same time (Barzegar et al. 2020). In this study, LSTM was used for prediction and nonlinear integration of PM2.5 concentration.

The actual PM2.5 concentration value at time \(t\) is assumed to be \(x_{t}\), and \(\hat{x}_{t}\) is the predicted value corresponding to the PM2.5 concentration. Moreover, \(f_{t}\), \(i_{t}\) and \(o_{t}\) represent the forgetting gate, the input gate and the output gate, respectively. The main formulae for each component of the LSTM are shown below:

$$f_{t} = \sigma \left( {U_{f} x_{t} + V_{f} \hat{x}_{t} + b_{f} } \right)$$
(19)
$$i_{t} = \sigma \left( {U_{i} x_{t} + V_{i} \hat{x}_{t} + b_{i} } \right)$$
(20)
$$o_{t} = \sigma \left( {U_{o} x_{t} + V_{o} \hat{x}_{t} + b_{o} } \right)$$
(21)
$$\tilde{c}_{t} = \tanh \left( {U_{{\tilde{c}}} x_{t} + V_{{\tilde{c}}} \hat{x}_{t} + b_{{\tilde{c}}} } \right)$$
(22)
$$c_{t} = f_{t} * c_{t - 1} + i_{t} * \tilde{c}_{t}$$
(23)
$$h_{t} = \tanh (c_{t} ) * o_{t}$$
(24)

In the above equation, \(U\) and \(V\) are the weight matrices, \(b\) is the bias term, and \(\tanh ( \cdot )\) and \(\sigma ( \cdot )\) are the activation functions, '\(*\)' denotes the scalar product. The LSTM consists of these memory blocks and is learned by a temporal algorithm using back propagation. More, LSTM is prone to overfitting or gradient explosion when dealing with long sequences. Adding Dropout layers and adjusting the appropriate Dropout layer rate can improve the generalization ability of the model and avoid overfitting.

2.5 Grey wolf optimizer (GWO)

Mirjalili et al. (2014) proposed the GWO algorithm, which is a new swarm intelligence method. GWO is considered to possess stronger performance than many existing superior algorithms such as particle swarm optimization algorithms (PSO) (Sulaiman et al. 2015). GWO simulates the hunting ability and social hierarchy of gray wolves. In general, GWO divides the social level of wolves into four levels. The first level of the pyramid is the leader of the wolf pack, which is called \(\alpha\). The second level is \(\beta\), which has dominance only after \(\alpha\). The third level is \(\delta\), which obeys the decisions of \(\alpha\) and \(\beta\). The bottom level is \(\omega\), which is responsible for the balance within the pack. The GWO optimization process involves hierarchical hierarchy, tracking, encircling and attacking prey and finding prey. GWO keeps the best three wolves (\(\alpha\), \(\beta\), \(\delta\)) in each iteration and updates them according to the three best solutions.

2.6 GWO-LSTM model

The training process of LSTM is mainly based on the update of weights and bias of hyperparameters, and the choice of hyperparameters can significantly affect the prediction performance of LSTM. Related studies show that the number of neural units directly affects the fitting ability of the model, and increasing the number of LSTM neural units can increase the fitting ability of the prediction model, but too many neural units may also lead to overfitting. However, there is no clear method to select the number of neural units. In addition, batch size is closely related to the weight update of the model. The size of batch size affects the convergence speed and prediction performance of the prediction model. Traditional prediction model research often relies on empirical selection when adjusting hyperparameters, by repeatedly experimenting and adjusting hyperparameters until the training set prediction error is minimized. This approach is time-consuming and difficult to obtain the best hyperparameters for prediction models.

To balance the complexity of prediction and prediction accuracy, the hyperparameters of the LSTM network are optimized using the GWO algorithm. When training the data, the number of iterations, the number of gray wolves, and the dimensionality of GWO are determined by first determining the historical data step lookback of the input layer in the LSTM. The fitness function is set as:

$$fitness = \frac{1}{N}\sum\nolimits_{i = 1}^{N} {|y_{i} - y_{i}^{^{\prime}} |}$$
(25)

where \(y_{i}\) is the training set data, \(y_{i}^{^{\prime}}\) is the predicted data, \(N\) is the length of the data. The hidden layer neurons, batch size and Dropout rate are selected as the target hyperparameters for optimization. The target hyperparameters are corresponding to the wolf positions of GWO in different dimensions, thus transforming the learning process of the neural network into the process of searching for the best position of wolves in the multidimensional space. The hyperparameters are substituted into the LSTM to calculate the corresponding prediction value y and the fitness value of each individual is calculated according to Eq. (25). Continuously iterating, the hyperparameter optimal solution of the LSTM network is finally returned. The pseudo code of the GWO-LSTM algorithm is shown below.

figure a

3 Structure of the proposed hybrid framework

3.1 Proposed hybrid framework

This study proposes a deep learning hybrid framework for air pollutant prediction and early warning based on multi-factor multi-scale and two-stage intelligent optimization, which combines CEEMDAN, FE, mRMR, GWO and LSTM. As shown in Fig. 1, the framework is mainly divided into four stages.

Stage 1: Feature extraction

CEEMDAN can decompose the PM2.5 concentration sequence adaptively into several patterns of different amplitudes and frequencies. The patterns obtained from the decomposition are arranged by frequency from high to low frequencies. Compared with the original PM2.5 sequence, these patterns have simpler structure, more stable fluctuations and more regularity, which can be predicted more easily. However, there are similar trends and complexity between these patterns, and fuzzy entropy can effectively calculate the complexity of different sequences, the higher the entropy value the higher the complexity of the sequence, and the lower the entropy value the lower the complexity of the entropy value. According to the similar trend and complexity of different patterns, the decomposed patterns can be reconstructed into several new components. Each reconstructed component has unique characteristics and contains different intrinsic features of PM2.5 concentration.

Stage 2: Feature Selection

The fluctuation of PM2.5 concentration is affected by complex factors such as environmental factors, human factors and meteorological factors, which leads to complex characteristics such as nonlinearity and nonstationarity of PM2.5. In this study, meteorological factors and pollutant factors are taken into account in the prediction of PM2.5 concentrations to improve the prediction accuracy and generalization ability of the model. There is redundancy between different influencing factors. Directly using all influencing factors for prediction research may lead to problems such as the cumulative error of prediction model. At the same time, different reconstructed components contain different intrinsic characteristics, and the same influencing factors have different degrees of influence on different reconstructed components. Therefore, in this paper, the mRMR algorithm is used to select the features of different components obtained after the decomposition and reconstruction of PM2.5. Several exogenous variables that have a strong influence on different reconstructed components are selected as input variables to improve the prediction accuracy and generalization performance of the model.

Stage 3: Two-stage intelligent optimization model

PM2.5 concentrations are influenced by its own historical concentration data and related factors and change gradually over time. LSTM can effectively capture the nonlinear relationship in the sequence and has the ability of long-term memory, which can combine the historical and current information in the long-term memory to make effective prediction for the future. Therefore, LSTM is used to predict PM2.5 future concentrations.

To balance the computational efficiency and prediction accuracy of the prediction model, this paper uses the GWO algorithm to optimize the hyperparameters of the LSTM. Based on the GWO-LSTM, a two-stage intelligent optimization model is developed to model the prediction for each subset of sequences, and all predictions are nonlinearly integrated to obtain the final PM2.5 concentration prediction results.

Stage 4: Air Pollution Prediction and Warning

China's Ambient Air Quality Standards, released in 2012 and implemented in 2016, regulate air environmental quality standards to further prevent and control air pollution and protect people's physical and mental health. The standard divides the ambient air functional areas into two categories, such as areas requiring special protection, such as nature reserves, and areas such as residential and industrial zones, and sets standards for pollutant concentration limits, which provide scientific support for the monitoring and management of environmental quality nationwide. To make it easier for people to pay attention to air pollution, Chinese private individuals have organized themselves to set up various environmental monitoring websites for the release of air pollution information in major cities. These air pollution monitoring websites set more refined criteria for assessing air pollution levels, making it easier for people to understand air pollution levels more intuitively.

Based on the proposed hybrid prediction framework, this paper makes effective predictions of PM2.5 future concentrations. As shown in Table 1, this paper also makes reference to the air pollution level criteria from the PM2.5 real-time monitoring network (http://www.pm25china.net/) to provide early warnings of future air pollution levels, helping people to prepare coping strategies in advance and the government to take pollution prevention and control measures in advance in a targeted manner.

Table 1 PM2.5 air pollution standards \((\mu {\text{g/m}}^{3} )\)

3.2 Evaluation criteria

To assess the predictive performance of various models, we must choose appropriate evaluation metrics. In this paper, four popular evaluation metrics are chosen to measure the performance of models, including mean absolute error (MAE), root mean square error (RMSE), coefficient of determination (R2), and mean absolute percentage error (MAPE). These metrics have been widely used in pollution prediction studies (Sun and Li 2020; Wu et al. 2020), and the details of each metric are described as follows:

$$MAE = \frac{1}{k}\sum\nolimits_{i = 1}^{k} {|y_{i} - \hat{y}_{i} |}$$
(26)
$$RMSE = \sqrt {\frac{{\sum\nolimits_{i = 1}^{k} {(\hat{y}_{i} - y_{i} )^{2} } }}{k}}$$
(27)
$$R^{2} = 1 - \frac{{\sum\nolimits_{i = 1}^{k} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum\nolimits_{i = 1}^{k} {(y_{i} - \overline{y}_{i} )^{2} } }}$$
(28)
$$MAPE = \frac{1}{k}\sum\nolimits_{i = 1}^{k} {|\frac{{\hat{y}_{i} - y_{i} }}{{y_{i} }}|} \times 100\%$$
(29)

where k represents the number of test sets, \(y\) represents the true value, \(\overline{y}\) represents the mean value, and \(\hat{y}\) represents the predicted outcome.

4 Case analysis

4.1 Data collection

4.1.1 PM2.5 concentration data

With the rapid development of the economy, air pollution has become an urgent problem in China. Researchers have focused their PM2.5 prediction studies on various economically developed cities in China. Examples include cities such as Beijing (Luo et al. 2018), Shanghai (Xu et al. 2017), and Wuhan (Wang et al. 2017). However, other industrial cities in China, where air pollution is more severe, are often neglected. In this paper, we fill this gap in the literature by selecting Xingtai and Anyang, two of the most polluted Chinese cities in terms of air pollution, as the study samples by referring to the 2019 China Ecological Environment Status Bulletin published by the Chinese Ministry of Ecology and Environment in 2020. In addition, Beijing, the capital city of China, is added as a research sample to verify the validity of the hybrid framework proposed in this paper.

In 2020 and 2021, some meteorological observations were stopped by COVID-19, and the relevant data were missing. Therefore, the data set for this study is the PM2.5 daily average concentration data from three cities from January 1, 2018 to December 31, 2019. The data were obtained from the Ministry of Ecology and Environment of China (http://www.mee.gov.cn). As shown in Figs. 1 and 2, the sample data are divided into training and test sets by 8:2.

Fig. 1
figure 1

Flow chart of the framework

Fig. 2
figure 2

Sample data of Xingtai, Anyang and Beijing

4.1.2 Influencing factors of PM2.5

The causes of PM2.5 pollution are complex. For example, pollutants such as nitrogen oxides and sulfur dioxide in the atmosphere are easy to produce secondary fine-grained pollutants through chemical reactions. In addition, human industrial production and living activities also bring a large number of fine particle pollutants. In addition, relevant studies show that meteorological factors have stronger influence on PM2.5 than other factors (Chen et al. 2017). For example, wind speed and direction can affect the diffusion range and speed of pollutants. PM2.5 is easily adsorbed to water vapor, so when the humidity is high, the concentration of PM2.5 is high. In addition, when the temperature increases, the concentration of PM2.5 decreases continuously, and when the temperature decreases, the concentration of PM2.5 increases significantly. As shown in Fig. 2, draw the daily concentration change curve of PM2.5. PM2.5 concentration changes obviously in different seasons, showing a "double peak" distribution mode. The concentration of fine particles in winter and spring is significantly higher than that in summer and autumn, which is related to the temperature difference in different seasons. Based on referring to relevant literature and considering the availability of data, this paper introduces 11 influencing factors of PM2.5, including average wind speed, maximum sustainable wind speed, average air temperature, average dew point, maximum temperature, minimum temperature, PM10, SO2, CO, NO2 and O3.

In artificial intelligence algorithms, dimensionless quantization of data can accelerate convergence and avoid the influence of singular sample data on calculation results. The typical normalization method is adopted in this study, and the formula is as follows:

$$x^{\prime} = \frac{x - \min (x)}{{\max (x) - \min (x)}}$$
(39)

where the original value is \(x\) and the normalized result is \(x^{\prime}\).

4.2 Decomposition of original PM2.5 series by CEEMDAN

In the proposed framework, the original PM2.5 concentration sequence is decomposed by CEEMDAN. Before that, two parameters,\(k\) and \(\varepsilon\), need to be set for CEEMDAN. Referring to relevant literature and several attempts, the values of \(k\) and \(\varepsilon\) are set to 100 and 0.005, respectively. The Xingtai PM2.5 concentration sequence is split into seven subsequences, as illustrated in Fig. 3A. Each decomposition pattern is named \(IMFi\left( {i = 1,2,..., \, 6} \right)\) and Residual, respectively. Meanwhile, the original PM2.5 sequence in Anyang and Beijing is decomposed into 8 and 7 subsequences, respectively. In addition, the Pearson correlation coefficients of the original data and each IMF are calculated in this paper to facilitate the exploration of the relationship between the decomposed subsequences and the original sequences, and are presented as bar charts in Figs. 3, 4 and 5B.

Fig. 3
figure 3

Decomposition and reconstruction results of Xingtai dataset

Fig. 4
figure 4

Decomposition and reconstruction results of Anyang dataset

Fig. 5
figure 5

Decomposition and reconstruction results of Beijing dataset

4.3 Subsequence reconstruction by FE

Fuzzy entropy can measure the complexity of different sequences, and Figs. 3C, 4C and 5C show the results of fuzzy entropy calculation for each subsequence. In this paper, according to the similar trends and fuzzy entropy values between different subsequences after decomposition, they can be reconstructed into three new components. (1) IMF1 is the high frequency component S-IMF1, which can respond to the random fluctuation of PM2.5 concentration caused by various complex factors. Although the high-frequency component may lead to short-term drastic changes in PM2.5 concentration, it does not cause long-term effects on PM2.5 concentration fluctuations (Tai et al. 2010). (2) IMF2 and IMF3 are reconstructed as the intermediate frequency component S-IMF2, which responds to the periodic variation of PM2.5 concentration caused by atmospheric quasi-biennial oscillations, weather-scale system cycles, or human periodic activities (Kim et al. 2010; You et al. 2009; Zhang et al. 2015). (3) IMF4, IMF5, IMF6 and residuals are reconstructed as low frequency component S-IMF3, and such components are more stable and can effectively characterize the trend of PM2.5 concentrations during seasonal change (Wang et al. 2006). Taking Xingtai as an example, the three components obtained from the reconstruction are shown in Figs. 3D, 4D and 5D. Each component has unique characteristics, and the accuracy and stability of its prediction will be improved by selecting appropriate influencing factors and constructing prediction models according to the data characteristics of different components.

Fig. 6
figure 6

Prediction accuracy of the hybrid model on three data sets

4.4 Influencing factors selection by mRMR

After obtaining the three reconstructed components, this paper uses the mRMR method for each component to explore the main influencing factors of different components. Using Xingtai as an example, Table 2 demonstrates the mRMR results for the three components.

Table 2 Order of exogenous variables in Xingtai dataset

According to the results in Table 2, for the irregular fluctuation components of PM2.5, O3, PM10, mean dew point, maximum temperature and CO lead the ranking order, indicating a greater influence on them. O3, CO, mean dew point, mean temperature and mean wind speed have a greater influence on the short-term fluctuation of PM2.5. The short-term fluctuation component of PM2.5 is strongly influenced by O3, CO, average dew point, average temperature and average wind speed. While the low frequency component is strongly affected by O3, CO, average temperature, average dew point and maximum sustainable wind speed. Due to the correlation between influencing factors, all as input features may reduce the prediction performance and accuracy. Therefore, this paper selects the top 5 exogenous variables for each component as the input variables for prediction.

4.5 Two-stage intelligent optimization model

LSTM is well-suited to processing and forecasting time series data, and the selection of hyperparameters is critical in LSTM training. Increasing the number of layers and neurons of the neural network can effectively improve the fitting ability of the model, but also increases the risk of overfitting. And by introducing the Dropout mechanism, given the Dropout ratio, so that the model randomly discards the corresponding number of neurons in the training process, it can effectively prevent overfitting. After a large number of experiments and parameter adjustment, it is found that LSTM with two hidden layers has excellent prediction performance and robustness on different data sets, and the look back is set to 4, and the upper limit of epoch for each experiment is set to 1000. Based on the consideration of balancing prediction efficiency and prediction accuracy, the GWO algorithm is used to optimize the three hyperparameters of the number of hidden layer neurons, batch size and Dropout ratio. The gray wolf population size is set to 25, and the number of iterations is 100. Finally, a suitable prediction model is established for each reconstructed component. Table 3 shows the prediction accuracy of the hybrid framework on three data sets.

Table 3 Model prediction accuracy of each reconstructed sub-sequences

The LSTM can effectively capture the information in nonlinear data, and the nonlinear integration can obtain higher prediction accuracy and prediction stability. Therefore, after obtaining the prediction results for each reconstructed component, the GWO-LSTM is used to nonlinearly integrate all the predictions to obtain the final prediction results. The performance of the hybrid framework proposed in this paper on three datasets is illustrated in Fig. 6.

4.6 Air pollutant forecasting and warning

The hybrid framework proposed in this paper obtains accurate PM2.5 concentration prediction results on all three data sets, and effective forecasting of air pollutant concentrations can be achieved based on the prediction results. More, based on the air quality criteria in Table 1, the future air quality levels are warned based on the prediction results, and the accuracy of the warning results is shown in Table 5. As two of the most polluted cities in China, Xingtai and Anyang have large fluctuations in pollutant concentrations. In the test set of 141 days, Xingtai has 14 days of light pollution, 6 days of medium pollution and 2 days of highly pollution, and the warning accuracy of the hybrid framework proposed in this paper reaches 87%. In Anyang, there were 9 days of light pollution, 7 days of medium pollution and 7 days of highly pollution, and the accuracy of early warning reached 90%. In Beijing, which has a developed economy and vigorously combats air pollution, there are only 3 days of light pollution, and the rest of the time are good or excellent, and the warning accuracy of the hybrid framework out of this paper reaches 93%. Therefore, the hybrid framework proposed in this paper can be used as an effective tool for air pollution forecasting and early warning.

4.7 Comparative experiments

In order to verify the effectiveness and superiority of the hybrid model proposed in this paper, we designed two sets of comparison experiments. The first set of experiments uses three commonly used optimization algorithms, Genetic Algorithm (GA), Particle Swarm Algorithm (PSO) and Simulated Annealing Algorithm (SA), to perform hyperparametric optimization of the LSTM to verify the superiority and effectiveness of the GWO optimized LSTM. The second set of experiments introduces nine comparison models, eight of which are the results from eight excellent papers in the same research area in recent years. As shown in Tables 4 and 5, the prediction accuracy of the hybrid model proposed in this paper outperforms all the comparative models.

Table 4 Comparison results of optimization algorithms
Table 5 Results of eight comparative models based on relevant literature

In the first set of comparison experiments, the number of hidden layers of LSTM is set to 1, and the number of hidden layer neurons, batch size and Dropout rate of hidden layers are determined by using the optimization algorithm. The number of hidden layer neurons is [2,128], the batch size is [2,256], and the Dropout rate is [0,0.6]. Taking the data of Xingtai as an example, the number of iterations of the optimization algorithm is 50, and the iteration time and prediction performance are shown in the Table 4. GWO can find the optimal hyperparameters of LSTM faster and more effectively than other optimization algorithms, which can effectively improve the prediction of the model.

In the second set of comparison experiments, the SVR model based on random forest (RF) for feature selection has the worst prediction performance. Although the introduction of exogenous variables can improve the robustness and prediction accuracy of the model, this can be achieved only on the basis of a reasonable treatment of PM2.5 concentration series. In addition, the introduction of exogenous variables can easily lead to problems such as error accumulation, which in turn affects the prediction accuracy. The advantages of ANN and LSTM in handling nonlinear sequences make their prediction accuracy better than RF-SVR. However, the advantage of LSTM in temporal patterns does not make its prediction performance significantly better than ANN. This is because the original PM2.5 concentration sequence is more complex and more volatile, which makes it more difficult for the LSTM to learn valid information from it. Therefore, we additionally constructed CEEMDAN-FE-mRMR-GWO-ANN for comparison. The results show that the hybrid model proposed in this paper has higher prediction performance. The effective processing of PM2.5 sequences and the inclusion of PM2.5 influencing factors make it easier for the LSTM to capture the long-term dependence in the data, which further improves the prediction performance.

As shown in Table 5, models 4 through 8 are decomposed integrated frameworks, and these models show a significant improvement in predictive performance compared to models 1 through 3. Taking Anyang as an example, the MAE, RMSE and MAPE of EEMD-LSTM are improved by 50.82%, 51.81% and 52.96%, respectively, compared with LSTM. The warning accuracy of EEMD-LSTM model reached 77%, while the warning accuracy of LSTM model was only 55%. Among these five decomposition integration frameworks, CEEMD-GWO-SVR has the worst prediction performance, which is because SVR is less capable of handling nonlinear time series than RF and LSTM. Although the prediction performance of CEEMD-RF and EMD-GRU is good in two cities, Xingtai and Anyang, the warning accuracy is low. In addition, the prediction performance of both models in the Beijing dataset shows a substantial decrease, which indicates that the model cannot be effectively applied to datasets in different cities and the stability of prediction is poor. Among these five decomposition integration frameworks, the VMD-SE-LSTM and EEMD-LSTM showed good prediction performance, early warning accuracy and prediction stability in different datasets. And the hybrid prediction framework proposed in this paper, with the Anyang dataset, improves the MAE, RMSE and MAPE by 32.92%, 27.65% and 30.02%, respectively, compared to the EEMD-LSTM. And compared with MAE, RMSE and MAPE of VMD-SE-LSTM, the improvement is 40.08%, 48.64% and 38.02%, respectively. In addition, the hybrid prediction model proposed in this paper outperforms VMD-SE-LSTM and EEMD-LSTM in terms of warning accuracy and prediction stability on different data sets.

In summary, the hybrid prediction framework proposed in this paper outperforms all comparative models in terms of prediction accuracy, warning accuracy and prediction stability. This proves that the hybrid prediction framework is suitable for air pollution forecasting and warning.

5 Conclusion

In order to prevent air pollution and protect human health, this paper proposes a multi-factor, multi-scale, and intelligent optimization based two-stage deep learning hybrid framework for air pollution forecasting and warning. First, feature extraction is performed using CEEMDAN and FE to decompose and reconstruct the original sequence into three components. Then, the mRMR algorithm is used for feature selection of the influencing factors to filter out the influencing factors that have a greater impact on each reconstructed component. Then, a two-stage deep learning hybrid framework is proposed in this paper to predict and nonlinearly integrate each reconstructed component. Finally, based on the proposed hybrid model, air pollution prediction and early warning are achieved. The results show that: (1) the feature extraction methods based on CEEMDAN and FE can effectively discover the multiscale relationships in PM2.5 sequences, reduce the complexity of prediction; (2) the mRMR-based influence factor selection method can not only reduce the complexity of data, but also improve the performance of the model; (3) A two-stage GWO-LSTM can effectively improve the prediction accuracy; (4) the model has good practical significance and application value, and can realize effective forecasting and early warning of air pollution.

Using PM2.5 concentration data from Xingtai, Anyang and Beijing as the study samples, the empirical results statistically support the effectiveness of the proposed hybrid model in terms of prediction accuracy and robustness, and the model outperforms all comparative models.

In conclusion, the hybrid framework has advantages in prediction stability, prediction accuracy and accuracy of air pollution warning. Not limited to air pollution prediction studies, the framework can be extended to other complex systems to verify its generality and versatility. However, no technique is perfect and flawless. As the theory matures and research progresses, more advanced and effective algorithms will be proposed. In the future, on top of the hybrid framework proposed in this paper, more novel and effective algorithms can be added to further improve the prediction performance of the model. In addition, only daily PM2.5 concentration data were considered in this study, so prediction of air pollutant concentrations on other time scales is also an option for future research.