Introduction

With the acceleration and deepening of industrialization and urbanization, air pollution has been a more and more serious problem, which heavily threatens to human health with a variety of respiratory diseases such as chronic pharyngitis, chronic bronchitis, and bronchial asthma (Chang et al. 2020; Schwartz 1993; Yan et al. 2020). Besides, heavy air pollution will lead to a haze, resulting in the low atmosphere visibility, traffic accidents, flight delays, and so on. Therefore, how to realize an accurate air quality forecasting has gradually drawn extensive attentions in recent years, due to its importance in environmental protection (Liao et al. 2015), government decision-making (Zheng et al. 2015), people's daily health (Ha Chi and Kim Oanh 2021), etc.

So far, a large number of big cities have established air quality monitoring stations in urban areas to observe the city’s real-time PM2.5 and other air pollutants such as PM10, CO, O3, NO2, SO2, etc. (Li and Cheng 2021; Wang et al. 2022a, b). In China, the air quality status of different cities in the east, north, and northeast of China is sometimes more notable in the world, since prior studies have been reported the chemical composition and mass concentration of PM2.5 in these areas of China (Gautam et al. 2019). Long-term exposure to PM2.5 easily causes the respiratory diseases (Chai et al. 2019; Yang et al. 2020). As a result, air pollution caused by PM2.5 has been regarded as a crucial problem threatening to people's daily health. Hence, it is of great importance to perform early diagnosis of air pollution occurrence and PM2.5 concentration estimation for air quality forecasting. At present, tremendous efforts have been made to focus on air quality forecasting (Janarthanan et al. 2021; Liu et al. 2021; Mao et al. 2021; Voukantsis et al. 2011; Yi et al. 2019; Zhu et al. 2018). Existing approaches for air quality prediction can be divided into two categories: deterministic methods and statistical methods. In particular, deterministic methods usually work in a model-driven manner. That is, they utilize the aerodynamic theory to construct a numeric model to simulate the pollutant discharge and diffusion of atmospheric pollution concentration. The representative deterministic methods contain Nested Air Quality Prediction Modeling System (NAQPMS) (Wang et al. 2001), Chemical Transport Models (CTMs) (Mihailovic et al. 2009; Ponomarev et al. 2020), Weather Research and Forecasting (WRF) (Powers et al. 2017), Community Multiscale Air Quality (CMAQ) (Zhang et al. 2014), the complicated WRF-SMOKE-CMAQ model (de Almeida Albuquerque et al. 2018), and so on. However, these deterministic methods may provide inaccurate prediction results owing to the lack of real observations (Kukkonen et al. 2003). In addition, since a variety of parameters in these models are required to be decided by experience, they easily suffer from the expensive computation cost (Xu et al. 2017).

By contrast, statistical methods usually work in a data-driven manner. In other words, based on the observed data they directly employ a statistical modeling strategy to forecast air pollutant concentrations. The conventional linear statistical methods for air quality prediction include Autoregressive Moving Average (ARMA) (Graupe et al. 1975), Autoregressive Integrated Moving Average (ARIMA) (Cekim 2020; Jian et al. 2012), Autoregressive Distributed Lag (ARDL) (Abedi et al. 2020). Nevertheless, these linear statistical methods are based on the assumption that there exist linear relationships between data variables and target labels. This does not comform to the non-linearity of real-world observed data. Therefore, these linear statistical methods may not obtain promising performance on air quality forecasting tasks. To address this issue, an alternative to these liner statistial methods is to adopt nonlinear statistical machine learning methods for air quality forecasting. The representative nonlinear statistical machine learning methods are Support Vector Regression (SVR) (Chu et al. 2021; Yang et al. 2018), Artificial Neural Network (ANN) (Agarwal et al. 2020; Arhami et al. 2013), Random Forest (RF) (Gariazzo et al. 2020), eXtreme Gradient Boosting (XGBoost) (Chen and Guestrin 2016), and so on. Among these nonlinear statistical machine learning methods, ANNs have become one of the most popular approaches for air quality forecasting. For instance, Ding et al. (2016) employed sparse response back-propagation training feedforward neural networks to predict air pollutant concentration. Zhao et al. (2020) integrated forward neural networks and recurrent neural networks to predict air quality hourly in Northwest of China. Liu and Zhang (2021) developed a method of AQI (air quality index) time series prediction by means of a hybrid data decomposition and echo state networks. In recent years, ensemble learning for different ANNs has been an attractive direction. In particular, an ensemble method based on 10 distinct ANNs was used to estimate air pollution health risks (Araujo et al. 2020). Wang et al. (2020) proposed a double decomposition and optimal combination ensemble learning method for interval-valued AQI forecasting. However, due to the used single-layer network structure, these tranditional nonlinear statistical learning methods belong to shallow leaning methods, resuting in their limited feature learning ability and prediction performance on air quality forecasting tasks.

To allievate the above-mentioned problem, recently emerged deep learning techqniques (Hinton and Salakhutdinov 2006; LeCun et al. 2015) may present a possible solution. With the aid of deep multi-layer network structures, deep learning techqniques are capable of learning high-level feature representations from input data and exhibit excellent performance in the fields of computer vision, natural language processing, signal processing, and so on. The well-known deep learning techniques contain Deep Belief Network (DBN) (Hinton and Salakhutdinov 2006), Convolutional Neural Network (CNN) (Krizhevsky et al. 2012), Recurrent Neural Network (RNN) (Elman 1990) and its variant of Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber 1997), and so on. At present, a variety of deep learning techniques have been successfully applied for air quality forecasting (Akbal and Ünlü 2022; Dhakal et al. 2021; Wong et al. 2021; Yang et al. 2021; Zhang et al. 2020a, 2022; Zhou et al. 2022). For instance, a deep stacked autoencoder (AE) model (Li et al. 2016), as a variant of DBN, was used to learn inherent air features for air quality prediction. Image-based air quality prediction based on CNN (Chakma et al. 2017; Zhang et al. 2016) was proposed, in which CNNs were leaverged to recognize natural images into different categories on the basis of their PM2.5 concentrations. An end-to-end deep learning model comprising of CNNs and Gradient Boosting Machine (GBM) (Luo et al. 2020) was proposed for PM2.5 concentration prediction in Shanghai City, China. A Graph-based LSTM (GLSTM) model (Gao and Li 2021) was presented to predict PM2.5 concentration in Gansu Province of Northwest in China.

In recent years, various hybrid deep learning structures have drawn extensive attention for air quality forecasting. In particular, a hybrid deep learning framework combining Variational Mode Decomposition (VMD) and Bi-directional LSTM (BiLSTM) (Zhang et al. 2021) was developed to predict PM2.5 changes in cities in China. A transfer learning-based BiLSTM (Ma et al. 2019) was utilized to improve air quality prediction performance. A spatio-temporal Convolutional LSTM Extended (C-LSTME) model (Wen et al. 2019), in which CNNs and LSTMs were integrated to learn high-level spatio-temporal features, was presented to predict air quality concentration. Although these deep learning methods mentioned above have achieved good performance on air quality forecasting tasks, they may still have a drawback. That is, owing to the existed “gradient vanishing and exploding” problems in RNNs and LSTMs, as well as the limited spatial learning ability of convolutional filters in CNNs, these sequence-aligned methods are restricted in modeling long-term and complex relationships in time series PM2.5 data.

To mitigate the above-mentioned issue, in recent year the developed Transformer (Vaswani et al. 2017) method, originally proposed for machine translation tasks in natural language processing, provides possible cues for long-term air quality prediction. The original Transformer model is constructed based on self-attention mechanisms without any recurrent structures and convolutions. The motivation of the used self-attention mechanisms in the Transformer is twofold. First, compared with recurrent structures it can deal with more direct information flow across the whole sequence data, thereby allowing for more direct gradient flow. Second, it can perform faster training than recurrent structures, since most operations can be implemented in parallel. So far, self-attention-based Transformers have shown superior performance to RNNs and LSTMs in the ability of capturing long-range dependencies in the fields of machine translation (Neishi and Yoshinaga 2019; Vaswani et al. 2017), speech recognition (Chen et al. 2021; Zeyer et al. 2019), image segmentation and classification (Bazi et al. 2021; Duke et al. 2021; Lanchantin et al. 2021), electricity-consuming load analysis (Yue et al. 2020; Zhou et al. 2021), and so on. Although self-attention-based Transformers may own powerful capability of modeling long-range dependencies of sequence data, they still need large time and memory that increases quadratically with the sequence length. Besides, few studies attempt to explore Transformer-based methods for long-term air quality forecasting. To address these two issues, this paper proposes a new lightweight deep learning model for air quality forecasting based on sparse attention-based Transformer networks (STN) so as to model long-term and complex relationships from time series PM2.5 data. In our STN, a multi-head sparse attention mechanism is designed to learn long-term dependencies on the long span of time series PM2.5 data and meanwhile reduce the time complexity. Moreover, the proposed STN method can deal with the whole time series PM2.5 data for each time employ with the aid of self-attention mechanisms.

The main contributions of this paper are summarized in three aspects: (1) a new lightweight deep learning model based on sparse attention-based Transformer networks (STN) is designed to learn long-term dependencies and complex relationships from time series PM2.5 data for deep air quality forecasting. The proposed STN method adopts a multi-head sparse attention mechanism in the encoder and decoder to learn long-term temporal dynamical information from time series PM2.5 data, and reduce time complexity simultaneously; (2) to the best of our knowledge, this is the first attempt to exploit deep sparse attention-based Transformer networks for air quality forecasting. The proposed STN method can process the entire time series PM2.5 data at the same time owing to the used self-attention mechanism. Unlike previous sequence-aligned methods, our method does not need to deal with time series PM2.5 data in an ordered sequence way; (3) this paper presents a comparative analysis of traditional ARIMA, SVR, RF, XGBoost, and recently developed deep learning models like CNN, LSTM, the original Transformer as well as our STN method. Extensive experiments on two real-world datasets in China, i.e., Beijing PM2.5 dataset and Taizhou PM2.5 dataset, show that our method not only has relatively small time complexity, but also outperforms state-of-the-arts, demonstrating the effectiveness of the proposed STN method on both short-term and long-term air quality prediction tasks.

Materials and methods

To evaluate the performance of the proposed method on air quality forecasting tasks, we employ two real-world air quality PM2.5 databases to conduct air quality forecasting experiments. One is the Beijing PM2.5 dataset (Liang et al. 2015) available at https://www.kaggle.com/djhavera/beijing-pm25-data-data-set. The other is Taizhou PM2.5 dataset, which was collected by our teams from Taizhou city.

Study area

In this work, we choose two typical cities, i.e., Beijing and Taizhou, for studying air quality prediction, as depicted in Fig. 1. Beijing city is the Capital of China and at 116°66ʹ east longitude and 40°13ʹ north latitude. Taizhou city is located in the southeast of Zhejiang Province and at 121°42ʹ east longitude and 28°65ʹ north latitude. Figure 1 shows the distribution of China's all air quality monitoring stations and the ranking of PM2.5 values corresponding to each station on November 1, 2019. Here, the rank of PM2.5 in Fig. 1 is determined by the Ambient Air Quality Standard (GB 3095-2012) in China (Zhang et al. 2020b).

Fig. 1
figure 1

Distribution of China's air quality monitoring stations (the color of each station denotes the rank of daily average PM2.5 on November 1, 2019, as depicted in the bottom right of the figure. For interpretation related to color in this figure legend, the readers see the details from the website https://www.aqistudy.cn/)

Data description

The used Beijing PM2.5 dataset (Liang et al. 2015) is hourly air quality database consisting of PM2.5 data (http://www.mee.gov.cn/) of the US Embassy in Beijing and meteorological data (http://tianqi.2345.com/) from Beijing Capital International Airport. This dataset includes eight feature items, i.e., PM2.5 concentration (µg/m3), dew point, temperature, pressure, combined wind direction, cumulated wind speed (m/s), cumulated hours of snow, cumulated hours of rain. The original dataset is recorded with an hourly interval ranging from 01/01/2010 to 12/31/2014, yielding a total of around 43,800 records. For year-independent experiments, the first four-year data are used for training, whereas the last year data (01/01/2014–12/31/2014) are selected as the testing set. For model validation, we randomly split 10% of the whole training set as the validation set. In this case, we keep that the training, and testing sets come from different years, thereby making such year-independent air quality forecasting experiments more practical. Note that such year-independent experiments are more difficult than the common year-dependent experiments in which the training and testing sets are derived from the same year.

The used hourly Taizhou PM2.5 dataset is collected from the single Hongjia monitoring station, which is located in Jiaojiang urban district from Taizhou city in Zhejiang Province. This dataset also contains eight feature items, including PM2.5 concentration (µg/m3), dew point, temperature, pressure, combined wind direction, cumulated wind speed (m/s), cumulated hours of rain, cumulated hours of relative humidity. It consists of around 26,000 hourly records ranging from 01/01/2017 to 12/31/2019. In our experiments, the first two-year data are used as the training set, and the last year data (01/01/2019–12/31/2019) are adopted as the testing set. The randomly divided 10% of the whole training set is employed as the validation set.

Methods

Figure 2 shows the methodology structure of modeling air quality PM2.5 forecasting based on shallow learning and deep learning methods. The methodology structure starts with data collection and processing. In particular, historical PM2.5 concentration and meteorological data are collected from monitoring stations and then cleaned by means of eliminating outliers and padding missing values with a linear interpolation way. Data normalization for all air quality time series data is performed before feeding data into the used models. In the next stage of temporal modeling, various models, including shallow learning models like ARIMA, SVR, RF, XGBoost, as well as deep learning models like CNN, LSTM, Transformer, and our designed STN, are employed to model temporal dynamics from time series PM2.5 data for air quality forecasting. All used models are trained and evaluated on the collected training and testing data sets. Finally, we present the result comparison and analysis according to the used typical evaluation metrics like root mean square error (RMSE), mean absolute error (MAE), and the coefficient of determination (R2).

Fig. 2
figure 2

Methodology structure of modeling air quality PM2.5 forecasting based on shallow learning and deep learning methods

Similar to the conventional Transformer (Vaswani et al. 2017), our designed sparse attention-based Transformer networks (STN) consist of encoder and decoder layers depending on self-attention mechanisms, as shown in Fig. 3. In order to learn long-term dependencies and complex relationships from time series PM2.5 data, this framework integrates two different self-attention mechanisms, including a multi-head sparse attention mechanism used in the encoder and decoder, in which a sparse attention block is designed to learn important queries for reducing time complexity, and a standard multi-head attention mechanism (Vaswani et al. 2017) in the decoder. In the following, we will elaborate the details related to the designed STN model.

Fig. 3
figure 3

Framework of proposed STN model for air quality (PM2.5 concentration) forecasting

Problem description

Given input time series data \({\mathbf{x}}=\{x_{1} {, }x_{2} , \ldots ,x_{{L_{x} }} \}\) (\(x_{i} \in {\mathbb{R}}^{{d_{x} }}\)) with a length \(L_{x}\) (historical meteorological data and PM2.5 concentration data) and input dimension \(d_{x}\), the proposed method aims to predict the corresponding time series data \({\mathbf{y}}=\{y_{1} {, }y_{2} , \ldots ,y_{{L_{y} }} \}\) (\(y_{i} \in {\mathbb{R}}^{{d_{y} }}\)) with a length \(L_{y}\) and input dimension \(d_{y}\). The encoder maps input time series data \({\mathbf{x}}=\{x_{1} {, }x_{2} , \ldots ,x_{{L_{x} }} \}\) into a hidden continuous representation \({\mathbf{z}}=\{z_{1} {, }z_{2} , \ldots ,z_{{L_{z} }} \}\). Then, the decoder generates an output of \({\mathbf{y}}=\{y_{1} {, }y_{2} , \ldots ,y_{{L_{y} }} \}\) from the given \({\mathbf{z}}=\{z_{1} {, }z_{2} , \ldots ,z_{{L_{z} }} \}\). This inference is realized by using an step-by-step operation in which the decoder calculates a new hidden representation \({\mathbf{z}}_{{k{ + }1}}\) from the previous \({\mathbf{z}}_{k}\) and other outputs in \(k\)-th step, and then forecasts the \((k + 1)\)-th time series data \({\mathbf{y}}_{{k{ + }1}}\).

Position embedding

Since the original Transformer model (Vaswani et al. 2017) does not have recurrent structures and convolutions, it has no ability of leveraging the temporal information of time series data. It is thus needed to extract the relative or absolute position information of the tokens in time series data. To this end, position embedding, which is conducted with the nonlinear sine and cosine functions (Vaswani et al. 2017), is utilized to encode the temporal information of time series data. Position embedding is usually added at the bottoms of the encoder and decoder of the used Transformer model, as described in Fig. 3.

Encoder

Given input time series data \({\mathbf{x}}\), consisting of normalized historical meteorological data and PM2.5 concentration data, position embedding is used to encode the temporal information of \({\mathbf{x}}\) and generate the resulting vector with the length of \(L_{x}\) as inputs of the encoder. The designed encoder aims to compute the interrelationship of PM2.5-related data at each time point in the sequence data by means of using a sparse self-attention mechanism in an effort to capture the relevance and importance of PM2.5-related data at different times in the sequence data. For such self-attention encoder, the attention weights can be calculated by means of using the scaled dot-product attention of the tuple input (query, key, value).

Different from the original Transformer model (Vaswani et al. 2017) with the single branch, the designed encoder contains two-branch parallel pipelines: (1) one sparse attention block and (2) two sparse attention blocks cascaded with a 1D convolution with a kernel width 3 and a max-pooling with stride 2. Each sparse attention block consists of a multi-head sparse attention layer, a fully connected feed-forward network, followed by layer normalization. A residual connection (He et al. 2016) is used around each of two sub-layers. Here, the used 1D convolution and max-pooling operations are adopted for the self-attention distilling operation to extract the dominant attention, thereby decreasing the network size. In addition, the first branch path with one sparse attention block receives halving inputs \(\frac{1}{2}L_{x}\), thereby reducing the number of self-attention distilling layers and improving robustness. In a concatenated layer, the learned feature maps of two-branch parallel pipelines are merged as the output \({\mathbf{z}}\) of the encoder.

Decoder

The decoder aims to learn the weighted attention composition of feature maps, and meanwhile, output predicted PM2.5 concentration data in a generative manner. The decoder is composed of a masked sparse attention block, a multi-head attention layer, a fully connected feed-forward network, and each of them is followed by layer normalization. Similar to the encoder, a residual connection (He et al. 2016) is also employed around each of three sub-layers. A linear mapping layer is used at the top of the decoder to output the PM2.5 prediction results \({\mathbf{y}}\). The masked sparse attention is obtained in the process of sparse attention computing by setting masked dot products to \({ - }\infty\), avoiding auto-regressive. The decoder receives time series input data \({\mathbf{x}}_{de} = \{ {\mathbf{x}}_{token} ,{\mathbf{x}}_{0} \}\), where \({\mathbf{x}}_{token}\) represents the started tokens and \({\mathbf{x}}_{0}\) denotes the placeholder for target time series data.

Self-attention mechanism and sparse analysis

Given an input times series data matrix \({\mathbf{X}} \in {\mathbb{R}}^{{L \times d_{x} }}\) with a length \(L\) and input dimension \(d\), in terms of the tuple input (query, key, value) the standard self-attention mechanism (Vaswani et al. 2017) computes the scaled dot-product as

$${\text{Att}}({\mathbf{Q}},{\mathbf{K}},{\mathbf{V}}) = {\text{softmax}}\left( {\frac{{{\mathbf{QK}}^{{\text{T}}} }}{\sqrt d }} \right),$$
(1)

where the query matrix \({\mathbf{Q}} \in {\mathbb{R}}^{L \times d}\), key matrix \({\mathbf{K}} \in {\mathbb{R}}^{L \times d}\), value matrix \({\mathbf{V}} \in {\mathbb{R}}^{L \times d}\) are separately defined as

$$\begin{gathered} {\mathbf{Q}} = {\mathbf{XW}}_{q} , \hfill \\ {\mathbf{K}} = {\mathbf{XW}}_{k} , \hfill \\ {\mathbf{V}} = {\mathbf{XW}}_{v} , \hfill \\ \end{gathered}$$
(2)

where \({\mathbf{W}}_{q} ,{\mathbf{W}}_{k} ,{\mathbf{W}}_{v}\) denote the projection matrices. Equation (1) can be reformulated as its vector form. In particular, given the \(i\)-th query \({\text{q}}_{i}\) from \({\mathbf{Q}}\), the attention score on the \(j\)-th key from \({\mathbf{K}}\) can be computed by

$$p({\text{k}}_{j} \left| {{\text{q}}_{i} } \right.) = \frac{{{\text{e}}^{{{{{\text{q}}_{i} {\text{k}}_{j}^{T} } \mathord{\left/ {\vphantom {{{\text{q}}_{i} {\text{k}}_{j}^{T} } {\sqrt d }}} \right. \kern-0pt} {\sqrt d }}}} }}{{\sum_{l = 1}^{L} {e^{{{{{\text{q}}_{i} {\text{k}}_{l}^{T} } \mathord{\left/ {\vphantom {{{\text{q}}_{i} {\text{k}}_{l}^{T} } {\sqrt d }}} \right. \kern-0pt} {\sqrt d }}}} } }}.$$
(3)

Then, the self-attention score of \({\text{q}}_{i}\) over \({\mathbf{K}}\) can be defined as

$${\text{Att}}\,({\text{q}}_{i} ,{\mathbf{K}},{\mathbf{V}}) = \sum_{j = 1}^{L} {p({\text{k}}_{j} \left| {{\text{q}}_{i} } \right.)} {\text{v}}_{j} .$$
(4)

In this case, the time complexity of the standard self-attention mechanism (Vaswani et al. 2017) is \({\rm O}(L^{2} )\). For the query matrix, there is a potential sparsity, that is, a lot of redundant calculations are conducted to obtain attention scores for all queries. It is needed to choose important queries in which the calculated attention scores over all keys are far from the uniform distribution. To measure important queries, the Kullback–Leibler (K-L) divergence (Hershey and Olsen 2007) between the true distribution \(P\) of \(p({\text{k}}_{j} \left| {{\text{q}}_{i} } \right.)\) and the uniform distribution \(U\) is used, as described below.

$$KL(P\left\| U \right.) = \ln \sum_{j = 1}^{L} {e^{{\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }}} } - \frac{1}{L}\sum_{j = 1}^{L} {\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }} - \ln L$$
(5)

After dropping the constant \(\ln L\), the sparse measurement of \({\text{q}}_{i}\) can be expressed as

$$M_{{{\text{s}}parse}} ({\text{q}}_{i} ,{\mathbf{K}}) = \ln \sum_{j = 1}^{{L_{K} }} {e^{{\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }}} } - \frac{1}{{L_{K} }}\sum_{j = 1}^{{L_{K} }} {\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }}$$
(6)

According to the obtained values of \(M_{sparse}\), larger \(M_{sparse}\) corresponds to more important queries in the self-attention mechanism. However, computing Eq. (6) is still expensive, since traversing all queries is needed to calculate every dot-product pairs. To further alleviate the computation issue, Eq. (6) can be approximated by using sampling ways:

$$\tilde{M}_{{{\text{s}}parse}} ({\text{q}}_{i} ,{\tilde{\mathbf{K}}}) = \mathop {\max }\limits_{j} \left\{ {\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }} \right\} - \frac{1}{{\tilde{L}}}\sum_{j = 1}^{{\tilde{L}}} {\frac{{{\text{q}}_{i} {\text{k}}_{j}^{{\text{T}}} }}{\sqrt d }}$$
(7)

where \({\tilde{\mathbf{K}}}\) denotes the random sampling key matrix and \(\tilde{L}\) denotes the random sampling number. After figuring out \(\tilde{M}_{sparse}\) for each query, only top \(u\) dominant queries are employed to calculate self-attention, filling other pairs with zero. In this case, the time complexity is \({\rm O}(L\ln L)\) for a given sequence length of \(L\).

Performance evaluation criteria

To evaluate the performance of different methods on air quality forecasting tasks, three typical evaluation metrics, such as root mean square error (RMSE), mean absolute error (MAE), and the coefficient of determination (R2), were utilized for experiments. These three evaluation metrics are expressed below.

$${\text{RMSE}}\,(y,\hat{y}) = \sqrt {\frac{1}{n}\sum_{i = 1}^{n} {(y_{i} - \hat{y}_{i} )^{2} } } ,$$
(8)
$${\text{MAE}}\,(y,\hat{y}) = \frac{1}{n}\sum_{i = 1}^{n} {\left| {y_{i} - \hat{y}_{i} } \right|} ,$$
(9)
$$R^{2} = 1 - \frac{{\sum_{i = 1}^{n} {(y_{i} - \hat{y}_{i} )^{2} } }}{{\sum_{i = 1}^{n} {(y_{i} - y_{i}^{mean} )^{2} } }},$$
(10)

where \(y_{i}\) represents the observed PM2.5 value of \(i\)-th sample, \(\hat{y}_{i}\) denotes the predicted PM2.5 value of of \(i\)-th sample, \(y_{i}^{mean}\) is the mean value of observed PM2.5 values, and \(n\) is the total number of samples. The smaller the RMSE and MAE are, the better the final prediction performance is. In this case, R2 is often relatively larger.

Implementation details

All the experiments are implemented on a PC server configured with a NVIDIA Quadro P6000 graphics card which has a 24G memory. We adopt the open source machine learning framework, i.e., Pytorch (https://pytorch.org) and Sklearn (https://scikit-learn.org/), to build all machine learning methods for air quality forecasting. In particular, the open-source Tensorflow library (https://github.com/tensorflow/) is used to configure deep learning and Transformer models. For these models, the Adam optimizer is employed, the initial learning rate is le−4, the batch size is 32, the maximum of epochs is 200, and the mean squared error loss function is adopted. All air quality time series data are normalized to [0, 1]. The lookup size (window size), representing historical observations as input size of all used models, is set to 24 for its best performance. We compared our STN method with other typical techniques, including the traditional shallow learning models such as ARIMA and SVR, RF, XGBoost, as well as recently developed CNNs, LSTMs, original Transformer methods. They are described below in brief.

ARIMA is a typical linear statistical model for forecasting time series data. SVR is a kernel model based on nonlinear statistical machine learning theories which also can be used for time series data prediction. SVR was adopted with three different kernels (RBF, poly, and linear) with default parameter settings, i.e., the penalty coefficient is 1, and the polynomial degree is 3. RF is a simple ensemble learning techniques based on decision tree predictors, and the number of trees in RF is set as 200. XGBoost is a tree-based boosting model that combines multiple tree models with low performance to build a stronger model, and the number of trees in XGBoost is also set as 200. CNNs are a typical deep learning model for 2D image data processing. Here, we use 1D-CNN for air quality prediction since time-series PM2.5 data are 1D. The used 1D-CNN contains 256 convolution kernels with a kernel width of 5 and a stride of 1, followed by a batch normalization layer, max-pooling layer, rectified linear unit layer, a dropout (0.3) layer, and a fully connected layer. LSTMs are a special kind of recurrent architecture used for modeling long-range dependencies more accurately on time series data in comparison with simple RNNs. We adopt BiLSTM for air quality forecasting, in which a forward LSTM and a backward LSTM are included. Since air quality data change significantly over time and has a strong relationship with the state before and after, BiLSTM may be appropriate for predicting PM2.5 data. In this study, we used a two-layer BiLSTM for air quality prediction, each of which has 256 hidden neurons, followed by a dropout (0.05) layer. For the original Transformer model (Vaswani et al. 2017) and the proposed STN method, we employ three encoders and two decoders for its promising performance. In the following section, we provided experimental results in two aspects: single-step forecasting for the next 1 h and multi-step forecasting for the next multiple hours.

Results and discussion

Single-step forecasting results

Table 1 shows a comparative analysis of single-step PM2.5 forecasting quantitative results (RMSE, MAE, R2) for the next 1 h (h1) obtained by different used methods, including SVR (poly, rbf and linear kernel), ARIMA, RF, XGBoost, CNN, LSTM, Transformer, and the proposed STN method, for the next 1 h on two real-world datasets, i.e., Beijing and Taizhou PM2.5 datasets. To evaluate the time computation efficiency of all models, Table 1 also presents the comparisons of the execution time for all used models, which is measured with the model’s run-time implemented on the testing data.

Table 1 Comparisons of different methods on singe-step PM2.5 forecasting results for the next 1 h

From Table 1, we can make the following three observations, as described below.

  1. 1.

    Among all used methods, our STN method obtains the smallest RSME, MAE, and the highest R2 on two real-world datasets. In particular, our method achieves the largest R2 of 0.937 and reduces RMSE to 19.04 µg/m3 and MAE to 11.13 µg/m3 on Beijing PM2.5 dataset. Also, our STN method gives the largest R2 of 0.924 and reduces RMSE to 5.79 µg/m3 and MAE to 3.76 µg/m3 on Taizhou PM2.5 dataset. This shows that compared with other methods such as SVR, ARIMA, RF, XGBoost, CNN, LSTM, Transformer, our STN method has more powerful ability of learn long-term dependencies and complex relationships from time series PM2.5 data for air quality forecasting. Additionally, our STN method outperforms the original Transformer method, demonstrating the advantages of our STN method on air quality forecasting tasks. The reason is that the used multi-head sparse attention mechanism in our STN has stronger ability of modeling long-term temporal dynamics from time series PM2.5 data on air quality forecasting tasks.

  2. 2.

    Most deep learning methods, such as LSTM, Transformer and our STN method, are superior to traditional shallow learning methods like SVR, ARIMA, RF, XGBoost on air quality prediction tasks. This indicates the advantages of deep learning methods over traditional shallow learning methods on air quality prediction tasks. Nevertheless, CNN does perform better than SVR, ARIMA, RF, and XGBoost on single-step PM2.5 prediction tasks. This shows that 2D image-based CNN is not very effective to process 1D time series PM2.5 data.

  3. 3.

    Among all used shallow learning methods, tree-based methods such as RF and XGBoost outperform SVR and ARIMA, demonstrating the superiority of tree-based methods to SVR and ARIMA. In addition, RF slightly performs better than XGBoost in terms of RSME, MAE, and R2.

  4. 4.

    As for the computation efficiency, the ranking order of execution time for all used models is ARIMA, Transformer, STN, XGBoost, RF, LSTM, CNN, SVR-RBF, SVR-POLY, and SVR-LINEAR. Note that our STN method, as an improved version of the original Transformer, takes less execution time compared with the original Transformer. In particular, STN separately saves 1.23 and 1.54 s on Beijing and Taizhou datasets than Transformer. This is because, in comparison with Transformer, the used multi-head sparse attention mechanism in our STN method can reduce the time complexity from \({\rm O}(L^{2} )\) to \({\rm O}(L\ln L)\), thereby yielding less execution time. This demonstrates the effectiveness of our STN method over Transformer on the time computation complexity.

Multi-step forecasting results

Table 2 presents the multi-step quantitative results of different methods on forecasting PM2.5 tasks for the next 6 h on two real-world datasets. In Table 2, the testing error of different models is the mean prediction error values in the next forward 6 h (h1–h6), thereby giving a comparative analysis of RMSE, MAE, and R2 of SVR (poly, rbf, and linear kernel), RF, XGBoost, CNN, LSTM, Transformer, and our STN method.

Table 2 Comparisons of different methods on multi-step PM2.5 forecasting results for the next 6 h on two real-world datasets

As shown in Table 2, among all used models our STN method still obtains the smallest RSME, MAE, and the highest R2 on the Beijing and Taizhou datasets, followed by Transformer, LSTM, CNN, RF, XGBoost, SVR-LINEAR, SVR-POLY, and SVR-RBF. In particular, our STN method individually yields the highest R2 of 0.782 on Beijing PM2.5 dataset and the highest R2 of 0.731 on Taizhou PM2.5 dataset. Additionally, our STN method reduces MAE to 22.09 µg/m3 on Beijing PM2.5 dataset and MAE to 7.19 µg/m3 on Taizhou PM2.5 dataset, respectively. It is worth pointing out that CNN yields better performance than traditional SVR-LINEAR and XGBoost on multi-step PM2.5 forecasting tasks for the next 6 h (h1–h6). On the contrary, CNN performs worse than SVR-LINEAR and XGBoost on single-step PM2.5 forecasting tasks for the next 1 h (h1). This indicates that CNN improves the prediction performance when the forward-step prediction size increases from the next 1–6 h.

For long-term time step prediction, Tables 3, 4 and 5 separately present performance comparisons of different methods on multi-step PM2.5 forecasting results for the next 12, 24, and 48 h on two real-world datasets. Note that for more than 6 h prediction, we split them into several intervals and trained independent models for each interval. Then, we reported the average prediction results for each interval. For instance, for the next 12 h (h1–h12) prediction, we divided it into three groups: 1–3, 4–6, and 7–12 h, as shown in Tables 3 and 4. For the next 24 h (h1–h24) prediction, four groups such as 1–3, 4–6, 7–12 and 13–24 h are adopted. For the next 48 h (h1–h48) prediction, four groups such as 1–6, 7–12, 13–24, 25–48 h are used.

Table 3 Comparisons of different methods on multi-step PM2.5 forecasting results for the next 12 h on Beijing and Taizhou PM2.5 datasets
Table 4 Comparisons of different methods on multi-step PM2.5 forecasting results for the next 24 h on Beijing and Taizhou PM2.5 datasets
Table 5 Comparisons of different methods on multi-step PM2.5 forecasting results for the next 48 h on Beijing and Taizhou PM2.5 datasets

From the results in Tables 3, 4 and 5, we can see that when the prediction time step increases, the multi-step PM2.5 forecasting performances of all used models gradually decrease. Nevertheless, it can be observed that compared with other methods, our STN method also achieves the lowest prediction error (RMSE, MAE), and the highest R2 versus different forward prediction sizes. In addition, for the next 48 h (h1–h48), CNN performs better than LSTM, RF, XGBoost, SVR-LINEAR, demonstrating the further performance improvement in CNN on long-term air quality prediction.

To further exhibit the advantages of our STN method, we present the visualization of multi-step PM2.5 forecasting results of four deep models for the next 48 h (h1-h48) on two real-world datasets. Specially, Fig. 4 shows a comparison of multi-step ground truth and predicted PM2.5 values for the next 48 h (h48) obtained by CNN, LSTM, Transformer, and our STN method during one month (10/01/2014–10/31/2014) on Beijing PM2.5 dataset. Figure 5 presents a comparison of multi-step ground truth and predicted PM2.5 values for the next 48 h (h48) obtained by CNN, LSTM, Transformer, and our STN method during one month (03/01/2019–03/31/2019) on Taizhou PM2.5 dataset. The results in Figs. 4 and 5 indicate that our STN method performs better than other used methods when predicting PM2.5 values, especially in the time period of wave valley and peak of air quality PM2.5 testing data. Here, an illustration of the differences of different used methods is labeled with a red circle in Figs. 4 and 5.

Fig. 4
figure 4

Comparisons of multi-step ground truth and predicted PM2.5 values (µg/m3) for the next 48 h (h48) obtained by CNN, LSTM, Transformer, and our STN method during one month (10/01/2014–10/31/2014) on Beijing PM2.5 dataset. (Each observation point in the horizontal axis represents the timescale (hour) corresponding to the obtained PM2.5 value, as depicted in the vertical axis in this figure)

Fig. 5
figure 5

Comparisons of multi-step ground truth and predicted hourly PM2.5 values (µg/m3) for the next 48 h (h48) obtained by CNN, LSTM, Transformer, and our STN method during one month (03/01/2019–03/31/2019) on Taizhou PM2.5 dataset (each observation point in the horizontal axis represents the timescale (hour) corresponding to the obtained PM2.5 value, as depicted in the vertical axis in this figure)

In summary, the results in Tables 1, 2, 3, 4 and 5 and Figs. 4 and 5 on Beijing PM2.5 dataset and Taizhou PM2.5 dataset indicate that our STN method not only has relatively small time complexity, but also outperforms other used methods. This shows the advantages of our STN method on both short-term and long-term air quality prediction tasks over other used methods. More specially, on singe-step PM2.5 forecasting tasks our STN method achieves R2 of 0.937, RMSE of 19.04, and MAE of 11.13 on Beijing PM2.5 dataset. On Taizhou PM2.5 dataset, our STN method obtains R2 of 0.924, RMSE of 5.79, and MAE of 3.76. For long-term PM2.5 forecasting, our STN method still gives better performance than other used methods on multi-step PM2.5 forecasting results for the next 6, 12, 24, and 48 h on two real-world datasets. In addition, it is found that the performance of all used method decreases with the increasing forward prediction size. In particular, the prediction results for the next 48 h are the worst, followed by the next 24, 12, 6, and 1 h. Besides, deep learning methods usually outperform shallow learning methods, especially for on multi-step PM2.5 forecasting tasks.

Conclusion

In this paper, we present a new lightweight method of modeling deep air quality forecasting based on sparse attention-based Transformer networks (STN) for single-step forward and multi-step forward air quality PM2.5 prediction. Our STN method, which adopts a multi-head sparse attention mechanism in the encoder and decoder to reduce the time complexity, is designed to learn long-term dependencies and complex relationships from time series PM2.5 data for air quality forecasting. Our STN method is capable of processing the entire time series PM2.5 data at the same time owing to the used self-attention mechanisms. We present a comparative analysis of traditional ARIMA, SVR, RF, XGBoost, as well as recently developed CNN, LSTM, Transformer, and our STN method. Experiment results on Beijing PM2.5 dataset and Taizhou PM2.5 dataset demonstrate that our STN method not only has relatively small time complexity, but also achieves better performance than other used methods, i.e., the recently emerged deep models like the original Transformer, LSTM, CNN, and traditional ARIMA, RF, XGBoost, SVR-LINEAR, SVR-POLY, and SVR-RBF on both short-term and long-term air quality prediction tasks.

In future, it is interesting and challenging to take into account the abrupt variation in air pollution time series data for air quality forecasting. This is because such successful forecasting in advance for the sudden variation in air pollution is very beneficial to environmental protection, government decision-making, people's daily health, etc. In addition, it is also meaningful to explore more advanced deep learning models on long-term air quality prediction under different forecasting conditions. Besides, this work evaluates the performance of the proposed method based on measurement samples at two air monitoring sites in China. Therefore, it is also interesting to exploit the generalizability of the proposed STN method in larger geographical regions. Moreover, our STN method shows less time complexity than the original Transformer, but the time complexity of our STN method is still larger than traditional shallow learning methods. Therefore, how to further reduce the time complexity of our STN method is an important direction in future.