1 Introduction

Time series forecasting is employed diverse fields, including traffic [1], management [2], energy consumption [3, 4], weather forecasting [5], and disease propagation [6]. Long sequence time series forecasting (LSTF) focuses on predicting a series of length O using an input series of length I, where O is notably greater than I [7, 8]. The progress made in deep learning techniques considerably improves the precision of time series prediction methods. Specifically, transformer-based models have showcased their remarkable capability in capturing complex temporal relationships and dependencies among different positions within a sequence, outperforming traditional RNNs and CNNs [7, 9,10,11,12,13,14].

Fig. 1
figure 1

Visualization of the attention map (heat map) on the ETTh1 dataset, transformer Layer 1 with variable dimension 512 dimensions

Forecasting in the long-term setting poses significant challenges. Firstly, it is unreliable to directly identify temporal dependencies from long-term time series, since these dependencies can be masked by complex temporal patterns. Complex temporal patterns refer to situations where the trend, seasonal, and residual patterns of the time series change randomly over time. Secondly, the traditional transformer is computationally intensive in terms of long-term prediction owing to the quadratic complexity inherent in self-attention. Previous prediction methods based on Transformers attempt to address this issue by improving sparse versions of self-attention through point-by-point representation aggregation. However, this approach sacrifices information utilization and creates bottlenecks in long-range time series prediction. To infer complex temporal patterns, we capture local dependencies by applying 1D convolution on the first branch and K-dimensional convolutional kernels sliding along the time dimension on the second branch. Since CNNs excel at recognizing simple patterns in data, by parallelly connecting these simple patterns, we extract more complex patterns to enhance the model’s predictability. [15] We leverage this capability by parallelly connecting these simple patterns and extracting the more predictable components from the time series. As depicted in Fig. 1, we visually illustrate the variable dimensions of the first layer of the transformer using the ETTh1 dataset. Both the horizontal and vertical coordinates represent 512-dimensional variables. The brightness of the color corresponds to its importance, indicating how noteworthy it is. The heat map displays variable dimensions with varying levels of color brightness and darkness, indicating the crucial role of variable dimension D in the time series \(X \in {\mathbb {R}}^{I \times D}\) for accurate long-term forecasting. In contrast, the transformer-based model only explicitly exploits temporal dependence through a self-attentive mechanism, ignoring the effect of variable dimensional dependence, which may limit its ability to exploit information.

Encouraged by Cao et al. [16], we use heatmaps to visualize the attention in the initial layer of vanilla transformer head1 and head7 on the ETTh1 dataset. As depicted in Fig. 2, every row signifies the sequence \(X_T = \{x_1,\ldots , x_T\}\) from left to right, and each column represents the sequence \(X_T = \{x_1,\ldots , x_T\}\) from top to bottom. Each point on the heat map corresponds to the correlation value \(\alpha _{ij}\),\(i\in [0,T],j\in [0,T]\). And we observe that attention at different query positions (each row in the heatmap) was almost in the same column, indicating that the aggregation method using point-by-point representation learns attention that is almost the same across different query positions (Q). The statistical results for Tables 1 and 2 show very small differences in attention across query positions, indicating that the model only learns query-independent dependencies, further supporting our observations.

Based on the motivation discussed earlier, our objective is to enhance the efficiency of predictive long-term time series models based on transformers. To achieve this, we propose a generic model called Double-Branch Attention that takes into account both temporal and variable dimension dependencies and uses query-independent attention for all query positions. Our proposed framework consists of two interconnected branches: one captures dependencies of the temporal and variable dimensions sequentially in the time series, while the other emphasizes temporal attention. With this design, Double-Branch Attention improves predictability and reduces the temporal complexity of the transformer and its efficient variants to O(LlogL). Our model attains state-of-the-art performance on six benchmarks derived from real-world scenarios, demonstrating its ability to be applied across a diverse range of transformers and further improve their overall performance.

To sum up, we make the following contributions:

  • The conventional dot product operation employed in the self-attention mechanism leads to significant time complexity. To deal with this issue, we simplify the self-attention module to only learn query-independent dependencies. This simplification reduces the computational complexity of each layer to O(LlogL) and effectively improves the overall running time.

  • Our approach addresses both temporal and variable dependencies by introducing a double-branch attention module. This module comprises two branches: one that captures cross-temporal and cross-dimensional dependencies simultaneously, and another that places greater emphasis on temporal dimension dependencies. By incorporating these features, our model achieves higher predictability.

  • We conduct extensive testing of our model on six real-world benchmarks, showcasing its superior effectiveness compared to previous state-of-the-art approaches. By integrating it into transformer-based models, we continually enhance the predictive power of these models.

Fig. 2
figure 2

The visualization presents the attention (heatmap) of two heads (head 1 and head 7) in the Transformer at Layer 1 on the ETTh1 dataset. Notably, the attention from these two heads demonstrates consistent alignment along the same column for various query positions (\(Q_i\))

2 Related Work

2.1 Stationarization for Time Series Prediction

The stationarity of time series plays a vital role in their predictability, and considerable efforts have been dedicated over the years to developing dependable and precise models for non-stationary time series. Traditional methods, such as hidden Markov models (HMMs [17]), dynamic Bayesian networks [18], Kalman filters [19], and other statistical models (e.g. ARIMA [20]) achieve significant advancements in the field. More recently, recurrent neural networks (RNNs) [21, 22] have achieved better performance. RNNs do not impose any assumptions regarding the temporal structure and have the ability to capture highly non-linear and intricate dependencies within time series. A wide range of deep learning architectures demonstrate excellent performance. DAIN [23] uses non-linear neural networks to adapt to stationary time series based on the observed training distribution. The adaptive sparse Huber additive model introduced by Huber additive models [24] enables robust forecasting for non-Gaussian and (non)stationary data. AdaRNN [25] uses an adaptive mechanism that automatically adjusts the weights and parameters of the model to accommodate changes in the data. Unlike previous methods, Non-stationary transformers [26] can automatically learn and adapt to changing patterns and structures in the time series by introducing adaptive mechanisms.

2.2 Deep Models for Long Time Series Forecasting

Transformer-based models exhibit significant prowess in sequence modeling tasks [27,28,29]. However, self-attention demonstrates quadratic complexity. Subsequent research focuses on seeking approaches to mitigate the complexity. LogTrans [12] introduces convolutional self-attention and proposes that the LogSparse transformer selects time steps at exponentially increasing intervals, effectively reducing the complexity. Reformer [10] employs a reversibility layer, a long-range self-attention mechanism, and a chunking attention mechanism to address the performance and memory limitations of traditional Transformers when dealing with long sequences, changing its complexity from \(O(L^2)\) to O(LlogL). Informer [7] adopts KL-divergence to select the dominant query and introduces probsparse self-attention that achieves O(LlogL) complexity. Other Transformer-based models also achieve excellent performance from other perspectives. Pyraformer [11] designed the Pyramid Attention Module (PAM) to capture the time dependencies of various hierarchies. Autoformer [9] adds a decomposition block to extract the intrinsic complex temporal trends of the hidden states in the model and proposes to replace self-attention with an Auto-Correlation mechanism, which takes into account the similarity between sub-series to better capture the trends, not only to achieve O(LlogL) complexity, but also to prevent information loss. FEDformer [30] introduces an innovative approach called the blend of expert techniques to combine trend components obtained through moving average kernels of different sizes. Preformer [31] proposes multi-scale segment-correlation mechanism and employs segment-wise correlation-based attention for accurate forecasting.

3 Methodology

In time series forecasting, our objective is to make predictions for the future value of a time series \(X_{I+1:I+\tau } \in {\mathbb {R}}^{I \times D}\) base on history data \(X_{1:I} \in {\mathbb {R}}^{I \times D}\). Here, \(\tau\) represents the number of time steps into the future, I denotes the number of time steps in the past, and \(D>1\) represents the number of feature dimensions. The long-term series prediction problem requires longer output lengths (\(I+\tau\)) than previous work. To exploit the dependencies among all dimensions, we take into account the variable dimension dependencies in Sect. 3.1, which improves the predictability of the model. In Sect. 3.2, we use the Q-independent approach to simplify self-attention, and in Sect. 3.3 , we construct a two-branch attention to exploit information from different dimensions for the final prediction.

3.1 Dimension Dependency

Previous transformer-based LSTF models embed data points at the same time step in a vector denoted by \(X_t \rightarrow h_t\), where \(X_t \in {\mathbb {R}}^D\) denotes all data points in the D dimension at step t and \({h_t \in {\mathbb {R}}^{d_{model}}}\) is the corresponding vector. The input \(X_{1:T} \in {\mathbb {R}}^{T \times D}\) was then embedded in T vectors \({h_1, h_2, \dots , h_T}\). However, previous Transformer-based models mainly capture temporal dependencies and ignore variable dimension dependencies, treating all dimensions of the input data as equally important. This limited the predictive power of the models. We visualize a 512-dimensional heat map of the dimensions in the Transformer, as shown in Fig. 1, and found that different dimensions had different levels of importance, with lighter colours indicating greater desirability. Based on this finding, we investigate variable dependencies to improve information utilization by explicitly capturing the interdependencies between variables. Thus, we consider both temporal and variable dependencies, enabling the model to use global information to selectively emphasize useful features.

3.2 Query-Independent Attention

3.2.1 Revisiting the Self-attention

The canonical self-attention [32] performs the scaled dot-product as \(A(Q,K,V)=Softmax \left( \frac{QK^T}{\sqrt{d}}\right) V\), where \(Q\in {\mathbb {R}}^{L_Q \times d}\), \(K\in {\mathbb {R}}^{L_K \times d}\), and \(V\in {\mathbb {R}}^{L_V \times d}\), with \(L_Q\), \(L_K\), and \(L_V\) representing the sequence lengths of Q, K, and V, respectively, and d representing the input dimension. The \(Softmax(\cdot )\) operation is conducted row by row. Typically, the self-attention mechanism employs query-specific attention weights to indicate the significance of positions relative to the query position. In order to delve deeper into the self-attention, we use \(q_i\), \(k_i\), and \(v_i\) to represent the i-th row in Q, K, and V, respectively. The attention of the i-th query is then defined as

$$\begin{aligned} Attn(q_i,K,V) = \sum _{0\le j\le L} \frac{k(q_i,k_j)}{\sum _{0\le l\le L}k(q_i,k_l)}v_j. \end{aligned}$$
(1)

The function \(k(q_i,k_j)\) selects the asymmetric exponential kernel \(exp(q_i,k_j^T/\sqrt{d})\). Nevertheless, the quadratic complexity of this function poses a notable drawback when it comes to improving predictive efficiency.

To delve deeper into the attention learned by the Transformer, we examine the attention distribution specifically on the ETTh1 dataset. We selected the attention scores of Head1, Head7 @Layer1 and visualized them by heat maps, as shown in Fig. 2. At different query positions \(q_i\) in each row, the global context scores for the corresponding positions are concentrated in almost the same few columns. This indicates that only one Q is used to learn all features and that the model performs a redundant Q–K calculation. In other words, the model learns only query-independent dependencies.

Analysis of Variance (ANOVA) is a statistical method used to examine the variation in a response variable, which is typically a continuous random variable. This analysis is performed by considering different conditions defined by discrete factors, often represented by classification variables with nominal levels. ANOVA helps assess the influence of these factors on the response variable and determines if there are statistically significant differences among the groups being compared [33]. We often use ANOVA to test for significant differences between two or more samples by comparing the between-group variance with the within-group variance (random error). ANOVA can also be used to infer statistically whether the independent variable has a significant effect on the dependent variable.

Therefore, ANOVA was used to further validate the aforementioned findings. In this study, the different query positions (\(q_i\)) in self-attention represent the different samples, with 96 samples in each group. The null hypothesis (\(H_0\)) asserts that there is no significant difference between the means of the sample groups, suggesting that all 96 samples are equivalent. On the other hand, the alternative hypothesis (\(H_1\)) suggests that the means of the sample groups are not all equal, indicating a significant difference between the 96 samples. Tables 1 and 2 present the mean ANOVA results for the vanilla Transform Head1 and Head7 at Layer1.

Table 1 The result of the analysis of variance for Head1 on the ETTh1 dataset
Table 2 The result of the analysis of variance for Head7 on the ETTh1 dataset

As shown in Tables 1 and 2, SS (Stedev squared) represents the sum of squares of deviations from the mean and quantifies the total variation in the data. DF (degrees of freedom) refers to the number of variables that are not limited in the calculation of a given measurement system. MS is the mean square, which is equal to the corresponding SS divided by DF, and F is the F statistic. The p-value is a statistical measure that quantifies the magnitude of the difference between the control group and the experimental group. It indicates the probability of observing the obtained difference (or a more extreme difference) between the groups, assuming that the null hypothesis is true. A smaller p-value suggests a larger difference between the groups and provides evidence against the null hypothesis. The value of F-crit is greater than the value of F, indicating that we accept the null hypothesis, i.e., the samples in the sample group are not significantly different. In other words, although the purpose of self-attention is to compute attention for each query position, the trained attention is actually independent of the query position. Therefore, computing attention for each query position is redundant and we can simplify the module for self-attention.

3.2.2 Simpliftying the Self-attention Block

By observing that the attention scores for different query positions exhibit significant similarity, we were able to simplify the self-attention mechanism. Our approach involves query-independent attention and sharing it across all query positions. As a result, we were able to define a simplified self-attention block that offers greater efficiency and effectiveness. Our simplified self-attention block is defined as

$$\begin{aligned} z_i=\sum _{0\le j\le L} \sigma \left( \frac{k_j}{\sqrt{d_k}}\right) v_j=\sum _{0\le j\le L} \frac{e^{k_j}}{\sum _{0\le l\le L}e^{k_l}}v_j. \end{aligned}$$
(2)

where \(\sigma\) denotes softmax function. L denote the sequence length.

In contrast to the traditional self-attention block, the term in Eq. 2 does not depend on the query position i. This implies that the term is common to all query positions i. By utilizing this observation to simplify the attention mechanism, we eliminate the need for matrix operations on Q and K, resulting in a substantial reduction in the model’s complexity. Instead, we obtain the global information directly by a weighted average of the features at all positions, as shown below:

$$\begin{aligned} z_i=\sum _{0\le j\le L} \alpha _j v_j. \end{aligned}$$
(3)

3.3 Double-Branch Attention Module

As we mentioned earlier, capturing long-term dependencies and reducing time complexity to improve information utilization and computational efficiency are challenging aspects of long-term series prediction. However, previous studies on the self-attention family have only achieved better prediction performance by modelling dependencies in the temporal dimension, while failing to effectively capture and utilize information from the variable dimension. To overcome this limitation, we introduce a Double-Branch attention module that incorporates convolutional neural networks (CNNs) [34] to preserve information from the dimensions of the latent feature. This module serves as a simplified self-attention block that can be seamlessly integrated into existing architectures as a modular component. By leveraging the Double-Branch attention module, we can better capture long-term dependencies and reduce time complexity, thereby improving the efficiency and accuracy of long-term series prediction.

Overall architecture Figure 3 showcases the proposed Double-Branch attention module, designed to integrate into transformer-based models. We keep the encoder-decoder architecture while replacing the double-branch attention with all self-attention of the encoder and input self-attention of the decoder in the Transformer-based models, but we do not replace the cross-attention that connects the encoder and decoder. As Q and K in cross-attention come from the decoder input and encoder output respectively, they are not equal, and the Q-independent theory no longer applies. To prevent information leakage, we mask future information in the input stage of the decoder. Furthermore, the simple architecture of the DBA block enables seamless integration into state-of-the-art architectures by replacing attention modules, thereby improving performance and speeding up computing. More results will be presented in the experimental section.

Fig. 3
figure 3

The outline of the proposed model. We substitute the self-attention mechanisms within the transformer-based model with double-branch attention. Specifically, we substitute the self-attention for the encoder with double-branch attention and the input self-attention for the decoder while retaining the cross-attention that connects the encoder and decoder. To prevent information leakage, we implement a masking mechanism in the input stage of the decoder that restricts access to future information

As the Double-Branch Attention name suggests, this module leverages a two-branch attention mechanism, with each branch operating in parallel. One branch sequentially captures temporal and variable dependencies in the temporal and variable dimensions, while the other focuses on temporal dependencies in the temporal dimension. Lastly, the outputs from the two branches are harmoniously merged using simple averaging. In summary, the attention process can be explained concisely in the following manner.

$$\begin{aligned} x^{\prime } &= M_v(x)\otimes x \otimes M_t(M_f(x)\otimes x), \end{aligned}$$
(4)
$$\begin{aligned} x^{\prime \prime } &= M_{ct}(x)\otimes x, \end{aligned}$$
(5)
$$\begin{aligned} y & = \frac{1}{2}(x^{\prime } + x^{\prime \prime }), \end{aligned}$$
(6)

where \(M_v\), \(M_t\) and \(M_{ct}\) represent variables, temporal and cross-temporal attention weight respectively; The \(\otimes\) represents element-wise multiplication. In this process, the attention values are appropriately broadcasted. The output is denoted as y.

Variables (Temporal) Attention Module By leveraging the inter-variable dimensions relationship, we generate attention weights for variables, which enables us to focus on what is truly meaningful across all variables. To perform the efficient computation of the information interaction on only the variables dimension, we first permute the input tensor and then extract the global context of the L dimension by conducting max-pooling [35], which also aids in reducing time complexity. The attention weights for the variables are obtained by passing the tensor through a 1D convolutional layer and applying a softmax activation function. This process is carried out sequentially. Subsequently, the resulting attention weights are permuted to preserve the shape and then element-wise multiplied with the input tensor. A similar procedure is followed for temporal attention. Next, we combine dimension attention and temporal attention by concatenating them, resulting in the outputs of the first branch. The experimental findings we obtained indicate that the sequence of concatenation has a negligible impact on the results. To summarize, the variables (temporal) attention weight is calculated as follows:

$$\begin{aligned} M_v & = \sigma (permute(C^1 (permute(x)))), \end{aligned}$$
(7)
$$\begin{aligned} M_t & = \sigma (C^1 (x)), \end{aligned}$$
(8)

where x is the input tensor; \(C^1\) represents a 1D convolution with the kernel size of (1), while \(\sigma\) denotes the softmax function.

The Cross-temporal Attention Module We produce temporal attention by leveraging the cross-dimension interaction relationship, which allows us to focus on the more important temporal sections. To achieve this, we unsqueeze the second dimension of the input vector to shape it as (B, 1, LF), where B represents the batch size, L represents the sequence length, and F represents the dimension. The reshaped tensor is subsequently fed into a conventional 2D convolution layer with a kernel size of k. Afterwards, the output is squeezed, followed by passing it through a softmax activation layer (\(\sigma\)). This generates the attention weights of shape (BLF), which are then applied to the input tensor x. The tensor obtained from the two branches is subsequently aggregated by simple averaging.

$$\begin{aligned} M_{ct}= \sigma (squeeze(C^{k\times k}(unsqueeze(x)))). \end{aligned}$$
(9)

Summarizing, the computation for obtaining the attention applied tensor y from an input tensor \(x \in {\mathbb {R}} ^{B\times L\times F}\) can be expressed by:

$$\begin{aligned} y=\frac{1}{2}(x \otimes M_v \otimes M_t +x \otimes M_{ct}). \end{aligned}$$
(10)

4 Experiments

The performance of the Double-Branch Attention mechanism is rigorously evaluated through extensive prediction experiments conducted on eight real-world benchmark datasets.

Datasets Below is an overview of the six datasets employed in our experimental analysis: (1) Exchange [34] collects daily exchange rates for different currencies from 1990 to 2016. (2) ILIFootnote 1 collects the ratio of patients with influenza-like illness in a week collected by healthcare facilities or surveillance systems in the United States from 2002 to 2021. (3) ETTm2 [7] comprises load and oil temperature measurements from power transformers, collected from 2016 to 2018. It’s sampled at 15-min intervals. (4) ElectricityFootnote 2 encompasses the hourly electricity consumption data from 321 customers, spanning the time period of 2012 to 2014. (5) TrafficFootnote 3 consists of road occupancy measurements, ranging between 0 and 1, recorded by 862 sensors per hour on freeways. The dataset spans from January 2015 to December 2016. (6) WeatherFootnote 4 comprises a meteorological time series with 21 indicators collected at one weather station in 2020. The data is recorded at a frequency of every 10 min. We adhere to the standard procedure of sequentially partitioning all datasets into training, validation, and testing subsets, maintaining the ratio of 6:2:2 for ETTm2 datasets and 7:1:2 for the remaining datasets.

Table 3 Multivariate prediction results for various prediction lengths

Baselines We use a total of 8 baselines. For multivariate forecasting, we select six recently developed transformer-based models that represent the cutting-edge in the field: Crossformer [36], Nonstationary-Transformer [13], Autoformer [9], Pyraformer [11], Informer [7], LogTrans [34], one RNN-based model:LstNet [34]. For univariate, we include six baselines:Nonstationary-Transformer [13], Autoformer [9], Pyraformer [11], Informer [7], LogTrans [34], ARIMA [37]. Furthermore, we apply our method to Transformer [32], Informer [7], and Non-stationary Transformer [13] to verify the generalization of our framework.

Implementation Details Our experiments are implemented with Pytorch [26], running on a single NVIDIA GeForce RTX 3090 24GB GPU. Each model is trained using the ADAM [38] optimizer with an L2 loss function, an initial learning rate of \(10^{-4}\), and a batch size of 32. During training, we make early stops within 3 epochs. The hyperparameter k for the Double-Branch Attention is chosen at \(\{3, 5, 7, 9\}\) to trade-off performance and efficiency. We perform all experiments three times, each time using a different random seed.

4.1 Results and Analysis

Multivariate results Regarding the multivariate prediction, our framework-equipped Non-stationary Transformer consistently achieves state-of-the-art performance compared to the majority of baseline models (Table 3). For the prediction length of 336, we achieve 33% \((0.572\rightarrow 0.384)\) MSE reduction in Exchange, 25% \((0.416\rightarrow 0.310)\) in ETTm2, 4% \((0.202 \rightarrow 0.194)\) in Electricity, 27% \((0.639\rightarrow 0.467)\) in Traffic, and 6% \((0.357\rightarrow 0.321)\) in Weather. In the ILI dataset, specifically in the input-36-predict-60 setting, we achieve a 22% \((2.653\rightarrow 2.062)\) MSE reduction. Furthermore, we observe that the performance of our model exhibits a consistent trend as the forecast length increases. This indicates that our model maintains robustness over longer-term predictions, which is a crucial factor for practical applications in real-world scenarios.

Univariate results Regarding univariate prediction, we present the prediction outcomes for two representative datasets in Table 4. Significantly, our model consistently maintains its position at the forefront, surpassing a wide range of baselines and delivering superior performance. Specifically, with the input-96-prediction-336 setting, our model gains a 7% \((0.486\rightarrow 0.452)\) reduction in MSE for Exchange, without any apparent periodicity, and a 29% \((0.180\rightarrow 0.128)\) reduction in MSE for ETTm2, which has an observable periodicity. Notably, ARIMA can be used to address the issue of “time-varying characteristics in a stochastic process rather than being fixed” and “the reason for non-stationarity in the time series is random rather than deterministic.” Since the Exchange dataset is not periodic, after differencing, the dataset becomes stable, making ARIMA more effective than some deep learning models for short-term predictions. However, when certain models take into account the non-stationarity of the dataset, the effectiveness of ARIMA is not as good as that of NS-stationary models. It is proved that our model performs better in sequences with or without periodicity.

Table 4 Univariate prediction results for various prediction lengths on two typical datasets

4.2 Ablation Study

Quality performance To qualitatively assess the validity of our model, we conduct a comparison of its predictions with those of two alternative models, the Autoformer and the Non-stationary transformer. The results are plotted for the last dimension of the ETTm2 prediction. The plots are shown in Figs. 4, 5, 6, and 7. Notably, our model demonstrates superior performance compared to the other models, showcasing its exceptional accuracy in prediction.

Fig. 4
figure 4

Predictions for different models on ETTm2 with the input-96-prediction-96 setting. The grey lines represent the ground truth and the orange lines represent the predictions. The shared part of length 96 is the input

Fig. 5
figure 5

Predictions for different models on ETTm2 with the input-96-prediction-192 setting

Fig. 6
figure 6

Predictions for different models on ETTm2 with the input-96-prediction-336 setting

Fig. 7
figure 7

Predictions for different models on ETTm2 with the input-96-prediction-720 setting

Quantitative performance In order to investigate the significance of each individual sub-module within our proposed framework, we conduct a quantitative performance comparison of each module. Table 5 displays the results obtained by separately adding three modules. In traffic flow forecasting, the current and future traffic states are influenced by the historical traffic states from previous moments, and a positive correlation exists between the trends observed in the traffic flow time series and the historical time series. Relying solely on variable attention or temporal attention in the model would result in poor performance. However, incorporating a two-branch attention mechanism highlights the significance of temporal dependence, which aligns with the characteristics of the traffic flow dataset. By combining variable attention and temporal attention, the model effectively leverages the impact of historical traffic states on present and future states, capturing complex temporal dependencies and improving predictive power. Overall, the prediction results for each module and the Non-stationary transformer are comparable, indicating that the Q-independent version in our framework can effectively replace QKV. Furthermore, our proposed Double-Branch Attention has outperformed all six benchmarks, especially in highly non-stationary datasets such as Exchange (\(0.482 \rightarrow 0.423\)) and ETTm2 (\(0.558 \rightarrow 0.403\)). This comparison highlights the limited predictive power of the transformer family when it only performs temporal attention. By incorporating attention to the variable dimension, our proposed approach effectively unleashes the model’s potential for series forecasting.

Table 5 The prediction results of Non-stationary Transformer stacked with different sub-modules are compared

Efficiency Analysis We compared the training times of Autoformer, Informer, and Transformer (Fig. 8). Remarkably, our proposed DBA model exhibits complexity of \(O(L\log L)\), which is notably reduced compared to alternative models. This highlights its computational efficiency advantages.

Fig. 8
figure 8

Efficiency analysis. For the running time, we run all models \(10^3\) times to obtain the execution time per step

Framework generality We utilize our framework to evaluate the predictive performance of two popular transformer-based models (Table 6). Overall, our approach consistently enhances the predictive power of these distinct models. Specifically, we achieve an average performance increase of 4.98% on Transformer and 3.99% on Informer. By applying our framework, we achieve improvements without adding significant parameters or computation, reducing the computational complexity of the models. This confirms the effectiveness and efficiency of our DBA Transformer, which can be extensively employed in Transformer-based models to enhance their predictive capabilities and achieve exceptional performance.

Table 6 The performance gains obtained by applying the proposed framework to transformer and informer

5 Conclusion

We have put forward a resolution to address the challenge of long-term time series forecasting. Unlike previous studies that relied on temporally sparse versions of self-attention, we proposed a more efficient method to capture dependencies in time and variable dimensions, improving information utilization. Additionally, we used Q-independent attention to simplify the self-attention, which reduced the prediction time of the model and achieved an O(LlogL) complexity. Our approach demonstrated both strong generality and high accuracy on six real-world benchmarks. Furthermore, We provide detailed ablation experiments to evidence the validity of our proposed double-branch attention. Overall, our framework presented a versatile, lightweight solution with wide-ranging utility to boost the predictability of transformer-based models and achieve state-of-the-art performance.