DBAFormer: A Double-Branch Attention Transformer for Long-Term Time Series Forecasting

The transformer-based approach excels in long-term series forecasting. These models leverage stacking structures and self-attention mechanisms, enabling them to effectively model dependencies in series data. While some approaches prioritize sparse attention to tackle the quadratic time complexity of self-attention, it can limit information utilization. We introduce a creative double-branch attention mechanism that simultaneously captures intricate dependencies in both temporal and variable perspectives. Moreover, we propose query-independent attention, taking into account the near-identical attention allocated by self-attention to different query positions. This enhances efficiency and reduces the impact of redundant information. We integrate the double-branch query-independent attention into popular transformer-based methods like Informer, Autoformer, and Non-stationary transformer. The results obtained from conducting experiments on six practical benchmarks consistently validate that our novel attention mechanism substantially improves the long-term series forecasting performance in contrast to the baseline approach.


Introduction
Time series forecasting is employed diverse fields, including traffic [1], management [2], energy consumption [3,4], weather forecasting [5], and disease propagation [6]. Long sequence time series forecasting (LSTF) focuses on predicting a series of length O using an input series of length I, where O is notably greater than I [7,8]. The progress made in deep learning techniques considerably improves the precision of time series prediction methods. Specifically, transformer-based models have showcased their remarkable capability in capturing complex temporal relationships and dependencies among different positions within a sequence, outperforming traditional RNNs and CNNs [7,[9][10][11][12][13][14].
Forecasting in the long-term setting poses significant challenges. Firstly, it is unreliable to directly identify temporal dependencies from long-term time series, since these dependencies can be masked by complex temporal patterns. Complex temporal patterns refer to situations where the trend, seasonal, and residual patterns of the time series change randomly over time. Secondly, the traditional transformer is computationally intensive in terms of long-term prediction owing to the quadratic complexity inherent in self-attention. Previous prediction methods based on Transformers attempt to address this issue by improving sparse versions of self-attention through point-by-point representation aggregation. However, this approach sacrifices information utilization and creates bottlenecks in long-range time series prediction. To infer complex temporal patterns, we capture local dependencies by applying 1D convolution on the first branch and K-dimensional convolutional kernels sliding along the time dimension on the second branch.
Since CNNs excel at recognizing simple patterns in data, by parallelly connecting these simple patterns, we extract more complex patterns to enhance the model's predictability. [15] We leverage this capability by parallelly connecting these simple patterns and extracting the more predictable components from the time series. As depicted in Fig. 1, we visually illustrate the variable dimensions of the first layer of the transformer using the ETTh1 dataset. Both the horizontal and vertical coordinates represent 512-dimensional variables. The brightness of the color corresponds to its importance, indicating how noteworthy it is. The heat map displays variable dimensions with varying levels of color brightness and darkness, indicating the crucial role of variable dimension D in the time series X ∈ ℝ I×D for accurate long-term forecasting. In contrast, the transformer-based model only explicitly exploits temporal dependence through a self-attentive mechanism, ignoring the effect of variable dimensional dependence, which may limit its ability to exploit information.
Encouraged by Cao et al. [16], we use heatmaps to visualize the attention in the initial layer of vanilla transformer head1 and head7 on the ETTh1 dataset. As depicted in Fig. 2, every row signifies the sequence X T = {x 1 , … , x T } from left to right, and each column represents the sequence X T = {x 1 , … , x T } from top to bottom. Each point on the heat map corresponds to the correlation value ij ,i ∈ [0, T], j ∈ [0, T] . And we observe that attention at different query positions (each row in the heatmap) was almost in the same column, indicating that the aggregation method using point-by-point representation learns attention that is almost the same across different query positions (Q). The statistical results for Tables 1 and 2 show very small differences in attention across query positions, indicating that the model only learns query-independent dependencies, further supporting our observations. Based on the motivation discussed earlier, our objective is to enhance the efficiency of predictive long-term time series models based on transformers. To achieve this, we propose a generic model called Double-Branch Attention that takes into account both temporal and variable dimension dependencies and uses query-independent attention for all query positions. Our proposed framework consists of two interconnected branches: one captures dependencies of the temporal and variable dimensions sequentially in the time  The visualization presents the attention (heatmap) of two heads (head 1 and head 7) in the Transformer at Layer 1 on the ETTh1 dataset. Notably, the attention from these two heads demonstrates consistent alignment along the same column for various query positions ( Q i ) To sum up, we make the following contributions: • The conventional dot product operation employed in the self-attention mechanism leads to significant time complexity. To deal with this issue, we simplify the self-attention module to only learn query-independent dependencies. This simplification reduces the computational complexity of each layer to O(LlogL) and effectively improves the overall running time. • Our approach addresses both temporal and variable dependencies by introducing a double-branch attention module. This module comprises two branches: one that captures cross-temporal and cross-dimensional dependencies simultaneously, and another that places greater emphasis on temporal dimension dependencies. By incorporating these features, our model achieves higher predictability. • We conduct extensive testing of our model on six realworld benchmarks, showcasing its superior effectiveness compared to previous state-of-the-art approaches. By integrating it into transformer-based models, we continually enhance the predictive power of these models.

Stationarization for Time Series Prediction
The stationarity of time series plays a vital role in their predictability, and considerable efforts have been dedicated over the years to developing dependable and precise models for non-stationary time series. Traditional methods, such as hidden Markov models (HMMs [17]), dynamic Bayesian networks [18], Kalman filters [19], and other statistical models (e.g. ARIMA [20]) achieve significant advancements in the field. More recently, recurrent neural networks (RNNs) [21,22] have achieved better performance. RNNs do not impose any assumptions regarding the temporal structure and have the ability to capture highly non-linear and intricate dependencies within time series. A wide range of deep learning architectures demonstrate excellent performance. DAIN [23] uses non-linear neural networks to adapt to stationary time series based on the observed training distribution. The adaptive sparse Huber additive model introduced by Huber additive models [24] enables robust forecasting for non-Gaussian and (non)stationary data. AdaRNN [25] uses an adaptive mechanism that automatically adjusts the weights and parameters of the model to accommodate changes in the data. Unlike previous methods, Non-stationary transformers [26] can automatically learn and adapt to changing patterns and structures in the time series by introducing adaptive mechanisms.

Deep Models for Long Time Series Forecasting
Transformer-based models exhibit significant prowess in sequence modeling tasks [27][28][29]. However, self-attention demonstrates quadratic complexity. Subsequent research focuses on seeking approaches to mitigate the complexity. LogTrans [12] introduces convolutional self-attention and proposes that the LogSparse transformer selects time steps at exponentially increasing intervals, effectively reducing the complexity. Reformer [10] employs a reversibility layer, a long-range self-attention mechanism, and a chunking attention mechanism to address the performance and memory limitations of traditional Transformers when dealing with long sequences, changing its complexity from O(L 2 ) to O(LlogL). Informer [7] adopts KL-divergence to select the dominant query and introduces probsparse self-attention that achieves O(LlogL) complexity. Other Transformerbased models also achieve excellent performance from other perspectives. Pyraformer [11] designed the Pyramid Attention Module (PAM) to capture the time dependencies of various hierarchies. Autoformer [9] adds a decomposition block to extract the intrinsic complex temporal trends of the hidden states in the model and proposes to replace self-attention with an Auto-Correlation mechanism, which takes into account the similarity between sub-series to better capture the trends, not only to achieve O(LlogL) complexity, but also to prevent information loss. FEDformer [30] introduces an innovative approach called the blend of expert techniques to combine trend components obtained through moving average kernels of different sizes. Preformer [31] proposes multi-scale segment-correlation mechanism and

Methodology
In time series forecasting, our objective is to make predictions for the future value of a time series X I+1∶I+ ∈ ℝ I×D base on history data X 1∶I ∈ ℝ I×D . Here, represents the number of time steps into the future, I denotes the number of time steps in the past, and D > 1 represents the number of feature dimensions. The long-term series prediction problem requires longer output lengths ( I + ) than previous work. To exploit the dependencies among all dimensions, we take into account the variable dimension dependencies in Sect. 3.1, which improves the predictability of the model. In Sect. 3.2, we use the Q-independent approach to simplify self-attention, and in Sect. 3.3 , we construct a two-branch attention to exploit information from different dimensions for the final prediction.

Dimension Dependency
Previous transformer-based LSTF models embed data points at the same time step in a vector denoted by X t → h t , where X t ∈ ℝ D denotes all data points in the D dimension at step t and h t ∈ ℝ d model is the corresponding vector. The input X 1∶T ∈ ℝ T×D was then embedded in T vectors h 1 , h 2 , … , h T . However, previous Transformer-based models mainly capture temporal dependencies and ignore variable dimension dependencies, treating all dimensions of the input data as equally important. This limited the predictive power of the models. We visualize a 512-dimensional heat map of the dimensions in the Transformer, as shown in Fig. 1, and found that different dimensions had different levels of importance, with lighter colours indicating greater desirability. Based on this finding, we investigate variable dependencies to improve information utilization by explicitly capturing the interdependencies between variables. Thus, we consider both temporal and variable dependencies, enabling the model to use global information to selectively emphasize useful features.

Revisiting the Self-attention
The canonical self-attention [32] performs the scaled dot- Q ∈ ℝ L Q ×d , K ∈ ℝ L K ×d , and V ∈ ℝ L V ×d , with L Q , L K , and L V representing the sequence lengths of Q, K, and V, respectively, and d representing the input dimension. The Softmax(⋅) operation is conducted row by row. Typically, the self-attention mechanism employs query-specific attention weights to indicate the significance of positions relative to the query position. In order to delve deeper into the self-attention, we use q i , k i , and v i to represent the i-th row in Q, K, and V, respectively. The attention of the i-th query is then defined as . Nevertheless, the quadratic complexity of this function poses a notable drawback when it comes to improving predictive efficiency.
To delve deeper into the attention learned by the Transformer, we examine the attention distribution specifically on the ETTh1 dataset. We selected the attention scores of Head1, Head7 @Layer1 and visualized them by heat maps, as shown in Fig. 2. At different query positions q i in each row, the global context scores for the corresponding positions are concentrated in almost the same few columns. This indicates that only one Q is used to learn all features and that the model performs a redundant Q-K calculation. In other words, the model learns only query-independent dependencies.
Analysis of Variance (ANOVA) is a statistical method used to examine the variation in a response variable, which is typically a continuous random variable. This analysis is performed by considering different conditions defined by discrete factors, often represented by classification variables with nominal levels. ANOVA helps assess the influence of these factors on the response variable and determines if there are statistically significant differences among the groups being compared [33]. We often use ANOVA to test for significant differences between two or more samples by comparing the between-group variance with the within-group variance (random error). ANOVA can also be used to infer statistically whether the independent variable has a significant effect on the dependent variable.
Therefore, ANOVA was used to further validate the aforementioned findings. In this study, the different query positions ( q i ) in self-attention represent the different samples, with 96 samples in each group. The null hypothesis ( H 0 ) asserts that there is no significant difference between the means of the sample groups, suggesting that all 96 samples are equivalent. On the other hand, the alternative hypothesis ( H 1 ) suggests that the means of the sample groups are not all equal, indicating a significant difference between the 96 samples. Tables 1 and 2 present the mean ANOVA results for the vanilla Transform Head1 and Head7 at Layer1.
As shown in Tables 1 and 2, SS (Stedev squared) represents the sum of squares of deviations from the mean and quantifies the total variation in the data. DF (degrees of freedom) refers to the number of variables that are not limited in the calculation of a given measurement system. MS is the mean square, which is equal to the corresponding SS divided by DF, and F is the F statistic. The p-value is a statistical measure that quantifies the magnitude of the difference between the control group and the experimental group. It indicates the probability of observing the obtained difference (or a more extreme difference) between the groups, assuming that the null hypothesis is true. A smaller p-value suggests a larger difference between the groups and provides evidence against the null hypothesis. The value of F-crit is greater than the value of F, indicating that we accept the null hypothesis, i.e., the samples in the sample group are not significantly different. In other words, although the purpose of selfattention is to compute attention for each query position, the trained attention is actually independent of the query position. Therefore, computing attention for each query position is redundant and we can simplify the module for self-attention.

Simpliftying the Self-attention Block
By observing that the attention scores for different query positions exhibit significant similarity, we were able to simplify the self-attention mechanism. Our approach involves queryindependent attention and sharing it across all query positions. As a result, we were able to define a simplified self-attention block that offers greater efficiency and effectiveness. Our simplified self-attention block is defined as where denotes softmax function. L denote the sequence length.
In contrast to the traditional self-attention block, the term in Eq. 2 does not depend on the query position i. This implies that the term is common to all query positions i. By utilizing this observation to simplify the attention mechanism, we eliminate the need for matrix operations on Q and K, resulting in a substantial reduction in the model's complexity. Instead, we obtain the global information directly by a weighted average of the features at all positions, as shown below:

Double-Branch Attention Module
As we mentioned earlier, capturing long-term dependencies and reducing time complexity to improve information utilization and computational efficiency are challenging aspects of long-term series prediction. However, previous studies on the self-attention family have only achieved better prediction performance by modelling dependencies in the temporal dimension, while failing to effectively capture and utilize information from the variable dimension. To overcome this limitation, we introduce a Double-Branch attention module that incorporates convolutional neural networks (CNNs) [34] to preserve information from the dimensions of the latent feature. This module serves as a simplified self-attention block that can be seamlessly integrated into existing architectures as a modular component. By leveraging the Double-Branch attention module, we can better capture long-term dependencies and reduce time complexity, thereby improving the efficiency and accuracy of long-term series prediction. Overall architecture Figure 3 showcases the proposed Double-Branch attention module, designed to integrate into transformer-based models. We keep the encoderdecoder architecture while replacing the double-branch attention with all self-attention of the encoder and input self-attention of the decoder in the Transformer-based models, but we do not replace the cross-attention that connects the encoder and decoder. As Q and K in crossattention come from the decoder input and encoder output respectively, they are not equal, and the Q-independent theory no longer applies. To prevent information leakage, we mask future information in the input stage of the decoder. Furthermore, the simple architecture of the DBA block enables seamless integration into state-of-theart architectures by replacing attention modules, thereby improving performance and speeding up computing. More results will be presented in the experimental section.
As the Double-Branch Attention name suggests, this module leverages a two-branch attention mechanism, with each branch operating in parallel. One branch sequentially captures temporal and variable dependencies in the temporal and variable dimensions, while the other focuses on temporal dependencies in the temporal dimension. Lastly, the outputs from the two branches are harmoniously merged using simple averaging. In summary, the attention process can be explained concisely in the following manner.
where M v , M t and M ct represent variables, temporal and cross-temporal attention weight respectively; The ⊗ represents element-wise multiplication. In this process, the attention values are appropriately broadcasted. The output is denoted as y.
Variables (Temporal) Attention Module By leveraging the inter-variable dimensions relationship, we generate attention weights for variables, which enables us to focus on what is truly meaningful across all variables. To perform the efficient computation of the information interaction on only the variables dimension, we first permute the input tensor and then extract the global context of the L dimension by conducting max-pooling [35], which also aids in reducing time complexity. The attention weights for the variables are obtained by passing the tensor through a 1D convolutional layer and applying a softmax activation function. This process is carried out sequentially. Subsequently, the resulting attention weights are permuted to preserve the shape and then element-wise multiplied with the input tensor. A similar procedure is followed for temporal attention. Next, we combine dimension attention and temporal attention by concatenating them, resulting in the outputs of the first branch. The experimental findings we obtained indicate that the sequence of concatenation has a negligible impact on the results. To summarize, the variables (temporal) attention weight is calculated as follows: where x is the input tensor; C 1 represents a 1D convolution with the kernel size of (1), while denotes the softmax function.

The Cross-temporal Attention Module
We produce temporal attention by leveraging the cross-dimension interaction relationship, which allows us to focus on the more important temporal sections. To achieve this, we unsqueeze the second dimension of the input vector to shape it as (B, 1, L, F), where B represents the batch size, L represents the sequence length, and F represents the dimension. The reshaped tensor is subsequently fed into a conventional 2D convolution layer with a kernel size of k. Afterwards, the output is squeezed, followed by passing it through a softmax activation layer ( ). This generates the attention weights of shape (B, L, F), which are then applied to the input tensor x. The tensor obtained from the two branches is subsequently aggregated by simple averaging.

(7)
M v = (permute(C 1 (permute(x)))), Fig. 3 The outline of the proposed model. We substitute the selfattention mechanisms within the transformer-based model with double-branch attention. Specifically, we substitute the self-attention for the encoder with double-branch attention and the input self-attention for the decoder while retaining the cross-attention that connects the encoder and decoder. To prevent information leakage, we implement a masking mechanism in the input stage of the decoder that restricts access to future information Summarizing, the computation for obtaining the attention applied tensor y from an input tensor x ∈ ℝ B×L×F can be expressed by:

Experiments
The performance of the Double-Branch Attention mechanism is rigorously evaluated through extensive prediction experiments conducted on eight real-world benchmark datasets.
Datasets Below is an overview of the six datasets employed in our experimental analysis: (1) Exchange [34] collects daily exchange rates for different currencies from The data is recorded at a frequency of every 10 min. We adhere to the standard procedure of sequentially partitioning all datasets into training, validation, and testing subsets, maintaining the ratio of 6:2:2 for ETTm2 datasets and 7:1:2 for the remaining datasets.
[32], Informer [7], and Non-stationary Transformer [13] to verify the generalization of our framework. Implementation Details Our experiments are implemented with Pytorch [26], running on a single NVIDIA GeForce RTX 3090 24GB GPU. Each model is trained using the ADAM [38] optimizer with an L2 loss function, an initial learning rate of 10 −4 , and a batch size of 32. During training, we make early stops within 3 epochs. The hyperparameter k for the Double-Branch Attention is chosen at {3, 5, 7, 9} to trade-off performance and efficiency. We perform all experiments three times, each time using a different random seed.

Results and Analysis
Multivariate results Regarding the multivariate prediction, our framework-equipped Non-stationary Transformer consistently achieves state-of-the-art performance compared to the majority of baseline models (Table 3). For the prediction length of 336, we achieve 33% (0.572 → 0.384) MSE reduction in Exchange, 25% (0.416 → 0.310) in ETTm2, 4% (0.202 → 0.194) in Electricity, 27% (0.639 → 0.467) in Traffic, and 6% (0.357 → 0.321) in Weather. In the ILI dataset, specifically in the input-36-predict-60 setting, we achieve a 22% (2.653 → 2.062) MSE reduction. Furthermore, we observe that the performance of our model exhibits a consistent trend as the forecast length increases. This indicates that our model maintains robustness over longer-term predictions, which is a crucial factor for practical applications in real-world scenarios.
Univariate results Regarding univariate prediction, we present the prediction outcomes for two representative datasets in Table 4. Significantly, our model consistently maintains its position at the forefront, surpassing a wide range of baselines and delivering superior performance. Specifically, with the input-96-prediction-336 setting, our model gains a 7% (0.486 → 0.452) reduction in MSE for Exchange, without any apparent periodicity, and a 29% (0.180 → 0.128) reduction in MSE for ETTm2, which has an observable periodicity. Notably, ARIMA can be used to address the issue of "time-varying characteristics in a stochastic process rather than being fixed" and "the reason for non-stationarity in the time series is random rather than deterministic." Since the Exchange dataset is not periodic, after differencing, the dataset becomes stable, making ARIMA more effective than some deep learning models for short-term predictions. However, when certain models take into account the nonstationarity of the dataset, the effectiveness of ARIMA is not as good as that of NS-stationary models. It is proved that our model performs better in sequences with or without periodicity.

Ablation Study
Quality performance To qualitatively assess the validity of our model, we conduct a comparison of its predictions with those of two alternative models, the Autoformer and the Non-stationary transformer. The results are plotted for the last dimension of the ETTm2 prediction. The plots are shown in Figs. 4, 5, 6, and 7. Notably, our model demonstrates superior performance compared to the other models, showcasing its exceptional accuracy in prediction.
Quantitative performance In order to investigate the significance of each individual sub-module within our proposed framework, we conduct a quantitative performance comparison of each module. Table 5 displays the results obtained by separately adding three modules. In traffic flow forecasting, the current and future traffic states are influenced by the historical traffic states from previous moments, and a positive correlation exists between the trends observed in the traffic flow time series and the historical time series. Relying solely on variable attention or temporal attention in the model would result in poor performance. However, incorporating a two-branch attention mechanism highlights the significance of temporal dependence, which aligns with the characteristics of the traffic flow dataset. By combining variable attention and temporal attention, the model effectively leverages the impact of historical traffic states on present and future states, capturing complex temporal dependencies and improving predictive power. Overall, the prediction results for each module and the Non-stationary transformer are comparable, indicating that the Q-independent version in our framework can effectively replace QKV. Furthermore, our proposed Double-Branch Attention has outperformed all six benchmarks, especially in highly non-stationary datasets such as Exchange ( 0.482 → 0.423 ) and ETTm2 ( 0.558 → 0.403 ). This comparison highlights the limited predictive power of the transformer family when it only performs temporal attention. By incorporating attention to the variable dimension, our proposed approach effectively unleashes the model's potential for series forecasting.
Efficiency Analysis We compared the training times of Autoformer, Informer, and Transformer (Fig. 8). Remarkably, our proposed DBA model exhibits complexity of O(L log L) , which is notably reduced compared to alternative models. This highlights its computational efficiency advantages.
Framework generality We utilize our framework to evaluate the predictive performance of two popular transformerbased models (Table 6). Overall, our approach consistently enhances the predictive power of these distinct models. Specifically, we achieve an average performance increase of 4.98% on Transformer and 3.99% on Informer. By applying our framework, we achieve improvements without adding significant parameters or computation, reducing the computational complexity of the models. This confirms the effectiveness and efficiency of our DBA Transformer, which can be extensively employed in Transformer-based models to

Conclusion
We have put forward a resolution to address the challenge of long-term time series forecasting. Unlike previous studies that relied on temporally sparse versions of selfattention, we proposed a more efficient method to capture dependencies in time and variable dimensions, improving information utilization. Additionally, we used Q-independent attention to simplify the self-attention, which reduced the prediction time of the model and achieved an O(LlogL) complexity. Our approach demonstrated both strong generality and high accuracy on six real-world benchmarks. Furthermore, We provide detailed ablation experiments to evidence the validity of our proposed double-branch attention. Overall, our framework presented a versatile, lightweight solution with wide-ranging utility to boost the predictability of transformer-based models and achieve state-of-the-art performance.