Learning evolving relations for multivariate time series forecasting

Multivariate time series forecasting is essential in various ﬁelds, including healthcare and trafﬁc management, but it is a challenging task due to the strong dynamics in both intra-channel relations (temporal patterns within individual variables) and inter-channel relations (the relationships between variables), which can evolve over time with abrupt changes. This paper proposes ERAN (Evolving Relational Attention Network), a framework for multivariate time series forecasting, that is capable to capture such dynamics of these relations. On the one hand, ERAN represents inter-channel relations with a graph which evolves over time, modeled using a recurrent neural network. On the other hand, ERAN represents the intra-channel relations using a temporal attentional convolution, which captures the local temporal dependencies adaptively with the input data. The elvoving graph structure and the temporal attentional convolution are intergrated in a uniﬁed model to capture both types of relations. The model is experimented on a large number of real-life datasets including trafﬁc ﬂows, energy consumption, and COVID-19 transmission data. The experimental results show a signiﬁcant improvement over the state-of-the-art methods in multivariate time series forecasting particularly for non-stationary data.


Introduction
In this paper, we study the problem of multivariate time series forecasting, which is to predict the future data points given previous data of time series.Forecasting time series is critical for many real-life applications, including predicting traffic flow, electricity consumption, and COVID transmission.Accurate forecasting is crucial for making informed decisions and planning for the future.
Multivariate time series forecasting is challenging due to the complexity in both intra-and inter-channel relations.Intra-channel relations involve the temporal patterns within individual variables, determining the dependencies of future values on prior ones.Intra-channel relations can be highly dynamic, particular in non-stationary time series, making it difficult to forecast.For example, in  Extended author information available on the last page of the article data, changes in government policies may impact COVID-19 transmission in a city, resulting in non-stationarity in temporal patterns and making it challenging to predict new cases.Inter-channel relations, on the other hand, refer to the dependencies between pairs of variables.Again, these relations can evolve over time.For instance, the correlation between the spread of COVID-19 among cities or countries may vary over time due to adaptive government policies such as social distancing or border closures.Therefore, accurately capturing the dynamics of these types of relations is essential for multivariate time series forecasting.
Time series forecasting have been extensively studied in conventional models such as autoregressive model (AR), moving average model (MA), autoregressive integrated moving average (ARIMA) [15].Recently, deep neural networkbased models, such as recurrent neural networks (RNNs) [7,14], and convolutional neural networks (CNNs) [3], have shown promising results due to their ability to capture nonlinear temporal patterns.These models represent the dependencies of future data points on previous data points using a set of learnable parameters.However, since these paramerters are fixed after training, they can only capture invariant temporal patterns, and therefore, are insufficient to model time series with time-varying patterns, such as non-stationary time series that are commonly observed in reality.
To model the inter-channel relations, recent works have been applying graph neural networks to multivariate time series data.In this approach, a multivariate time series is viewed as a graph, with each variable represented as a node and the underlying correlations between variables represented as connections between nodes.The graph can be either pre-defined or learned from data.By combining a graph neural network with a temporal model such as an RNN [2] or a CNN [12,34,35], both types of relations can be modeled simultaneously.The primary limitation of the this approach is its static graph structure, where a single graph is used throughout the entire lifespan of the time series.However, the static graph structure is inadequate for capturing the evolving relationships between variables over time, as the example of COVID transmission above.
A highly plausible path to tackle these challenges is via learning the evolution of the spatio-temporal relations in time series.Unlike entities in static data, time series variables have unique evolving lives throughout space-time.As a series progresses, it changes its internal states and interacts with other series at arbitrary time.Guided by these principles, we introduce ERAN (Evolving Relational Attention Network), a novel method for modeling multivariate time series.ERAN learns to extract the graph structure underlying inter-channel relations at each time step.In ERAN, a graph structure is represented by an adjacency matrix.The evolution of the adjacency matrices is modelled by a recurrent neural network, where the adjacency matrix at each time step depends on that of the previous step and the current observed data, allowing ERAN to accurately model the evolving interchannel relations in the data.
Once discovered, the dynamic graph structure guides a reasoning process that jointly captures the intra-and interchannel relations of a time series.This process happens within a multi-layer architectures that alternate between inter-channel relations by graph convolution network and intra-channel relations by a fused Temporal Attentional Convolution (TAC).This process results in a concrete representation of the observed time series which facilitates convenient decoding into the future forecast.
The ERAN model stands out with its authentic and explicit modeling of evolving relations in time series leading to the effective and stable forecasting process.These advantages are demonstrated through a comprehensive set of experiments across multiple domains, including traffic flow, electricity consumption, and COVID-19 transmission.ERAN consistently outperforms existing state-of-the-art models on these tasks.Analysis of the model reveals its effectiveness in exploring the underlying dynamic relations during the forecasting process.
In summary, we present the following innovations: • A method to model the dynamics of inter-channel relations in multivariate time series, by learning a dynamic graph underlying such relations, that evolves over time.

Traditional time series forecasting
Time series forecasting has been studied for decades.Traditional time series techniques are mainly based on the statistical approach.These methods include the well-known autoregressive integrated moving average (ARIMA) [15], support vector regression (SVR) [25], random forest (RF) [10], vector autoregressive models (VARs) [15].ARIMA is a generalisation of the autoregressive (AR) and and autoregresive moving average (ARMA) models [22].SVR [15] is a type of support vector machine that can perform regression by finding a hyperplane that fits the data with the maximum margin.SVR can handle non-linear and highdimensional data, and provide probabilistic forecasts.RF [10] is an ensemble learning method that can perform regression by combining multiple decision trees that are trained on different subsets of the data.RF can handle noisy and heterogeneous data, and reduce the variance and overfitting of individual trees.
Vector autoregresive models (VARs) extend AR and ARMA to extract linear correlation between variables in multivariate time series [15].Nevertheless, traditional methods have several drawbacks.Firstly, they are linear models and hence cannot capture non-linear dependencies present in complex data.Secondly, these methods train time series individually, and therefore, they are not scalable to large-scale data sets containing millions of time series.Furthermore, they cannot leverage the common patterns shared between time series in the data set, due to their individual training.

Deep learning-based time series forecasting
Deep learning-based methods have recently shown promising results in time series forecasting by capturing non-linear dependencies in the data.Among these techniques, RNN and its variants LSTM and GRU [6,7,14], has been used, such as Deep AR [30], Deep State-Space Models (DSSM) [28], TimeGrad [29] to name a few.CNN-based models such as WaveNet [24], GluonTS [1] have also demonstrated their effectiveness in modeling time series data.
These models embed the observed multivariate time series into a sequence of vectors in a shared hidden space, and use RNN, CNN, or self-attention mechanisms to model the sequence in the temporal axis.Since the variables are encoded in a shared hidden space, the inter-dependencies between variables are not modeled explicitly.

Graph neural network for time series forecasting
Graph neural networks (GNN) have been showing great successes in structured data like social network, protein networks, chemical networks, and human skeleton data.The main goal of a GNN is to capture the dependencies of nodes via a graph structure.GNN can learn the representation of a node by leveraging not only that node's feature but also its neighbor nodes.Various techniques have been proposed for this purpose such as graph convolution [4,8,16] and message passing [11,23,27].
Inspired by the success of the GNN in other domains, researchers have recently applied GNN to multivariate time series such as Graph Wavenet (GWNet) [34], Spatio-Temporal Graph Convolutional Networks (STGCN) [36] Attention-based Spatial-temporal Graph Convolutional Network (ASTGCN) [12].A multivariate time series can be seen as a graph where variables correspond to nodes of the graph, and the edges of the graph are the underlying dependencies between the variables.By combining GNN and a temporal model, e.g., a RNN or a CNN, these works can learn time series representations that capture both intra-and interchannel relations.However, these models need a pre-defined graph, which is not always readily available.
To enable the GNN on data where a pre-defined graph is unavailable, researchers have proposed to learn latent structure from data.In time series, a number of models have been proposed to learn an underlying graph from data such as MTGNN [35], Adaptive Graph Convolutional Recurrent Network (AGCRN) [2], Spectral Temporal Graph Neural Network (StemGNN) [5].The main disadvantage is that they use a static graph structure over the entire time span of a time series, thus cannot capture the dynamic dependencies between variables.

Problem formulation
A multivariate time series is represented by a matrix Y = [X 1 , X 2 , . . ., X T ] ∈ R N ×T , where T is the number of time steps and N is the number of variables.In this notation, X t ∈ R N represents a slice of Y observed at time step t.Given a historical window of L observed time steps, Y o = X T −L+1 , X T −L+2 , . . ., X T , and a forecast horizon τ , the task is to predict the values of the next τ steps in the future: Y f = X T +1 , X T +2 , . . ., X T +τ .Often, historical data may be associated with covariates such as date, time, and location.Therefore, we assume that the input data for forecasting Y f is S = S T −L+1 , S T −L+2 , . . ., S T , where S t ∈ R D in ×N is obtained by concatenating X t and its covariates, D in is the dimensionality of the input features.Our goal is to build a model F that predicts Y f from S.
Here, Ŷ f ∈ R N ×τ is the predicted values in the forecasting horizon, and is the set of all model parameters.

Model overview
We propose a novel approach to address the multivariate time series forecasting problem by leveraging a dynamic graph G t = (V, A t ) that captures the interactions between the variables at each time step of the series.The set of nodes V is of size V = N , and A t ∈ R N ×N is the adjacency matrix whose entries reflect the strengths of the relations between pairs of variables at the t-th time step.It is worth noting that this graph is not pre-defined.Instead, our proposed model will learn to extract the node features and generate the corresponding adjacency matrix A t and its evolution over time steps.
The proposed approach is implemented in ERAN, whose overall architecture is illustrated in Fig. 1a.Firstly, the multivariate time series is input into a Evolving Graph Learning layer (EGL) to generate a sequence of adjacency matrices, Each ERAN block has two output branches: (i) the residual output R (i) , which serves as part of the input for the next ERAN block, and (ii) the skip output C (i) , which is fed directly to the output layer (described in the next paragraph) to contribute to the forecast.
The output layer utilises the skip connection outputs of the ERAN blocks to generate future values.The model is trained to produce the most accurate prediction of these future values.The Mean Absolute Error is chosen as the objective function to train ERAN, which is computed as follows: where Y f is the ground truth of the forecasting horizon, Ŷ f are the values predicted by the model, and represents all model parameters.The detailed design of these components is described in the subsequent sections.

EGL: Evolving Graph Learning layer
In a GNN-based method, the adjacency matrix plays a central role in learning the relationships between individual variables.In deterministic systems, this matrix can be predefined based on human knowledge.For instance, in traffic flow data, the adjacency matrix can be constructed from the road network and the distances between the sensors.However, in many cases, such a graph is not readily available or too complex to be defined manually.Early learning-based works proposed generating such a structure from the data without prior knowledge of the graph [2,5,35].However, these works generate a single adjacency matrix to cover the relations between variables throughout the entire time series, and thus they are not adaptive to the dynamics of the data.In contrast, EGL learns a series of adjacency matrices, each for a time step, enabling the model to capture the evolving relationships of individual variables.Now we present how EGL works.Our aim is to design a recurrent process that generates A t , the adjacency matrix at time step t, conditioned on the previous step's adjacency matrix A t−1 and current input value X t .However, directly modeling the values of the adjacency matrices would require O N 4 computational complexity and memory consumption 1 , which is remarkably expensive for a large network.To reduce the computational complexity, we factorise the matrix into two low-rank matrices H t , H t ∈ R de×N , where d e N , holding the states of N variables at time step t.Instead of generating A t , EGL learns to generate H t and H t and approximate A t by A t = g H t , H t .In our implementation, we choose the function A t = relu tanh H t H t .By using the relu and tanh function, we aim to make the matrix A t sparse, forcing the model to retain only important relations.The benefits of this factorisation are not limited to reducing computational complexity but also to maintaining the low-rank properties of A t .
We use a GRU (Gated Recurrent Unit), a type of recurrent neural network, to model the evolution of H t , where H t is the hidden state of the GRU, and the layer input is the GRU's input.Similarly, we use another GRU to model the evolution of H t .Specifically, first, we embed X t into embedding vectors E t and E t using linear transformations ×N , where E t , E t ∈ R d e ×N are learnable parameters.Then, E t and E t are fed into the GRU to compute hidden states H t and H t as follows: Since a GRU originally operates on vectors, the GRU's design need to be updated to operate on matrices X t , H t , and H t .For this purpose, we use the same approach presented in [26].In detail, the GRU's computation is presented in Algorithm The dynamic adjacency matrices built from this process contain the evolving relational structure between variables 1 Flattening the matrix results in a vector of length N 2 , and mapping between two consecutive matrices requires N 4 operations.and will be used to extract the intra-and inter-channel relations, which will be described in Section 3.4.

ERAN block
The ERAN layer layer consists of multiple ERAN blocks which are stacked together to form a multi-layer network with skip connection.Each ERAN block is designed to capture the intra-and inter-channel relations by employing a Temporal Attentional Convolution (TAC) and a Temporal Graph Convolution (TGC).In this section, we will introduce the motivation behind an ERAN block and its architecture.

TAC: Temporal Attentional Convolution Module
For local temporal pattern modeling, convolutional neural networks (CNNs) are a common choice to find local motifs.CNNs learn a kernel to operate on a sliding window and compute the output from input within a context.In CNN, one unique kernel is applied for the whole lifetime of a time series.However, a unique kernel is insufficient to capture the temporal dynamics where different parts of the same time series may have changing temporal patterns.Furthermore, once learned, the kernel is fixed, thus it is poor at capturing the temporal patterns when the test set and training set have different patterns.
In order to model such temporal dynamics, here we design temporal attentional convolution (TAC).Unlike CNN, which aggregates input within a local context to compute the output using a fixed kernel, TAC employs a self-attention mechanism to learn to focus on important input when computing the output, enabling it to capture changes in temporal patterns over time.Attention mechanisms with the ability to learn to focus on important parts within a context have shown their effectiveness in natural language processing and computer vision.We extend that concept to multivariate time series where we need to deal with N sequences corresponding to N variables.
For each sequence, the data point at each time step will attend to its neighbors of the same sequence, within a given window with of size w (w = 3, 5, 7, . . . ) where each time step attends w time steps each size and itself.Specifically, at time step t, the attention region is . This approach is in contrast to global attention, where each data point attends to all data points across all time steps.By limiting attention to a local region, the proposed mechanism reduces the computational complexity and memory consumption required for processing the sequence data.An example of TAC operating in a window of size w = 5 is illustrated in Fig. 2.
Given the input x i,t of the i-th series at time step t, a single-headed attention for the output feature z i,t ∈ R d out is Fig. 2 An example of the TAC module over a window of size w = 5 computed as: where the queries q i,t , keys k i,t , and values v i,t are linear transformations of x i,t and its neighbourhoods, and are computed as follows.
Here, W Q , W K , W V ∈ R d out ×d in are learnable parameters, and α t −t ∈ R d out represents the relational position embedding associated with the neighborhood t of t.The relational position embedding is introduced here to capture the temporal information of the input.It was introduced in [31], where its effectiveness over absolute position embeddings was suggested.
The way in which we compute z i,t (see (5)) is similar to that of a convolutional operator across the temporal dimension.However, instead of using a fixed kernel, the kernel weights are computed from the content of the variables and their underlying dependencies, making the model adaptive to the dynamism of the temporal patterns of the data.
Multi-headed TAC In practice, multiple attention heads can be used to calculate different representation sets from the input.To achieve this, each single-headed attention uses linear transforms , where H is the number of attention heads.These outputs are then concatenated to obtain the final output z i,t of the multi-headed attention: Dilated TAC When m ERAN blocks are stacked together, the receptive field will be m (w − 1).To further increase the receptive field, we use a "dilated" window, where neighbor data points can be skipped regularly.This concept is similar to the dilated convolution presented in [24].We use a dilation factor p to control the dilation of each layer, where the dilation of the i-th ERAN block is p i−1 .For example, if the dilation factor is 2, then the dilation of the first layer is 1, the second layer is 2, the third layer is 4, and so on.In summary, in a network of m ERAN blocks, the receptive field is: assuming the window size w is fixed for all layers.

TGC: Temporal graph convolution module
Graph convolutional networks (GCN) generalise convolutional neural networks (CNNs) to work on graph-structured data, such as social networks or protein structures [4,8,16].GCNs generate output node features that capture the spatial dependencies of nodes, given their node features and a graph structure.While there are existing works that use graph convolutions for time series forecasting, most of them use a pre-defined graph or a static graph learned from data for the entire time series.In contrast, we use an evolving graph for the time series, where each time step has a different adjacency matrix computed in the EGL layer.At each step, graph convolutions are performed with the corresponding adjacency matrix.
There are many ways to perform graph convolutions, such as spectral graph convolution [4], graph convolutional networks [16], and approximation of convolutions using Chebyshev polynomials [8].Here, we use diffusion graph convolution, which was proposed in [20] for its effectiveness in capturing inter-series relations.This formulation captures the relations of node features in K graph convolution iterations.Given node features X ∈ R d in ×N and the learned adjacency matrix A, the output node features Z ∈ R d out ×N are calculated as follows: where A k ∈ R N ×N is the k-th power of the adjacency matrix A, and W (k) ∈ R d in ×d out are learnable parameters at the k-th convolution iteration.
The formula for the graph diffusion convolution in ( 8) is applied to each step of the i-th ERAN block, given the list of adjacency matrices A 1 , A 2 , . . ., A L .In detail, at the i-th ERAN block, the graph diffusion convolution is applied at each time step as follows.
where Z t ∈ R d out ×N is the output node features, and X t ∈ R d in ×N is the input node features at time step t, respectively.It is important to note that at the i-th ERAN block, the length of the temporal model's output is L i , which is smaller than L. Therefore, only the last L i adjacency matrices are used.

Residual and skip connections
The classical residual network (ResNet) architecture introduced a skip connection that adds the input tensor to the output tensor of a stack of layers, which is then passed to the next stack [13].This helps alleviate the problem of vanishing gradients and improves the performance of deep neural networks.In this work, we adopt a similar approach in our proposed model, to enhance the trainability of the model.However, due to the downsampling effect of the TAC module, the length of the residual tensor may be shorter than that of the input tensor.To ensure that the input tensor and the residual tensor have the same dimensions for addition, we truncate the input tensor to match the dimensions of the residual tensor.The addition operation is defined as follows: with L i is the length of X (i) , the residual output of the i-th ERAN block.Note that, L i is also the length of the input for the (i + 1)-th ERAN block.
The skip connection at each consists of a 2D convolution with a kernel size of (1, L i ).The purpose of a skip connection module is to combine all steps of C (i) , the skip output of the i-th TAC, into C (i) ∈ R D c ×N , which has a single step:

Multi-step forecasting
The outputs of the skip connections are summed up and fed to the output layer which generates the prediction of the future Ŷ f .The sum of outputs of the skip connections is as follows.
The output layer consists of two 2D convolutions with a kernel size of (1, 1), which are used to translate the dimension of the input (D c ) to the forecasting horizontal dimension.In other words, the dimensionality of the output layer's input is D c , while its the dimentionality of its output is τ .

Experimental settings
To measure the time-series forecasting performance of ERAN, we use the following public datasets, ranging from traffic flow, electricity consumption, to COVID-19 2 .A summary of these datasets is presented in Table 1. •

Baseline methods
To verify the effectiveness of the proposed model, we compare it with state-of-the-art methods for time-series forecasting from the following groups: (i) traditional methods, (ii) deep learning-based methods, (iii) methods that use a pre-defined graph, and (iv) methods that automatically generate a graph from the data.The details of the baselines are as follows.
• LSTM [14]: A type of recurrent neural network (RNN) architecture designed to capture and remember longrange dependencies in sequential data.
• GRU [6]: A recurrent neural network for sequence data that efficiently learns dependencies and avoids gradient issues.
• TCN [3]: A model for sequential data using convolutional neural network.• DCRNN [20]: A convolutional recurrent neural network that combines graph convolutions with recurrent neural networks.
• ST-GCN [36]: A spatial-temporal graph convolutional network that combines a graph convolution with 1D convolutions.• GWNet (Graph Wavenet) [34]: A spatial-temporal graph convolutional network that combines graph convolutions with 1D dilated convolutions.• MT-GNN [35]: A method that learns to generate a static graph from the data and then combines graph convolutions with 1D convolutions.• StemGNN [5]: A method that learns to generate a static graph from the data and uses graph neural networks and 1D convolutions in the spectral domain.
• AGCRN [2]: A method that learns to generate a static graph from the data and combines graph convolutions with recurrent neural networks.

Implementation details
In our proposed model, ERAN, we utilized a three-layer architecture with input and output dimensionalities of 128.
The selection of window size and dilation factor was contingent upon the length of the historical window.In the case of a long historical window, our goal is to have an expansive receptive field that covers the entire sequence.To achieve this, we utilize a long window size and a large dilation factor.In contrast, for a shorter historical window, a smaller window size and dilation factor are sufficient.In detail, for data with length smaller than or equal to 12 steps, we used a window size of 3 and dilation of 1, while for data with length greater than 12 steps, we used a window size of 5 and dilation factor of 2. We employed the Adam optimiser with a learning rate of 0.001 and weight decay of 0.0001.All deep learning-based models were implemented using PyTorch and trained on a machine equipped with a single NVIDIA GPU.We halted training after 100 epochs and reported results on the test set for the epochs that produced the lowest loss on the validation set.

Overall comparison
Tables 2, 3 and 4a provide a comprehensive comparison of the methods across the datasets.The results show that ERAN outperforms all competing methods on all datasets.In addition, we make the following observations: 1.In general, deep learning-based methods perform better than traditional methods, except for VAR, which performs comparably to some deep learning-based methods on certain traffic flow datasets.Due to the unavailability of pre-defined graphs, models that are dependent on pre-defined graphs are excluded thus captures dynamic dependencies between variables more effectively.
To evaluate the long-term forecasting ability, we report the mean absolute error (MAE) of the predictions at different forecasting horizons for the PEMS-04 and Electricity datasets in Fig. 3. Our results demonstrate that ERAN outperforms all competing methods across all forecasting horizons, showing its effectiveness in predicting long-term trends.Particularly, for the longest horizon (60 minutes), the difference between ERAN and the other methods is significant, emphasizing ERAN's superior long-term forecasting ability.

Significance testing
To assess the statistical significance of our results, we conducted pairwise hypothesis tests comparing the absolute errors generated by the ERAN against those produced by the second-best model.The hypotheses were formulated as follows: H 0 (null hypothesis) posits that there is no difference between the mean absolute errors (MAE) of the two algorithms, while H a (alternative hypothesis) asserts that the mean absolute errors of the two algorithms are statistically different.
The significance level, denoted by a threshold of 0.05 (corresponding to a confidence level of 95%), was used to assess the p-values.If the p-value is smaller than this threshold, the null hypothesis is rejected.
The t-test results are outlined in Table 5. Notably, all pairwise tests exhibit p-values significantly below 0.05.This lack of evidence supports the rejection of the null hypothesis, Due to the unavailability of pre-defined graphs, models that are dependent on pre-defined graphs are excluded signifying a statistically significant difference between the ERAN MAE and that of the next best model.

Impact on non-stationary time series
To evaluate the efficacy of ERAN on non-stationary time series, specifically, we investigate the stationarity of time series and compare how ERAN outperforms existing models in both stationary and non-stationary contexts.We begin by examining the stationarity of time series using the Augmented Dickey-Fuller test (ADF) [9].The ADF test is a widely employed statistical method for determining whether a given time series is stationary.The test formulates a null hypothesis assuming the presence of a unit root, indicating non-stationarity, and an alternative hypothesis suggesting stationarity.The test statistic is then compared to critical values, and the resulting p-value is pivotal in establishing the stationarity of the time series.A p-value below a chosen significance level (commonly 0.05) leads to the rejection of the null hypothesis, providing evidence in favor of stationarity.In contrast, a p-value exceeding the significance level leads to the acceptance of the null hypothesis, implying the presence of non-stationarity in the time series.Therefore, the ADF test utilizes the p-value to make informed decisions about the stationarity of the analysed time series.The results of the ADF test regarding the stationarity of time series are presented in Table 6.

Ablation study
To gain more insight into the proposed model, we conducted an ablation study to evaluate the impact of (i) learning the dynamics of the inter-channel relations, and (ii) learning the dynamics of the intra-channel relations; (iii) the number of the layers; (iv) the dilation factor, on the model's performance as follows.8 reports the results on the COVID-19 dataset.We observe that the Evolving graph variant, which uses the evolving graph, performs better than the other variants, demonstrating the importance of capturing the evolution of inter-channel relations over time.Additionally, we see that the Static graph variant, which can only capture a static relation between the variables, improves the forecasting accuracy.Finally, the No graph variant performs the worst among the three variants, emphasizing the effectiveness of capturing the relations between the variables using graph convolution.

Impact of learning the dynamics of the intra-channel relations
As the learning the dynamics of intra-channel relations is accomplished through the TAC module, we investigated the impact of this learning approach by comparing TAC with LSTM and TCN, which capture invariant temporal patterns only.It's important to note that TAC is achieved by removing the graph from ERAN, making it equivalent to ERAN's "no graph" variant presented in Section 4.5.1.
The results in Table 9 show that TAC significantly outperforms LSTM and TCN on COVID transmission data, a highly non-stationary time series.This indicates the effectiveness of using TAC in capturing the dynamics of the temporal patterns.

Impact of the number of layers
One important parameter of the ERAN is the number of layers.To demonstrate the impact of the number of layers, we present a plot of the MAE on COVID-19 US death tolls in Fig. 4. The optimal forecasting accuracy is achieved when the number of layers is 3.With a small number of layers, the model's capacity is too limited to learn the data.On the other hand, if we increase the number of layers, the model capacity will increase and is prone to over-fitting.

Impact of the dilation factor
We use a dilation factor p to control the dilation at each layer.The dilation of the next layer is p times the dilation of the previous one.Thus, the dilation of layer l is p l−1 .Table 10 shows the forecasting accuracy of COVID-19 in the US using dilation factors 1 and 2. We can see that dilation factor 2 achieves slightly better forecasting accuracy than dilation factor 1. This confirms the effectiveness of using a dilation factor larger than 1 to extend the receptive field.

Qualitative analysis
To gain more insight into the behavior of the methods, we plot in Fig. 5 two examples of the daily new cases forecasting  Overall, in both examples, ERAN performed better than other methods in predicting the ground truth data.ERAN was also better in capturing the trend when the ground truth went up and down.However, in both cases, there were some short but sharp spikes that no method could capture well.

Conclusion
We propose ERAN, a model that captures the dynamics of the intra-and inter-channel relations for multivairate time series forecasting.To model the intra-channel relations, ERAN utilises Temporal Attentional Convolution (TAC), which applies self-attention mechanism within a temporal window.On the other hand, to model the inter-channel relations, ERAN uses dynamic graph convolutional network, wherein the graph structure evolves over time.Our experimental architecture has established new state-of-the-art results on multiple types of time series data, from classical traffic flows and electricity consumption forecasting to newly emerging problems like COVID-19 projections.Furthermore, ERAN exhibits significant improvement over existing methods, particularly evident in non-stationary time series.The representation power and generality of the model promise strong and wide applications in time series modeling.However, a notable limitation of the proposed model is the current exclusion of time-dependent covariates such as weather and price indices as inputs for forecasting.Recognizing this, we identify the incorporation of these covariates as a potential avenue for improvement, which we leave for future work.
right holder.To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.

Fig. 1
Fig. 1 (a) The high level architecture of ERAN model.The ERAN is composed of an Evolving Graph Learning layer (EGL) which learns to generate the evolving adjacency matrices, and multiple ERAN blocks stacked together.Residual connections and skips connections are used to prevent vanishing gradient.(b) An ERAN block consists of two main components: a Temporal Attentional Convolution (TAC) and a Temporal Graph Convolution (TGC) which are integrated to capture the intraand inter-channel relations

Table 1
Summary of the datasets

Table 3
Results on the energy consumption forecasting (historical win-

Table 5
Significance testing results between ERAN and the Next best model for each dataset

Table 6
The results of ADF test

Table 7
MAE Improvement of ERAN relative to the second best model across datasets

Table 8
Ablation study on the impact of the graphs Static graph: where the inter-channel relations is modelled by a static graph generated from the data, which is unchanged thoughout the time series lifetime, and (3) No graph: where inter-channel relations are ignored.Table

Table 9
Study on the impact of TAC

Table 10
Ablation study on the impact of the dilation factor