Persistence Initialization: a novel adaptation of the Transformer architecture for time series forecasting

Haugsdal, Espen; Aune, Erlend; Ruocco, Massimiliano

doi:10.1007/s10489-023-04927-4

Persistence Initialization: a novel adaptation of the Transformer architecture for time series forecasting

Open access
Published: 29 August 2023

Volume 53, pages 26781–26796, (2023)
Cite this article

Download PDF

You have full access to this open access article

Applied Intelligence Aims and scope Submit manuscript

Persistence Initialization: a novel adaptation of the Transformer architecture for time series forecasting

Download PDF

2072 Accesses
6 Citations
Explore all metrics

Abstract

Time series forecasting is an important problem, with many real world applications. Transformer models have been successfully applied to natural language processing tasks, but have received relatively little attention for time series forecasting. Motivated by the differences between classification tasks and forecasting, we propose PI-Transformer, an adaptation of the Transformer architecture designed for time series forecasting, consisting of three parts: First, we propose a novel initialization method called Persistence Initialization, with the goal of increasing training stability of forecasting models by ensuring that the initial outputs of an untrained model are identical to the outputs of a simple baseline model. Second, we use ReZero normalization instead of Layer Normalization, in order to further tackle issues related to training stability. Third, we use Rotary positional encodings to provide a better inductive bias for forecasting. Multiple ablation studies show that the PI-Transformer is more accurate, learns faster, and scales better than regular Transformer models. Finally, PI-Transformer achieves competitive performance on the challenging M4 dataset, both when compared to the current state of the art, and to recently proposed Transformer models for time series forecasting.

Development and Application of Artificial Neural Network

Article 30 December 2017

Deep learning for time series classification: a review

Article 02 March 2019

A review on the long short-term memory model

Article 13 May 2020

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The ability to forecast the future is a valuable tool across a wide range of applications, such as finance, energy, and industry. Forecasting allows for better decision-making in the present, and even small improvements in accuracy can often provide great benefits.

The Transformer [1] has recently become the dominant method for most Natural Language Processing (NLP) tasks [2, 3]. It has also been successfully applied to a diverse set of challenging problems outside of NLP, such as protein folding [4] and Reinforcement Learning [5]. However, relatively little attention has been given to the use of Transformers for time series forecasting. Most prior work in this direction has focused on the addressing the computational limitations of the Transformer, by proposing computationally efficient alternatives to regular attention [6,7,8]. In contrast, this work is primarily focused on improving the forecasting accuracy of the Transformer on time series with shorter forecasting horizons, by addressing differences between time series data and text data.

Time series forecasting and natural language modeling might at first glance appear to be highly similar; they both form ordered sequences, and forecasting the next step of a time series can be seen as analogous to predicting the next token in a language modeling task.

However, there are also important differences between time series data and text data. First, time series forecasting is a continuous regression problem, while language modeling is a discrete classification problem. Consequently, in order to use the Transformer to forecast, the final softmax activation layer must be removed. We argue that this makes the model more sensitive to how the weights are initialized, as the initial forecasts will now be proportional to the weights of the final linear layer. Conversely, models including a final softmax layer are likely to be more robust to weight initialization. A random initialization of such a model likely has an approximately uniform output distribution, which is arguably a good starting point for a reasonably balanced classification task. Second, time series data often do not have any particular semantics associated with the beginning or conclusion of a sequence. In other words, for time series data, there is in general no reason to assume any meaning from the fact that the sequence started or ended at some particular point in time. As a consequence, time series sequences can often be subdivided into smaller sub-sequences, a technique which is commonly referred to as windowing. In contrast, for text sequences, both the start and the end of a sequence has semantic meaning, because the start and end of the sequence signifies the bounds of a connected body of text.

We propose three modifications of the Transformer architecture, directly motivated by these differences. First, we propose an adaptation called Persistence Initialization (PI), which aims to improve the Transformer’s ability to forecast. It has long been known that initialization is an important component in the process of training deep neural networks [9,10,11]. Persistence Initialization works by implicitly initializing the model in such a way that the initial forecasts (before training) become equal to the forecasts of a persistence model. The persistence model, also known as a random walk method [12], is defined by letting the forecast $\hat{x}_{t+1}$ be equal to the previous value $x_t$. In order to implement PI, we add two components: a residual skip connection, and a scalar multiplicative gating parameter $\gamma $. The residual skip connection has the effect of adding the value at time t (i.e. $x_t$) to the forecast value for time $t+1$ (i.e. $\hat{x}_{t+1}$). The scalar multiplicative gating parameter $\gamma $ is initialized to 0, and is multiplied with the outputs of the Transformer. As a consequence of this combination, only the skip connection contributes to the initial forecasts, which means that any complex model can be effectively initialized as a persistence model, regardless of the values of the randomly initialized parameters within the model.

Our second proposed adaption attempts to further improve training stability by replacing the commonly used Layer Normalization [13] layer with ReZero normalization [14]. ReZero is a technique designed to improve the training stability of deep networks, and was proposed as an alternative to normalization layers such as Layer Norm and Batch norm. Note that while the implementation of Persistence Initialization is almost identical to that of ReZero, these techniques are intended to solve different problems. The goal of ReZero normalization is to control the magnitude of gradients in deep networks, while The goal of Persistence Initialization is to improve the Transformer’s forecasting accuracy by providing an inductive bias towards models with a significant autoregressive component.

Our third proposed adaptation is related to the difference in the semantics of the time series sequences, compared to natural language sequences. Instead of using the absolute sinusoidal encoding [1], we propose to use the relative Rotary Encoding [15], which has been shown to outperform the sinusoidal encoding in some NLP tasks [15]. In the context of time series, we argue that a relative positional encoding provides a better inductive bias for forecasting. Time series sequences are often “windowed”, which means that the absolute position within the window has no semantic significance. Consequently, absolute positional encodings are ill-suited for forecasting, as they put undue emphasis on an arbitrary location in the sequence. In contrast, a relative encoding emphasizes the position of the outputs, i.e. the forecasts, which should result in a better inductive bias for forecasting.

In summary, our contributions are:

1.
We propose Persistence Initialization, a novel and general adaptation autoregressive for time series forecasting with neural networks. This adaptation initializes the model such that it starts off as a persistence model, which provides a good starting point for further learning.
2.
We propose the PI-Transformer architecture, a Transformer architecture with three main modifications: Persistence Initialization, ReZero normalization, and Rotary positional encodings. We perform two ablation studies to verify the importance of each modification. The first ablation study compares the effects of the components of Persistence Initialization, and the second compares the effect of positional encoding and normalization layers. Both studies show that the proposed modifications are necessary for good forecasting performance.
3.
We evaluate PI-Transformer on the challenging M4 forecasting dataset, and show that PI-Transformer achieves competitive accuracy, outperforming the winner of the original M4 competition. Furthermore, PI-Transformer is highly accurate without the need for a large ensemble of models, in contrast to other state-of-the-art methods on the M4 dataset. To the best of our knowledge, this is the first time a Transformer model has been successfully used to forecast the complete M4 dataset to a high degree of accuracy. We also compare PI-Transformer with recent existing Transformer architectures for time series forecasting, and show that PI-Transformer outperforms these by a large margin on the M4-Hourly dataset.

In order to ensure reproducibility, all the code related to our work is publicly available^{Footnote 1}. The rest of the paper is organized as follows: Section 2 provides some background on the Transformer, Section 4 describes our proposed adaptation, Section 3 reviews existing related work, Section 5 describes the experiments, and Section 6 provides analysis and discussion of the results. Finally, Section 7 concludes with a summary.

2 Background

2.1 Decoder-only transformer

A decoder-only Transformer [1] consists of blocks of causal self-attention layers and feedforward layers, both followed by a residual skip connection and Layer-Normalization [13]. The model can be defined recursively by letting $X_i$ be the output of the ith block, as follows:

$$\begin{aligned} X_i(X_{i-1})&= \text {FF}_i ( \text {SA}_i (X_{i-1})) \end{aligned}$$

(1)

$$\begin{aligned} \text {SA}_i(X)&= \text {LayerNorm}(X + \text {SelfAttention}_i(X) ) \end{aligned}$$

(2)

$$\begin{aligned} \text {FF}_i(X)&= \text {LayerNorm}(X + \text {FeedForward}_i(X)), \end{aligned}$$

(3)

where $X_i$ is a matrix of shape $L \times d_\text {model}$, with L representing the “sequence” or “time” dimension and $d_\text {model}$ representing the feature dimension. $X_0$ is the base case of the recursion, and represents the initial input to the model. The number of blocks N is a hyperparameter which determines the final output of the model, $X_N$.

In order to define self-attention, we must first define multi-head attention. Multi-head attention combines multiple attention heads, by giving each head a separate set of learnable weights, ensuring that each head can perform a different operation. Self-attention is then defined as a special case of multi-head attention where the keys, queries, and values are all equal:

$$\begin{aligned}{} & {} \text {SelfAttention}(x) = \text {MHA}(X, X, X) \end{aligned}$$

(4)

$$\begin{aligned}{} & {} \text {MHA}(Q, K, V)= \text {Concat}(\text {head}_1, ..., \text {head}_h)W_O \end{aligned}$$

(5)

$$\begin{aligned}{} & {} \text {head}_j = \text {Attention}\left( QW_Q^{(j)}, KW_K^{(j)}, VW_V^{(j)}\right) \end{aligned}$$

(6)

$$\begin{aligned}{} & {} \text {Attention}(Q,\ K,\ V) = \text {softmax}\left( \frac{Q K^T}{\sqrt{d_\text {head}}} + M\right) V, \end{aligned}$$

(7)

where h is the number of attention heads, and $d_\text {head} = d_\text {model}/h$. The learnable weight matrices $W_Q^{(j)}$, $W_K^{(j)}$, and $W_V^{(j)}$ are of shape $d_\text {model} \times d_\text {head}$. $W_O$ is a learnable weight matrix of shape $d_\text {model} \times d_\text {model}$, and has the effect of mixing the outputs of each head. M is an upper triangular masking matrix, which ensures that the model does not attend to “future” time steps.

The feed-forward layer is performed point-wise, i.e. it only considers information in the current time step, like a 1-D convolution. It is defined as two affine transformations with a ReLU non-linearity in between:

$$\begin{aligned} \text {FeedForward}(X)&= \text {ReLU}(XW_1 + b_1) W_2 + b_2, \end{aligned}$$

(8)

where $W_1$ and $W_2$ are learnable weight matrices of shape $d_\text {model}~\times ~d_\text {ff}$ and $d_\text {ff}~\times ~d_\text {model}$, and $b_1$ and $b_2$ are learnable bias vectors of shape $d_\text {ff}$ and $d_\text {model}$.

2.2 Positional encoding

The Transformer cannot distinguish elements of a sequence based on their ordering, because the attention operation is permutation invariant. Consequently, it is necessary to provide explicit positional information to the model, which is the purpose of the positional encoding.

2.2.1 Absolute positional encoding

The sinusoidal positional encoding is one of the most commonly used absolute positional encodings. The encoding is applied by adding it to the inputs of the model, and can be implemented by creating a matrix E of size $L \times d_\text {model}$, where L is the sequence length. Each row contains $d_\text {model}/2$ pairs of sine and cosine functions with varying wavelengths, with each pair sharing the same wavelength. The wavelength is increased geometrically for each pair, which can be written as follows:

$$\begin{aligned} E_{i,2j}&= \sin {\left( \frac{i}{K^{j/d_\text {model}}} \right) } \end{aligned}$$

(9)

$$\begin{aligned} E_{i,2j+1}&= \cos {\left( \frac{i}{K^{j/d_\text {model}}} \right) } \end{aligned}$$

(10)

where $i \in [1, L]$ is the position in the sequence, and $j \in [1, d_\text {model}/2]$ is the index of the feature dimension. The value of K determines what the largest wavelength will be, and is commonly set to 10000.

2.2.2 Rotary encoding

The Rotary Encoding [15] is a relative positional encoding, which was introduced as an alternative to the absolute positional encoding and other previously proposed relative positional encodings. It encodes relative positional information in the angle between the key and query vectors. This is different to most other positional encodings, which are typically additive. The encoding function f is derived by considering a relation between ${\textbf{q}}_m$, a query vector at position m, and ${\textbf{k}}_n$, a key vector at position n. The dot product of the encoded vectors should be equal to the output of some function g, which only depends on the original vectors and the relative distance between their positions:

$$\begin{aligned} f(\textbf{q}_m, m) \cdot f(\textbf{k}_n, n)&= g(\textbf{q}_m,\ \textbf{k}_n,\ m - n) \end{aligned}$$

(11)

We will consider only case of 2D vectors $\textbf{q}$ and $\textbf{k}$. (The general case of an arbitrarily sized vector is more cumbersome to state, but is a straightforward generalization of the 2D case.) The desired relation can be achieved by the following encoding:

$$\begin{aligned} f(\textbf{x}, k)&= \textbf{x} e^{ik\theta }, \end{aligned}$$

(12)

where the 2D vector $\textbf{x}$ is considered as a number in the complex plane, and $\theta $ is a real non-zero constant. This encoding satisfies the desired relation with the following function:

$$\begin{aligned} g(\textbf{q}_m, \textbf{k}_n, m - n)&= \text {Re}[\textbf{q}_m \textbf{k}_n^* e^{i(m-n)\theta }], \end{aligned}$$

(13)

where $\text {Re}[\cdot ]$ is the real part of the complex number and $\textbf{k}_n^*$ is the complex conjugate of $\textbf{k}_n$.

2.3 Normalization

Training deep neural networks can be a challenging, due to issues such as the vanishing gradient problem. Normalization layers [13, 16] have been proposed to speed up the training process, and to make training more robust to random weight initialization. The Transformer architecture also includes normalization layers, specifically Layer Normalization [13]. There are mainly two alternatives for the location of the normalization layer within the architecture; the first is post-layer normalization, and the second is pre-layer normalization. The original Transformer [1] used post-layer normalization, however pre-layer normalization has been found by some to lead to more effective training [17]. The two orderings can be formalized as follows. Let $\text {Sublayer}(\cdot )$ refer to either the multi-head attention or the feedforward layers of the Transformer. Then we have:

$$\begin{aligned} \text {PostLN}(x)&= \text {LayerNorm}(x + \text {Sublayer}(x)) \end{aligned}$$

(14)

$$\begin{aligned} \text {PreLN}(x)&= x + \text {Sublayer}(\text {LayerNorm}(x)) \end{aligned}$$

(15)

ReZero [14] is an alternative to Layer Normalization for training deep networks. Instead of calculating statistics and using these to normalize the data, it simply uses a multiplicative gating parameter $\alpha $, which is initially set to 0:

$$\begin{aligned} \text {ReZero}(x)&= x + \alpha \cdot \text {Sublayer}(x) \end{aligned}$$

(16)

The same $\alpha $ parameter is used within a single Transformer block, both for the multi-head attention layer and for the feedforward layer.

3 Related work

3.1 Transformers for time series forecasting

Following the introduction of the Transformer [1] in 2017, researchers also started to use Transformers for time series tasks. However, the topic of using Transformers for time series tasks have received relatively little research attention, compared to the use of Transformers for other kinds of data.

The main focus of the work on Transformers for time series has been on the quadratic computational complexity of the attention operation. In the context of NLP, there are numerous works attempting to address this issue, see for instance the recent survey by Tay et al. [18]. We will not attempt to summarize this line of work here, but instead focus on works that specifically target time series forecasting problems.

The first work in the direction of efficient Transformers for time series was by Li et al. [6], who introduced the LogSparse Transformer. The architecture improves the efficiency of the attention operation by removing queries that are far away from the current time step, by exponentially increasing the space between consecutive queries, resulting in a complexity of $O(N\log N)$ instead of $O(N^2)$. Moreover, a causal convolution layer was added before the attention layer, to allow the model to easily discover similarities between ranges of time series points.

The Informer [7] is a Transformer designed for the task of Long Sequence Time-Series Forecasting, which was defined by the authors as forecasting horizons of size 48 or longer. The authors propose an efficient attention operation, which first approximates the query-key similarity and then selects the most important queries, resulting in $O(N\log N)$ computational complexity. The model uses an encoder-decoder architecture that produces forecasts for the entire horizon in a single evaluation. Compared to the LogSparse Transformer, this leads to improved speed when forecasting long horizons, as the model does not need to iteratively generate forecasts.

The Autoformer [8] is another Transformer designed for Long Sequence Time-Series Forecasting, and similarly to the Informer, it also has an encoder-decoder architecture which forecasts the entire horizon in a single step. It proposes a different modification of the attention operation, replacing the dot-product attention mechanism with an auto-correlation based mechanism, again with $O(N\log N)$ complexity. Like the Informer, it can evaluate long horizons quickly and efficiently, and moreover, the authors show improved accuracy compared to the Informer on several datasets.

In contrast to these works, we are primarily interested in improving the forecasting accuracy of the Transformer, and do not attempt to improve the computational complexity of the model.

3.2 M4 competition methods

Time series forecasting is an increasingly relevant area of research. However, in our opinion, there has not been a clear consensus within the Machine Learning community on how to benchmark forecasting models. Earlier work, such as the LogSparse Transformer by Li et al. [6], frequently used the traffic^{Footnote 2} and electricity^{Footnote 3} datasets. However, these datasets have some issues that make them less suitable as a benchmark dataset. There are at least three different train/test split points used for each dataset, which makes comparing performance across different splits difficult [19]. Moreover, these datasets contain missing data, which complicates the training and evaluation setup. The traffic dataset also has missing data during some public holidays, but lacks documentation regarding which specific dates have been removed [19].

We suggest that the M4 dataset should be used as a standard benchmark dataset for research into time series forecasting. The M4 dataset was introduced in the M4 forecasting competition [20], the fourth competition in a series of highly influential forecasting competitions, known as the Makridakis competitions. The dataset contains 100,000 time series; compared to previously used datasets in Machine Learning forecasting studies, this is a very large dataset. These time series were collected from a wide range of domains, and exhibit a wide range of behaviors. The organizers of the competition provided forecasts of several well-known baseline methods, including various naïve methods, exponential smoothing methods, and the well-known ARIMA method. These baseline methods are able to capture linear relationships well, and are often difficult to beat in many real world problems which have strong auto-correlation. However, on the M4 dataset, the best methods outperform these baselines substantially, which indicates the necessity of being able to capture non-linear dynamics to achieve a high level of accuracy on this dataset.

The M4 dataset has several desirable properties compared to previously used forecasting datasets. It is quality controlled, and there are no missing data. The evaluation procedures are clearly defined, and there is no confusion regarding train/test split points. Furthermore, the M4 dataset is large, which arguably is a precondition for Deep Learning methods to be successful. (The small size of the previous M3 competition dataset is believed to be a major reason for why neural networks did not perform well in that competition [21].)

The M4 competition included 61 methods in total. These methods used a wide variety of techniques, the majority of which were not Deep Learning techniques. However, to the surprise of some, the winning method relied heavily on Deep Learning. This method was developed by Smyl [22], and was a hybrid model combining a recurrent neural network with multiple exponential smoothing models. The parameters of the exponential smoothing model were learned for a single time series, while the neural network parameters were shared across time series. The recurrent neural network was a dilated LSTM [23] with attention [24] and residual connections [25]. Moreover, several such hybrid models were combined in ensembles in order to improve forecasting accuracy.

After the competition was finished, Oreshkin et al. proposed a new method which outperformed even the winner of the competition, called N-BEATS [19]. The authors wanted to show that a deep neural network could perform well on the M4 dataset, without the need for classical time series forecasting techniques, such as those used in Smyl’s hybrid model. N-BEATS consists of blocks of feed-forward networks, which are combined such that each block provides a partial forecast, and these are added together to produce the final forecast. Furthermore, instead of using a regular residual skip connection, the partial forecasts are subtracted from the input of the next block, which the authors call double residual stacking. The final model is an ensemble of 180 such networks, which are trained with 18 different configurations in order to ensure sufficient diversity in the ensemble.

None of the works on Transformers for time series from the previous section evaluate their method on the complete M4 dataset. While Li et al. [6] evaluate their LogSparse Transformer on the hourly portion of the M4 dataset, the authors do not report their performance in the metric used in the competition (OWA), making it difficult to compare their accuracy to the original M4 contestants.

To the best of our knowledge, our proposed method is the first to achieve competitive results on the complete M4 dataset using a Transformer model. Moreover, to the best of our knowledge, our method is also the first to achieve competitive results using a single neural network model, instead of using ensembles of Deep Learning models.

4 Method

4.1 The time series forecasting task

A time series is defined to be a sequence of fully observable measurements $\textbf{x}=[x_1, \ldots , x_T] \in \mathbb {R}^T $, where $x_t$ is the observation at time t, and T is the length of the series. The goal of forecasting is to predict $\textbf{y}$ given $\textbf{x}$, where $\textbf{y}=[x_{T+1}, \ldots , x_{T+H}] \in \mathbb {R}^H $, and H is the forecasting horizon.

4.2 The PI-transformer

We will now introduce the PI-Transformer architecture. The architecture can be divided into four parts: normalization, linear projections, a decoder-only Transformer with Rotary positional encodings and ReZero normalization, and Persistence Initialization. A diagram of the complete architecture is shown in Fig. 1.

4.2.1 Normalization

The first step of our method is the normalization step. We first divide by the mean $\mu _H$ of the H most recent values of $\textbf{x}$, and then perform a log transform:

$$\begin{aligned} \textbf{z} = f(\textbf{x}) = \ln \frac{\textbf{x}}{\mu _H}, \end{aligned}$$

(17)

By using only the H most recent values, instead of the entire sequence, we can better capture the trend of the series close to the forecasting window. To produce an output in the original data space, we perform the inverse transformation:

$$\begin{aligned} f^{-1}(\textbf{z}) = \mu _H \cdot e^{\textbf{z}} \end{aligned}$$

(18)

During training, gradients are back-propagated through the inverse transformation. From this point on, we will focus on how to forecast the value of $z_{t+1}$, as this can be converted to a forecast in the original data space by using the inverse normalization function: $\hat{x}_{t+1} = f^{-1}(\hat{z}_{t+1})$.

4.2.2 Transformer model

Our method generates forecasts autoregressively by using a decoder-only Transformer architecture, similar to generative language models in NLP [3]. The architecture consists of blocks of causal self-attention layers and by feedforward layers. In order to improve the stability of training the Transformer for forecasting, we connect the layers using residual skip connections with ReZero gating [14].

In NLP, embedding layers are commonly used to transform text tokens into feature vectors with size $d_\text {model}$. We instead need to transform a univariate time series into a sequence of feature vectors, which we do with a linear projection. In other words, the initial input to the Transformer, $X_0$, is defined as follows:

$$\begin{aligned} X_0&= \textbf{z} W_\text {in}\ , \end{aligned}$$

(19)

where $W_\text {in} \in \mathbb {R}^{1 \times d_\text {model}}$ is a learnable weight matrix and $\textbf{z}$ is the normalized input vector. Now, if we let $X_i$ be the output of the i-th Transformer block, the rest of the model can be defined recursively:

$$\begin{aligned} X_i(X_{i-1})&= \text {FF}_i ( \text {SA}_i (X_{i-1})) \end{aligned}$$

(20)

$$\begin{aligned} \text {SA}_i(X)&= X + \alpha _i \cdot \text {SelfAttention}_i(X) \end{aligned}$$

(21)

$$\begin{aligned} \text {FF}_i(X)&= X + \alpha _i \cdot \text {FeedForward}_i(X), \end{aligned}$$

(22)

where $\alpha _i$ is the learnable ReZero scalar parameter shared between the self-attention and feedforward layer within each block.

The feedforward layer is defined as in the original Transformer, i.e. (8). However, for the self-attention layer, we replace the standard absolute sinusoidal positional encoding with a relative positional encoding, more specifically the Rotary encoding [15]. The effect of the Rotary encoding is to multiply each key and query with a rotation matrix, which causes positional information to be encoded in the angle between the vectors. The use of the Rotary encoding is motivated by the fact that there the absolute position within a time series window has no semantic significance, in contrast to text data, where the start of the sequence often has a semantic meaning (for instance as the start of a document or a sentence). Instead of adding the positional encoding to the input features of the model, as is done with a standard sinusoidal encoding, the Rotary encoding is implemented by modifying the definition of self-attention:

$$\begin{aligned} \text {Attention}(Q,\ K,\ V)&= \text {softmax}\left( \frac{\widetilde{Q}_\textsc {r} \widetilde{K}_\textsc {r}^T}{\sqrt{d_{qk}}} + M\right) V, \end{aligned}$$

(23)

where $\widetilde{Q}_\textsc {r}$ and $\widetilde{K}_\textsc {r}$ represents the Q and K matrices with Rotary positional encoding applied.

Finally, we perform a linear projection on the output of the final Transformer block in order to go back to a univariate sequence. We define the $T(\textbf{z})$, the “Transformer function”, to be the value after this projection:

$$\begin{aligned} T(\textbf{z})&= X_N W_\text {out}, \end{aligned}$$

(24)

where $X_N$ is the output of the Nth block, and $W_\text {out} \in \mathbb {R}^{d_\text {model} \times 1} $ is a learnable weight matrix.

4.2.3 Persistence initialization

Persistence Initialization (PI) is a technique to implicitly initialize an autoregressive neural network for forecasting, in order to improve training stability and forecasting performance. Specifically, the neural network is initialized to become a persistence model, which is a model that simply uses the last known value at time t as the forecast for $t+1$, i.e. $\hat{z}_{t+1} = z_t$. One way of combining this persistence forecast with the forecast from the Transformer model defined above, would be to add the outputs of both models: $\hat{z}_{t+1} = z_t + T(\textbf{z})$. However, this combined model will not become a persistence model at initialization, as the initial outputs will depend on the randomly initialized weights of network. In order to ensure that the neural network does not contribute to the initial forecasts, we introduce a new zero-initialized gating parameter $\gamma $:

$$\begin{aligned} \hat{z}_{t+1} = z_t + \gamma \cdot T(\textbf{z}) \end{aligned}$$

(25)

This results in a combined architecture with the property that the forecasts produced by the initial model are exactly equal to the persistence forecast. However, the combined architecture is still able to improve upon this simple forecast by changing the value of $\gamma $ through learning.

5 Experimental settings

5.1 Dataset

Public datasets have played an important role for the development of both deep learning methods and forecasting methods. There are clear benefits to having a publicly available high quality dataset, of which the most important is that it allows researchers to measure progress in a standardized way. However, Machine Learning research focusing on time series forecasting has lacked a commonly agreed up benchmark dataset. We propose to use the M4 dataset [20] for this purpose.

Table 1 Descriptive statistics for each frequency in the M4 dataset

Full size table

The M4 dataset was introduced in the fourth Makridakis competition, held in 2018. The previous Makridakis competitions have been very influential for the development of forecasting methods, and are well regarded in the forecasting community. The M4 dataset consists of 100,000 time series from various domains, divided into six sub-sets based on their sampling frequency. Each frequency has a corresponding forecasting horizon H: Yearly (${H=6}$), Quarterly (${H=8}$), Monthly (${H=18}$), Weekly (${H=13}$), Daily (${H=14}$), and Hourly (${H=48}$). Table 1 contains some descriptive statistics for each frequency. We consider each data frequency as an independent learning problem, and consequently train separate models for each data frequency.

5.2 Metrics

Performance in on the M4 dataset is measured a metric called Overall Weighted Average (OWA) [20]. OWA is a combination metric, which combines the Mean Absolute Scaled Error (MASE) and symmetric Mean Absolute Percentage Error (sMAPE) metrics. MASE and sMAPE are both scale-independent error metrics which are commonly used in the time series forecasting literature [26]. The purpose of these metrics is to enable comparisons of forecasting accuracy across time series data with varying scales. In order to combine sMAPE and MASE metric into the single OWA metric, these metric scores are scaled by corresponding metric scores from a baseline model. In the M4 competition, the baseline model was the Naïve2 model, which is a persistence model that is seasonally adjusted by multiplicative decomposition [20, 27]. After scaling the metrics, their values are combined by taking the average of the two, as follows:

$$\begin{aligned} \text {OWA}&= \frac{1}{2} \left[ \frac{\text {sMAPE}}{\text {sMAPE}_{\text {Na}\ddot{\i }\text {ve2}}} + \frac{\text {MASE}}{\text {MASE}_{\text {Na}\ddot{\i }\text {ve2}}} \right] \end{aligned}$$

(26)

5.2.1 MASE

MASE is a scaled version of Mean Absolute Error (MAE). The scaling factor for MASE is the MAE of a baseline model on the training set. The baseline model used in the M4 competition is the seasonal naïve model, which always predicts the value S steps in the past (e.g. 24 for all hourly time series, 12 for all monthly time series, etc.). Let x be the training portion of the series, y the true continuation of the series, and $\hat{y}$ be the forecast. Then MASE can be defined as follows:

$$\begin{aligned} \text {MASE}&= \frac{1}{N} \sum _{i=1}^N \frac{ \frac{1}{H} \sum _{j=1}^{H} |y_{j}^{(i)} - \hat{y}_{j}^{(i)} |}{\frac{1}{T^{(i)}-S}\sum _{j=S+1}^{T^{(i)}}|x_j^{(i)} - x_{j-S}^{(i)}|}, \end{aligned}$$

(27)

where N is the number of time series, $T^{(i)}$ is the length of the time series, H is the forecasting horizon, and S is the seasonality. The superscript (i) (as in $x^{(i)}$) denotes the time series with index i, with $1\le i \le N$.

5.2.2 sMAPE

sMAPE calculates the symmetric percentage difference between the forecast and the actual values. sMAPE scales the absolute error at each time step by the average between the forecast and ground truth at that time step, and can be defined as follows, using the previously introduced notation:

$$\begin{aligned} \text {sMAPE}&= 100 \cdot \frac{1}{N} \sum _{i=1}^N \frac{1}{H} \sum _{j=1}^{H} \frac{|y_{j}^{(i)} - \hat{y}_{j}^{(i)} |}{\ \bigl ( |y_{j}^{(i)} |+ |\hat{y}_{j}^{(i)} |\bigr ) / 2} \end{aligned}$$

(28)

5.3 Training

We used a sliding window approach to train our models. In other words, instead of full length time series, fixed length sub-sequences (windows) were used to train. This was done to avoid the computational issues related to attention over long sequences. The size of the sliding window was defined to be nH, where H is the forecasting horizon and n is a hyperparameter determining the size of the window relative the forecasting horizon.

During training, teacher forcing was used to produce H predictions in parallel. Consequently, it is necessary to sample sub-sequences of length $L~=~nH~+~H$ to train the model, where the first nH elements are the sliding window inputs and the final H elements are targets. To construct a training mini-batch, we first a sampled a time series i with uniform probability, and then sampled from the sub-sequences within that time series with (conditional) uniform probability.

We created our validation set by combining all the rightmost sub-sequences of length L. However, in order to have a greater number of sub-sequences available in the training set, we excluded the rightmost sub-sequences belonging to the shortest sequences of the dataset. Using set notation, the procedure can be described as follows: Create an index set $\mathcal {I}~=~\{\ i\ \mid \ \forall _{i, 1\le i \le N}\ \ T^{(i)} \ge P_{25}\ \}$, where $T^{(i)}$ is the length of time series i, and $P_{25}$ is equal to the 25th percentile in the distribution of time series lengths. Then the validation set is $\mathbb {X}_\text {val}~=~\bigcup _i \mathbb {X}^{(i)}_\text {val} = \{ x^{(i)}_{T^{(i)} -L < t\le T^{(i)} } \mid \ i \in ~\mathcal {I} \}$, where the $x_{a\le t\le b}$ notation indicates the sub-sequence of x starting at a and ending at b, i.e.: $[x_a, x_{a+1}, \ldots , x_{b-1}, x_b ]$. The training set is then created by enumerating all possible sub-sequences without overlapping targets in the validation set: $\mathbb {X}_\text {train} = \bigcup _i \mathbb {X}_\text {train}^{(i)} = \bigcup _i\ \{ x_{j\le t < j + L}^{(i)} \mid \ \forall _{j, 1\le j \le T^{(i)} - L - H\mathbb {I}_\mathcal {I}(i)} \}$, where $\mathbb {I}_\mathcal {I}$ is the indicator function for $\mathcal {I}$. See Fig. 2 for a graphical representation.

5.4 Hyperparameters

We performed manual hyperparameter tuning with the goal of finding a general setting which could work well across all data frequencies of the M4 dataset. However, most of the tuning focused on the Monthly frequency. We were largely successful in finding a general setting for all data frequencies, except for the value of n, which determines the size of the input window relative to the forecasting horizon.

For the Yearly, Quarterly, Monthly, and Daily frequencies we set $n=3$; while for the Weekly and Hourly frequencies we set $n=4$. A value of $n=3$ resulted in to poor performance on the Weekly and Hourly frequencies, likely due to seasonal patterns that are only included with window sizes corresponding to $n=4$. In the case of Weekly, $n=4$ corresponds to 52 weeks, which indicates the presence of yearly seasonality. For the Hourly frequency, $n=4$ corresponds to 192 hours, which is approximately 8 days, indicating a weekly seasonality.

The remaining hyperparameters were set to be identical for all data frequencies. The model has 4 layers, 4 attention heads, $d_\text {model}=512$, and $d_\text {ff}=2048$. We use the Lamb [28] optimizer with default hyperparameters, bias correction, and gradient clipping for gradients with norms greater than 10. Our loss function is defined to be identical to the MASE metric (27). We define a training epoch to consist of 128 mini-batches of size 1024. As our stopping criterion, we use early stopping with a patience value of 8, such that training was stopped after 8 epochs without improvement in the validation loss.

5.5 Ablation studies

In order to better understand the effects of the various components of our models, we perform two ablation studies. To reduce the complexity of these studies, we focus exclusively on the monthly portion of M4 dataset, which contains 48% of the series in the M4 dataset.

The first ablation study focuses on the effects of Persistence Initialization, and the second ablation study focuses on the effect of the positional encoding and the normalization layer. Moreover, in both studies we are also interested in the effect of the size of the Transformer model, and possible interactions between architectural components and model size. For this reason we also we vary the model size by setting the hyperparameter $d_\text {model}$ to values in the set $\{ 32, 64, 128, 256, 512 \}$, with the feedforward size $d_\text {ff}$ set to $4\cdot ~d_\text {model}$.

In order to ensure fair comparisons, we perform 9 repeated experiments for each model setting, such that each repeated experiment has different weight initialization and data sampling. This minimizes the effect of randomness due to weight initialization and data sampling, and ensures that we are not cherry-picking the best performing models after the fact.

5.5.1 First ablation study

The first ablation study investigates the effects of the skip connection and the multiplicative gating. We compare architectures with neither skip connections nor multiplicative gating (29), architectures with a skip connection but without multiplicative gating (30), and architectures with both a skip connection and multiplicative gating, i.e. Persistence Initialization (31):

$$\begin{aligned} \hat{z}_{t+1}&= T(\textbf{z}) \end{aligned}$$

(29)

$$\begin{aligned} \hat{z}_{t+1}&= z_t \ +\ T(\textbf{z}) \end{aligned}$$

(30)

$$\begin{aligned} \hat{z}_{t+1}&= z_t \ +\ \gamma \cdot T(\textbf{z}) \end{aligned}$$

(31)

5.5.2 Second ablation study

The second ablation study compares the effect of the positional encoding and the normalization layers. We compare two positional encodings: the original sinusoidal encoding [1], and the Rotary encoding [15], both of which are described in Section 2.2. We compare three kinds of normalization: ReZero, post-activation Layer Norm, and pre-activation Layer Norm, as described in Section 2.3. This results in a total of six combinations of architecture settings for the second ablation study.

5.6 M4 comparison

In our second experiment we want to measure the performance of our PI-Transformer on the complete M4 dataset, in order to compare it to other state-of-the-art methods that have been evaluated on the complete M4 dataset. In particular, we compare against the top 10 methods of the M4 competition. We also compare against two versions of the N-BEATS [19] method, which was developed after the conclusion of the competition.

All the top performing methods on the M4 dataset are ensemble methods, and for this reason we are mainly interested in two issues: forecasting performance and ensemble size. This presents an obvious difficulty, as an ensemble model might have better forecasting accuracy compared to a single model, which would mean that neither model is strictly superior. To address this issue, we measure both the performance of a single PI-Transformer model, and the performance of an ensemble of PI-Transformers. First, we perform 9 repeated experiments for each subset of the M4 dataset. As in the previous ablation studies, each repeated experiment has different weight initialization and data sampling. We will consider the model with the median OWA score within each data subset to estimate the expected performance of a single PI-Transformer on that subset. The total OWA score is computed by concatenating the predictions of these median-score models from each data frequency into a single prediction on the complete dataset. Second, in order to estimate the effect of using a PI-Transformer in an ensemble, we compute the mean of the 9 predictions for each data subset. These mean predictions are then concatenated into a single prediction for the complete M4 dataset.

5.7 Comparison to other transformer models for time series

For the sake of completeness, we perform a final experiment where we compare the performance of our architecture against three Transformer models which have been applied to time series forecasting: LogSparse Transformer [6], Informer [7], and Autoformer [8].

In this comparison, we only use data from the Hourly sub-set of the M4 dataset, for two reasons. First, the Informer and Autoformer architectures were designed to deal with very long forecasting horizons. The Hourly sub-set of the M4 for has the longest forecasting horizon ($H=48$), so it is the part of the M4 dataset that most resembles the problems these architectures were designed for. Second, in the case of the LogSparse Transformer, the authors report the performance of their model on the Hourly sub-set of the M4 dataset. This allows us to refer to the authors’ own reported performance, instead of re-implementing the architecture.

Similarly to previous experiments, we perform repeated experiments and report the median score. However, we only perform 5 repeats in this comparison instead of 9, as was done previously. (For the LogSparse Transformer, we use the authors’ own reported performance, which was not the median of 5 repeated experiments.)

We used publicly available code^{Footnote 4}^,^{Footnote 5} to implement the Informer and Autoformer. However, we found training the Autoformer to be challenging, as the loss values were generally high throughout training. This was especially the case as the number of parameters increased, and for this reason we decided to only consider the relatively small setting of $d_\text {model}= 32$. Similarly to the previous experiments, we set $d_\text {ff}= 4 \cdot d_\text {model}$. For the Autoformer and the Informer we used 2 encoder layers and 2 decoder layers, and for the Transformer we used 4 layers, as before. This setting results in a similar number of parameters for the three models. We use a context window of length H for the decoder of both the Informer and Autoformer.

We also found the previously used strategy of early stopping on the validation loss to be unreliable when training the Informer and Autoformer, as the validation loss would often have much larger variance than in the previous experiments. (We believe this difference comes from the one-shot forecasting approach taken by both methods. In contrast, our autoregressive model uses teacher forcing during training, leading to more stable loss values, as the model only has to perform 1-step predictions.) To provide a more fair comparison, we instead allocate a fixed amount of computation to each method by setting a limit of 100 epochs, and then selecting the weights with the lowest validation loss to compute the final test score.

The authors of LogSparse Transformer measured performance on M4-Hourly using a 0.5-quantile loss. To be comparable, we also report the 0.5-quantile loss, which can be defined as follows:

$$\begin{aligned} R_{0.5} = \frac{ \sum _{i=1}^N \sum _{j=1}^H |y_{j}^{(i)} - \hat{y}_{j}^{(i)}|}{\sum _{i=1}^N \sum _{j=1}^H |y_{j}^{(i)}|} \end{aligned}$$

(32)

6 Results and discussion

6.1 First ablation study

The first ablation study focuses on the components of Persistence Initialization. We compare models using Persistence Initialization (PI) to two different ablation settings: models lacking both skip connection and multiplicative gating, and models with a skip connection but no multiplicative gating. Additional details regarding this experiment can be found in Section 5.5.

Figure 3 shows box plots of OWA test scores the three settings, and Fig. 4 shows the training and validation loss curves of the three settings. To give additional context about the accuracy of the settings relative to other methods, the box plots in Fig. 3 also includes striped lines representing the first and second place entries in the M4 competition. As a shorthand we will refer to the three settings by their ordering in the plots; i.e. setting 1 refers to models lacking both skip connections and gating, setting 2 refers to the models with skip connections but no gating, and setting 3 refers to models with Persistence Initialization.

The box plots in Fig. 3 shows that each setting has a different relationship between the size of the model and forecasting accuracy. The two ablation settings do not show improved accuracy as model size is increased. For setting 1, the smallest model performs worse than the largest model. In setting 2 all the model sizes perform at a similar level. Only models in setting 3 (i.e. with PI) improve in accuracy when model size is increased. Moreover, the largest model size has a lower median OWA model than the winner of the M4 competition, which shows that Transformer models with Persistence Initialization are able to achieve a high level of accuracy.

Looking at the loss curves for in Fig. 4, we see several indications as to why models with PI achieve better accuracy. Compared to the curves of the two ablation settings, the loss curves of setting 3 (i.e. PI) are shifted both down and to the left. In other words, models with PI start at a lower loss and end at a lower loss, and do so in fewer iterations. It is not surprising that these models start at lower initial loss values, as PI is an initialization technique, designed to improve the initial forecasts of a forecasting model. It might be more surprising that these models also achieve lower final loss values, as the Transformer is known to be a very powerful architecture. One might expect that a Transformer without PI would be able to achieve the same level of accuracy by quickly learning to select the previous time step in one of its attention heads. However, the fact that the models without PI train for more iterations, only to achieve worse results, suggests that learning this simple mapping is in fact non-trivial.

In conclusion, this ablation study clearly shows that Persistence Initialization has a large effect on the training process, improving both performance and training stability. Both the residual skip connection and the multiplicative gating parameter are necessary to see these effects.

6.2 Second ablation study

The second ablation study compares the effect of the positional encodings and normalization layers. We compare two kinds of positional encodings and three kinds of normalization layers. The two positional encodings are the standard sinusoidal positional encoding and the Rotary encoding. The three kinds of normalization layers are post-activation Layer Normalization, pre-activation Layer Normalization, and ReZero normalization. Additional details regarding this experiment can be found in Section 5.5.

Figure 5 shows box plots of test scores for the six ablation settings. As in the previous experiment, we include striped lines representing the first and second place entries in the M4 competition to give additional context about the level of accuracy.

By inspecting the box plots we immediately see that the Rotary encoding outperforms the sinusoidal encoding in every setting of the experiment. This suggests that the Rotary encoding is better suited for time series tasks than the sinusoidal encoding. Furthermore, the models with ReZero normalization show improved performance compared to the other two options, indicating that ReZero might be the better choice of normalization function for time series tasks.

Table 2 OWA test scores for each frequency of the M4 dataset. Bold font is used to indicate the best model, an underline indicates second best. N is the number of time series, and H is the forecasting horizon. N-BEATS-18 is a version of the N-BEATS model with 18 models in its ensemble instead of 180. We report the OWA score for this model based on Fig. 3 from Oreshkin et al. [19]

Full size table

6.3 M4 dataset performance

This experiment compares the PI-Transformer to other state-of-the-art methods on the complete M4 dataset. We are mainly interested in two issues: forecasting performance and ensemble size. Table 2 compares our method to the top 10 methods of the M4 competition. We also include two versions of the N-BEATS [19] method, which was proposed after the conclusion of the competition. See Section 5.6 for more details regarding this experiment.

The size of an ensemble of models is an important aspect of practical usefulness in real world settings. This is especially true when the ensemble members are themselves complex models, such as Deep Learning models. The current top performing method on the M4 dataset is the N-BEATS method by Oreshkin et al. [19], which consists of an ensemble of 180 feed-forward neural networks. To ensure diversity in the ensemble, the authors used three different loss functions and six different window sizes, resulting in a total of 18 different model configurations. The final ensemble was then formed by using 18 copies of 10 identical model configurations, resulting in 180 models in total. The authors also reported the performance of a smaller ensemble which only used one copy of each model configuration, which we have called N-BEATS-18 in Table 2. The top performing method of the M4 competition was also an ensemble. The winner of the competition, Smyl [22], used a complex strategy which combined models at multiple conceptual levels. First, at the level of ensembles of models, the method combines forecasts from 6-9 independent training runs. Second, each of these training runs consisted of multiple models which were trained on subsets of the dataset. Finally, the forecasts of each such model were produced by taking the average of the predictions produced by the models in the final 4-5 training epochs.

Both of these methods are arguably highly complex, but they also substantially improved forecasting performance compared simpler methods. The significance of the improvements in accuracy can be most easily be seen by inspecting the total OWA of the methods from rank 2 to until rank 6. The median difference between these consecutive ranks is 0.001 OWA, and the difference between the best and the worst method in this group is 0.010 OWA. In contrast, the difference between the rank 1 method (i.e. the winner, Smyl) and the rank 2 method is 0.017 OWA. This observation led the organizers of the competition to characterize the difference between ranks 6 to 2 as “miniscule”, while the difference between ranks 2 and 1 was characterized as “considerable” [29]. A similar argument can be made for the difference between N-BEATS and the rank 1 method, which is even greater: 0.026 OWA.

In this experiment, we trained 9 PI-Transformer models for each frequency of the M4 dataset. We measure the expected OWA of a single PI-Transformer by the median OWA of the 9 models. We measure the OWA of an ensemble of PI-Transformer models by using the mean of the 9 predictions, which is arguably a more fair comparison to the other methods of Table 2, as most of these are in fact also ensembles.

As can be seen from Table 2, the median-OWA PI-Transformer achieved a score between the winner of the M4 competition and the N-BEATS method. The difference to the winner of the competition is 0.006 OWA, which is relatively small, as we have discussed above. However, we would argue that the most important advantage of our method is that it is easier to use for Machine Learning practitioners.

The mean-ensemble PI-Transformer achieves a score between N-BEATS-18 and the full N-BEATS model. As before, the differences in OWA are small: 0.002 to 0.005 OWA. Considering that N-BEATS consists of an ensemble of 180 models, we believe that our approach represents a favorable trade-off between forecasting accuracy and ensemble size in this case.

6.4 Comparison with other transformer models for time series

In this experiment, we compare our proposed Transformer architecture against three other Transformer models recently proposed for time series forecasting: the LogSparse Transformer [6], the Informer [7], and the Autoformer [8]. The details regarding the experimental setup for this experiment can be found in Section 5.7.

Table 3 shows the results of the comparison. As can be seen from the table, our method outperforms the others, both in terms of OWA and in terms of $R_{0.5}$.

Table 3 Comparison of Transformer models on M4-Hourly. We include the $R_{0.5}$ score to be comparable with the LogSparse Transformer, as the authors do not report the OWA score

Full size table

This shows that our method is able to achieve better forecasting accuracy compared to recently proposed Transformer methods for time series. However, we would emphasize that these architectures are inherently different, and were designed for different purposes. The PI-Transformer is a decoder-only architecture, while the Informer and Autoformer are encoder-decoder architectures. We have attempted to make the comparison between these models as fair as possible by keeping the number of parameters in these models approximately equal. This required using two encoder layers and two decoder layers for the Informer and Autoformer, compared to the four decoder layers of our PI-Transformer. However, this results in models with different depth; a better comparison might be to use a depth of 4 for both the encoder layers and decoder layers, regardless of the total parameter count. Moreover, our model is an autoregressive architecture which needs to perform several model evaluations whenever the forecasting horizon is greater than 1, in contrast to the Autoformer and Informer which produce a full horizon of forecasts in a single evaluation. Consequently, these models are likely much faster to evaluate.

6.5 Interpretations of persistence initialization

It is perhaps surprising that adding a single parameter, as Persistence Initialization does, can have such a big impact on forecasting accuracy. Persistence Initialization is a relatively simple change, and the Transformer is a powerful model. One might expect that it would be able to “discover” this pattern by itself, by using the attention mechanism to select the previous time step in one of its attention heads. In this section we will discuss some interpretations of what Persistence Initialization is doing.

One interpretation, which is the one suggested by our naming choice, is that the model is initialized to become a persistence model. During training the model changes from the naïve persistence model to become a more complex model. In other words, Persistence Initialization can be seen as a kind of implicit weight initialization.

A related interpretation is that Persistence Initialization is a re-parametrization of the forecasting problem. Instead of directly forecasting the values of the time series, the model must instead predict the difference to the previous time step. Furthermore, the way the model is initialized corresponds to a prior belief that these differences are zero. This interpretation is somewhat related to the concept of differencing, which is a technique commonly used in statistical forecasting methods to make time series more stationary. However, Persistence Initialization is not the same as differencing, as only the outputs of the PI-Transformer are (implicitly) differenced, and not the inputs.

A third interpretation is that we are combining two models in a way that is somewhat similar to boosting. In boosting, a sequence of models are trained iteratively to predict the residual errors of the previous models. We combine the naïve persistence model and a Transformer, such that the Transformer predicts the residuals of the persistence model. The persistence model has a fixed weight of 1, and the weight of the Transformer is the gating parameter $\gamma $.

7 Conclusion

In this work, we have presented Persistence Initialization, a novel and general adaptation for autoregressive time series forecasting with neural networks. Furthermore, we introduced PI-Transformer, a Transformer model based on Persistence Initialization, Rotary positional encodings, and ReZero normalization. We perform two ablation studies, and show that the PI-Transformer learns faster, is more accurate, and scales better than Transformer models without our proposed modifications. Moreover, we measure the performance of our proposed PI-Transformer model on the complete M4 dataset, and find that it is able to achieve a high level of forecasting accuracy, similar to other state-of-the-art methods. Our method outperforms the original winner of the M4 competition, and achieves a comparable level of accuracy to N-BEATS, which is an ensemble of 180 deep neural networks, using a significantly smaller ensemble of only 9 PI-Transformers.

Data and Code Availability

Source code and raw data (i.e. model predictions) are available at https://github.com/EspenHa/persistence_initialization

Notes

A public implementation is available at https://github.com/EspenHa/persistence_initialization.
https://archive.ics.uci.edu/dataset/204/pems+sf
https://archive.ics.uci.edu/dataset/321/electricityloaddiagrams2011 2014
https://github.com/zhouhaoyi/Informer2020
https://github.com/thuml/Autoformer

References

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Advances in Neural Information Processing Systems, pp. 5998–6008
Devlin J, Chang M-W, Lee K, Toutanova K (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pp. 4171–4186. https://aclanthology.org/N19-1423
Brown T, Mann B, Ryder N, Subbiah M, Kaplan JD, Dhariwal P, Neelakantan A, Shyam P, Sastry G, Askell A et al (2020) Language models are few-shot learners. Adv Neural Inf Process Syst 33:1877–1901
Google Scholar
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, Tunyasuvunakool K, Bates R, Žídek A, Potapenko A et al (2021) Highly accurate protein structure prediction with AlphaFold. Nature 596(7873):583–589
Article Google Scholar
Chen L, Lu K, Rajeswaran A, Lee K, Grover A, Laskin M, Abbeel P, Srinivas A, Mordatch I (2021) Decision Transformer: Reinforcement learning via sequence modeling. Adv Neural Inf Process Syst 34:5084–15097
Google Scholar
Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253
Google Scholar
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: Beyond efficient Transformer for Long Sequence Time-Series Forecasting. In: Proceedings of AAAI
Wu H, Xu J, Wang J, Long M (2021) Autoformer: Decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430
Google Scholar
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507
Article MathSciNet MATH Google Scholar
Glorot X, Bengio Y (2010) Understanding the difficulty of training deep feed-forward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings
Goodfellow I, Bengio Y, Courville A (2016) Deep Learning
Hyndman RJ, Athanasopoulos G (2018) Forecasting: Principles and Practice
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv:1607.06450
Bachlechner T, Majumder BP, Mao H, Cottrell G, McAuley J (2021) Rezero is all you need: Fast convergence at large depth. In: Uncertainty in Artificial Intelligence, pp. 1352–1361. PMLR
Su J, Lu Y, Pan S, Murtadha A, Wen B, Liu Y (2021) Roformer: Enhanced transformer with rotary position embedding. arXiv:2104.09864
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR
Baevski A, Auli M (2019) Adaptive Input Representations for Neural Language Modeling. In: International Conference on Learning Representations. https://openreview.net/forum?id=ByxZX20qFQ
Tay Y, Dehghani M, Bahri D, Metzler D (2022) Efficient transformers: A survey. ACM Comput Surv 55(6):1–28
Oreshkin BN, Carpov D, Chapados N, Bengio Y (2020) N-BEATS: Neural basis expansion analysis for interpretable time series forecasting. In: International Conference on Learning Representations. https://openreview.net/forum?id=r1ecqn4YwB
Makridakis S, Spiliotis E, Assimakopoulos V (2020) The M4 Competition: 100,000 time series and 61 forecasting methods. Int J Forecasting 36(1):54–74
Article Google Scholar
Hyndman RJ (2020) A brief history of forecasting competitions. Int J Forecasting 36(1):7–14
Article Google Scholar
Smyl S (2020) A hybrid method of exponential smoothing and recurrent neural networks for time series forecasting. Int J Forecasting 36(1):75–85
Article Google Scholar
Chang S, Zhang Y, Han W, Yu M, Guo X, Tan W, Cui X, Witbrock M, Hasegawa-Johnson MA, Huang TS (2017) Dilated recurrent neural networks. Adv Neural Inf Process Syst 30
Qin Y, Song D, Cheng H, Cheng W, Jiang G, Cottrell GW (2017) A dual-stage attention-based recurrent neural network for time series prediction. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2627–2633
Kim J, El-Khamy M, Lee J (2017) Residual LSTM: Design of a Deep Recurrent Architecture for Distant Speech Recognition. In: Proc. Inter-speech 2017, pp. 1591–1595. https://doi.org/10.21437/Interspeech.2017-477
Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecasting 22(4):679–688
Article Google Scholar
Makridakis S, Hibon M (2000) The M3-competition: Results. Conclusions and Implications. Intl J Forecasting 16(4):451–476
Article Google Scholar
You Y, Li J, Reddi S, Hseu J, Kumar S, Bhojanapalli S, Song X, Demmel J, Keutzer K, Hsieh C-J (2020) Large Batch Optimization for Deep Learning: Training BERT in 76 minutes. In: International Conference on Learning Representations. https://openreview.net/forum?id=Syx4wnEtvH
Makridakis S, Spiliotis E, Assimakopoulos V (2018) The M4 competition: Results, findings, conclusion and way forward. Int J Forecasting 34(4):802–808
Article Google Scholar

Download references

Acknowledgements

This research was carried out with the support of the ML4ITS project (312062), funded by the Norwegian Research Council (NFR)

Funding

Open access funding provided by NTNU Norwegian University of Science and Technology (incl St. Olavs Hospital - Trondheim University Hospital).

Author information

Authors and Affiliations

Norwegian University of Science and Technology, Trondheim, Norway
Espen Haugsdal, Erlend Aune & Massimiliano Ruocco
BI Norwegian Business School, Oslo, Norway
Erlend Aune
Sintef Digital, Trondheim, Norway
Massimiliano Ruocco

Authors

Espen Haugsdal
View author publications
You can also search for this author in PubMed Google Scholar
Erlend Aune
View author publications
You can also search for this author in PubMed Google Scholar
Massimiliano Ruocco
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Espen Haugsdal.

Ethics declarations

Conflicts of Interest

The authors have no conflicts of interest to declare

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Haugsdal, E., Aune, E. & Ruocco, M. Persistence Initialization: a novel adaptation of the Transformer architecture for time series forecasting. Appl Intell 53, 26781–26796 (2023). https://doi.org/10.1007/s10489-023-04927-4

Download citation

Accepted: 29 July 2023
Published: 29 August 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s10489-023-04927-4

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Persistence Initialization: a novel adaptation of the Transformer architecture for time series forecasting

Abstract

Similar content being viewed by others

Development and Application of Artificial Neural Network

Deep learning for time series classification: a review

A review on the long short-term memory model

1 Introduction

2 Background

2.1 Decoder-only transformer

2.2 Positional encoding

2.2.1 Absolute positional encoding

2.2.2 Rotary encoding

2.3 Normalization

3 Related work

3.1 Transformers for time series forecasting

3.2 M4 competition methods

4 Method

4.1 The time series forecasting task

4.2 The PI-transformer

4.2.1 Normalization

4.2.2 Transformer model

4.2.3 Persistence initialization

5 Experimental settings

5.1 Dataset

5.2 Metrics

5.2.1 MASE

5.2.2 sMAPE

5.3 Training

5.4 Hyperparameters

5.5 Ablation studies

5.5.1 First ablation study

5.5.2 Second ablation study

5.6 M4 comparison

5.7 Comparison to other transformer models for time series

6 Results and discussion

6.1 First ablation study

6.2 Second ablation study

6.3 M4 dataset performance

6.4 Comparison with other transformer models for time series

6.5 Interpretations of persistence initialization

7 Conclusion

Data and Code Availability

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflicts of Interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation