1 Introduction

Electricity has become a necessity of daily life in the modern world. Recently, the global demand and usage of electricity has been increasing drastically due to urban development, industrial expansion, climate change, population growth and so on [1,2,3]. However, the process of power scheduling and transmission is costly and the amount of power is insufficient to meet global demand. As a solution, many studies aim to use various methods to forecast future electricity demand so that the governments and power companies can plan ahead effectively and promote energy efficiency among customers [4].

Electrical load forecasting is of vital importance in intelligent power management and has been an interest topic in academic and business domains [5]. The electrical load forecasting is not only a task to reasonably guide power planning, but also an important guarantee for improving the economy of the power system and ensuring the safe operation of the electrical grid. Hence, as an essential function for power management, electrical load forecasting is crucial to the relevant decision-making. However, accurate forecast of electrical load using time series data of historical electric consumption is still a challenging task [6]. Due to the complex patterns and dynamics of the data, it may be affected by various factors, including temperature, seasons, economy and some unpredictable events [7]. How to fit these complex factors affecting power demand into the prediction models needs to be solved urgently [8]. With the different forecasting scale time, the work can be categorized as three types: short-term, medium-term and long-term [9]. Short-term load forecasting can offer strong support for real-time scheduling and operation planning of power system, and reduce the excessive consumption of energy [10]. It has always been a hot spot in power research, with more and more new methods being introduced including statistical methods and machine learning methods. Statistical methods commonly used for power load and network traffic forecasting such as ARIMA [11, 12], etc. can effectively use the input historical data to predict future power load. But with increasing demand for higher forecast accuracy, the predictive power of these models is insufficient, since it’s difficult to deal with complex patterns and dynamic electrical demand data for statistical approaches. Machine learning methods, such as Support Vector Regression (SVR) [13], Radom Forest (RF), Gradient Boosting Machines (GBM) [14], etc. are also used for power load forecasting because of their powerful ability on processing and analyzing some nonlinear and complex problems. In recent years, deep learning methods, which gain ground on feature extraction than traditional machine learning methods, have developed rapidly and been able to predict power load more accurately. Many models based on deep learning methods are applied for short-term load forecast such as Convolutional Neural Network (CNN), Recurrent Neural Networks (RNN), LSTM, Bi-directional long short-term memory (Bi-LSTM), Seq2Seq [15, 16], etc.

Transformer is also a deep learning method with a new network architecture initially designed for machine translation[17]. It entirely depends on the attention mechanisms without sequence aligned recurrence and convolutions to calculate input–output representations and captures long sequence dependencies. Transformer model has great performance for capturing the complex dynamic nonlinear sequences dependence on long sequence input to provide a new possibility for power load forecasting. In this work, we focus on the short-term forecasting with multivariate time series data, and propose a new model Time Augmented Transformer (TAT) based on an adaptation of the recent deep self-attention Transformer architecture incorporating a time augmentation method for short-term load forecast. The main contributions and novel findings are the following:

  1. 1.

    A highly accurate electrical short-term load forecasting approach based on Transformer Model was developed. We have modified the original Transformer to adapt the electrical load forecasting to successfully improve the prediction capacity.

  2. 2.

    The new Time Augmented Transformer model is proposed on the basis of an adaptation of the recent deep self-attention Transformer architecture. We extracted additional time features as augmentation encoding to enhance the temporal representation of the historical input sequences. The TAT model further improved the ability of learning the nonlinear relationship between load data and achieved great improvement.

  3. 3.

    We carefully designed experiments to demonstrate that multivariate feature input is more appropriate for the proposed model in the short-term load forecasting task and our approach can use less historical information to make more accurate predictions, which means less memory occupancy and faster calculation speed.

2 Related Work

Previous work on short-term electrical load forecast can be classified into statistical approaches, machine learning and deep learning [4]. Many statistical methods used in electrical load forecasting just like ARIMA [18, 19]. Wei and Zhang [20] proposed an ARIMA model for short-term electrical load forecasting. However, it’s difficult to deal with complex patterns and dynamic electrical demand data for statistical approaches, and it has high requirements for the stationarity of the data, so that the accuracy of the prediction results by statistical methods are not enough and the statistical approaches fail to achieve the expected forecasting results.

In recent years, machine learning methods have gradually been investigated for power load forecasting. Artificial intelligence-based methods have accounted for 90 percent of power forecasting research models during 2010 to 2020 [4]. Yi, Niu [21] using a wavelet transform with least squares support vector machine (LSSVM) to predict demand power. The random forest was used for short-term load prediction in one day ahead of one-step in Tunisia [22]. Besides, Zhang, Li [23] compared three kinds of models, multiple linear regression, RF and gradient boosting, for hourly electricity load forecasting in southern California, the result demonstrated that gradient boosting has the best performance.

Along the rapid development of artificial intelligence, deep learning has widely been used in natural language understanding, image processing, autonomous driving and other fields [24, 25]. Deep learning methods can not only capture the complex dependencies in nonlinear dynamic system, but also achieve remarkable performance in many prediction applications with higher accuracy [26, 27]. Tokgöz and Ünal [28] built a forecasting model based RNN with an ant colony optimization algorithm and improved the prediction accuracy in electrical load forecasting. However, RNN has the problem of gradient vanishing when dealing with long sequence input that the back-propagation error either decays rapidly or grows beyond the limit, and it is difficult to capture the long-distance dependencies between sequences. Long Short Term Memory (LSTM), which is a further developed model based on RNN, realizes the function of forgetting or remembering using “Gates” to control the discarding or adding of information to solve the problem of gradient disappearance of RNN [29]. Peng, Shuai [15] have applied LSTM to improve the forecast accuracy of traditional RNN model. Besides, CNN has also been used in load forecast because of its excellent ability to capture the trend of load data. Wang, Zhao [30] proposed a mothed based on the integration of CNN and LSTM and the results in higher precision in short-term forecasting. Taking into consideration to utilize the global historical information, Gong, An [16] developed a short-term load prediction model based on Seq2Seq, which use encoder–decoder architecture, has exhibiting better performance. However, Seq2Seq model uses a recurrent neural network structure as encoder to encode historical information into an intermediate vector, it will inevitably lose the dynamic dependencies between historical sequences in the encoder vector.

3 The Proposed Approach

3.1 Problem Description

We can convert the power load forecasting to a supervised learning problem, in multi-step ahead electric load forecasting, the input sequence under the rolling forecasting setting with a sliding window, a history time series of historical electrical load and relative features \({{\varvec{X}}} = \left\{ {x_{t_1 } ,x_{t_2 } , \ldots ,x_{t_n } | x_{t_i } \in R^{d_x } } \right\}\) was given, and the output is the prediction of the next m-step electrical load sequence \({{\varvec{Y}}} = \left\{ {x_{t_{n + 1} } ,x_{t_{n + 2} } , \ldots ,x_{t_{n + m} } | x_{t_i } \in R^{d_y } } \right\}\), where \({\rm d}_x\) is the number of feature in the input vector and \(x_{t_i }\) can be a scaler or a vector that consists of multiple features including historical electrical load, dry bulb temperature, wet bulb temperature, dew point temperature, hours and electricity price, and \(d_y = 1\). Figure 1 shows the sliding window for the input electrical load sequence. In this work, for short-term electrical load forecasting, we will make predictions for 30 min, 1 h, 12 h and one day respectively using historical data from the previous 1 day as input, that means \({\text{m} = 1, 2, 24, 48}\) and \(n = 48\) while a time step \({\text{m}}\) denotes 30 min.

Fig. 1
figure 1

Sliding windows to construct supervised learning examples for rolling forecasting

3.2 Time Augmentation Transformer Model

The Transformer model entirely depends on the attention mechanisms without sequence aligned recurrence and convolutions to calculate input–output representations and captures long sequence dependencies [17]. Furthermore, it does not process data in an ordered sequence manner, but used attention mechanisms to process entire sequence to learn dependencies without regard to their distance from input sequences. Therefore, Transformer-based model has the potential to model complex dynamics of the electrical load data [31]. Because of the transformer design for machine translation, it cannot be directly used to forecast the electrical load. To this end we have modified the Transformer to adapt our task.

The structure of our Time Augmentation Transformer named TAT is show in Fig. 2. TAT model use encoder–decoder architecture. All the historical load and features are inputted into the encoder to generate a history global information coding result after fusion time information in input layer. The decoder uses the one-position shifted future load data and the historical global attention vector encoded by the encoder as the input to predict the electrical load on next step.

Fig. 2
figure 2

Structure of Time Augmented Transformer model for load forecasting

Input Layer: The input layer is composed of a fully connected layer, a position encoding layer and a time augmented encoding layer. The historical electrical load data firstly entry the input layer. Unlike the original Transformer architecture, the historical observation \({{\varvec{X}}} \in {\mathbb{R}}^{n \times {{\rm d}}_{{\rm x}} }\) is transformed to \({{\varvec{X}}} \in {\mathbb{R}}^{n \times {{\rm d}}_{{{\rm model}}} }\) that maps the input data to a vector of dimension \({\text{d}}_{{{\rm model}}}\) by employing a fully connected layer, where \({\text{n}}\) is the input time step of historical data, \({\text{d}}_{{\rm x}}\) is the number of input features for a single time step. Positional-encoding \({\text{PE}}\) was added to above the fully connected layer, it injects relative and absolute position information of the input sequence using sine and cosine functions:

$$\begin{aligned} PE_{\left( {{\text{pos}}, 2i} \right)} & = {\text{sin}}\left( {\text{pos}/10000^{2i/{{\rm d}}_{{{\rm model}}} } } \right) \\ PE_{\left( {{\text{pos}}, 2i + 1} \right)} & = {\text{cos}}\left( {\text{pos}/10000^{2i/{{\rm d}}_{{{\rm model}}} } } \right) \\ \end{aligned}$$
(1)

Power load forecasting task is a time-dependent forecasting task. However, inputting the global time information split into "Year, Month, Day, etc." as additional feature with other variables into the Transformer model has resulted in decrease for prediction accuracy, because too many feature inputs will bring more noise to the model. The positional embedding of the basic Transformer can only obtain the sequential representation between the input sequences but failed to effectively represent the relationship of each point in the sequence in the global time. For example, in real-world scenarios, consumers will consume more electricity at night than during the day, and more on weekends than weekdays. It is difficult for the basic Transformer model to effectively utilize the time information in the power load data. To better learn the time relationship between historical data, we proposed a time augmentation layer to enhance the temporal representation of the historical input sequence. For each time step of the input sequence, the basic time feature as input for time augmentation layer \(T_{t_i }\) such as “2010/1/1 00:30” used for generation of derived features: Year, \({\text{Y}}\); Month, \({\text{M}}\); Day, \({\text{D}}\); Time-stamp of the day, divided into 30 min interval each, \({\text{H}}\); Current day of the week, \({\text{W}}\); Holidays represented by a binary label \({\text{L}}\). We convert these discrete temporal features to one-hot encoding and concatenate them to a vector \(T_i \in {\mathbb{R}}^{n \times {{\rm d}}_{{\rm t}} }\), where \({\text{d}}_{{\rm t}}\) is total dimension of the one-hot encoding of the temporal features:

$$T_i = {\text{Concat}}({\text{one}} - {\text{hot}}\left( {Y,M,D,H,W,L} \right)$$
(2)

Each time step’s time encoding \(T_i \in {\mathbb{R}}^{n \times {{\rm d}}_{{\rm t}} }\) is transformed to \(T_i \in {\mathbb{R}}^{n \times {{\rm d}}_{{{\rm model}}} }\) employed two fully connected network and ReLU activation function:

$${\text{FFN}}\left( {T_i } \right) = {\text{max}}\left( {0, T_i W_1 + b_1 } \right)W_2 + b_2$$
(3)

While \(W_1\), \(W_2\), \(b_1\) and \(b_2\) are the learnable parameter matrices of linear mapping. In addition, to prevent the disappearance of gradients caused by the excessive number of layers in the overall model, we use a residual connection and layer normalization which can be expressed as (4):

$$T_i = {\text{LayerNorm}}\left( {T_i + FFN\left( {T_i } \right)} \right)$$
(4)

Thus, we have the final vector \(X_i\) as the final inputs to Encoder and Decoder which contains the original sequence input, absolute position information \(PE\) from position encoding and time information \(T_i\) from the time augmentation layer:

$$X_i = X_i + T_i + PE$$
(5)

Encoder: The encoder is composed of a stack of encoder layers and the number of encoder layers is a free parameter. The vector is fed into the encoder layer after being processed by the input layer. There are a multi-head self-attention sub-layer and a fully connected feed-forward sub-layer in each encoder layer. As the name implies, self-attention is responsible for the calculation of the attention of the input sequence within the encoder. The entered historical sequence uses encoder to encode all historical load information and captures each of their interdependence, and the context vector encoded by encoder will be inputted to the decoder and provide global historical load information for the decoder. Besides, to speed up the training and reduce the disappearance of gradients, a residual connection [32] and layer normalization [33] was employed for each of the two sub-layers.

Decoder: The decoder is also consisted of a stack of decoder layers. In the training phase, the input of the encoder is the sequence shifted one-position offset to the target output we predict and the start token of decoder is the load in last step of encoder’s input sequence. In the predicting phase, the input of the decoder is just one data which is the load in last step of encoder’s input sequence, and predict load in next time step by step. The input of sequence is transformed into a \(d_{{{\rm model}}}\) dimensional vector representation through input layer and position encoding, and then feed to a stack of decoder layer. There are three sub-layers in each of decoder layer: an encoder–decoder attention layer, a fully connected feed-forward network layer and a masked multi-head self-attention layer. For self-attention layer in decoder, self-attention is modified to a masked self-attention by setting the sequence after the current prediction step to \(- \infty\), since each position can attend to all positions when performing attention calculations in decoder and it will result in the disclosure of future sequence information when decoder makes prediction. To train decoder in batches during the training phase, we use an upper triangular matrix as the masking to prevent the decoder from obtaining future information. Encoder–decoder attention performs multi-head attention over the input of the decoder and the output of the encoder stack. It converts the vector of the encoded historical electrical load feature to generate global attention vector as the input of the decoder by building the relationship between data from each historical time step and every future time step. Finally, the soft-max layer that is used for classifying in original Transformer is also omitted, we use a fully connected layer transformed the \({{\varvec{Y}}} \in {\mathbb{R}}^{m \times {\rm d}_{{\rm model}} }\) output vector from decoder to \({{\varvec{Y}}} \in {\mathbb{R}}^{m \times 1}\), where \(m\) is the number of forecasting ahead step and we use Mean Squared Error (MSE) loss to measure training loss:

$${\text{MSE}} = \frac{1}{N}\mathop \sum \limits_{i = 1}^N \left( {y_i - \hat{y}_i } \right)^2$$
(6)

Self-Attention: Attention is an indispensable and complex cognitive function of human beings, which refers to the selective ability of focus on some information while ignoring others [34]. The attention mechanism draws on the human brain to improve the ability of the neural network to process information. When the neural network processes a large amount of input information, the attention mechanism allows the network just to select some key information as inputs.

The calculation of the attention mechanism can be divided into two steps: the first step is to calculate the attention distribution on all input sequences, and the second step is to calculate the weighted average of the input sequences based on the attention distribution [35]. The \({\text{N}}\) group input information is represented by \(X = \left[ {x_1 , \ldots ,x_N } \right] \in {\mathbb{R}}^{D \times N}\), where \(D\)-dimension vector \(x_n \in {\mathbb{R}}^D ,n \in \left[ {1,N} \right]\). The input information can be represented in a query-key-value pair format, for each input \(x_i\), first map it linearly to three different spaces to get the query vector \(q_i \in {\mathbb{R}}^{D_k }\), the key vector \(k_i \in {\mathbb{R}}^{D_k }\) and the value vector \(v_i \in {\mathbb{R}}^{D_k }\). For the entire input sequence X, the linear mapping process can be expressed as (7) (8) (9), while \(W_q \in {\mathbb{R}}^{D_k \times D_x }\), \(W_k \in {\mathbb{R}}^{D_k \times D_x }\) and \(W_v \in {\mathbb{R}}^{D_v \times D_x }\) are the parameter matrices of linear mapping. \(Q = \left[ {q_1 , \ldots ,q_N } \right]\), \(K = \left[ {k_1 , \ldots ,k_N } \right]\), \(V = \left[ {v_1 , \ldots ,v_N } \right]\) are the matrices composed of query vector, key vector and value vector respectively.

$$Q = W_q X \in {\mathbb{R}}^{D_q \times N}$$
(7)
$$K = W_k X \in {\mathbb{R}}^{D_k \times N}$$
(8)
$$V = W_v X \in {\mathbb{R}}^{D_v \times N}$$
(9)

Transformer uses the scaled dot product as the attention scoring function to calculate the attention distribution. When the dimension \(D\) of the input vector is relatively high, the value of the dot product model usually has a large variance, resulting in a small gradient of the soft-max function. Using the scaled dot product can solve this problem well. The formula of the dot product model is as follows:

$$\left( {x,q} \right) = \frac{{x^{ \intercal } q}}{\sqrt D }$$
(10)

For each query vector \(q_n \in Q\), the key-value pairs attention mechanism of formula (11) is used to obtain the output vector:

$$\begin{aligned} h_n & = {\text{att}}\left( {\left( {K,V} \right),q_n } \right) \\ & = \mathop \sum \limits_{j = 1}^N \alpha_{nj} v_j \\ & = \mathop \sum \limits_{j = 1}^N {\text{ softmax}}\left( {s\left( {k_j ,q_n } \right)} \right)v_j \\ \end{aligned}$$
(11)

where \(n,j \in \left[ {1,N} \right]\) is the position of the output and input vector sequences, \(\alpha_{nj}\) represents the weight of the \(n\)-th output focusing on the \(j\)-th input. The output vector sequence can be abbreviated as:

$$\begin{aligned} H & = {\text{softmax}}\left( {\frac{{QK^{ \intercal } }}{{\sqrt {D_k } }}} \right)V \\ & = \mathop \sum \limits_{n = 1}^N \frac{{{\text{exp}}\left( {\frac{{QK^{ \intercal } }}{{\sqrt {D_k } }}} \right)}}{{\sum_j {\text{exp}}\left( {\frac{{QK^{ \intercal } }}{{\sqrt {D_k } }}} \right)}}V \\ \end{aligned}$$
(12)

The self-attention module makes the historical load feature sequence and the future load sequence interrelated, so that the embedding representation of the source sequence and the target sequence will contain more abundant information. The information input from the attention layer to the subsequent FFN also has stronger model representation ability. The self-attention mechanism was shown in Fig. 3.

Fig. 3
figure 3

Self-attention mechanism

4 Experiment

4.1 Dataset and Preprocessing

The electrical load data of New South Wales were publicly obtained from the Australian National Electricity Market, where data points are collected every half hour, 5 years from 2006 to 2010. Each data point consists of the target value electrical load and other six features including: hours, dry bulb temperature, wet bulb temperature, dew point temperature, humidity and electricity price.

We use data from the first 5 years as the training set, the first 6 months of the last year as the validation set, and the last 6 months as the test set. All of the data was normalized via the zero-mean method. Then a fixed-length sliding window show in Fig. 1 was applied to construct \(\left( {X,Y} \right)\) pairs, in which \(X\) are previous \(n\)-step feature vector including our target electrical load data and \(Y\) are next \(m\)-step data as our forecast target.

4.2 Experimental Design

We compared our Time Augmented Transformer model with following forecasting models: ① ARIMA; ② SVR; ③LSTM; ④Bi-LSTM; ⑤CNN-LSTM; ⑥Seq2Seq; ⑦Basic Transformer.

For all methods, the input history data length for model is 48 step and the step of predict length is chosen from {1, 2, 24, 48} that means 30 min, 1 h, 12 h and 1 day. For ARIMA, we choose the parameter as \(p = 1, d = 2 \, and \, q = 1\) by analyzing the ACF and PACF diagrams produced from dataset. For SVR model, we used a multiple regression strategy to use SVR for multi-step prediction. For LSTM, we set a dense connected network and a stack of LSTM layers. The data input into LSTM-layer for learning historical sequential information, and the final output from the LSTM-layer was feed into dense connected layer to fit the number of steps for target prediction. For seq2seq, we used the Gate Recurrent Unit (GRU) and dense connected network as the basic components. The encoder in Seq2seq receive and process historical input data. The results of the GRU network in decoder was feed into a fully connected feedforward neural network and then predict backwards step by step with autoregressive methods. For LSTM, Bi-LSTM and Seq2Seq, the size of hidden state is chosen from {16, 32, 64, 128, 256} and the number of layers was chosen from {1, 2, 3, 4}. For the model of CNN-LSTM, we choose one dimensional convolution and the number of filters is 64 with kernel size 3, and the size of hidden state is 200.

For basic Transformer and TAT, the head number of multi-head attention was chosen from {8,16}, the layers of encoder and decoder was chosen from {2, 3, 4, 5, 6}, and the dimension of multi-head attention’s output was chosen from {16, 32, 64, 128, 256, 512} respectively. We use grid search to select optimal hyper-parameters by observing their performance in the validation set. We set the number of encoder layer to 4, decoder layer to 2, the dimension of model to 64, the number of heads to 8, the number of hidden states in FFC layer to 2048 and the dimension of attention q, k and v to 8. For time augmentation layer, we set the hidden state size to 1024. Our model was optimized with Adam optimizer [36], and we use \(1e^{ - 5}\) as learning rate. For best generalization performance, we use 20 epochs with proper early stopping for all deep learning methods. A mini-batch of size 1024 was used for training. Besides, the dropout of 0.2 was applied for our model.

We computed Root Mean Squared Error (RMSE), Mean Absolute Error (MAE) and Mean Absolute Percentage Error (MAPE) between the actual data and the predicted value to evaluate the performance for all the methods. The measures of test error RMSE and MAPE are expressed as follows:

$${\text{RMSE}} = \sqrt {\frac{1}{N}\mathop \sum \limits_{i = 1}^N \left( {y_i - \hat{y}_i } \right)^2 }$$
(13)
$${\text{MAPE}} = \frac{100\% }{N}\mathop \sum \limits_{i = 1}^N \left| {\frac{{y_i - \hat{y}_i }}{y_i }} \right|$$
(14)

All the experiments were carried out on a personal server with two Nvidia Tesla V100 (16 GB) GPU.

4.3 Results and Discussion

4.3.1 Multi-step Ahead Forecasting for SW Data

In our first experiment, we compare different methods for 30 min (1 step), 1 h (2 steps), 12 h (24 steps) and one day (48 steps) ahead forecasting on the electrical dataset on New South Wales. For all the predictive models, we used 24 h of historical data (48 historical steps) as input vector. We compared our TAT model’s performance with ARIMA, LSTM, Bi-LSTM, CNN-LSTM, Seq2Seq and basic Transformer. Table 1 summarizes the MAE, MSE and MAPE values for each method for multi-step ahead forecasting and our model have the best results in all different time step ahead predictions.

Table 1 Comparison of different models for forecasting multi-step electrical load

Figure 4. shows the predictions of the four models at prediction steps 1, 12, 24 and 48 ahead in subgraph a, b, c and d respectively. It can be seen that the prediction results of each model are both accurate when predicting 1 step forward. However, our method is more sensitive to local changes in the load curve and can more accurately predict subtle changes in the electrical load over a shorter period of time, as show in local zoomed images in subgraph Fig. 4 (a). As the prediction step increases, the deviation of the prediction curve of ARIMA, machine learning and other deep learning methods from the real curve gradually increases, while the prediction result of our method is closer to the actual value curve than other models, especially at the bottom and top of the power load curve in subgraph Fig. 4 (b), (c), (d), which demonstrates that our model shows significantly better results than other forecasting models.

Fig. 4
figure 4

Result in forecast testing of different-step forecasting for 30 min, 1 h, 12 h, and 1 day ahead respectively using 4 models

4.3.2 Multivariate and Univariate Variable Input with Multi-step Ahead Forecasting

Our model can be used for both univariate and multivariate input predictions by adjusting the input layer of encoder. To solve the prediction problems for univariate inputs, we only use load consumption as a single variable time series and construct supervised learning pairs by sliding windows to use the historical load to predict the subsequent multi-step load. For multivariate variable input, we use electrical load and other six features including hours, dry bulb temperature, wet bulb temperature, dew point temperature, humidity and electricity price as the input data. In this section, we have validated the validity of the univariate model using only historical power load data as a single variable serial data input into our TAT model. Same as aforementioned TAT of multivariate, using historical univariate data as input, we made predictions for 30 min, 1 h, 12 h, and 1 day ahead, respectively, and compared them with the multivariate-TAT model. As shown in Table 2, we can see that multivariable inputs produce better predictions than univariate inputs. Suggesting that the change of electric load is related to many factors, not only depends on its own features, but also is directly interfered by random factors. It is shown that more prior knowledge is beneficial to the improvement of our model’s prediction accuracy because the multivariate variable input brings more dependent features to the model and self-attention mechanism have sufficient capacity to capture complex dynamical patterns in the multivariate variable data.

Table 2 Comparison of performance for multi-step forecasting under the univariate and multivariate variable input

Besides, we ranked the importance of the multivariate features to explore how each variable feature contributes to the prediction results. By successively removing the variable input to the model, the decline of model’s accuracy reflects the contribution of the variable to the prediction result, and the results are shown in Fig. 5 The dry bulb temperature has the greatest effect on the predicted results, and the electricity price has the weakest effect in all the variables.

Fig. 5
figure 5

The rank of the importance of the multivariate features. The value of the horizontal axis represents the decrease of the accuracy of the model after this variable was removed. The greater the decrease, the greater the contribution of which variable for the model prediction

4.3.3 Comparison of Different Input Length for an Hour Ahead Forecasting

During the experiment, we found that the input time step of historical data has great influence on the prediction results of the model, so we tried to use historical sequences of different lengths as input to predict the power load in the next hour, the comparison graph is shown in Fig. 6. Prediction errors of models except LSTM are decreasing with the length of input historical data, because longer historical data may contain more dependencies and provides more historical information for the model. But for LSTM, further increasing causes the RMSE to drop since it cannot effectively capture the dependency and regularity of the history records in the case of longer input sequence. Both Bi-LSTM and CNN-LSTM can improve this defect. In the prediction of 1 h (2 steps) ahead, our TAT model always performs the best regardless of the length of historical information as input. Our experiments show that the model occupies preferable prediction performance and practicability that can use less data to capture the load features and make more accurate predictions. Only 6 steps of historical data needed to achieve the same prediction effect as the basic Transformer with 48 steps input, which means less memory occupancy and faster calculation speed.

Fig. 6
figure 6

The RMSE and MAPE of different input for an hour ahead forecast

5 Conclusion

In this paper, we developed a short-term forecast model TAT for electrical load forecasting, which was tested in the data of electrical load in New South Wales. Compared with other six methods (ARIMA, LSTM, Bi-LSTM, Seq2Seq, CNN-LSTM and basic Transformer), our model has the best forecast performance. Moreover, we compare the model with univariate variable input using only historical power load data as a single variable serial to the previous multivariate TAT, and multivariable inputs produce better predictions than univariate, suggesting that the multivariate input brings more dependent features to the model and our approach can better learn the dynamic dependencies in the complex input sequence. In addition, we compare the predictive ability of the model with different input steps, it was found that our approach can rely on less historical data to obtain better prediction results than other models. In summary, it can be concluded that our model is a satisfactory approach in terms of electrical load forecasting. Finally, although our approach has been very effective in short-term electrical load forecasting, with the increase of the prediction step, the prediction accuracy gradually decreases. In future work, we hope to further improve the performance of the model from the perspective of external factors of power load.