Self-attention-based time-variant neural networks for multi-step time series forecasting

Gao, Changxia; Zhang, Ning; Li, Youru; Bian, Feng; Wan, Huaiyu

doi:10.1007/s00521-021-06871-1

Self-attention-based time-variant neural networks for multi-step time series forecasting

Original Article
Published: 05 February 2022

Volume 34, pages 8737–8754, (2022)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Changxia Gao^1,2,
Ning Zhang^1,2,
Youru Li^1,5,
Feng Bian³ &
…
Huaiyu Wan ORCID: orcid.org/0000-0002-0501-9363^1,4

1600 Accesses
19 Citations
1 Altmetric
Explore all metrics

Abstract

Time series forecasting is ubiquitous in various scientific and industrial domains. Powered by recurrent and convolutional and self-attention mechanism, deep learning exhibits high efficacy in time series forecasting. However, the existing forecasting methods are suffering some limitations. For example, recurrent neural networks are limited by the gradient vanishing problem, convolutional neural networks cost more parameters, and self-attention has a defect in capturing local dependencies. What’s more, they all rely on time invariant or stationary since they leverage parameter sharing by repeating a set of fixed architectures with fixed parameters over time or space. To address the above issues, in this paper we propose a novel time-variant framework named Self-Attention-based Time-Variant Neural Networks (SATVNN), generally capable of capturing dynamic changes of time series on different scales more accurately with its time-variant structure and consisting of self-attention blocks that seek to better capture the dynamic changes of recent data, with the help of Gaussian distribution, Laplace distribution and a novel Cauchy distribution, respectively. SATVNN obviously outperforms the classical time series prediction methods and the state-of-the-art deep learning models on lots of widely used real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

Deep learning for time series classification: a review

Article 02 March 2019

A CNN-BiLSTM-AM method for stock price prediction

Article 24 November 2020

Notes

The first layer is an input layer.

References

Yule GU (1927) On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers. Philos Trans R Soc A 226:267–298
MATH Google Scholar
Box GEP, Pierce DA (1970) Distribution of residual in autoregressive-integrated moving average time series. J Am Stat Assoc 65:1509–1526
Article MathSciNet Google Scholar
Chen S, Wang XX, Harris CJ (2007) NARX-based nonlinear system identification using orthogonal least squares basis hunting. IEEE Trans Control Syst Technol 16(1):78–84
Article Google Scholar
Bouchachia A, Bouchachia S (2008) Ensemble learning for time series prediction. NDES
Frigola R, Rasmussen CE (2013) Integrated pre-processing for Bayesian nonlinear system identification with gaussian processes. In: IEEE CDC
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Article Google Scholar
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471
Article Google Scholar
Gers FA, Eck D, Schmidhuber J (2001) Applying LSTM to time series predictable through time-window approaches. In: ICANN, pp 669–676
Khandelwal U, He H, Qi P, Jurafsky D (2018) Sharp nearby, fuzzy far away: how neural language models use context. ACL, pp 284–294
Mariet Z, Kuznetsov V (2019) Foundations of sequence-to-sequence modeling for time series. CoRR arXiv:1805.03714
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS
Guo M, Zhang Y, Liu T (2019) Gaussian transformer: a lightweight approach for natural language inference. AAAI 33:6489–6496
Article Google Scholar
Im J, Cho S (2017) Distance-based self-attention network for natural language inference. In: CoRR arXiv:1805.03714
Dabrowski JJ, Zhang Y, Rahman A ForecastNet: A time-variant deep feed-forward neural network architecture for multi-step-ahead time-series forecasting. In: ICONIP
Qin Y, Song D, Chen H, Cheng W, Jiang G, Cottrell G (2017) A dual-stage attention-based recurrent neural network for time series prediction. In: IJCAI
Lai G, Chang WC, Yang Y, Liu H (2018) Modeling long- and short-term temporal patterns with deep neural networks. ACM
Taieb SB, Atiya AF (2015) A bias and variance analysis for multistep-ahead time series forecasting. IEEE Trans Neural Netw Learn Syst 27:62–76
Article MathSciNet Google Scholar
Taieb SB, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. Expert Syst Appl 39(8):7067–7083
Article Google Scholar
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191
Article Google Scholar
Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. In: NIPS, pp 7785–7794
Wang Y, Smola A, Maddix D, Gasthaus J, Foster D, Januschowski T (2019) Deep factors for forecasting. In: ICML, pp 6607–6617
Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 2:157–166
Article Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. NIPS, pp 3104–3112
Yu R, Zheng S, Anandkumar A, Yue Y (2017) Long-term forecasting using higher order tensor RNNs. CoRR arXiv:1711.00073
Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. NIPS
Guen VL, Thome N (2019) Shape and time distortion loss for training deep time series forecasting models. In: NIPS, pp 4189–4201
Laptev N, Yosinski J, Li LE, Smyl S (2017) Time-series extreme event forecasting with neural networks at uber. ICML 34:1–5
Google Scholar
Xiong H, He Z, Wu H, Wang H (2018) Modeling coherence for discourse neural machine translation. In: AAA I, pp 7338–7345
Yang B, Wang L, Wong DF, Chao LS, Tu Z (2019) Convolutional self-attention network. CoRR arXiv:1810.13320
Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: NAACL
Zhang J, Luan H, Sun M, Zhai F, Xu J, Zhang M, Liu Y (2018) Improving the transformer translation model with document-level context. In: EMNLP
Im J, Cho S (2017) Distance-based self-attention network for natural language inference. CoRR arXiv:1712.02047
Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018) Modeling localness for self-attention networks. CoRR arXiv:1810.10182
Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186
Kang WC, Mcauley J (2018) Self-attentive sequential recommendation. In: ICDM
Sun F, Liu J, Wu J, Pei C, Lin X, Ou W, Jiang P (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. CoRR arXiv:1904.06690
Zhang S, Tay Y, Yao L, Sun A (2018) Next item recommendation with self-attention. CoRR arXiv:1808.06414
Chen Q, Zhao H, Li W, Huang P, Ou W (2019) Behavior sequence transformer for E-commerce recommendation in Alibaba. In: KDD
Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N (2018) Generating wikipedia by summarizing long sequences. In: ICLR
Hoang A, Bosselut A, Celikyilmaz A, Choi Y (2019) Efficient adaptation of pretrained transformers for abstractive summarization. CoRR arXiv:1906.00138
Subramanian S, Li R, Pilault J, Pal C (2019) On extractive and abstractive neural document summarization with transformer language models. CoRR arxiv:1909.03186
Zhang X, Meng K, Liu G (2019) Hie-transformer: a hierarchical hybrid transformer for abstractive article summarization. In: ICONIP, pp 248–258
Wu N, Green B, Ben X, O’Banion S (2020) Deep transformer models for time series forecasting: the influenza prevalence case. CoRR arXiv:2001.08317
Huang S, Wang D, Wu X, Tang A (2019) DSANet: dual self-attention network for multivariate time series forecasting. ACM, pp 2129–2132
Song H, Rajan D, Thiagarajan JJ, Spanias A (2018) Attend and diagnose: clinical time series analysis using attention models. In: AAAI
Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018c) Modeling localness for self-attention networks. In: EMNLP, pp 4449–4458
Oppenheim AV, Schafer RW, Buck JR (1999) Pearson Education Signal Processing Series, Discrete-Time Signal Processing (2nd ed.)
Hyndman R Time series data library. https://robjhyndman.com/tsdl/
Guin A (2006) Travel time prediction using a seasonal autoregressive integrated moving average time series model. In: ITSC, pp 493–498
Fildes R (1992) Bayesian forecasting and dynamic models. Int J Forecast 8:635–637
Google Scholar
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. In: ICLR
Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR arxiv:1803.01271
Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecast 22(4):679–688
Article Google Scholar
Hyndman R, Athanasopoulos G (2018) Forecasting: principles and practice. OTexts
Makridakis S, Spiliotis E, Assimakopoulos V (2018) The M4 competition: results, findings, conclusion and way forward. Int J Forecast 34(4):802–808
Article Google Scholar
Sun M (2009) Iterative learning neurocomputing. In: WNIS, pp 158–161

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 2020-JBZD004).

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
Changxia Gao, Ning Zhang, Youru Li & Huaiyu Wan
China Engineering Research Center of Network Management Technology for High Speed Railway of MOE, Beijing, 100044, China
Changxia Gao & Ning Zhang
College of Information Science and Technology, Beijing Normal University, Beijing, 100875, China
Feng Bian
Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing, 100044, China
Huaiyu Wan
Beijing Key Laboratory of Advanced Information Science and Network Technology, Beijing, China
Youru Li

Authors

Changxia Gao
View author publications
You can also search for this author in PubMed Google Scholar
Ning Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Youru Li
View author publications
You can also search for this author in PubMed Google Scholar
Feng Bian
View author publications
You can also search for this author in PubMed Google Scholar
Huaiyu Wan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huaiyu Wan.

Ethics declarations

Conflict of interest

All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Proof of convergence of SATVNN

The weight of the time-invariant network is often constant, and the mapping between the input and output of the network is fixed after training. However, the mapping relationship between the input and output of the time-variant system will change with time. In this paper, we propose a SATVNN with time-variant weights, which is used to establish a neural network each time, and the neural network at that time is used to approach the mapping relationship between the input and output of the system at that time. We regard time-variant weight learning as an unknown time-variant parameter estimation problem for nonlinear time-variant systems. In [57], an iterative learning least squares algorithm is proposed to train the weights of the time-variant network. Theoretical analysis shows that the iterative learning least squares algorithm can ensure that the modeling error converges to zero along the iteration axis in the whole time interval. We will elaborate on the reasons in the following paragraphs.

Time-variant neural network The time-variant neural network whose input, output and weight change with time can be expressed as follows:

$$\begin{aligned} y(t)={\mathcal {E}}^{\mathrm{T}}(x(t)) {W}(t) \end{aligned}$$

(31)

or

$$\begin{aligned} y(t)={\mathcal {E}}^{\mathrm{T}}(x(t),t) {W}(t) \end{aligned}$$

(32)

where $x(t)=\left[ x_{1}(t), x_{2}(t), \ldots , x_{n}(t)\right] ^{\mathrm{T}} \in {\mathbb{R}}^{n}$ is input vector. ${\mathcal {E}}(x(t)) = [{\mathcal {E}}_{1}(x(t)), {\mathcal {E}}_{2}(x(t)), \ldots , {\mathcal {E}}_{l}(x(t))]^{\mathrm{T}}$ is a set of activation function vectors. y(t) is output. ${W}(t)=\left[ w_{1}(t), \ldots , w_{l}(t)\right] ^{\mathrm{T}}$ is an adjustable time-variant weight vector. l is the number of neuron nodes.

Time-variant system Let the nonlinear discrete time-variant system be:

$$\begin{aligned} y_{m}(t)=f\left( x_{m}(t), t\right) \end{aligned}$$

(33)

where f is a nonlinear time-variant function, $x_{m}(t)$ is input vector. $t \in \{0,1, \ldots , N\}$ is a finite time interval. $m \in \{0,1, \ldots \}$ is the number of iterations and $y_{m}(t)$ is output. In the following discussion, we all assume that $x_{m}(t)$ is bounded.

Iterative learning least square method Let the nonlinear time-variant system (33) be run repeatedly for m time from the 0th time, and the input and output data pairs $\{\left( x_{i}(t), y_{i}(t)\right) , t=0,1, \ldots , N, i=0,1,\ldots ,m\}$ are generated by the system. We will use the time-variant neural network (32) to approximate the above input–output mapping.

First, we define ${{E}}_{m}(t)=[{\mathcal {E}}_{0}^{\mathrm{T}}(t), {\mathcal {E}}_{1}^{\mathrm{T}}(t), \ldots , {\mathcal {E}}_{m}^{\mathrm{T}}(t)]^{\mathrm{T}}$ and ${Y}_{m}(t)=[y_{0}(t), y_{1}(t), \ldots , y_{m}(t)]^{\mathrm{T}}$,where ${\mathcal {E}}_{i}(t)={\mathcal {E}} (x_{i}(t), t)$, according to Eq. (32),

$$\begin{aligned} Y_{m}(t)={E}_{m}(t) W(t) \end{aligned}$$

(34)

We hope to find out the least square solution for W(t), ${\hat{W}}_{m}(t)$, so that the following loss function is minimized.

$$\begin{aligned} \begin{aligned} J_{m}({\hat{W}}_{m}(t), t)&= \frac{1}{2}[Y_{m}(t)-{E_{m}(t)} {\hat{W}}_{m}(t)]^{\mathrm{T}} \\&\quad [{Y}_{m}(t)-{E_{m}(t)} {\hat{W}}_{m}(t)] \end{aligned} \end{aligned}$$

(35)

Write Eq. (35) further as:

$$\begin{aligned} \begin{aligned}&2 J_{m} \left( {\hat{W}}_{m}(t), t\right) ={Y}_{m}^{\mathrm{T}}(t) \left[ {I}-{{E}}_{m}(t) \left( {{E}}_{m}^{\mathrm{T}}(t) {E}_{m}\right) ^{-1}(t) {E}_{m}^{\mathrm{T}}(t) \right] \\&{Y}_{m}(t) + \left[ {\hat{W}}_{m}(t) - \left[ {{E}}_{m}^{\mathrm{T}}(t) {{E}}_{{m}}(t)\right] ^{-1} {{E}}_{m}^{\mathrm{T}}(t) {Y}_{{m}}(t)\right] ^{\mathrm{T}} \\&{E}_{m}^{\mathrm{T}}(t) {E}_{m}(t) \left[ \hat{W}_{m}(t)-\left( {{E}}_{m}^\mathrm{T}(t) {E}_{m}(t)\right) ^{-1} {{E}}_{m}^{\mathrm{T}}(t) {Y}_{m}(t+1)\right] \end{aligned} \end{aligned}$$

(36)

Note that only the second term on the right side of the above formula is related to ${\hat{W}}_{m}(t)$, set this item zero. We assume that ${{E}}_{m}^{\mathrm{T}}(t) {{E}}_{m}(t)$ is invertible and achieve the minimum value:

$$\begin{aligned} {\hat{W}}_{m}(t)=\left( {{E}}_{m}^{\mathrm{T}}(t) {{E}}_{m}(t)\right) ^{-1} {{E}}_{m}^{\mathrm{T}}(t) Y_{m}(t) \end{aligned}$$

(37)

According to Eq. (37), calculating ${\hat{W}}_{m}(t)$ requires calculating the inverse of ${E}_{m}(t)$. The dimension of ${E}_{m}(t)$ will increase, and the inverse computation will become time-consuming as iteration increases. To solve this problem, an iterative learning least square algorithm is derived.

Let us define $Q_{m}^{-1}(t)={E}_{m}^{\mathrm{T}}(t) {E}_{m}(t)$, according to the definition of ${E}_{m}(t)$, we can get:

$Q_{m+1}^{-1}(t)=\sum _{i=0}^{m+1} {\mathcal {E}}_{i}(t) {\mathcal {E}}_{i}^{\mathrm{T}}(t)=\sum _{i=0}^{m} {\mathcal {E}}_{i}(t) {\mathcal {E}}_{i}^{\mathrm{T}}(t)+{\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)$ and namely:

$$\begin{aligned} Q_{m+1}^{-1}(t)=Q_{m}^{-1}(t)+{\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) \end{aligned}$$

(38)

Use Matrix inversion lemma, $(A^{-1}+H R^{-1} H^{\mathrm{T}})^{-1}=A-A H({H}^{\mathrm{T}} A {H}+{R})^{-1} {H}^{T} {A}$, let ${A}={Q}_{m}(t)$, ${H}={\mathcal {E}}_{m+1}(t)$, ${R}=1$. we can get:

$$\begin{aligned} Q_{m+1}(t)=Q_{m}(t)-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$

(39)

According to the definition of $Q_{m}^{-1}(t)$, we can compute ${\hat{W}}_{m}$ such that:

$$\begin{aligned} \begin{aligned} {\hat{W}}_{m+1}&= Q_{m+1} {E}_{m+1}^{\mathrm{T}} Y_{m+1} \\&=Q_{m+1}\left[ {E}_{m}^{\mathrm{T}} Y_{m}+{\mathcal {E}}_{m+1} y_{m+1}\right] \\&=Q_{m+1}\left[ Q_{m}^{-1} {\hat{W}}_{m}+{\mathcal {E}}_{m+1} y_{m+1}\right] \\&={\hat{W}}_{m}+Q_{m+1} {\mathcal {E}}_{m+1}\left[ y_{m+1}-{\mathcal {E}}_{m+1}^{\mathrm{T}} {\hat{W}}_{m}\right] \end{aligned} \end{aligned}$$

(40)

Define error $e_{m+1}(t)=y_{m+1}(t)-{\mathcal {E}}_{m+1}(t) {\hat{W}}_{m}(t)$

According to Eq. (39), we get:

$$\begin{aligned} {\hat{W}}_{m+1}(t)={\hat{W}}_{m}(t)+\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {Q}_{m}(t){{\mathcal {E}}_{m+1}}(t)} e_{m+1}(t) \end{aligned}$$

(41)

According to Eqs. (39) and (41), update laws for $Q_{m}(t)$ and ${{\hat{W}}}_{m}(t)$ indicate iterative learning least square algorithm of time-variant neural for network training.

Convergence analysis We analyze the convergence of iterative learning algorithm.

Theorem 1

For system (33), the learning algorithm Eqs. (39)–(41) has the following properties:

(i)
For $t=0,1, \ldots , N$ and $m=0,1, \ldots ,$
$$\begin{aligned} \left\| {\tilde{W}}_{m+1}(t)\right\| ^{2}&\leqslant \rho _{m}(t)\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$
(42)
$$\begin{aligned} \left\| {\tilde{W}}_{m}(t)\right\| ^{2}&\leqslant \rho _{0}(t)\left\| {\tilde{W}}_{0}(t)\right\| ^{2} \end{aligned}$$
(43)
where $\rho _{m}(t)={\lambda _{\max }(Q_{m}^{-1}(t))}/{\lambda _{\min }(Q_{m}^{-1}(t))}$, minimum eigenvalues and maximum eigenvalues of $Q_{m}^{-1}(t)$ are represented by $\lambda _{\min }(Q_{m}^{-1}(t))$ and $\lambda _{\max }(Q_{m}^{-1}(t))$, respectively.
(ii)
For $t=0,1, \ldots , N$,
$$\begin{aligned} \lim _{m \rightarrow \infty } e_{m+1}(t)=0 \end{aligned}$$
(44)

Proof

By definition, $e_{m+1}(t)={\mathcal {E}}_{m+1}^{\mathrm{T}}(t) \tilde{{W}}_{m}(t)$, according to Eqs. (39)–(41), we can get:

$$\begin{aligned}&Q_{m+1}(t) Q_{m}^{-1}(t)=I-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$

(45)

$$\begin{aligned}&{\tilde{W}}_{m+1}(t)=\left[ I-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}\right] {\tilde{W}}_{m}(t) \end{aligned}$$

(46)

Combine Eqs. (45) and (46), we can get:

$$\begin{aligned} Q_{m+1}^{-1}(t) {\tilde{W}}_{m+1}(t)=Q_{m}^{-1}(t) {\tilde{W}}_{m}(t) \end{aligned}$$

(47)

Define $V_{m}(t)={\tilde{W}}_{m}^{\mathrm{T}}(t) Q_{m}^{-1}(t) {\tilde{W}}_{m}(t)$, use Eq. (45), we can get:

$$\begin{aligned} V_{m+1}(t)-V_{m}(t)=-\frac{{\tilde{W}}_{m}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\tilde{W}}_{m}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$

(48)

Therefore, $V_{m}(t)$ are non-increasing and nonnegative for m, namely

$$\begin{aligned} V_{m+1}(t) \leqslant V_{m}(t) \end{aligned}$$

(49)

According to Eq. (38), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{m+1}^{-1}(t)) \geqslant \lambda _{\min }(Q_{m}^{-1}(t)) \geqslant \lambda _{\min }(Q_{0}^{-1}(t)) \end{aligned}$$

(50)

Combine Eq. (49), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{m}^{-1}(t))\left\| {\tilde{W}}_{m+1}(t)\right\| ^{2} \leqslant \lambda _{\max }(Q_{m}^{-1}(t))\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$

(51)

Therefore:

$$\begin{aligned} \left\| {\tilde{W}}_{m+1}(t)\right\| ^{2} \leqslant \rho _{m}(t)\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$

(52)

Apparently there is $V_{m}(t) \leqslant V_{0}(t)$, combine Eq. (49), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{0}^{-1}(t))\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \leqslant \lambda _{\max }(Q_{0}^{-1}(t))\left\| {\tilde{W}}_{0}(t)\right\| ^{2} \end{aligned}$$

(53)

This ensures that $\left\| {\tilde{W}}_{m}(t)\right\| ^{2}$ is uniformly bounded. The property (i) is proved. $\hfill\square$

Proof

According to Eq. (48), we can get:

$$\begin{aligned} \frac{e_{m+1}^{2}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}=V_{m}(t)-V_{m+1}(t) \end{aligned}$$

(54)

Sum the above formula from 0 to m, we get:

$$\begin{aligned} \sum _{i=0}^{m} \frac{e_{i+1}^{2}(t)}{1+{\mathcal {E}}_{i+1}^{\mathrm{T}}(t) Q_{i}(t) {\mathcal {E}}_{i+1}(t)}=V_{0}(t)-V_{m+1}(t) \leqslant V_{0}(t) \end{aligned}$$

(55)

Because $V_{0}(t)$ is bounded, therefore:

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{e_{m+1}^{2}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}=0 \end{aligned}$$

(56)

Use $\lambda _{\max }\left( Q_{m+1}(t)\right) \leqslant \lambda _{\max }\left( Q_{m}(t)\right)$, we can get:

$$\begin{aligned} {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t) \leqslant \lambda _{\max }\left( Q_{0}(t)\right) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t) \end{aligned}$$

(57)

According to Eq. (56), we can get:

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{e_{m+1}(t)}{\sqrt{1+\lambda _{\max }\left( Q_{0}(t)\right) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t)}}=0 \end{aligned}$$

(58)

The input $x_{m}(t)$ is bounded; then, ${\mathcal {E}}_{m+1}(t)$ is bounded. In Eq. (58), the denominator term is bounded and nonzero; then, we can get:

$$\begin{aligned} \lim _{m \rightarrow \infty } e_{m+1}(t)=0, t=0,1, \ldots , N \end{aligned}$$

(59)

$\hfill\square$

This ensures the uniform convergence of the estimation error and proves the property (ii). Therefore, the learning algorithm we give ensures the boundedness of time-variant weights. The estimation error on a finite time interval can asymptotically converge to zero.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gao, C., Zhang, N., Li, Y. et al. Self-attention-based time-variant neural networks for multi-step time series forecasting. Neural Comput & Applic 34, 8737–8754 (2022). https://doi.org/10.1007/s00521-021-06871-1

Download citation

Received: 10 February 2021
Accepted: 12 December 2021
Published: 05 February 2022
Issue Date: June 2022
DOI: https://doi.org/10.1007/s00521-021-06871-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Self-attention-based time-variant neural networks for multi-step time series forecasting

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

A CNN-BiLSTM-AM method for stock price prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Theorem 1

Proof

Proof

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Self-attention-based time-variant neural networks for multi-step time series forecasting

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

Deep learning for time series classification: a review

A CNN-BiLSTM-AM method for stock price prediction

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendix

Appendix

Theorem 1

Proof

Proof

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation