Skip to main content
Log in

Self-attention-based time-variant neural networks for multi-step time series forecasting

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Time series forecasting is ubiquitous in various scientific and industrial domains. Powered by recurrent and convolutional and self-attention mechanism, deep learning exhibits high efficacy in time series forecasting. However, the existing forecasting methods are suffering some limitations. For example, recurrent neural networks are limited by the gradient vanishing problem, convolutional neural networks cost more parameters, and self-attention has a defect in capturing local dependencies. What’s more, they all rely on time invariant or stationary since they leverage parameter sharing by repeating a set of fixed architectures with fixed parameters over time or space. To address the above issues, in this paper we propose a novel time-variant framework named Self-Attention-based Time-Variant Neural Networks (SATVNN), generally capable of capturing dynamic changes of time series on different scales more accurately with its time-variant structure and consisting of self-attention blocks that seek to better capture the dynamic changes of recent data, with the help of Gaussian distribution, Laplace distribution and a novel Cauchy distribution, respectively. SATVNN obviously outperforms the classical time series prediction methods and the state-of-the-art deep learning models on lots of widely used real-world datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. The first layer is an input layer.

References

  1. Yule GU (1927) On a method of investigating periodicities in disturbed series, with special reference to Wolfer’s sunspot numbers. Philos Trans R Soc A 226:267–298

    MATH  Google Scholar 

  2. Box GEP, Pierce DA (1970) Distribution of residual in autoregressive-integrated moving average time series. J Am Stat Assoc 65:1509–1526

    Article  MathSciNet  Google Scholar 

  3. Chen S, Wang XX, Harris CJ (2007) NARX-based nonlinear system identification using orthogonal least squares basis hunting. IEEE Trans Control Syst Technol 16(1):78–84

    Article  Google Scholar 

  4. Bouchachia A, Bouchachia S (2008) Ensemble learning for time series prediction. NDES

  5. Frigola R, Rasmussen CE (2013) Integrated pre-processing for Bayesian nonlinear system identification with gaussian processes. In: IEEE CDC

  6. Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536

    Article  Google Scholar 

  7. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  8. Gers FA, Schmidhuber J, Cummins F (2000) Learning to forget: continual prediction with LSTM. Neural Comput 12(10):2451–2471

    Article  Google Scholar 

  9. Gers FA, Eck D, Schmidhuber J (2001) Applying LSTM to time series predictable through time-window approaches. In: ICANN, pp 669–676

  10. Khandelwal U, He H, Qi P, Jurafsky D (2018) Sharp nearby, fuzzy far away: how neural language models use context. ACL, pp 284–294

  11. Mariet Z, Kuznetsov V (2019) Foundations of sequence-to-sequence modeling for time series. CoRR arXiv:1805.03714

  12. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I (2017) Attention is all you need. In: NIPS

  13. Guo M, Zhang Y, Liu T (2019) Gaussian transformer: a lightweight approach for natural language inference. AAAI 33:6489–6496

    Article  Google Scholar 

  14. Im J, Cho S (2017) Distance-based self-attention network for natural language inference. In: CoRR arXiv:1805.03714

  15. Dabrowski JJ, Zhang Y, Rahman A ForecastNet: A time-variant deep feed-forward neural network architecture for multi-step-ahead time-series forecasting. In: ICONIP

  16. Qin Y, Song D, Chen H, Cheng W, Jiang G, Cottrell G (2017) A dual-stage attention-based recurrent neural network for time series prediction. In: IJCAI

  17. Lai G, Chang WC, Yang Y, Liu H (2018) Modeling long- and short-term temporal patterns with deep neural networks. ACM

  18. Taieb SB, Atiya AF (2015) A bias and variance analysis for multistep-ahead time series forecasting. IEEE Trans Neural Netw Learn Syst 27:62–76

    Article  MathSciNet  Google Scholar 

  19. Taieb SB, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the NN5 forecasting competition. Expert Syst Appl 39(8):7067–7083

    Article  Google Scholar 

  20. Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191

    Article  Google Scholar 

  21. Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. In: NIPS, pp 7785–7794

  22. Wang Y, Smola A, Maddix D, Gasthaus J, Foster D, Januschowski T (2019) Deep factors for forecasting. In: ICML, pp 6607–6617

  23. Bengio Y, Simard P, Frasconi P (1994) Learning long-term dependencies with gradient descent is difficult. IEEE Trans Neural Netw 2:157–166

    Article  Google Scholar 

  24. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. NIPS, pp 3104–3112

  25. Yu R, Zheng S, Anandkumar A, Yue Y (2017) Long-term forecasting using higher order tensor RNNs. CoRR arXiv:1711.00073

  26. Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. NIPS

  27. Guen VL, Thome N (2019) Shape and time distortion loss for training deep time series forecasting models. In: NIPS, pp 4189–4201

  28. Laptev N, Yosinski J, Li LE, Smyl S (2017) Time-series extreme event forecasting with neural networks at uber. ICML 34:1–5

    Google Scholar 

  29. Xiong H, He Z, Wu H, Wang H (2018) Modeling coherence for discourse neural machine translation. In: AAA I, pp 7338–7345

  30. Yang B, Wang L, Wong DF, Chao LS, Tu Z (2019) Convolutional self-attention network. CoRR arXiv:1810.13320

  31. Shaw P, Uszkoreit J, Vaswani A (2018) Self-attention with relative position representations. In: NAACL

  32. Zhang J, Luan H, Sun M, Zhai F, Xu J, Zhang M, Liu Y (2018) Improving the transformer translation model with document-level context. In: EMNLP

  33. Im J, Cho S (2017) Distance-based self-attention network for natural language inference. CoRR arXiv:1712.02047

  34. Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018) Modeling localness for self-attention networks. CoRR arXiv:1810.10182

  35. Devlin J, Chang MW, Lee K, Toutanova K (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In: NAACL, pp 4171–4186

  36. Kang WC, Mcauley J (2018) Self-attentive sequential recommendation. In: ICDM

  37. Sun F, Liu J, Wu J, Pei C, Lin X, Ou W, Jiang P (2019) BERT4Rec: sequential recommendation with bidirectional encoder representations from transformer. CoRR arXiv:1904.06690

  38. Zhang S, Tay Y, Yao L, Sun A (2018) Next item recommendation with self-attention. CoRR arXiv:1808.06414

  39. Chen Q, Zhao H, Li W, Huang P, Ou W (2019) Behavior sequence transformer for E-commerce recommendation in Alibaba. In: KDD

  40. Liu PJ, Saleh M, Pot E, Goodrich B, Sepassi R, Kaiser L, Shazeer N (2018) Generating wikipedia by summarizing long sequences. In: ICLR

  41. Hoang A, Bosselut A, Celikyilmaz A, Choi Y (2019) Efficient adaptation of pretrained transformers for abstractive summarization. CoRR arXiv:1906.00138

  42. Subramanian S, Li R, Pilault J, Pal C (2019) On extractive and abstractive neural document summarization with transformer language models. CoRR arxiv:1909.03186

  43. Zhang X, Meng K, Liu G (2019) Hie-transformer: a hierarchical hybrid transformer for abstractive article summarization. In: ICONIP, pp 248–258

  44. Wu N, Green B, Ben X, O’Banion S (2020) Deep transformer models for time series forecasting: the influenza prevalence case. CoRR arXiv:2001.08317

  45. Huang S, Wang D, Wu X, Tang A (2019) DSANet: dual self-attention network for multivariate time series forecasting. ACM, pp 2129–2132

  46. Song H, Rajan D, Thiagarajan JJ, Spanias A (2018) Attend and diagnose: clinical time series analysis using attention models. In: AAAI

  47. Yang B, Tu Z, Wong DF, Meng F, Chao LS, Zhang T (2018c) Modeling localness for self-attention networks. In: EMNLP, pp 4449–4458

  48. Oppenheim AV, Schafer RW, Buck JR (1999) Pearson Education Signal Processing Series, Discrete-Time Signal Processing (2nd ed.)

  49. Hyndman R Time series data library. https://robjhyndman.com/tsdl/

  50. Guin A (2006) Travel time prediction using a seasonal autoregressive integrated moving average time series model. In: ITSC, pp 493–498

  51. Fildes R (1992) Bayesian forecasting and dynamic models. Int J Forecast 8:635–637

    Google Scholar 

  52. Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. In: ICLR

  53. Bai S, Kolter JZ, Koltun V (2018) An empirical evaluation of generic convolutional and recurrent networks for sequence modeling. CoRR arxiv:1803.01271

  54. Hyndman RJ, Koehler AB (2006) Another look at measures of forecast accuracy. Int J Forecast 22(4):679–688

    Article  Google Scholar 

  55. Hyndman R, Athanasopoulos G (2018) Forecasting: principles and practice. OTexts

  56. Makridakis S, Spiliotis E, Assimakopoulos V (2018) The M4 competition: results, findings, conclusion and way forward. Int J Forecast 34(4):802–808

    Article  Google Scholar 

  57. Sun M (2009) Iterative learning neurocomputing. In: WNIS, pp 158–161

Download references

Acknowledgements

This work was supported by the Fundamental Research Funds for the Central Universities (Grant No. 2020-JBZD004).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huaiyu Wan.

Ethics declarations

Conflict of interest

All authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

Proof of convergence of SATVNN


The weight of the time-invariant network is often constant, and the mapping between the input and output of the network is fixed after training. However, the mapping relationship between the input and output of the time-variant system will change with time. In this paper, we propose a SATVNN with time-variant weights, which is used to establish a neural network each time, and the neural network at that time is used to approach the mapping relationship between the input and output of the system at that time. We regard time-variant weight learning as an unknown time-variant parameter estimation problem for nonlinear time-variant systems. In [57], an iterative learning least squares algorithm is proposed to train the weights of the time-variant network. Theoretical analysis shows that the iterative learning least squares algorithm can ensure that the modeling error converges to zero along the iteration axis in the whole time interval. We will elaborate on the reasons in the following paragraphs.


Time-variant neural network The time-variant neural network whose input, output and weight change with time can be expressed as follows:

$$\begin{aligned} y(t)={\mathcal {E}}^{\mathrm{T}}(x(t)) {W}(t) \end{aligned}$$
(31)

or

$$\begin{aligned} y(t)={\mathcal {E}}^{\mathrm{T}}(x(t),t) {W}(t) \end{aligned}$$
(32)

where \(x(t)=\left[ x_{1}(t), x_{2}(t), \ldots , x_{n}(t)\right] ^{\mathrm{T}} \in {\mathbb{R}}^{n}\) is input vector. \({\mathcal {E}}(x(t)) = [{\mathcal {E}}_{1}(x(t)), {\mathcal {E}}_{2}(x(t)), \ldots , {\mathcal {E}}_{l}(x(t))]^{\mathrm{T}}\) is a set of activation function vectors. y(t) is output. \({W}(t)=\left[ w_{1}(t), \ldots , w_{l}(t)\right] ^{\mathrm{T}}\) is an adjustable time-variant weight vector. l is the number of neuron nodes.


Time-variant system Let the nonlinear discrete time-variant system be:

$$\begin{aligned} y_{m}(t)=f\left( x_{m}(t), t\right) \end{aligned}$$
(33)

where f is a nonlinear time-variant function, \(x_{m}(t)\) is input vector. \(t \in \{0,1, \ldots , N\}\) is a finite time interval. \(m \in \{0,1, \ldots \}\) is the number of iterations and \(y_{m}(t)\) is output. In the following discussion, we all assume that \(x_{m}(t)\) is bounded.


Iterative learning least square method Let the nonlinear time-variant system (33) be run repeatedly for m time from the 0th time, and the input and output data pairs \(\{\left( x_{i}(t), y_{i}(t)\right) , t=0,1, \ldots , N, i=0,1,\ldots ,m\}\) are generated by the system. We will use the time-variant neural network (32) to approximate the above input–output mapping.

First, we define \({{E}}_{m}(t)=[{\mathcal {E}}_{0}^{\mathrm{T}}(t), {\mathcal {E}}_{1}^{\mathrm{T}}(t), \ldots , {\mathcal {E}}_{m}^{\mathrm{T}}(t)]^{\mathrm{T}}\) and \({Y}_{m}(t)=[y_{0}(t), y_{1}(t), \ldots , y_{m}(t)]^{\mathrm{T}}\),where \({\mathcal {E}}_{i}(t)={\mathcal {E}} (x_{i}(t), t)\), according to Eq. (32),

$$\begin{aligned} Y_{m}(t)={E}_{m}(t) W(t) \end{aligned}$$
(34)

We hope to find out the least square solution for W(t), \({\hat{W}}_{m}(t)\), so that the following loss function is minimized.

$$\begin{aligned} \begin{aligned} J_{m}({\hat{W}}_{m}(t), t)&= \frac{1}{2}[Y_{m}(t)-{E_{m}(t)} {\hat{W}}_{m}(t)]^{\mathrm{T}} \\&\quad [{Y}_{m}(t)-{E_{m}(t)} {\hat{W}}_{m}(t)] \end{aligned} \end{aligned}$$
(35)

Write Eq. (35) further as:

$$\begin{aligned} \begin{aligned}&2 J_{m} \left( {\hat{W}}_{m}(t), t\right) ={Y}_{m}^{\mathrm{T}}(t) \left[ {I}-{{E}}_{m}(t) \left( {{E}}_{m}^{\mathrm{T}}(t) {E}_{m}\right) ^{-1}(t) {E}_{m}^{\mathrm{T}}(t) \right] \\&{Y}_{m}(t) + \left[ {\hat{W}}_{m}(t) - \left[ {{E}}_{m}^{\mathrm{T}}(t) {{E}}_{{m}}(t)\right] ^{-1} {{E}}_{m}^{\mathrm{T}}(t) {Y}_{{m}}(t)\right] ^{\mathrm{T}} \\&{E}_{m}^{\mathrm{T}}(t) {E}_{m}(t) \left[ \hat{W}_{m}(t)-\left( {{E}}_{m}^\mathrm{T}(t) {E}_{m}(t)\right) ^{-1} {{E}}_{m}^{\mathrm{T}}(t) {Y}_{m}(t+1)\right] \end{aligned} \end{aligned}$$
(36)

Note that only the second term on the right side of the above formula is related to \({\hat{W}}_{m}(t)\), set this item zero. We assume that \({{E}}_{m}^{\mathrm{T}}(t) {{E}}_{m}(t)\) is invertible and achieve the minimum value:

$$\begin{aligned} {\hat{W}}_{m}(t)=\left( {{E}}_{m}^{\mathrm{T}}(t) {{E}}_{m}(t)\right) ^{-1} {{E}}_{m}^{\mathrm{T}}(t) Y_{m}(t) \end{aligned}$$
(37)

According to Eq. (37), calculating \({\hat{W}}_{m}(t)\) requires calculating the inverse of \({E}_{m}(t)\). The dimension of \({E}_{m}(t)\) will increase, and the inverse computation will become time-consuming as iteration increases. To solve this problem, an iterative learning least square algorithm is derived.

Let us define \(Q_{m}^{-1}(t)={E}_{m}^{\mathrm{T}}(t) {E}_{m}(t)\), according to the definition of \({E}_{m}(t)\), we can get:

\(Q_{m+1}^{-1}(t)=\sum _{i=0}^{m+1} {\mathcal {E}}_{i}(t) {\mathcal {E}}_{i}^{\mathrm{T}}(t)=\sum _{i=0}^{m} {\mathcal {E}}_{i}(t) {\mathcal {E}}_{i}^{\mathrm{T}}(t)+{\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)\) and namely:

$$\begin{aligned} Q_{m+1}^{-1}(t)=Q_{m}^{-1}(t)+{\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) \end{aligned}$$
(38)

Use Matrix inversion lemma, \((A^{-1}+H R^{-1} H^{\mathrm{T}})^{-1}=A-A H({H}^{\mathrm{T}} A {H}+{R})^{-1} {H}^{T} {A}\), let \({A}={Q}_{m}(t)\), \({H}={\mathcal {E}}_{m+1}(t)\), \({R}=1\). we can get:

$$\begin{aligned} Q_{m+1}(t)=Q_{m}(t)-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$
(39)

According to the definition of \(Q_{m}^{-1}(t)\), we can compute \({\hat{W}}_{m}\) such that:

$$\begin{aligned} \begin{aligned} {\hat{W}}_{m+1}&= Q_{m+1} {E}_{m+1}^{\mathrm{T}} Y_{m+1} \\&=Q_{m+1}\left[ {E}_{m}^{\mathrm{T}} Y_{m}+{\mathcal {E}}_{m+1} y_{m+1}\right] \\&=Q_{m+1}\left[ Q_{m}^{-1} {\hat{W}}_{m}+{\mathcal {E}}_{m+1} y_{m+1}\right] \\&={\hat{W}}_{m}+Q_{m+1} {\mathcal {E}}_{m+1}\left[ y_{m+1}-{\mathcal {E}}_{m+1}^{\mathrm{T}} {\hat{W}}_{m}\right] \end{aligned} \end{aligned}$$
(40)

Define error \(e_{m+1}(t)=y_{m+1}(t)-{\mathcal {E}}_{m+1}(t) {\hat{W}}_{m}(t)\)

According to Eq. (39), we get:

$$\begin{aligned} {\hat{W}}_{m+1}(t)={\hat{W}}_{m}(t)+\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {Q}_{m}(t){{\mathcal {E}}_{m+1}}(t)} e_{m+1}(t) \end{aligned}$$
(41)

According to Eqs. (39) and (41), update laws for \(Q_{m}(t)\) and \({{\hat{W}}}_{m}(t)\) indicate iterative learning least square algorithm of time-variant neural for network training.


Convergence analysis We analyze the convergence of iterative learning algorithm.

Theorem 1

For system (33), the learning algorithm Eqs. (39)–(41) has the following properties:

  1. (i)

    For \(t=0,1, \ldots , N\) and \(m=0,1, \ldots ,\)

    $$\begin{aligned} \left\| {\tilde{W}}_{m+1}(t)\right\| ^{2}&\leqslant \rho _{m}(t)\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$
    (42)
    $$\begin{aligned} \left\| {\tilde{W}}_{m}(t)\right\| ^{2}&\leqslant \rho _{0}(t)\left\| {\tilde{W}}_{0}(t)\right\| ^{2} \end{aligned}$$
    (43)

    where \(\rho _{m}(t)={\lambda _{\max }(Q_{m}^{-1}(t))}/{\lambda _{\min }(Q_{m}^{-1}(t))}\), minimum eigenvalues and maximum eigenvalues of \(Q_{m}^{-1}(t)\) are represented by \(\lambda _{\min }(Q_{m}^{-1}(t))\) and \(\lambda _{\max }(Q_{m}^{-1}(t))\), respectively.

  2. (ii)

    For \(t=0,1, \ldots , N\),

    $$\begin{aligned} \lim _{m \rightarrow \infty } e_{m+1}(t)=0 \end{aligned}$$
    (44)

Proof

By definition, \(e_{m+1}(t)={\mathcal {E}}_{m+1}^{\mathrm{T}}(t) \tilde{{W}}_{m}(t)\), according to Eqs. (39)–(41), we can get:

$$\begin{aligned}&Q_{m+1}(t) Q_{m}^{-1}(t)=I-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$
(45)
$$\begin{aligned}&{\tilde{W}}_{m+1}(t)=\left[ I-\frac{Q_{m}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}\right] {\tilde{W}}_{m}(t) \end{aligned}$$
(46)

Combine Eqs. (45) and (46), we can get:

$$\begin{aligned} Q_{m+1}^{-1}(t) {\tilde{W}}_{m+1}(t)=Q_{m}^{-1}(t) {\tilde{W}}_{m}(t) \end{aligned}$$
(47)

Define \(V_{m}(t)={\tilde{W}}_{m}^{\mathrm{T}}(t) Q_{m}^{-1}(t) {\tilde{W}}_{m}(t)\), use Eq. (45), we can get:

$$\begin{aligned} V_{m+1}(t)-V_{m}(t)=-\frac{{\tilde{W}}_{m}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\tilde{W}}_{m}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)} \end{aligned}$$
(48)

Therefore, \(V_{m}(t)\) are non-increasing and nonnegative for m, namely

$$\begin{aligned} V_{m+1}(t) \leqslant V_{m}(t) \end{aligned}$$
(49)

According to Eq. (38), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{m+1}^{-1}(t)) \geqslant \lambda _{\min }(Q_{m}^{-1}(t)) \geqslant \lambda _{\min }(Q_{0}^{-1}(t)) \end{aligned}$$
(50)

Combine Eq. (49), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{m}^{-1}(t))\left\| {\tilde{W}}_{m+1}(t)\right\| ^{2} \leqslant \lambda _{\max }(Q_{m}^{-1}(t))\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$
(51)

Therefore:

$$\begin{aligned} \left\| {\tilde{W}}_{m+1}(t)\right\| ^{2} \leqslant \rho _{m}(t)\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \end{aligned}$$
(52)

Apparently there is \(V_{m}(t) \leqslant V_{0}(t)\), combine Eq. (49), we can get:

$$\begin{aligned} \lambda _{\min }(Q_{0}^{-1}(t))\left\| {\tilde{W}}_{m}(t)\right\| ^{2} \leqslant \lambda _{\max }(Q_{0}^{-1}(t))\left\| {\tilde{W}}_{0}(t)\right\| ^{2} \end{aligned}$$
(53)

This ensures that \(\left\| {\tilde{W}}_{m}(t)\right\| ^{2}\) is uniformly bounded. The property (i) is proved. \(\hfill\square\)

Proof

According to Eq. (48), we can get:

$$\begin{aligned} \frac{e_{m+1}^{2}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}=V_{m}(t)-V_{m+1}(t) \end{aligned}$$
(54)

Sum the above formula from 0 to m, we get:

$$\begin{aligned} \sum _{i=0}^{m} \frac{e_{i+1}^{2}(t)}{1+{\mathcal {E}}_{i+1}^{\mathrm{T}}(t) Q_{i}(t) {\mathcal {E}}_{i+1}(t)}=V_{0}(t)-V_{m+1}(t) \leqslant V_{0}(t) \end{aligned}$$
(55)

Because \(V_{0}(t)\) is bounded, therefore:

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{e_{m+1}^{2}(t)}{1+{\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t)}=0 \end{aligned}$$
(56)

Use \(\lambda _{\max }\left( Q_{m+1}(t)\right) \leqslant \lambda _{\max }\left( Q_{m}(t)\right)\), we can get:

$$\begin{aligned} {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) Q_{m}(t) {\mathcal {E}}_{m+1}(t) \leqslant \lambda _{\max }\left( Q_{0}(t)\right) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t) \end{aligned}$$
(57)

According to Eq. (56), we can get:

$$\begin{aligned} \lim _{m \rightarrow \infty } \frac{e_{m+1}(t)}{\sqrt{1+\lambda _{\max }\left( Q_{0}(t)\right) {\mathcal {E}}_{m+1}^{\mathrm{T}}(t) {\mathcal {E}}_{m+1}(t)}}=0 \end{aligned}$$
(58)

The input \(x_{m}(t)\) is bounded; then, \({\mathcal {E}}_{m+1}(t)\) is bounded. In Eq. (58), the denominator term is bounded and nonzero; then, we can get:

$$\begin{aligned} \lim _{m \rightarrow \infty } e_{m+1}(t)=0, t=0,1, \ldots , N \end{aligned}$$
(59)

\(\hfill\square\)

This ensures the uniform convergence of the estimation error and proves the property (ii). Therefore, the learning algorithm we give ensures the boundedness of time-variant weights. The estimation error on a finite time interval can asymptotically converge to zero.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Gao, C., Zhang, N., Li, Y. et al. Self-attention-based time-variant neural networks for multi-step time series forecasting. Neural Comput & Applic 34, 8737–8754 (2022). https://doi.org/10.1007/s00521-021-06871-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-021-06871-1

Keywords

Navigation