Skip to main content
Log in

Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Time Series Forecasting (TSF) is essential to key domains, and the Transformer neural network has advanced the state-of-the-art on global, multi-horizon TSF benchmarks. The quadratic time and memory complexity of the Vanilla Transformer (VT) hinders its application to Big Data environments; therefore, multiple efficient variants of the VT that lower complexity via sparse self-attention have been proposed. However, less complex algorithms do not directly produce faster executions, and machine learning models for Big Data are typically trained on accelerators designed for dense-matrix computation that render slower performance with sparse matrices. To better compare the accuracy-speed trade-off of the VT and its variants, it is essential to test them on such accelerators. We implemented a cloud-based VT on Tensor Processing Units to address this task. Experiments on large-scale datasets show that our Transformer achieves good predictive performance when compared to state-of-the-art models while reducing training times from hours to under 2 min.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Availability of data and materials

The datasets analyzed during the current study are available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (electricity), and http://archive.ics.uci.edu/ml/datasets/PEMS-SF (traffic). The datasets generated during the current study are available on request from the corresponding author.

Notes

  1. For simplicity, we omit the fact that this is the conditional distribution of the prediction range of time series i given the conditioning range of time series i, as well as the collection’s conditioning and prediction ranges of all other time series. We consider this situation implicit since the parameters \(\varvec{\Phi }\) are learned jointly from all the time series in the collection.

  2. Since the model applies to all the time series, the subscript i is omitted for simplicity.

  3. http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.

  4. http://archive.ics.uci.edu/ml/datasets/PEMS-SF.

  5. Elapsed real time, or wall-clock time, is the actual time taken to complete the training process. Training wall time is reported by TensorBoard and does not include the time spent on evaluation or checkpoint-related operations.

References

  1. Lim B, Zohren S (2021) Time-series forecasting with deep learning: a survey. Philos Trans R Soc A 379(2194):20200209

    Article  MathSciNet  Google Scholar 

  2. Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253

    Google Scholar 

  3. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5999–6010

    Google Scholar 

  4. Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10):1–41. https://doi.org/10.1145/3505244

    Article  Google Scholar 

  5. Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. https://doi.org/10.48550/ARXIV.2106.04554. https://arxiv.org/abs/2106.04554

  6. Tay Y, Dehghani M, Bahri D, Metzler D (2020) Efficient transformers: a survey. ACM Comput Surv (CSUR) 55(6):1–28

    Article  Google Scholar 

  7. Hewamalage H, Bergmeir C, Bandara K (2021) Global models for time series forecasting: a simulation study. Pattern Recognition 108441

  8. Duan J, Kashima H (2021) Learning to rank for multi-step ahead time-series forecasting. IEEE Access pp 1–1. https://doi.org/10.1109/ACCESS.2021.3068895

  9. Poh D, Lim B, Zohren S, Roberts S (2021) Enhancing cross-sectional currency strategies by context-aware learning to rank with self-attention. https://doi.org/10.48550/ARXIV.2105.10019. https://arxiv.org/abs/2105.10019

  10. Poh D, Roberts S, Zohren S (2022) Transfer ranking in finance: applications to cross-sectional momentum with data scarcity. https://doi.org/10.48550/ARXIV.2208.09968. https://arxiv.org/abs/2208.09968

  11. Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI

  12. Lim B, Arık SÖ, Loeff N, Pfister T (2021) Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int J Forecast 37:1748–1764

    Article  Google Scholar 

  13. Taieb SB, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition. Expert Syst Appl 39(8):7067–7083

    Article  Google Scholar 

  14. Fournier Q, Caron GM, Aloise D (2021) A practical survey on faster and lighter transformers. https://doi.org/10.48550/ARXIV.2103.14636. https://arxiv.org/abs/2103.14636

  15. He X, Pal S, Amarnath A, Feng S, Park D.-H, Rovinski A, Ye H, Chen Y, Dreslinski R, Mudge T (2020) Sparse-TPU: adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM international conference on supercomputing, pp 1–12

  16. Wu N, Green B, Ben X, O’Banion S (2020) Deep transformer models for time series forecasting: the influenza prevalence case. https://doi.org/10.48550/ARXIV.2001.08317. https://arxiv.org/abs/2001.08317

  17. Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J (2020) Adversarial sparse transformer for time series forecasting. Adv Neural Inf Process Syst 33:17105–17115

    Google Scholar 

  18. Xu J, Wang J, Long M et al (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430

    Google Scholar 

  19. Woo G, Liu C, Sahoo D, Kumar A, Hoi S (2022) ETSformer: exponential smoothing transformers for time-series forecasting. https://doi.org/10.48550/ARXIV.2202.01381. https://arxiv.org/abs/2202.01381

  20. You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and Cloud TPUs. IEEE Trans Parallel Distrib Syst 30(11):2449–2462

    Article  Google Scholar 

  21. Wongpanich A, Pham H, Demmel J, Tan M, Le Q, You Y, Kumar S (2021) Training EfficientNets at supercomputer scale: 83% ImageNet top-1 accuracy in one hour. In: 2021 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE. pp 947–950

  22. Sheng A, He JYD (2020) Distributed evolution strategies using tpus for meta-learning. In: 2020 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 721–728

  23. Pan Z, Mishra P (2022) Hardware acceleration of explainable machine learning. In: 2022 Design, automation and test in europe conference and exhibition DATE, pp 1127–1130. https://doi.org/10.23919/DATE54114.2022.9774739

  24. Hauru M, Morningstar A, Beall J, Ganahl M, Lewis A, Vidal G (2021) Simulation of quantum physics with Tensor Processing Units: brute-force computation of ground states and time evolution. arXiv. https://doi.org/10.48550/ARXIV.2111.10466. https://arxiv.org/abs/2111.10466

  25. Xu H, Van Durme B, Murray K (2021) BERT, mBERT, or BiBERT? a study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6663–6675

  26. Alrowili S, Vijay-Shanker K (2021) BioM-transformers: building large biomedical language models with BERT, ALBERT and ELECTRA. In: Proceedings of the 20th workshop on biomedical language processing, pp 221–227

  27. Zhao L, Zhang Z, Chen T, Metaxas D, Zhang H (2021) Improved transformer for high-resolution GANs. Adv Neural Inf Process Syst 34:18367–18380

    Google Scholar 

  28. Srinivas A, Lin T.-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529

  29. Abdellaoui I.A, Mehrkanoon S (2020) Deep multi-stations weather forecasting: explainable recurrent convolutional neural networks. https://doi.org/10.48550/ARXIV.2009.11239. https://arxiv.org/abs/2009.11239

  30. Abdellaoui I.A, Mehrkanoon S (2021) Symbolic regression for scientific discovery: an application to wind speed forecasting. In: 2021 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 01–08

  31. Jouppi N.P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A et al (2017) In-datacenter performance analysis of a Tensor Processing Unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12

  32. Kung H, Leiserson CE (1979) Systolic arrays (for VLSI). In: Sparse matrix proceedings 1978. Society for industrial and applied mathematics, vol 1, pp 256–282

  33. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) TensorFlow: a system for large-scale machine learning. OSDI 16:265–283

    Google Scholar 

  34. Cheng H-T, Haque Z, Hong L, Ispir M, Mewald C, Polosukhin I, Roumpos G, Sculley D, Smith J, Soergel D et al (2017) TensorFlow estimators: managing simplicity vs. flexibility in high-level machine learning frameworks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1763–1771

  35. Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191

    Article  Google Scholar 

  36. Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings

  37. Yu H-F, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. Adv Neural Inf Process Syst 29:847–855

    Google Scholar 

  38. Box GE, Jenkins GM (1968) Some recent advances in forecasting and control. J R Stat Soc Ser C (Appl Stat) 17(2):91–109

    MathSciNet  Google Scholar 

  39. Gardner ES Jr (1985) Exponential smoothing: the state of the art. J Forecast 4(1):1–28

    Article  Google Scholar 

  40. Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. Adv Neural Inf Process Syst 31:7785–7795

    Google Scholar 

  41. Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3113

    Google Scholar 

  42. Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. https://doi.org/10.48550/ARXIV.1711.11053. https://arxiv.org/abs/1711.11053

  43. Belkhouja T, Doppa JR (2022) Adversarial framework with certified robustness for time-series domain via statistical features. J Artif Intell Res 73:1435–1471

    Article  MathSciNet  MATH  Google Scholar 

  44. Liu L, Park Y, Hoang TN, Hasson H, Huan J (2022) Towards robust multivariate time-series forecasting: adversarial attacks and defense mechanisms. https://doi.org/10.48550/ARXIV.2207.09572. https://arxiv.org/abs/2207.09572

Download references

Acknowledgments

This research was supported by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) of Mexico.

Funding

J. Luis García-Nava’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 737505. Victor M. Tellez’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 816803.

Author information

Authors and Affiliations

Authors

Contributions

All authors contributed to the research conceptualization and design. JLG-N, JJF, and VMT collected, reviewed, and prepared data. The software was developed, deployed in the cloud, and executed by JLG-N. Results analysis was performed by JLG-N, JJF, and FC. JLG-N and JJF wrote the first draft of the manuscript. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to J.-Luis García-Nava.

Ethics declarations

Conflict of interests

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

García-Nava, JL., Flores, J.J., Tellez, V.M. et al. Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units. J Supercomput 79, 8475–8498 (2023). https://doi.org/10.1007/s11227-022-05009-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-05009-x

Keywords

Navigation