Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

García-Nava, J.-Luis; Flores, Juan J.; Tellez, Victor M.; Calderon, Felix

doi:10.1007/s11227-022-05009-x

Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

Published: 19 December 2022

Volume 79, pages 8475–8498, (2023)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

J.-Luis García-Nava¹,
Juan J. Flores^1,2,
Victor M. Tellez¹ &
…
Felix Calderon¹

374 Accesses
Explore all metrics

Abstract

Time Series Forecasting (TSF) is essential to key domains, and the Transformer neural network has advanced the state-of-the-art on global, multi-horizon TSF benchmarks. The quadratic time and memory complexity of the Vanilla Transformer (VT) hinders its application to Big Data environments; therefore, multiple efficient variants of the VT that lower complexity via sparse self-attention have been proposed. However, less complex algorithms do not directly produce faster executions, and machine learning models for Big Data are typically trained on accelerators designed for dense-matrix computation that render slower performance with sparse matrices. To better compare the accuracy-speed trade-off of the VT and its variants, it is essential to test them on such accelerators. We implemented a cloud-based VT on Tensor Processing Units to address this task. Experiments on large-scale datasets show that our Transformer achieves good predictive performance when compared to state-of-the-art models while reducing training times from hours to under 2 min.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel spatio-temporal attention-based TCN for multivariate time series prediction

Article 12 May 2021

A tensor-based deep LSTM forecasting model capturing the intrinsic connection in multivariate time series

Article 29 November 2022

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Availability of data and materials

The datasets analyzed during the current study are available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (electricity), and http://archive.ics.uci.edu/ml/datasets/PEMS-SF (traffic). The datasets generated during the current study are available on request from the corresponding author.

Notes

For simplicity, we omit the fact that this is the conditional distribution of the prediction range of time series i given the conditioning range of time series i, as well as the collection’s conditioning and prediction ranges of all other time series. We consider this situation implicit since the parameters \(\varvec{\Phi }\) are learned jointly from all the time series in the collection.
Since the model applies to all the time series, the subscript i is omitted for simplicity.
http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014.
http://archive.ics.uci.edu/ml/datasets/PEMS-SF.
Elapsed real time, or wall-clock time, is the actual time taken to complete the training process. Training wall time is reported by TensorBoard and does not include the time spent on evaluation or checkpoint-related operations.

References

Lim B, Zohren S (2021) Time-series forecasting with deep learning: a survey. Philos Trans R Soc A 379(2194):20200209
Article MathSciNet Google Scholar
Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253
Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5999–6010
Google Scholar
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10):1–41. https://doi.org/10.1145/3505244
Article Google Scholar
Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. https://doi.org/10.48550/ARXIV.2106.04554. https://arxiv.org/abs/2106.04554
Tay Y, Dehghani M, Bahri D, Metzler D (2020) Efficient transformers: a survey. ACM Comput Surv (CSUR) 55(6):1–28
Article Google Scholar
Hewamalage H, Bergmeir C, Bandara K (2021) Global models for time series forecasting: a simulation study. Pattern Recognition 108441
Duan J, Kashima H (2021) Learning to rank for multi-step ahead time-series forecasting. IEEE Access pp 1–1. https://doi.org/10.1109/ACCESS.2021.3068895
Poh D, Lim B, Zohren S, Roberts S (2021) Enhancing cross-sectional currency strategies by context-aware learning to rank with self-attention. https://doi.org/10.48550/ARXIV.2105.10019. https://arxiv.org/abs/2105.10019
Poh D, Roberts S, Zohren S (2022) Transfer ranking in finance: applications to cross-sectional momentum with data scarcity. https://doi.org/10.48550/ARXIV.2208.09968. https://arxiv.org/abs/2208.09968
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI
Lim B, Arık SÖ, Loeff N, Pfister T (2021) Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int J Forecast 37:1748–1764
Article Google Scholar
Taieb SB, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition. Expert Syst Appl 39(8):7067–7083
Article Google Scholar
Fournier Q, Caron GM, Aloise D (2021) A practical survey on faster and lighter transformers. https://doi.org/10.48550/ARXIV.2103.14636. https://arxiv.org/abs/2103.14636
He X, Pal S, Amarnath A, Feng S, Park D.-H, Rovinski A, Ye H, Chen Y, Dreslinski R, Mudge T (2020) Sparse-TPU: adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM international conference on supercomputing, pp 1–12
Wu N, Green B, Ben X, O’Banion S (2020) Deep transformer models for time series forecasting: the influenza prevalence case. https://doi.org/10.48550/ARXIV.2001.08317. https://arxiv.org/abs/2001.08317
Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J (2020) Adversarial sparse transformer for time series forecasting. Adv Neural Inf Process Syst 33:17105–17115
Google Scholar
Xu J, Wang J, Long M et al (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430
Google Scholar
Woo G, Liu C, Sahoo D, Kumar A, Hoi S (2022) ETSformer: exponential smoothing transformers for time-series forecasting. https://doi.org/10.48550/ARXIV.2202.01381. https://arxiv.org/abs/2202.01381
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and Cloud TPUs. IEEE Trans Parallel Distrib Syst 30(11):2449–2462
Article Google Scholar
Wongpanich A, Pham H, Demmel J, Tan M, Le Q, You Y, Kumar S (2021) Training EfficientNets at supercomputer scale: 83% ImageNet top-1 accuracy in one hour. In: 2021 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE. pp 947–950
Sheng A, He JYD (2020) Distributed evolution strategies using tpus for meta-learning. In: 2020 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 721–728
Pan Z, Mishra P (2022) Hardware acceleration of explainable machine learning. In: 2022 Design, automation and test in europe conference and exhibition DATE, pp 1127–1130. https://doi.org/10.23919/DATE54114.2022.9774739
Hauru M, Morningstar A, Beall J, Ganahl M, Lewis A, Vidal G (2021) Simulation of quantum physics with Tensor Processing Units: brute-force computation of ground states and time evolution. arXiv. https://doi.org/10.48550/ARXIV.2111.10466. https://arxiv.org/abs/2111.10466
Xu H, Van Durme B, Murray K (2021) BERT, mBERT, or BiBERT? a study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6663–6675
Alrowili S, Vijay-Shanker K (2021) BioM-transformers: building large biomedical language models with BERT, ALBERT and ELECTRA. In: Proceedings of the 20th workshop on biomedical language processing, pp 221–227
Zhao L, Zhang Z, Chen T, Metaxas D, Zhang H (2021) Improved transformer for high-resolution GANs. Adv Neural Inf Process Syst 34:18367–18380
Google Scholar
Srinivas A, Lin T.-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529
Abdellaoui I.A, Mehrkanoon S (2020) Deep multi-stations weather forecasting: explainable recurrent convolutional neural networks. https://doi.org/10.48550/ARXIV.2009.11239. https://arxiv.org/abs/2009.11239
Abdellaoui I.A, Mehrkanoon S (2021) Symbolic regression for scientific discovery: an application to wind speed forecasting. In: 2021 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 01–08
Jouppi N.P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A et al (2017) In-datacenter performance analysis of a Tensor Processing Unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12
Kung H, Leiserson CE (1979) Systolic arrays (for VLSI). In: Sparse matrix proceedings 1978. Society for industrial and applied mathematics, vol 1, pp 256–282
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) TensorFlow: a system for large-scale machine learning. OSDI 16:265–283
Google Scholar
Cheng H-T, Haque Z, Hong L, Ispir M, Mewald C, Polosukhin I, Roumpos G, Sculley D, Smith J, Soergel D et al (2017) TensorFlow estimators: managing simplicity vs. flexibility in high-level machine learning frameworks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1763–1771
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191
Article Google Scholar
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings
Yu H-F, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. Adv Neural Inf Process Syst 29:847–855
Google Scholar
Box GE, Jenkins GM (1968) Some recent advances in forecasting and control. J R Stat Soc Ser C (Appl Stat) 17(2):91–109
MathSciNet Google Scholar
Gardner ES Jr (1985) Exponential smoothing: the state of the art. J Forecast 4(1):1–28
Article Google Scholar
Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. Adv Neural Inf Process Syst 31:7785–7795
Google Scholar
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3113
Google Scholar
Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. https://doi.org/10.48550/ARXIV.1711.11053. https://arxiv.org/abs/1711.11053
Belkhouja T, Doppa JR (2022) Adversarial framework with certified robustness for time-series domain via statistical features. J Artif Intell Res 73:1435–1471
Article MathSciNet MATH Google Scholar
Liu L, Park Y, Hoang TN, Hasson H, Huan J (2022) Towards robust multivariate time-series forecasting: adversarial attacks and defense mechanisms. https://doi.org/10.48550/ARXIV.2207.09572. https://arxiv.org/abs/2207.09572

Download references

Acknowledgments

This research was supported by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) of Mexico.

Funding

J. Luis García-Nava’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 737505. Victor M. Tellez’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 816803.

Author information

Authors and Affiliations

School of Electrical Engineering, Universidad Michoacana de San Nicolás de Hidalgo, Morelia, 58030, Michoacán, Mexico
J.-Luis García-Nava, Juan J. Flores, Victor M. Tellez & Felix Calderon
University of Oregon, Eugene, OR, 97403, USA
Juan J. Flores

Authors

J.-Luis García-Nava
View author publications
You can also search for this author in PubMed Google Scholar
Juan J. Flores
View author publications
You can also search for this author in PubMed Google Scholar
Victor M. Tellez
View author publications
You can also search for this author in PubMed Google Scholar
Felix Calderon
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

All authors contributed to the research conceptualization and design. JLG-N, JJF, and VMT collected, reviewed, and prepared data. The software was developed, deployed in the cloud, and executed by JLG-N. Results analysis was performed by JLG-N, JJF, and FC. JLG-N and JJF wrote the first draft of the manuscript. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to J.-Luis García-Nava.

Ethics declarations

Conflict of interests

All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

García-Nava, JL., Flores, J.J., Tellez, V.M. et al. Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units. J Supercomput 79, 8475–8498 (2023). https://doi.org/10.1007/s11227-022-05009-x

Download citation

Accepted: 04 December 2022
Published: 19 December 2022
Issue Date: May 2023
DOI: https://doi.org/10.1007/s11227-022-05009-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

Abstract

Access this article

Similar content being viewed by others

Parallel spatio-temporal attention-based TCN for multivariate time series prediction

A tensor-based deep LSTM forecasting model capturing the intrinsic connection in multivariate time series

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Availability of data and materials

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units

Abstract

Access this article

Similar content being viewed by others

Parallel spatio-temporal attention-based TCN for multivariate time series prediction

A tensor-based deep LSTM forecasting model capturing the intrinsic connection in multivariate time series

Evaluation of the Transformer Architecture for Univariate Time Series Forecasting

Availability of data and materials

Notes

References

Acknowledgments

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interests

Ethics approval and consent to participate

Consent for publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation