Abstract
Time Series Forecasting (TSF) is essential to key domains, and the Transformer neural network has advanced the state-of-the-art on global, multi-horizon TSF benchmarks. The quadratic time and memory complexity of the Vanilla Transformer (VT) hinders its application to Big Data environments; therefore, multiple efficient variants of the VT that lower complexity via sparse self-attention have been proposed. However, less complex algorithms do not directly produce faster executions, and machine learning models for Big Data are typically trained on accelerators designed for dense-matrix computation that render slower performance with sparse matrices. To better compare the accuracy-speed trade-off of the VT and its variants, it is essential to test them on such accelerators. We implemented a cloud-based VT on Tensor Processing Units to address this task. Experiments on large-scale datasets show that our Transformer achieves good predictive performance when compared to state-of-the-art models while reducing training times from hours to under 2 min.
Similar content being viewed by others
Availability of data and materials
The datasets analyzed during the current study are available in the UCI Machine Learning Repository at http://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 (electricity), and http://archive.ics.uci.edu/ml/datasets/PEMS-SF (traffic). The datasets generated during the current study are available on request from the corresponding author.
Notes
For simplicity, we omit the fact that this is the conditional distribution of the prediction range of time series i given the conditioning range of time series i, as well as the collection’s conditioning and prediction ranges of all other time series. We consider this situation implicit since the parameters \(\varvec{\Phi }\) are learned jointly from all the time series in the collection.
Since the model applies to all the time series, the subscript i is omitted for simplicity.
Elapsed real time, or wall-clock time, is the actual time taken to complete the training process. Training wall time is reported by TensorBoard and does not include the time spent on evaluation or checkpoint-related operations.
References
Lim B, Zohren S (2021) Time-series forecasting with deep learning: a survey. Philos Trans R Soc A 379(2194):20200209
Li S, Jin X, Xuan Y, Zhou X, Chen W, Wang Y-X, Yan X (2019) Enhancing the locality and breaking the memory bottleneck of transformer on time series forecasting. Adv Neural Inf Process Syst 32:5243–5253
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:5999–6010
Khan S, Naseer M, Hayat M, Zamir SW, Khan FS, Shah M (2021) Transformers in vision: a survey. ACM Comput Surv (CSUR) 54(10):1–41. https://doi.org/10.1145/3505244
Lin T, Wang Y, Liu X, Qiu X (2021) A survey of transformers. https://doi.org/10.48550/ARXIV.2106.04554. https://arxiv.org/abs/2106.04554
Tay Y, Dehghani M, Bahri D, Metzler D (2020) Efficient transformers: a survey. ACM Comput Surv (CSUR) 55(6):1–28
Hewamalage H, Bergmeir C, Bandara K (2021) Global models for time series forecasting: a simulation study. Pattern Recognition 108441
Duan J, Kashima H (2021) Learning to rank for multi-step ahead time-series forecasting. IEEE Access pp 1–1. https://doi.org/10.1109/ACCESS.2021.3068895
Poh D, Lim B, Zohren S, Roberts S (2021) Enhancing cross-sectional currency strategies by context-aware learning to rank with self-attention. https://doi.org/10.48550/ARXIV.2105.10019. https://arxiv.org/abs/2105.10019
Poh D, Roberts S, Zohren S (2022) Transfer ranking in finance: applications to cross-sectional momentum with data scarcity. https://doi.org/10.48550/ARXIV.2208.09968. https://arxiv.org/abs/2208.09968
Zhou H, Zhang S, Peng J, Zhang S, Li J, Xiong H, Zhang W (2021) Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI
Lim B, Arık SÖ, Loeff N, Pfister T (2021) Temporal fusion transformers for interpretable multi-horizon time series forecasting. Int J Forecast 37:1748–1764
Taieb SB, Bontempi G, Atiya AF, Sorjamaa A (2012) A review and comparison of strategies for multi-step ahead time series forecasting based on the nn5 forecasting competition. Expert Syst Appl 39(8):7067–7083
Fournier Q, Caron GM, Aloise D (2021) A practical survey on faster and lighter transformers. https://doi.org/10.48550/ARXIV.2103.14636. https://arxiv.org/abs/2103.14636
He X, Pal S, Amarnath A, Feng S, Park D.-H, Rovinski A, Ye H, Chen Y, Dreslinski R, Mudge T (2020) Sparse-TPU: adapting systolic arrays for sparse matrices. In: Proceedings of the 34th ACM international conference on supercomputing, pp 1–12
Wu N, Green B, Ben X, O’Banion S (2020) Deep transformer models for time series forecasting: the influenza prevalence case. https://doi.org/10.48550/ARXIV.2001.08317. https://arxiv.org/abs/2001.08317
Wu S, Xiao X, Ding Q, Zhao P, Wei Y, Huang J (2020) Adversarial sparse transformer for time series forecasting. Adv Neural Inf Process Syst 33:17105–17115
Xu J, Wang J, Long M et al (2021) Autoformer: decomposition transformers with auto-correlation for long-term series forecasting. Adv Neural Inf Process Syst 34:22419–22430
Woo G, Liu C, Sahoo D, Kumar A, Hoi S (2022) ETSformer: exponential smoothing transformers for time-series forecasting. https://doi.org/10.48550/ARXIV.2202.01381. https://arxiv.org/abs/2202.01381
You Y, Zhang Z, Hsieh C-J, Demmel J, Keutzer K (2019) Fast deep neural network training on distributed systems and Cloud TPUs. IEEE Trans Parallel Distrib Syst 30(11):2449–2462
Wongpanich A, Pham H, Demmel J, Tan M, Le Q, You Y, Kumar S (2021) Training EfficientNets at supercomputer scale: 83% ImageNet top-1 accuracy in one hour. In: 2021 IEEE international parallel and distributed processing symposium workshops (IPDPSW), IEEE. pp 947–950
Sheng A, He JYD (2020) Distributed evolution strategies using tpus for meta-learning. In: 2020 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 721–728
Pan Z, Mishra P (2022) Hardware acceleration of explainable machine learning. In: 2022 Design, automation and test in europe conference and exhibition DATE, pp 1127–1130. https://doi.org/10.23919/DATE54114.2022.9774739
Hauru M, Morningstar A, Beall J, Ganahl M, Lewis A, Vidal G (2021) Simulation of quantum physics with Tensor Processing Units: brute-force computation of ground states and time evolution. arXiv. https://doi.org/10.48550/ARXIV.2111.10466. https://arxiv.org/abs/2111.10466
Xu H, Van Durme B, Murray K (2021) BERT, mBERT, or BiBERT? a study on contextualized embeddings for neural machine translation. In: Proceedings of the 2021 conference on empirical methods in natural language processing, pp 6663–6675
Alrowili S, Vijay-Shanker K (2021) BioM-transformers: building large biomedical language models with BERT, ALBERT and ELECTRA. In: Proceedings of the 20th workshop on biomedical language processing, pp 221–227
Zhao L, Zhang Z, Chen T, Metaxas D, Zhang H (2021) Improved transformer for high-resolution GANs. Adv Neural Inf Process Syst 34:18367–18380
Srinivas A, Lin T.-Y, Parmar N, Shlens J, Abbeel P, Vaswani A (2021) Bottleneck transformers for visual recognition. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16519–16529
Abdellaoui I.A, Mehrkanoon S (2020) Deep multi-stations weather forecasting: explainable recurrent convolutional neural networks. https://doi.org/10.48550/ARXIV.2009.11239. https://arxiv.org/abs/2009.11239
Abdellaoui I.A, Mehrkanoon S (2021) Symbolic regression for scientific discovery: an application to wind speed forecasting. In: 2021 IEEE symposium series on computational intelligence (SSCI), IEEE. pp 01–08
Jouppi N.P, Young C, Patil N, Patterson D, Agrawal G, Bajwa R, Bates S, Bhatia S, Boden N, Borchers A et al (2017) In-datacenter performance analysis of a Tensor Processing Unit. In: Proceedings of the 44th annual international symposium on computer architecture, pp 1–12
Kung H, Leiserson CE (1979) Systolic arrays (for VLSI). In: Sparse matrix proceedings 1978. Society for industrial and applied mathematics, vol 1, pp 256–282
Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M et al (2016) TensorFlow: a system for large-scale machine learning. OSDI 16:265–283
Cheng H-T, Haque Z, Hong L, Ispir M, Mewald C, Polosukhin I, Roumpos G, Sculley D, Smith J, Soergel D et al (2017) TensorFlow estimators: managing simplicity vs. flexibility in high-level machine learning frameworks. In: Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining, pp 1763–1771
Salinas D, Flunkert V, Gasthaus J, Januschowski T (2020) DeepAR: probabilistic forecasting with autoregressive recurrent networks. Int J Forecast 36(3):1181–1191
Kingma DP, Ba J (2015) Adam: a method for stochastic optimization. In: 3rd international conference on learning representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, conference track proceedings
Yu H-F, Rao N, Dhillon IS (2016) Temporal regularized matrix factorization for high-dimensional time series prediction. Adv Neural Inf Process Syst 29:847–855
Box GE, Jenkins GM (1968) Some recent advances in forecasting and control. J R Stat Soc Ser C (Appl Stat) 17(2):91–109
Gardner ES Jr (1985) Exponential smoothing: the state of the art. J Forecast 4(1):1–28
Rangapuram SS, Seeger MW, Gasthaus J, Stella L, Wang Y, Januschowski T (2018) Deep state space models for time series forecasting. Adv Neural Inf Process Syst 31:7785–7795
Sutskever I, Vinyals O, Le QV (2014) Sequence to sequence learning with neural networks. Adv Neural Inf Process Syst 27:3104–3113
Wen R, Torkkola K, Narayanaswamy B, Madeka D (2017) A multi-horizon quantile recurrent forecaster. https://doi.org/10.48550/ARXIV.1711.11053. https://arxiv.org/abs/1711.11053
Belkhouja T, Doppa JR (2022) Adversarial framework with certified robustness for time-series domain via statistical features. J Artif Intell Res 73:1435–1471
Liu L, Park Y, Hoang TN, Hasson H, Huan J (2022) Towards robust multivariate time-series forecasting: adversarial attacks and defense mechanisms. https://doi.org/10.48550/ARXIV.2207.09572. https://arxiv.org/abs/2207.09572
Acknowledgments
This research was supported by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) of Mexico.
Funding
J. Luis García-Nava’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 737505. Victor M. Tellez’s Sc.D. program is supported by a National Scholarship granted by the Consejo Nacional de Ciencia y Tecnología (National Council of Science and Technology) under CVU No. 816803.
Author information
Authors and Affiliations
Contributions
All authors contributed to the research conceptualization and design. JLG-N, JJF, and VMT collected, reviewed, and prepared data. The software was developed, deployed in the cloud, and executed by JLG-N. Results analysis was performed by JLG-N, JJF, and FC. JLG-N and JJF wrote the first draft of the manuscript. All authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interests
All authors certify that they have no affiliations with or involvement in any organization or entity with any financial interest or non-financial interest in the subject matter or materials discussed in this manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
García-Nava, JL., Flores, J.J., Tellez, V.M. et al. Fast training of a transformer for global multi-horizon time series forecasting on tensor processing units. J Supercomput 79, 8475–8498 (2023). https://doi.org/10.1007/s11227-022-05009-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-05009-x