Abstract
In recent years, HPC and AI fusion, which refers to applying AI technology to traditional HPC applications, has become a new trend. HPC and AI fusion requires supports for multiple precisions used in both domains. While, supporting multiple precisions in a single computing unit is not easy. Prior research typically targets on modifying a high-precision FMA to support low precisions. However, such efforts suffer from disadvantages of increased latency and limited compute throughput for low-precision operations, high area and power overhead, and limited support of new data formats appeared in AI domain. To address this issue, we propose Haica, a double-precision FMA and low-precision systolic array fused architecture, to achieve HPC and AI fusion. Our work has two innovations. First, we propose a low-cost multiple-low-precision FMA by taking advantage of the commonality among FP16, BF16, and TF32. Second, inspired by the idea of splicing high precision with low precision, we replace the multiply and merge modules in a double-precision FMA with a modified systolic array that is composed of the proposed low-precision FMAs. We implement the logic design of Haica using RTL and evaluate its overhead. Results show that compared to the naive combination of a double-precision FMA and single-half-mixed-precision systolic array, Haica provides extra support for BF16 and TF32 with only 7.87% area and 33.26% power overhead. This demonstrates that Haica achieves HPC and AI fusion in a cost-effective manner.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arunachalam, V., Raj, A.N.J., Hampannavar, N., Bidul, C.: Efficient dual-precision floating-point fused-multiply-add architecture. Microprocess. Microsyst. 57, 23–31 (2018)
Chen, Z., Wu, T., Liu, X., Zheng, F., Ding, Y., Li, H.: Design and implementation of a multi-precision mixed floating point fused multiply add component. In: Proceedings of HPC China (2018). (in Chinese)
Choquette, J., Gandhi, W., Giroux, O., Stam, N., Krashinsky, R.: Nvidia a100 tensor core GPU: performance and innovation. IEEE Micro 41(2), 29–35 (2021)
Dong, L., Wei, F., Xu, K., Liu, S., Zhou, M.: Adaptive multi-compositionality for recursive neural network models. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 422–431 (2015)
Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 603–613. IEEE (2018)
Han, Y., Zhang, G.J., Huang, X., Wang, Y.: A moist physics parameterization based on deep learning. J. Adv. Model. Earth Syst. 12(9), e2020MS002076 (2020)
Hokenek, E., Montoye, R.K., Cook, P.W.: Second-generation risc floating point with multiply-add fused. IEEE J. Solid-State Circuits 25(5), 1207–1213 (1990)
Jia, W., et al.: Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2020)
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
Kalamkar, D., et al.: A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Kumar, V.P., Tsai, Y.C.: Designing linear systolic arrays. J. Parallel Distrib. Comput. 7(3), 441–463 (1989)
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Lang, T., Bruguera, J.D.: Floating-point fused multiply-add with reduced latency. In: Proceedings. In: IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 145–150. IEEE (2002)
Mohammadi, F.G., Shenavarmasouleh, F., Amini, M.H., Arabnia, H.R.: Malware detection using artificial bee colony algorithm. In: Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, pp. 568–572 (2020)
Rajaraman, V.: IEEE standard for floating point numbers. Resonance 21(1), 11–30 (2016)
Tannenbaum, D.C., Iyer, S.: Logic circuitry configurable to perform 32-bit or dual 16-bit floating-point operations, uS Patent 9,465,578 (11 October 2016)
Wu, T.: The research and implementation of high performance vector FMAC unit for LTE. Ph.D. thesis, National University of Defense Technology (2011). (in Chinese)
Xiao, Z., Xu, X., Xing, H., Luo, S., Dai, P., Zhan, D.: RTFN: a robust temporal feature network for time series classification. arXiv preprint arXiv:2011.11829 (2020)
Xiao, Z., Xu, X., Xing, H., Song, F., Wang, X., Zhao, B.: A federated learning system with enhanced feature extraction for human activity recognition. Knowl.-Based Syst. 229, 107338 (2021)
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Zhang, H., Chen, D., Ko, S.B.: New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, Z., Zheng, F., Guo, F., Yu, Q., Chen, Z. (2023). Haica: A High Performance Computing & Artificial Intelligence Fused Computing Architecture. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-22677-9_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)