Haica: A High Performance Computing & Artificial Intelligence Fused Computing Architecture

Chen, Zhengbo; Zheng, Fang; Guo, Feng; Yu, Qi; Chen, Zuoning

doi:10.1007/978-3-031-22677-9_13

Zhengbo Chen¹¹,
Fang Zheng¹²,
Feng Guo¹²,
Qi Yu¹² &
…
Zuoning Chen¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1450 Accesses
1 Altmetric

Abstract

In recent years, HPC and AI fusion, which refers to applying AI technology to traditional HPC applications, has become a new trend. HPC and AI fusion requires supports for multiple precisions used in both domains. While, supporting multiple precisions in a single computing unit is not easy. Prior research typically targets on modifying a high-precision FMA to support low precisions. However, such efforts suffer from disadvantages of increased latency and limited compute throughput for low-precision operations, high area and power overhead, and limited support of new data formats appeared in AI domain. To address this issue, we propose Haica, a double-precision FMA and low-precision systolic array fused architecture, to achieve HPC and AI fusion. Our work has two innovations. First, we propose a low-cost multiple-low-precision FMA by taking advantage of the commonality among FP16, BF16, and TF32. Second, inspired by the idea of splicing high precision with low precision, we replace the multiply and merge modules in a double-precision FMA with a modified systolic array that is composed of the proposed low-precision FMAs. We implement the logic design of Haica using RTL and evaluate its overhead. Results show that compared to the naive combination of a double-precision FMA and single-half-mixed-precision systolic array, Haica provides extra support for BF16 and TF32 with only 7.87% area and 33.26% power overhead. This demonstrates that Haica achieves HPC and AI fusion in a cost-effective manner.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Arunachalam, V., Raj, A.N.J., Hampannavar, N., Bidul, C.: Efficient dual-precision floating-point fused-multiply-add architecture. Microprocess. Microsyst. 57, 23–31 (2018)
Article Google Scholar
Chen, Z., Wu, T., Liu, X., Zheng, F., Ding, Y., Li, H.: Design and implementation of a multi-precision mixed floating point fused multiply add component. In: Proceedings of HPC China (2018). (in Chinese)
Google Scholar
Choquette, J., Gandhi, W., Giroux, O., Stam, N., Krashinsky, R.: Nvidia a100 tensor core GPU: performance and innovation. IEEE Micro 41(2), 29–35 (2021)
Article Google Scholar
Dong, L., Wei, F., Xu, K., Liu, S., Zhou, M.: Adaptive multi-compositionality for recursive neural network models. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 422–431 (2015)
Article Google Scholar
Haidar, A., Tomov, S., Dongarra, J., Higham, N.J.: Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed-precision iterative refinement solvers. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 603–613. IEEE (2018)
Google Scholar
Han, Y., Zhang, G.J., Huang, X., Wang, Y.: A moist physics parameterization based on deep learning. J. Adv. Model. Earth Syst. 12(9), e2020MS002076 (2020)
Google Scholar
Hokenek, E., Montoye, R.K., Cook, P.W.: Second-generation risc floating point with multiply-add fused. IEEE J. Solid-State Circuits 25(5), 1207–1213 (1990)
Article Google Scholar
Jia, W., et al.: Pushing the limit of molecular dynamics with ab initio accuracy to 100 million atoms with machine learning. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE (2020)
Google Scholar
Jouppi, N.P., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, pp. 1–12 (2017)
Google Scholar
Kalamkar, D., et al.: A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: Advances in Neural Information Processing Systems, vol. 25 (2012)
Google Scholar
Kumar, V.P., Tsai, Y.C.: Designing linear systolic arrays. J. Parallel Distrib. Comput. 7(3), 441–463 (1989)
Article Google Scholar
Kurth, T., et al.: Exascale deep learning for climate analytics. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 649–660. IEEE (2018)
Google Scholar
Lang, T., Bruguera, J.D.: Floating-point fused multiply-add with reduced latency. In: Proceedings. In: IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp. 145–150. IEEE (2002)
Google Scholar
Mohammadi, F.G., Shenavarmasouleh, F., Amini, M.H., Arabnia, H.R.: Malware detection using artificial bee colony algorithm. In: Adjunct Proceedings of the 2020 ACM International Joint Conference on Pervasive and Ubiquitous Computing and Proceedings of the 2020 ACM International Symposium on Wearable Computers, pp. 568–572 (2020)
Google Scholar
Rajaraman, V.: IEEE standard for floating point numbers. Resonance 21(1), 11–30 (2016)
Article Google Scholar
Tannenbaum, D.C., Iyer, S.: Logic circuitry configurable to perform 32-bit or dual 16-bit floating-point operations, uS Patent 9,465,578 (11 October 2016)
Google Scholar
Wu, T.: The research and implementation of high performance vector FMAC unit for LTE. Ph.D. thesis, National University of Defense Technology (2011). (in Chinese)
Google Scholar
Xiao, Z., Xu, X., Xing, H., Luo, S., Dai, P., Zhan, D.: RTFN: a robust temporal feature network for time series classification. arXiv preprint arXiv:2011.11829 (2020)
Xiao, Z., Xu, X., Xing, H., Song, F., Wang, X., Zhao, B.: A federated learning system with enhanced feature extraction for human activity recognition. Knowl.-Based Syst. 229, 107338 (2021)
Article Google Scholar
Zhang, H., Chen, D., Ko, S.B.: Efficient multiple-precision floating-point fused multiply-add with mixed-precision support. IEEE Trans. Comput. 68(7), 1035–1048 (2019)
Article MathSciNet MATH Google Scholar
Zhang, H., Chen, D., Ko, S.B.: New flexible multiple-precision multiply-accumulate unit for deep neural network training and inference. IEEE Trans. Comput. 69(1), 26–38 (2019)
Article MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

Information Engineering University, Zhengzhou, China
Zhengbo Chen
State Key Laboratory of Mathematical Engineering and Advanced Computing, Wuxi, China
Fang Zheng, Feng Guo & Qi Yu
Chinese Academy of Engineering, Beijing, China
Zuoning Chen

Authors

Zhengbo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Fang Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Feng Guo
View author publications
You can also search for this author in PubMed Google Scholar
Qi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Zuoning Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zhengbo Chen .

Editor information

Editors and Affiliations

Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
University of Exeter, Exeter, UK
Geyong Min
Rutgers University, Newark, NJ, USA
Jaideep Vaidya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Z., Zheng, F., Guo, F., Yu, Q., Chen, Z. (2023). Haica: A High Performance Computing & Artificial Intelligence Fused Computing Architecture. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-22677-9_13
Published: 11 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics