Performance evaluation of convolutional neural network on Tianhe-3 prototype

Chen, Weiduo; Dong, Xiaoshe; Chen, Heng; Wang, Qiang; Yu, Xingda; Zhang, Xingjun

doi:10.1007/s11227-021-03759-8

Performance evaluation of convolutional neural network on Tianhe-3 prototype

Published: 12 April 2021

Volume 77, pages 12647–12665, (2021)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Weiduo Chen¹,
Xiaoshe Dong¹,
Heng Chen¹,
Qiang Wang¹,
Xingda Yu¹ &
…
Xingjun Zhang¹

753 Accesses
5 Citations
2 Altmetric
Explore all metrics

Abstract

Exascale supercomputers will greatly support the expanding computational resource demand of convolutional neural networks (CNNs). At present, the prototype cluster of Tianhe-3 supercomputer, which is based on the Chinese-made many-core processors, the Phytium-2000+ (FTP) and Matrix-2000+ (MTP), has gone into service. We evaluated the training performance of CNN on the Tianhe-3 prototype. The performance of image convolution and matrix multiplication on the FTP and MTP was tested to evaluate the single-node performance, and the Allreduce element was tested to evaluate the scalability of the distributed training on the prototype cluster. We also qualitatively analyzed the performance bottlenecks of CNN on the FTP and MTP processors by Roofline model and provided some optimization suggestions for improving the CNN on the Tianhe-3 prototype.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

References

Abadi M, Barham P, Chen J (2016) Tensorflow: A system for large-scale machine learning. In: 12th USENIX symposium on operating systems design and implementation, OSDI 2016, Savannah, GA, USA, November 2-4, 2016, USENIX Association, pp 265–283
Awan AA, Subramoni H, Panda DK (2017) An in-depth performance characterization of CPU- and gpu-based DNN training on modern architectures. In: Proceedings of the machine learning on HPC environments, MLHPC@SC 2017, Denver, CO, USA, November 13, 2017, ACM, pp 8:1–8:8
Chetlur S, Woolley C, Vandermersch P (2014) cudnn: Efficient primitives for deep learning. CoRR abs/1410.0759, arXiv:1410.0759
Chilimbi TM, Suzue Y, Apacible J (2014) Project adam: Building an efficient and scalable deep learning training system. In: 11th USENIX symposium on operating systems design and implementation, OSDI ’14, Broomfield, CO, USA, October 6-8, 2014, USENIX Association, pp 571–582
Dean J, Corrado G, Monga R, (2012) Large scale distributed deep networks. In: Advances in neural information processing systems 25: 26th annual conference on neural information processing systems, (2012) Proceedings of a meeting held December 3–6, 2012. Lake Tahoe, Nevada, United States, pp 1232–1240
Developer N (2018) Nvidia turing architecture whitepaper. Whitepaper, accessed April 26, 2020
Fang J, Fu H, Zhao W (2017) swdnn: A library for accelerating deep learning applications on sunway taihulight. In: 2017 IEEE International parallel and distributed processing symposium, IPDPS 2017, Orlando, FL, USA, May 29 - June 2, 2017, IEEE Computer Society, pp 615–624
Gibiansky A (2016) Bringing hpc techniques to deep learning. Website, http://research.baidu.com/bringing-hpc-techniques-deep-learning/, accessed Mar 22, 2018
Goto K, van de Geijn RA (2008) Anatomy of high-performance matrix multiplication. ACM Trans Math Softw 34(3):12:1-12:25
He K, Zhang X, Ren (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Howard AG, Zhu M, Chen B (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. CoRR abs/1704.04861, arXiv:1704.04861
Jack D (2017) Report on the tianhe-2a system. Technical report, https://www.icl.utk.edu/files/publications/2017/icl-utk-970-2017.pdf, accessed April 4, 2020
Jang M, Kim K, Kim K (2011) The performance analysis of ARM NEON technology for mobile platforms. In: Research in applied computation symposium, RACS ’11, Miami, FL, USA, October 19-22, 2011, ACM, pp 104–106
JD M (1996) Stream benchmark. Website, http://www.cs.virginia.edu/stream/ref.html#what, accessed April 26, 2020
Jia X, Song S, He W (2018) Highly scalable deep learning training system with mixed-precision: Training imagenet in four minutes. CoRR abs/1807.11205, arXiv:1807.11205
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Adv Neural Inform Process Syst 25:1097–1105
Google Scholar
Lavin A, Gray S (2016) Fast algorithms for convolutional neural networks. In: 2016 IEEE conference on computer vision and pattern recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, IEEE Computer Society, pp 4013–4021
Li Y, Chen X, Liu J (2020) OHTMA: an optimized heuristic topology-aware mapping algorithm on the tianhe-3 exascale supercomputer prototype. Front Inf Technol Electron Eng 21(6):939–949
Article Google Scholar
Lian X, Zhang C, Zhang H, (2017) Can decentralized algorithms outperform centralized algorithms? A case study for decentralized parallel stochastic gradient descent. Advances in neural information processing systems 30: annual conference on neural information processing systems, (2017) 4–9 December 2017. Long Beach, CA, USA, pp 5330–5340
McIntosh-Smith S, Price J, Deakin T (2019) A performance analysis of the first generation of hpc-optimized arm processors. Concurr Comput Pract Exp 31(16)
Molchanov P, Tyree S, Karras T (2017) Pruning convolutional neural networks for resource efficient inference. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net
Rajovic N, Rico A, Puzovic N (2014) Tibidabo: making the case for an arm-based HPC system. Fut Gener Comput Syst 36:322–334
Article Google Scholar
Research B (2019) Deepbench. Website, https://github.com/baidu-research/DeepBench, accessed April 26, 2020
Shazeer N, Mirhoseini A, Maziarz K (2017) Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. In: 5th International conference on learning representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, OpenReview.net
Sun D, Liu S, Gaudiot J (2017) Enabling embedded inference engine with ARM compute library: a case study. CoRR abs/1704.03751, arXiv:1704.03751
Szegedy C, Liu W, Jia Y (2015) Going deeper with convolutions. In: IEEE conference on computer vision and pattern recognition, CVPR 2015, Boston, MA, USA, June 7-12, 2015, IEEE Computer Society, pp 1–9
Watcharapichat P, Morales VL, Fernandez RC (2016) Ako: Decentralised deep learning with partial gradient exchange. In: Proceedings of the seventh ACM symposium on cloud computing, Santa Clara, CA, USA, October 5-7, 2016, ACM, pp 84–97
Williams S, Waterman A, Patterson DA (2009) Roofline: an insightful visual performance model for multicore architectures. Commun ACM 52(4):65–76
Article Google Scholar
Yokoyama D, Schulze B, Borges F (2019) The survey on ARM processors for HPC. J Supercomput 75(10):7003–7036
Article Google Scholar
You X, Yang H, and ZL (2019) Performance evaluation and analysis of linear algebra kernels in the prototype tianhe-3 cluster. In: Supercomputing Frontiers - 5th Asian Conference, SCFA 2019, Singapore, March 11-14, 2019, Proceedings, Springer, Lecture Notes in Computer Science, vol 11416, pp 86–105
Zhang X, Wang Q, W S (2020) Openblas: an optimized blas library. Website, http://www.openblas.net/, accessed April 25, 2020
Zhu R, Zhao K, Yang H (2019) Aligraph: a comprehensive graph neural network platform. Proc VLDB Endow 12(12):2094–2105
Article Google Scholar

Download references

Acknowledgements

We would like to express our appreciation to the National SuperComputer Center in Tianjin for offering us this opportunity to evaluate the Tianhe-3 prototype. This work is supported by the National Key R&D Program of China (Grant No. 2016YFB0200902).

Author information

Authors and Affiliations

School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, 710049, China
Weiduo Chen, Xiaoshe Dong, Heng Chen, Qiang Wang, Xingda Yu & Xingjun Zhang

Authors

Weiduo Chen
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoshe Dong
View author publications
You can also search for this author in PubMed Google Scholar
Heng Chen
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Xingda Yu
View author publications
You can also search for this author in PubMed Google Scholar
Xingjun Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaoshe Dong.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, W., Dong, X., Chen, H. et al. Performance evaluation of convolutional neural network on Tianhe-3 prototype. J Supercomput 77, 12647–12665 (2021). https://doi.org/10.1007/s11227-021-03759-8

Download citation

Accepted: 18 March 2021
Published: 12 April 2021
Issue Date: November 2021
DOI: https://doi.org/10.1007/s11227-021-03759-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance evaluation of convolutional neural network on Tianhe-3 prototype

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance evaluation of convolutional neural network on Tianhe-3 prototype

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

CBAM: Convolutional Block Attention Module

Visualizing and Understanding Convolutional Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation