Applying CNN on a scientific application accelerator based on dataflow architecture

Ye, Xiaochun; Xiang, Taoran; Tan, Xu; Feng, Yujing; Wu, Haibin; Wu, Meng; Fan, Dongrui

doi:10.1007/s42514-019-00015-7

Applying CNN on a scientific application accelerator based on dataflow architecture

Regular Paper
Published: 04 December 2019

Volume 1, pages 177–195, (2019)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Xiaochun Ye^1,2,
Taoran Xiang^1,2,
Xu Tan^1,2,
Yujing Feng^1,2,
Haibin Wu¹,
Meng Wu¹ &
…
Dongrui Fan^1,2

848 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Convolutional neural network (CNN) is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallel processing for CNN. However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing scientific applications. Meanwhile, CNN algorithm shares many vital characteristics with scientific applications including high parallelism, simple loop and regular memory accessing pattern. In this paper, we propose a scheme for implementing and optimizing CNN on fine-grained dataflow architecture designed for scientific applications, namely Scientific Processing Unit (SPU). The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely \(2.29\,\times\) higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware is averagely \(5.76\,\times\) lower than that of Titan Xp.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 5

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Article 27 September 2017

Multi-clusters: An Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA

References

Albericio, J., Judd, P., Hetherington, T., et al.: Cnvlutin: ineffectual-neuron-free deep neural network computing. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 1–13 (2016). https://doi.org/10.1109/ISCA.2016.11
Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 1–6 (2006)
Chen, T., Du, Z., Sun, N., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 269–284. ACM, New York (2014). https://doi.org/10.1145/2541940.2541967
Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622 (2014). https://doi.org/10.1109/MICRO.2014.58
Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367–379 (2016). https://doi.org/10.1109/ISCA.2016.40
Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. CoRR arxiv: abs/1410.0759 (2014)
Coates, A., Huval, B., Wang, T., et al.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28. ICML’13, pp. III-1337–III-1345. JMLR.org (2013). http://dl.acm.org/citation.cfm?id=3042817.3043086
Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque Sur La Programmation, pp. 362–376. Springer, London (1974). http://dl.acm.org/citation.cfm?id=647323.721501
Google Scholar
Fan, D., Zhang, H., Wang, D., Ye, X., Song, F., Li, G., Sun, N.: Godson-T: an efficient many-core processor exploring thread-level parallelism. IEEE Micro 32(2), 38–47 (2012)
Article Google Scholar
Fan, D., Li, W., Ye, X., Wang, D., Zhang, H., Tang, Z., Sun, N.: SmarCO: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE, New York (2018)
Fu, H., Gan, L., Clapp, R.G., Ruan, H., Pell, O., Mencer, O., Flynn, M., Huang, X., Yang, G.: Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34(1), 30–40 (2014)
Article Google Scholar
Govindan, M.S.S., Burger, D., Keckler, S.: Trips: a distributed explicit data graph execution (edge) microprocessor. In: 2007 IEEE Hot Chips 19 Symposium (HCS), pp. 1–13 (2007). https://doi.org/10.1109/HOTCHIPS.2007.7482519
Gu, L., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 305–314. ACM, New York (2010). https://doi.org/10.1145/1810085.1810127
Jouppi, N.P., Young, C., Patil, N., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pp. 1–12. ACM, New York (2017). https://doi.org/10.1145/3079856.3080246
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)
Liang, Y., Lu, L., Xiao, Q., Yan, S.: Evaluating fast algorithms for convolutional neural networks on FPGAS. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1 (2019). https://doi.org/10.1109/TCAD.2019.2897701
Lu, L., Liang, Y.: SpWA: an efficient sparse Winograd convolutional neural networks accelerator on FPGAS. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465842
Lu, W., Yan, G., Li, J., et al.: FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553–564 (2017). https://doi.org/10.1109/HPCA.2017.29
Nguyen, A., Satish, N., Chhugani, J., et al.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010). https://doi.org/10.1109/SC.2010.2
Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 129–132. IEEE, New York (2012)
Pratas, F., Oriato, D., Pell, O., Mata, R.A., Sousa, L.: Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In: IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 177–180. IEEE, New York (2013)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Swanson, S., Schwerin, A., Mercaldi, M., et al.: The wavescalar architecture. ACM Trans. Comput. Syst. 25(2), 4:1–4:54 (2007). https://doi.org/10.1145/1233307.1233308
Article Google Scholar
Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. CoRR arxiv: abs/1703.09039 (2017)
Tan, X., Ye, X.C., Shen, X.W., Xu, Y.C., Wang, D., Zhang, L., Li, W.M., Fan, D.R., Tang, Z.M.: A pipelining loop optimization method for dataflow architecture. J. Comput. Sci. Technol. 33(1), 116–130 (2018). https://doi.org/10.1007/s11390-017-1748-5
Article Google Scholar
Venkataramanan, G., Sarma, D.D.: Compute and redundancy solution for Tesla’s full self driving computer. In: Hotchips 2019 (2019)
Verdoscia, L., Vaccaro, R., Giorgi, R.: A matrix multiplier case study for an evaluation of a configurable dataflow-machine. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 63:1–63:6. ACM, New York (2015). https://doi.org/10.1145/2742854.2747287
Xiao-Wei, S., Xiao-Chun, Y., Da, W., et al.: Optimizing dataflow architecture for scientific applications. Chin. J. Comput. 9, 2181–2196 (2017)
Google Scholar
Ye, X., Fan, D., Sun, N., et al.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: International Symposium on Low Power Electronics and Design (ISLPED), pp. 273–278 (2013). https://doi.org/10.1109/ISLPED.2013.6629308
Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, pp. 161–170. ACM, New York (2015). https://doi.org/10.1145/2684746.2689060

Download references

Acknowledgements

This work was supported by the National Key Research and Development Plan of China under Grant no. 2017YFC0803401, the National Natural Science Foundation of China under Grant nos. 61872335 and 61732018, the International Partnership Program of Chinese Academy of Sciences under Grant no. 171111KYSB20170032.

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
Xiaochun Ye, Taoran Xiang, Xu Tan, Yujing Feng, Haibin Wu, Meng Wu & Dongrui Fan
School of Computer and Control Engineering, University of Chinese Academy of Sciences, Beijing, 100049, China
Xiaochun Ye, Taoran Xiang, Xu Tan, Yujing Feng & Dongrui Fan

Authors

Xiaochun Ye
View author publications
You can also search for this author in PubMed Google Scholar
Taoran Xiang
View author publications
You can also search for this author in PubMed Google Scholar
Xu Tan
View author publications
You can also search for this author in PubMed Google Scholar
Yujing Feng
View author publications
You can also search for this author in PubMed Google Scholar
Haibin Wu
View author publications
You can also search for this author in PubMed Google Scholar
Meng Wu
View author publications
You can also search for this author in PubMed Google Scholar
Dongrui Fan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dongrui Fan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ye, X., Xiang, T., Tan, X. et al. Applying CNN on a scientific application accelerator based on dataflow architecture. CCF Trans. HPC 1, 177–195 (2019). https://doi.org/10.1007/s42514-019-00015-7

Download citation

Received: 31 May 2019
Accepted: 25 October 2019
Published: 04 December 2019
Issue Date: December 2019
DOI: https://doi.org/10.1007/s42514-019-00015-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Applying CNN on a scientific application accelerator based on dataflow architecture

Abstract

Access this article

Similar content being viewed by others

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Multi-clusters: An Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Applying CNN on a scientific application accelerator based on dataflow architecture

Abstract

Access this article

Similar content being viewed by others

UniCNN: A Pipelined Accelerator Towards Uniformed Computing for CNNs

Multi-clusters: An Efficient Design Paradigm of NN Accelerator Architecture Based on FPGA

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation