Skip to main content

Advertisement

Log in

Applying CNN on a scientific application accelerator based on dataflow architecture

  • Regular Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Convolutional neural network (CNN) is widely used in applications such as face recognition, intelligent monitoring, image recognition and text recognition. Because of its high computational complexity, many efficient hardware accelerators have been proposed to exploit high degree of parallel processing for CNN. However, accelerators which are implemented on FPGAs and ASICs usually sacrifice generality for higher performance and lower power consumption. Other accelerators, such as GPUs, are general enough, but they lead to higher power consumption. Fine-grained dataflow architectures, which break conventional Von Neumann architectures, show natural advantages in processing scientific applications. Meanwhile, CNN algorithm shares many vital characteristics with scientific applications including high parallelism, simple loop and regular memory accessing pattern. In this paper, we propose a scheme for implementing and optimizing CNN on fine-grained dataflow architecture designed for scientific applications, namely Scientific Processing Unit (SPU). The experiment results reveal that by using our scheme, the performance of AlexNet and VGG-19 running on SPU is averagely \(2.29\,\times\) higher than that on NVIDIA Titan Xp, and the energy consumption of our hardware is averagely \(5.76\,\times\) lower than that of Titan Xp.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22

Similar content being viewed by others

References

  • Albericio, J., Judd, P., Hetherington, T., et al.: Cnvlutin: ineffectual-neuron-free deep neural network computing. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 1–13 (2016). https://doi.org/10.1109/ISCA.2016.11

  • Chellapilla, K., Puri, S., Simard, P.: High performance convolutional neural networks for document processing. In: Tenth International Workshop on Frontiers in Handwriting Recognition, pp. 1–6 (2006)

  • Chen, T., Du, Z., Sun, N., et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 269–284. ACM, New York (2014). https://doi.org/10.1145/2541940.2541967

  • Chen, Y., Luo, T., Liu, S., et al.: Dadiannao: a machine-learning supercomputer. In: 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, pp. 609–622 (2014). https://doi.org/10.1109/MICRO.2014.58

  • Chen, Y.H., Emer, J., Sze, V.: Eyeriss: A spatial architecture for energy-efficient dataflow for convolutional neural networks. In: 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 367–379 (2016). https://doi.org/10.1109/ISCA.2016.40

  • Chetlur, S., Woolley, C., Vandermersch, P., et al.: cuDNN: efficient primitives for deep learning. CoRR arxiv: abs/1410.0759 (2014)

  • Coates, A., Huval, B., Wang, T., et al.: Deep learning with COTS HPC systems. In: Proceedings of the 30th International Conference on Machine Learning, vol. 28. ICML’13, pp. III-1337–III-1345. JMLR.org (2013). http://dl.acm.org/citation.cfm?id=3042817.3043086

  • Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque Sur La Programmation, pp. 362–376. Springer, London (1974). http://dl.acm.org/citation.cfm?id=647323.721501

    Google Scholar 

  • Fan, D., Zhang, H., Wang, D., Ye, X., Song, F., Li, G., Sun, N.: Godson-T: an efficient many-core processor exploring thread-level parallelism. IEEE Micro 32(2), 38–47 (2012)

    Article  Google Scholar 

  • Fan, D., Li, W., Ye, X., Wang, D., Zhang, H., Tang, Z., Sun, N.: SmarCO: an efficient many-core processor for high-throughput applications in datacenters. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 596–607. IEEE, New York (2018)

  • Fu, H., Gan, L., Clapp, R.G., Ruan, H., Pell, O., Mencer, O., Flynn, M., Huang, X., Yang, G.: Scaling reverse time migration performance through reconfigurable dataflow engines. IEEE Micro 34(1), 30–40 (2014)

    Article  Google Scholar 

  • Govindan, M.S.S., Burger, D., Keckler, S.: Trips: a distributed explicit data graph execution (edge) microprocessor. In: 2007 IEEE Hot Chips 19 Symposium (HCS), pp. 1–13 (2007). https://doi.org/10.1109/HOTCHIPS.2007.7482519

  • Gu, L., Li, X., Siegel, J.: An empirically tuned 2D and 3D FFT library on CUDA GPU. In: Proceedings of the 24th ACM International Conference on Supercomputing, ICS ’10, pp. 305–314. ACM, New York (2010). https://doi.org/10.1145/1810085.1810127

  • Jouppi, N.P., Young, C., Patil, N., et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture, ISCA ’17, pp. 1–12. ACM, New York (2017). https://doi.org/10.1145/3079856.3080246

  • Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: F. Pereira, C.J.C. Burges, L. Bottou, K.Q. Weinberger (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., Red Hook (2012)

  • Liang, Y., Lu, L., Xiao, Q., Yan, S.: Evaluating fast algorithms for convolutional neural networks on FPGAS. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, pp. 1–1 (2019). https://doi.org/10.1109/TCAD.2019.2897701

  • Lu, L., Liang, Y.: SpWA: an efficient sparse Winograd convolutional neural networks accelerator on FPGAS. In: 2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC), pp. 1–6 (2018). https://doi.org/10.1109/DAC.2018.8465842

  • Lu, W., Yan, G., Li, J., et al.: FlexFlow: a flexible dataflow accelerator architecture for convolutional neural networks. In: 2017 IEEE International Symposium on High Performance Computer Architecture (HPCA), pp. 553–564 (2017). https://doi.org/10.1109/HPCA.2017.29

  • Nguyen, A., Satish, N., Chhugani, J., et al.: 3.5-D blocking optimization for stencil computations on modern CPUs and GPUs. In: 2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–13 (2010). https://doi.org/10.1109/SC.2010.2

  • Oriato, D., Tilbury, S., Marrocu, M., Pusceddu, G.: Acceleration of a meteorological limited area model with dataflow engines. In: 2012 Symposium on Application Accelerators in High Performance Computing (SAAHPC), pp. 129–132. IEEE, New York (2012)

  • Pratas, F., Oriato, D., Pell, O., Mata, R.A., Sousa, L.: Accelerating the computation of induced dipoles for molecular mechanics with dataflow engines. In: IEEE 21st Annual International Symposium on Field-Programmable Custom Computing Machines, pp. 177–180. IEEE, New York (2013)

  • Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  • Swanson, S., Schwerin, A., Mercaldi, M., et al.: The wavescalar architecture. ACM Trans. Comput. Syst. 25(2), 4:1–4:54 (2007). https://doi.org/10.1145/1233307.1233308

    Article  Google Scholar 

  • Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. CoRR arxiv: abs/1703.09039 (2017)

  • Tan, X., Ye, X.C., Shen, X.W., Xu, Y.C., Wang, D., Zhang, L., Li, W.M., Fan, D.R., Tang, Z.M.: A pipelining loop optimization method for dataflow architecture. J. Comput. Sci. Technol. 33(1), 116–130 (2018). https://doi.org/10.1007/s11390-017-1748-5

    Article  Google Scholar 

  • Venkataramanan, G., Sarma, D.D.: Compute and redundancy solution for Tesla’s full self driving computer. In: Hotchips 2019 (2019)

  • Verdoscia, L., Vaccaro, R., Giorgi, R.: A matrix multiplier case study for an evaluation of a configurable dataflow-machine. In: Proceedings of the 12th ACM International Conference on Computing Frontiers, CF ’15, pp. 63:1–63:6. ACM, New York (2015). https://doi.org/10.1145/2742854.2747287

  • Xiao-Wei, S., Xiao-Chun, Y., Da, W., et al.: Optimizing dataflow architecture for scientific applications. Chin. J. Comput. 9, 2181–2196 (2017)

    Google Scholar 

  • Ye, X., Fan, D., Sun, N., et al.: SimICT: a fast and flexible framework for performance and power evaluation of large-scale architecture. In: International Symposium on Low Power Electronics and Design (ISLPED), pp. 273–278 (2013). https://doi.org/10.1109/ISLPED.2013.6629308

  • Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, FPGA ’15, pp. 161–170. ACM, New York (2015). https://doi.org/10.1145/2684746.2689060

Download references

Acknowledgements

This work was supported by the National Key Research and Development Plan of China under Grant no. 2017YFC0803401, the National Natural Science Foundation of China under Grant nos. 61872335 and 61732018, the International Partnership Program of Chinese Academy of Sciences under Grant no. 171111KYSB20170032.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dongrui Fan.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ye, X., Xiang, T., Tan, X. et al. Applying CNN on a scientific application accelerator based on dataflow architecture. CCF Trans. HPC 1, 177–195 (2019). https://doi.org/10.1007/s42514-019-00015-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-019-00015-7

Keywords

Navigation