Compiling Optimization for Neural Network Accelerators

  • Jin SongEmail author
  • Yimin Zhuang
  • Xiaobing Chen
  • Tian ZhiEmail author
  • Shaoli Liu
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11719)


Nowadays artificial neural networks are one of the most common computational models among all the intelligent methods. To cope with the ever-growing scales of neural networks and the restrictions of system energy consumption, there comes out a bunch of neural network (NN) accelerators. However, owing to their dedicated architecture, programming on NN accelerators is different from general processors. In order to improve performance, it is necessary to use global structure information of NN model to optimize the compilation. In this paper, we introduce a series of layer-based compile optimizations for NN accelerators. From top to bottom, we define a type of computational graph, carrying necessary information such as relationship between layer nodes and data nodes. Then according to the pattern of a NN layer computation process, we apply an intra layer loop unrolling and pipelining, including fine-grained and coarse-grained two levels. Similarly, we apply layer fusion optimization based on our computational graph and abstract pipelining stage. After expanding pipelining stages of layers, we can reduce some redundant IO operations, which we call it layer elimination optimization. The experiment results show that with our proposed optimizations the inference process can achieve up to 1.34x speedup than not using fusion optimization.


Neural network accelerator Compile optimization Layer fusion 



This work is partially supported by the National Key Research and Development Program of China (under Grant 2017YFB1003104), the NSF of China (under Grants 61432016, 61532016, 61672491, 61602441, 61602446, 61732002, 61702478, 61732007 and 61732020), Beijing Natural Science Foundation (JQ18013), the 973 Program of China (under Grant 2015CB358800), National Science and Technology Major Project (2018ZX01031102), the Transformation and Transfer of Scientific and Technological Achievements of Chinese Academy of Sciences (KFJ-HGZX-013), Key Research Projects in Frontier Science of Chinese Academy of Sciences (QYZDB-SSW-JSC001), Strategic Priority Research Program of Chinese Academy of Science (XDB32050200, XDC01020000) and Standardization Research Project of Chinese Academy of Sciences (BZ201800001).


  1. 1.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. In: International Conference on Neural Information Processing Systems, pp. 1097–1105. Curran Associates Inc. (2012)Google Scholar
  2. 2.
    He, K., Zhang, X., Ren, S., et al.: Deep residual learning for image recognition, pp. 770–778 (2015)Google Scholar
  3. 3.
    Zhang, X., Zhou, X., Lin, M., et al.: ShuffleNet: an extremely efficient convolutional neural network for mobile devices (2017)Google Scholar
  4. 4.
    Liu, W., et al.: SSD: single shot MultiBox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). Scholar
  5. 5.
    Ren, S., He, K., Girshick, R., et al.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, pp. 91–99. MIT Press (2015)Google Scholar
  6. 6.
    Sak, H., Senior, A., Beaufays, F.: Long short-term memory recurrent neural network architectures for large scale acoustic modeling. Comput. Sci. 338–342 (2014)Google Scholar
  7. 7.
    Graves, A., Jaitly, N., Mohamed, A.R.: Hybrid speech recognition with deep bidirectional LSTM. In: Automatic Speech Recognition and Understanding, pp. 273–278. IEEE (2014)Google Scholar
  8. 8.
    Silver, D., Schrittwieser, J., Simonyan, K., et al.: Mastering the game of Go without human knowledge. Nature 550(7676), 354–359 (2017)CrossRefGoogle Scholar
  9. 9.
    Silver, D., Huang, A., Maddison, C.J., et al.: Mastering the game of Go with deep neural networks and tree search. Nature 529(7587), 484–489 (2016)CrossRefGoogle Scholar
  10. 10.
  11. 11.
    Venkatesh, G., Nurvitadhi, E., Marr, D.: Accelerating deep convolutional networks using low-precision and sparsity (2016)Google Scholar
  12. 12.
    Ovtcharov, K., Ruwase, O., Kim, J., et al.: Accelerating deep convolutional neural networks using specialized hardware. Miscellaneous (2015)Google Scholar
  13. 13.
    Han, S., Liu, X., Mao, H., et al.: EIE: efficient inference engine on compressed deep neural network. In: International Symposium on Computer Architecture, pp. 243–254. IEEE Press (2016)Google Scholar
  14. 14.
    Zhang, C., Li, P., Sun, G., et al.: Optimizing FPGA-based accelerator design for deep convolutional neural networks. In: ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, pp. 161–170. ACM (2015)Google Scholar
  15. 15.
    Parashar, A., Rhu, M., Mukkara, A., et al.: SCNN: an accelerator for compressed-sparse convolutional neural networks, pp. 27–40 (2017)CrossRefGoogle Scholar
  16. 16.
    Chen, T., Du, Z., Sun, N.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machine-learning. ACM SIGPLAN Not. 49(4), 269–284 (2014)Google Scholar
  17. 17.
    Chen, Y., Chen, T., Xu, Z.: DianNao family: energy-efficient hardware accelerators for machine learning. Commun. ACM 59(11), 105–112 (2016)CrossRefGoogle Scholar
  18. 18.
    Zhang, S., Du, Z., Zhang, L., et al.: Cambricon-X: an accelerator for sparse neural networks. In: 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE Computer Society (2016)Google Scholar
  19. 19.
    Liu, S., Du, Z., Tao, J., et al.: Cambricon: an instruction set architecture for neural networks. In: ACM/IEEE International Symposium on Computer Architecture, pp. 393–405. IEEE (2016)Google Scholar
  20. 20.
    Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86, 2278–2324 (1998)CrossRefGoogle Scholar
  21. 21.
    Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. Comput. Sci. (2014)Google Scholar
  22. 22.
    Szegedy, C., Liu, W., Jia, Y., et al.: Going deeper with convolutions (2014)Google Scholar
  23. 23.
    Szegedy, C., Ioffe, S., Vanhoucke, V., et al.: Inception-v4, Inception-ResNet and the impact of residual connections on learning (2016)Google Scholar
  24. 24.
    Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions (2015)Google Scholar
  25. 25.
    Howard, A.G., Zhu, M., Chen, B., et al.: MobileNets: efficient convolutional neural networks for mobile vision applications (2017)Google Scholar
  26. 26.
    Iandola, F.N., Han, S., Moskewicz, M.W., et al.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5 MB model size (2016)Google Scholar
  27. 27.
    Abadi, M., Agarwal, A., Barham, P., et al.: TensorFlow: large-scale machine learning on heterogeneous distributed systems (2016)Google Scholar
  28. 28.
    Jia, Y., Shelhamer, E., et al.: Caffe: convolutional architecture for fast feature embedding, pp. 675–678 (2014)Google Scholar
  29. 29.
    Chen, T., Li, M., Li, Y., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. Statistics (2015)Google Scholar
  30. 30.
    Allan, V.H., Jones, R.B., Lee, R.M., et al.: Software pipelining. ACM Comput. Surv. 27(3), 367–432 (1995)CrossRefGoogle Scholar
  31. 31.
    Gray, A., Gottbrath, C., Olson, R., Prasanna, S., et al.: Production deep learning with NVIDIA GPU inference engine.
  32. 32.
    Russakovsky, O., et al.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115(3), 211–252 (2015)MathSciNetCrossRefGoogle Scholar
  33. 33.
    Truong, L., Barik, R., Totoni, E., et al.: Latte: a language, compiler, and runtime for elegant and efficient deep neural networks. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 209–223. ACM (2016)CrossRefGoogle Scholar
  34. 34.
    Chen, T., Moreau, T., Jiang, Z., et al.: TVM: an automated end-to-end optimizing compiler for deep learning (2018)Google Scholar
  35. 35.
    Ragankelley, J., Adams, A., Sharlet, D., et al.: Halide: decoupling algorithms from schedules for high-performance image processing. Commun. ACM 61(1), 106–115 (2018)CrossRefGoogle Scholar
  36. 36.
    Cyphers, S., Bansal, A.K., Bhiwandiwalla, A., et al.: Intel nGraph: an intermediate representation, compiler, and executor for deep learning (2018)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.University of Chinese Academy of SciencesBeijingChina
  2. 2.SKL of Computer ArchitectureInstitute of Computing Technology, CASBeijingChina
  3. 3.Cambricon Tech. Ltd.BeijingChina

Personalised recommendations