Compiler-assisted Operator Template Library for DNN Accelerators

Published: 25 March 2021

Volume 49, pages 628–645, (2021)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Jiansong Li ORCID: orcid.org/0000-0002-2924-5189^1,2,
Wei Cao¹,
Xiao Dong^1,2,
Guangli Li^1,2,
Xueying Wang^1,2,
Peng Zhao^1,2,
Lei Liu¹ &
…
Xiaobing Feng^1,2

442 Accesses
1 Citation
Explore all metrics

Abstract

Despite many dedicated accelerators are gaining popularity for their performance and energy efficiency in the deep neural network (DNN) domain, high-level programming support for these accelerators remains thin. In contrast to existing researches targeting the whole DNNs, we choose to dive into details and review this problem from a finer-grained level, operators. Due to performance concerns, operator programmers may have to take hand-written assembly as their first choice, which is error-prone and involves many programming chores. To alleviate this problem, we propose TOpLib, a compiler-assisted template library. By providing a unified user-view abstraction, TOpLib allows programmers to express computational kernels with high-level tensor primitives, which will be automatically lowered into low-level intrinsic primitives via expression templates. Moreover, considering memory management is performance-critical and the optimization strategy of expression template is limited to enumeration based rewriting rules, we implement TOpLib with a compiler-assisted approach. We address the memory reuse challenges into the compiler, which allows TOpLib to make full use of on-chip buffers and result in better performance. Experiments over 55 typical DNN operators demonstrate that TOpLib can generate scalable code with performance faster than or on par with hand-written assembly versions.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 2

Fig. 3

Fig. 4

Fig. 5

Fig. 6

Similar content being viewed by others

Compiler-Assisted Operator Template Library for DNN Accelerators

Chapter © 2021

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Article 06 October 2020

Optimized Code Generation for Deep Neural Networks

Chapter © 2022

Notes

Due to space limit, the detailed data scales are clearly listed in the GitHub repository: https://github.com/anonymous-0x00/npc20-benchmarks.

References

Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro, C., Corrado, G.S., Davis, A., Dean, J., Devin, M., Ghemawat, S., Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefowicz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mané, D., Monga, R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J., Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke, V., Vasudevan, V., Viégas, F., Vinyals, O., Warden, P., Wattenberg, M., Wicke, M., Yu, Y., Zheng, X.: TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems (2015). http://tensorflow.org/. Software available from tensorflow.org
AnandTech: Cambricon, Makers of Huawei’s Kirin NPU IP. https://www.anandtech.com/show/12815/cambricon-makers-of-huaweis-kirin-npu-ip-build-a-big-ai-chip-and-pcie-card (2018)
Chen, T., Du, Z., Sun, N., Wang, J., Wu, C., Chen, Y., Temam, O.: Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, pp. 269–284. ACM, New York, NY, USA (2014). https://doi.org/10.1145/2541940.2541967
Cook, S.: CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs, 1st edn. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (2012)
Google Scholar
Cover, T., Hart, P.: Nearest Neighbor Pattern Classification. IEEE Trans. Inf. Theor. 13(1), 21–27 (2006)
Article Google Scholar
Culberson, J.C.: Iterated Greedy Graph Coloring and the Difficulty Landscape. Tech. rep. (1992)
Deng, J., Dong, W., Socher, R., jia Li, L., Li, K., Fei-fei, L.: Imagenet: A large-scale hierarchical image database. In: In CVPR (2009)
DMLC teams: mshadow. https://github.com/dmlc/mshadow (2018)
Guennebaud, G., Jacob, B., et al.: Eigen v3. http://eigen.tuxfamily.org (2010)
He, K., et al.: Deep residual learning for image recognition. CoRR abs/1512.03385 (2015)
Hearst, M.A.: Support Vector Machines. IEEE Intelligent Systems 13(4), 18–28 (1998)
Article Google Scholar
Hornik, K., Stinchcombe, M., White, H.: Multilayer feedforward networks are universal approximators. Neural Netw. 2(5), 359–366 (1989)
Article Google Scholar
Howard, A.G., et al.: MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. CoRR abs/1704.04861 (2017)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely Connected Convolutional Networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017). https://doi.org/10.1109/CVPR.2017.243
Iandola, F.N., et al.: SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and \(<\)1MB model size. CoRR abs/1602.07360 (2016)
J. Hrdtlein C. Pflaum, A.L.C.H.W.: Advanced expression templates programming. In: Computing and Visualization in Science. Springer (2010). https://doi.org/10.1007/s00791-009-0128-2
Jianwen Zhu: Static memory allocation by pointer analysis and coloring. In: Proceedings Design, Automation and Test in Europe. Conference and Exhibition 2001, pp. 785–790 (2001). https://doi.org/10.1109/DATE.2001.915121
Jouppi, N.P., Young, C., Patil, N., Patterson, D., et al.: In-datacenter performance analysis of a tensor processing unit. ISCA’17, p. 1–12. Association for Computing Machinery, New York, NY, USA (2017). https://doi.org/10.1145/3079856.3080246
Krizhevsky, A., et al.: ImageNet Classification with Deep Convolutional Neural Networks. NIPS’12, pp. 1097–1105. Curran Associates Inc., USA (2012)
Li, L., Feng, H., Xue, J.: Compiler-directed scratchpad memory management via graph coloring. ACM Trans. Archit. Code Optim. 6(3) (2009). https://doi.org/10.1145/1582710.1582711
Lian Li, Lin Gao, Jingling Xue: Memory coloring: a compiler approach for scratchpad memory management. In: 14th International Conference on Parallel Architectures and Compilation Techniques (PACT’05), pp. 329–338 (2005). https://doi.org/10.1109/PACT.2005.27
Liao, H., Tu, J., Xia, J., Zhou, X.: Davinci: A scalable architecture for neural network computing. In: 2019 IEEE Hot Chips 31 Symposium (HCS), pp. 1–44. IEEE Computer Society, Los Alamitos, CA, USA (2019). https://doi.org/10.1109/HOTCHIPS.2019.8875654
Liu, S., Du, Z., Tao, J., Han, D., Luo, T., Xie, Y., Chen, Y., Chen, T.: Cambricon: An instruction set architecture for neural networks. In: Proceedings of the 43rd International Symposium on Computer Architecture, ISCA ’16, p. 393–405. IEEE Press (2016). https://doi.org/10.1109/ISCA.2016.42
Moazeni, M., Bui, A., Sarrafzadeh, M.: A memory optimization technique for software-managed scratchpad memory in gpus. In: 2009 IEEE 7th Symposium on Application Specific Processors, pp. 43–49 (2009). https://doi.org/10.1109/SASP.2009.5226334
Muchnick, S.S.: Advanced Compiler Design and Implementation. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA (1998)
Google Scholar
Munshi, A., Gaster, B., Mattson, T.G., Fung, J., Ginsburg, D.: OpenCL Programming Guide, 1st edn. Addison-Wesley Professional, Boston (2011)
Google Scholar
NVIDIA teams: Cutlass. https://github.com/NVIDIA/cutlass (2017)
P. Briggs, K.D.C., Torczon, L.: Improvements to graph coloring register allocation. ACM Trans. Program. Lang. Syst. 16(3), 428–455 (1994)
Progsch, J., Ineichen, Y., Adelmann, A.: A new vectorization technique for expression templates in C++. CoRR abs/1109.1264 (2011). arXiv:1264
Simonyan, K., Zisserman, A.: Very Deep Convolutional Networks for Large-Scale Image Recognition (2014). arXiv:1409.1556
Springer, M., Sun, Y., Masuhara, H.: Inner Array Inlining for Structure of Arrays Layout. In: Proceedings of the 5th ACM SIGPLAN International Workshop on Libraries, Languages, and Compilers for Array Programming, ARRAY 2018, p. 50–58. Association for Computing Machinery, New York, NY, USA (2018). https://doi.org/10.1145/3219753.3219760
Szegedy, C., et al.: Going deeper with convolutions. In: Computer Vision and Pattern Recognition (CVPR) (2015). arXiv:1409.4842
Szegedy, C., et al.: Rethinking the inception architecture for computer vision. CoRR abs/1512.00567 (2015)
Williams, S., Waterman, A., Patterson, D.: Roofline: An Insightful Visual Performance Model for Multicore Architectures. Commun. ACM 52(4), 65–76 (2009). https://doi.org/10.1145/1498765.1498785
Article Google Scholar
Wu, J., Belevich, A., Bendersky, E., Heffernan, M., Leary, C., Pienaar, J., Roune, B., Springer, R., Weng, X., Hundt, R.: Gpucc: An Open-Source GPGPU Compiler. In: Proceedings of the 2016 International Symposium on Code Generation and Optimization, CGO ’16, p. 105–116. Association for Computing Machinery, New York, NY, USA (2016). https://doi.org/10.1145/2854038.2854041

Download references

Acknowledgements

We would like to thank the anonymous reviewers for their comments and valuable feedback. This work is supported by the National Key R&D Program of China (under Grant No. 2017YFB1003103) and the Science Fund for Creative Research Groups of the National Natural Science Foundation of China (under Grant No. 61521092).

Author information

Authors and Affiliations

Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Jiansong Li, Wei Cao, Xiao Dong, Guangli Li, Xueying Wang, Peng Zhao, Lei Liu & Xiaobing Feng
University of Chinese Academy of Sciences, Beijing, China
Jiansong Li, Xiao Dong, Guangli Li, Xueying Wang, Peng Zhao & Xiaobing Feng

Authors

Jiansong Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Cao
View author publications
You can also search for this author in PubMed Google Scholar
Xiao Dong
View author publications
You can also search for this author in PubMed Google Scholar
Guangli Li
View author publications
You can also search for this author in PubMed Google Scholar
Xueying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Lei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaobing Feng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiansong Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Cao, W., Dong, X. et al. Compiler-assisted Operator Template Library for DNN Accelerators. Int J Parallel Prog 49, 628–645 (2021). https://doi.org/10.1007/s10766-021-00701-6

Download citation

Received: 03 November 2020
Accepted: 01 March 2021
Published: 25 March 2021
Issue Date: October 2021
DOI: https://doi.org/10.1007/s10766-021-00701-6

Keywords