Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Liu, Zihan; Leng, Jingwen; Lu, Guandong; Wang, Chenhui; Chen, Quan; Guo, Minyi

doi:10.1007/s42514-020-00044-7

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Regular Paper
Published: 06 October 2020

Volume 2, pages 332–347, (2020)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Zihan Liu¹,
Jingwen Leng¹,
Guandong Lu¹,
Chenhui Wang¹,
Quan Chen¹ &
…
Minyi Guo ORCID: orcid.org/0000-0003-0034-2302¹

839 Accesses
2 Citations
Explore all metrics

Abstract

Specialized hardware accelerators for deep learning are widely introduced by many hardware vendors because of their high performance and efficiency. However, different vendors adopt different accelerator architectures, making it challenging for the compiler tool-chain to generate and optimize high-performance codes. Moreover, the current tool-chains provided by the vendors are either highly abstract, which makes it hard to optimize or contain too many hardware-related details, which makes it inconvenient to program. So, in this paper, we propose a middle layer compiler tool-chain for Cambricon MLU-100 to fill the gap between high-level runtime library and low operator-level SDK. Our tool-chain is based on the operator level SDK but abstracts away its redundant initialization and allocation statement. We also expose the interface of major optimization knobs compared to the existing runtime, thus enabling a considerable optimization space. We evaluate our work by several state-of-the-art neural networks and choose the line of code and optimization knobs as evaluation metrics. We also compare the performance against state-of-the-art tool-chain TensorRT applying simple optimization strategy and find that our work has great potential in optimization. Our work can guarantee the user a vast optimization space with only around \( 20\% \) amount of the codes that hides the redundant initialization and allocation statements from users.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

Article Open access 31 March 2021

A survey of the recent architectures of deep convolutional neural networks

Article 21 April 2020

A comprehensive review of Binary Neural Network

Article 30 March 2023

References

Albericio, J et al.: Cnvlutin: Ineffectual-Neuron- Free Deep Neural Network Computing. In: 43rd ACM/IEEE Annual International Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea, June 18–22, 2016, pp. 1–13 (2016). https://doi.org/10.1109/ISCA.2016.11
Alwani, M., et al.: Fused-layer CNN accelerators. In: 49th Annual IEEE/ACM International Symposium on Microarchitecture. (2016)
Chen, T. et al.: TVM: an automated end-to- end optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation. (2018)
Chen, T. et al.: DianNao: a small-footprint high-throughput accelerator for ubiquitous machinelearning. In: Architectural Support for Programming Languages and Operating Systems, ASPLOS ’14, Salt Lake City, UT, USA, March 1–5, 2014. Ed. by Rajeev Balasubramonian, Al Davis, and Sarita V. Adve, pp. 269–284 (2014). https://doi.org/10.1145/2541940.2541967
Chen, Yunji et al.: DaDianNao: a machine-learning supercomputer. In: 47th Annual IEEE/ACM international symposium on microarchitecture, MI- CRO 2014, Cambridge, United Kingdom, December 13–17. pp. 609–622 (2014). https://doi.org/10.1109/MICRO.2014.58
Copeland, M.: What’s the difference between deep learning training and inference? https://blogs.nvidia.com/blog/2016/08/22/difference-deep-learning-training-inference-ai/. Accessed Feb. 20 (2020)
Cui, W. et al.: Ebird: Elastic batch for improving responsiveness and throughput of deep learning services. In: 37th IEEE International Conference on Computer Design, ICCD 2019, Abu Dhabi, United Arab Emirates, November 17–20, 2019. IEEE, pp. 497–505 (2019). https://doi.org/10.1109/ICCD46524.2019.00075
Dong, Z. et al.: HAWQ: Hessian aware quantization of neural networks with mixed-precision. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27–November 2. pp. 293–302 (2019). https://doi.org/10.1109/ICCV.2019.00038
Elango, V. et al.: Diesel: DSL for linear algebra and neural net computations on GPUs. In: Proceedings of the 2nd ACM SIGPLAN international workshop on machine learning and programming languages, MAPL@PLDI 2018, Philadelphia, PA, USA, June 18–22, 2018. Ed. by Justin Gottschlich and Alvin Cheung, pp. 42–51 (2018).https://doi.org/10.1145/3211346.3211354
Filipovic, J., et al.: Optimizing CUDA code by kernel fusion: application on BLAS. J. Supercomput. 71(10), 3934–3957 (2015)
Article Google Scholar
Google. Route. Schedule. Plan. Assign. Pack. Solve. OR-Tools is fast and portable software for com- binatorial optimization. https://developers.google.com/optimization. Accessed May 20, (2020)
Guo C et al.: Flexibility for DNN Acceleration via Temporal GPUSystolic Array Integration. In: CoRR abs/2002.08326 (2020). url: arXiv:2002.08326
He, K. et al.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition. (2016)
Jain, A. et al.: Performance Characterization of DNN Training using TensorFlow and PyTorch on Modern Clusters. In: 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019, Albuquerque, NM, USA, September 23–26. pp. 1–11 (2019). https://doi.org/10.1109/CLUSTER.2019.8891042
Jia, Y. et al.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference on Multimedia, MM ’14, Orlando, FL, USA, November 03–07, 2014. Ed. by Kien A. Hua et al. pp. 675–678 (2014). https://doi.org/10.1145/2647868.2654889
Jia, Z. et al.: TASO: optimizing deep learning computation with automatic generation of graph substitutions. In: Proceedings of the 27th ACM Symposium on Operating Systems Principles. (2019)
Jouppi, N.P. et al.: In-datacenter performance analysis of a tensor processing unit. In: Proceedings of the 44th Annual International Symposium on Computer Architecture. (2017)
Kim, J. et al.: A code generator for high-performance tensor contractions on GPUs. In: IEEE/ACM International Symposium on Code Generation and Optimization. (2019)
Krizhevsky, A., et al.: ImageNet classification with deep convolutional neural networks. In: Advanced in Neural Information Processing Systems. (2012)
Leng, Jingwen et al.: Asymmetric Resilience: Exploiting Task-Level Idempotency for Transient Error Recovery in Accelerator-Based Systems. In: IEEE International Symposium on High Performance Computer Architecture, HPCA 2020, San Diego, CA, USA, February 22–26, 2020. IEEE, pp. 44–57 (2020). https://doi.org/10.1109/HPCA47549.2020.00014
Liu, D-F et al.: PuDianNao: a polyvalent machine learning accelerator. In: Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems. (2015)
Marchisio, A., Hanif, M.A., Shafique, M.: CapsAcc: An Efficient Hardware Accelerator for CapsuleNets with Data Reuse. In: Design, Automation & Test in Europe Conference & Exhibition, DATE 2019, Florence, Italy, March 25-29, 2019. Ed. by Jürgen Teich and Franco Fummi. pp. 964–967 (2019). https://doi.org/10.23919/DATE.2019.8714922
MXNet. A flexible and efficient library for deep learning. A truly open source deep learning framework suited for exible research prototyping and production. https://mxnet.apache.org/. Accessed Feb. 20 (2020)
NVIDIA Corp. Geforce RTX 2080Ti. User Guide. (2019)
NVIDIA Corp. NVIDIA AI INFERENCE PLAT- FORM. Giant Leaps in Performance and Efficiency for AI Services, from the Data Center to the Network’s Edge. (2018)
NVIDIA Corp. NVIDIA TensorRT. Programmable Inference Accelerator. (2020)
ONNX. Open Neural Network Exchange. The open standard for machine learning interoperability. http://onnx.ai. Accessed Feb. 20 (2020)
Paliwal, A. et al.: Reinforced genetic algorithm learning for optimizing computation graphs. In: 8th International Conference on Learning Rep- resentations, ICLR 2020, Addis Ababa, Ethiopia, April 26–30 (2020)
Qiao, B. et al.: From loop fusion to kernel fusion: a domain-specific approach to locality optimization. In: IEEE/ACM International Symposium on Code Generation and Optimization. (2019)
Qiu, Yuxian et al.: Adversarial Defense Through Network Profiling Based Path Extraction. In: IEEE Conference on Computer Vision and Pat- tern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20. pp. 4777-4786 (2019). https://doi.org/10.1109/CVPR.2019.00491
Quinton, P.: Systolic arrays: why and how? In: Parcella 1994, VI. International Workshop on Parallel Processing by Cellular Automata and Arrays, Potsdam, Germany, September 21–23, 1994. Proceedings. Ed. by Chris R. Jesshope, Vesselin Jossifov, and Wolfgang Wilhelmi. Vol. 81. Mathematical Research. pp. 39–50 (1994)
Ragan-Kelley. J., et al.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: Conference on Programming Language Design and Implementation. (2013)
Sandler, M., et al.: MobileNetV2: Inverted Residuals and Linear Bottlenecks. In: Conference on Computer Vision and Pattern Recognition. (2018)
Shao, Y.S. et al.: Simba: Scaling Deep- Learning Inference with Multi-Chip-Module-Based Architecture. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12–16, 2019. ACM, pp. 14–27 (2019). https://doi.org/10.1145/3352460.3358302
Simonyan, K., Zisserman, A.: VVGGery Deep Convolutional Networks for Large-Scale Image Recognition. In: 3rd International Conference on Learning Representations. (2015)
Cambricon Technologies. Cambricon MLU100 Datasheet. Aug. (2019)
Cambricon Technologies. Cambricon Neuware Whitesheet. Aug. (2019)
Vasilache, N., et al.: Tensor comprehensions: framework-agnostic high-performance machine learning abstractions. In: CoRR 1802.04730 (2018)
Wang, G., Lin, Y., Yi, W.: Kernel fusion: an effective method for better power efficiency on multithreaded GPU. In: 2010 IEEE/ACM Int’l Conference on Green Computing and Communications. (2010)
Zhang, S. et al.: Cambricon-X: An accelerator for sparse neural networks. In: 49th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2016, Taipei, Taiwan, October 15–19, 20:1–20:12 (2016). https://doi.org/10.1109/MICRO.2016.7783723
Zhang, W et al.: Laius: towards latency awareness and improved utilization of spatial multitasking accelerators in datacenters. In: Proceed- ings of the ACM International Conference on Supercomputing, ICS 2019, Phoenix, AZ, USA, June 26-28, 2019. Ed. by Rudolf Eigenmann, Chen Ding, and Sally A. McKee. ACM, pp. 58–68 (2019). https://doi.org/10.1145/3330345.3330351
Zheng, S. et al.: FlexTensor: An Automatic Schedule Exploration and Optimization Framework for Tensor Computation on Heterogeneous System. In: ASPLOS ’20: Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, March 16-20, 2020 [ASPLOS 2020 was canceled because of COVID-19]. Ed. by James R. Larus, Luis Ceze, and Karin Strauss. pp. 859-873 (2020). https://doi.org/10.1145/3373376.3378508
Zhou, X et al.: Cambricon-S: Addressing Irregularity in Sparse Neural Networks through A Cooperative Software/Hardware Approach. In: 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018, Fukuoka, Japan, October 20–24. pp. 15–28 (2018). https://doi.org/10.1109/MICRO.2018.00011
Zhu, M. et al.: Sparse Tensor Core: Algorithm and Hardware Co-Design for Vector-wise Sparse Neural Networks on Modern GPUs. In: Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2019, Columbus, OH, USA, October 12-16, pp. 359-371 (2019). https://doi.org/10.1145/3352460.3358269

Download references

Acknowledgements

We thank the anonymous reviewers for their constructive feedback. This work was supported by National Key R&D Program of China (2019YFF0302600) and the National Natural Science Foundation of China (NSFC) Grant (61702328 and 61832006). Any opinions, findings, and conclusions in this paper are those of the authors only and do not necessarily reflect the views of our sponsors

Author information

Authors and Affiliations

Shanghai Jiao Tong University, Shanghai, 200240, China
Zihan Liu, Jingwen Leng, Guandong Lu, Chenhui Wang, Quan Chen & Minyi Guo

Authors

Zihan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jingwen Leng
View author publications
You can also search for this author in PubMed Google Scholar
Guandong Lu
View author publications
You can also search for this author in PubMed Google Scholar
Chenhui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Quan Chen
View author publications
You can also search for this author in PubMed Google Scholar
Minyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Jingwen Leng or Minyi Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liu, Z., Leng, J., Lu, G. et al. Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator. CCF Trans. HPC 2, 332–347 (2020). https://doi.org/10.1007/s42514-020-00044-7

Download citation

Received: 20 March 2020
Accepted: 17 July 2020
Published: 06 October 2020
Issue Date: December 2020
DOI: https://doi.org/10.1007/s42514-020-00044-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey of the recent architectures of deep convolutional neural networks

A comprehensive review of Binary Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Survey and design of paleozoic: a high-performance compiler tool chain for deep learning inference accelerator

Abstract

Access this article

Similar content being viewed by others

Review of deep learning: concepts, CNN architectures, challenges, applications, future directions

A survey of the recent architectures of deep convolutional neural networks

A comprehensive review of Binary Neural Network

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation