Skip to main content

Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip

  • Chapter
  • First Online:
Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing
  • 337 Accesses

Abstract

Convolutional neural networks (CNNs)-based inference is a quintessential component in mobile machine learning applications. Privacy and real-time response requirements require applications to perform inference on the mobile (edge) devices themselves. Heterogeneous multi-processor system-on-chips (HMPSoCs) within the edge devices enable high-throughput, low-latency edge inference. An HMPSoC contains several processing cores, each capable of independently performing CNN inference. However, to meet stringent performance requirements, an application must simultaneously involve all core types in inferencing. A software-based CNN inference pipeline design allows for synergistic engagement of all the cores in an HMPSoC for a high-throughput and low-latency CNN inference. In this chapter, we present two different CNN inference pipeline designs. The first design creates a pipeline between two different types of CPU cores. The second design extends the pipeline from CPU to GPU. We also provide a future perspective and research directions on the subject.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 149.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 199.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)

    Article  Google Scholar 

  2. Mitra, T.: Heterogeneous multi-core architectures. Inf. Media Technol. 10(3), 383–394 (2015)

    Google Scholar 

  3. Prakash, A., Wang, S., Mitra, T.: Mobile application processors: Techniques for software power-performance optimization. IEEE Consumer Electron. Magaz. 9(4), 67–76 (2020)

    Article  Google Scholar 

  4. Wang, S., Ananthanarayanan, G., Zeng, Y., Goel, N., Pathania, A., Mitra, T.: High-throughput CNN inference on embedded ARM Big. LITTLE multicore processors. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 39(10), 2254–2267 (2019)

    Article  Google Scholar 

  5. Khadas VIM 3, https://www.khadas.com/vim3, 23 12 2011

  6. Somu Muthukaruppan, T., Pathania, A., Mitra, T.: Price theory based power management for heterogeneous multi-cores. ACM SIGPLAN Notices 49(4), 161–176 (2014)

    Article  Google Scholar 

  7. Mitra, T., Muthukaruppan, T.S., Pathania, A., Pricopi, M., Venkataramani, V., Vishin, S.: Power management of asymmetric multi-cores in the dark silicon Era. In: The Dark Side of Silicon, pp. 159–189. Springer, Cham (2017)

    Google Scholar 

  8. Rapp, M., Pathania, A., Mitra, T., Henkel, J.: Neural network-based performance prediction for task migration on S-NUCA many-cores. IEEE Trans. Comput. 70(10), 1691–1704 (2020)

    MATH  Google Scholar 

  9. Pricopi, M., Mitra, T.: Bahurupi: a polymorphic heterogeneous multi-core architecture. ACM Trans. Archit. Code Optimiz. 8(4), 1–21 (2012)

    Article  Google Scholar 

  10. Mitra, T., Pricopi, M.: U.S. Patent No. 9,690,620. Washington, DC: U.S. Patent and Trademark Office (2017)

    Google Scholar 

  11. Pricopi, M., Mitra, T.: Task scheduling on adaptive multi-core. IEEE Trans. Comput. 63(10), 2590–2603 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  12. Pathania, A., Jiao, Q., Prakash, A., Mitra, T.: Integrated CPU-GPU power management for 3D mobile games. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2014)

    Google Scholar 

  13. Pathania, A., Irimiea, A.E., Prakash, A., Mitra, T.: Power-performance modelling of mobile gaming workloads on heterogeneous MPSoCs. In Proceedings of the 52nd Annual Design Automation Conference, pp. 1–6 (2015)

    Google Scholar 

  14. Prakash, A., Wang, S., Irimiea, A. E., Mitra, T.: Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms. In 2015 33rd IEEE International Conference on Computer Design (ICCD), pp. 208–215 (2015)

    Google Scholar 

  15. Karunaratne, M., Mohite, A.K., Mitra, T., Peh, L.S.: HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)

    Google Scholar 

  16. Li, Z., Wijerathne, D., Chen, X., Pathania, A., Mitra, T.: ChordMap: Automated mapping of streaming applications onto CGRA. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 41, 306–319 (2021)

    Article  Google Scholar 

  17. Wijerathne, D., Li, Z., Pathania, A., Mitra, T., Thiele, L.: HiMap: fast and scalable high-quality mapping on CGRA via hierarchical abstraction. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 41(10), 3290–3303 (2021)

    Article  Google Scholar 

  18. Wijerathne, D., Li, Z., Karunarathne, M., Pathania, A., Mitra, T.: Cascade: High throughput data streaming via decoupled access-execute CGRA. ACM Trans. Embed. Comput. Syst. 18(5s), 1–26 (2019)

    Article  Google Scholar 

  19. Li, Z., Wu, D., Wijerathne, D., Mitra, T.: LISA: Graph neural network based portable mapping on spatial accelerators. In: 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 444–459. IEEE (2022)

    Google Scholar 

  20. Bandara, T.K., Wijerathne, D., Mitra, T., Peh, L.S.: REVAMP: A systematic framework for heterogeneous CGRA realization. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 918–932 (2022)

    Google Scholar 

  21. Wijerathne, D., Li, Z., Bandara, T.K., Mitra, T.: PANORAMA: Divide-and-conquer approach for mapping complex loop kernels on CGRA. In: Proceedings of the 59th Annual Design Automation Conference 2022 (2022)

    Google Scholar 

  22. Venkataramani, V., Pathania, A., Mitra, T.: Unified thread-and data-mapping for multi-threaded multi-phase applications on SPM many-cores. In: 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1496–1501. IEEE (2020)

    Google Scholar 

  23. Wang, S., Pathania, A., Mitra, T.: Neural network inference on mobile SoCs. IEEE Design Test 37(5), 50–57 (2020)

    Article  Google Scholar 

  24. Wang, S., Prakash, A., Mitra, T.: Software support for heterogeneous computing. In: 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 756–762. IEEE (2018)

    Google Scholar 

  25. Prakash, A., Wang, S., Mitra, T.: Mobile application processors: techniques for software power-performance optimization. IEEE Consumer Electron. Mag. 9(4), 67–76 (2020)

    Article  Google Scholar 

  26. ARM. Arm Compute Library. Available online: https://developer.arm.com/ip-products/processors/machine-learning/compute-library. Accessed 17 March 2022

  27. OAID. Tengine. Available online: https://github.com/OAID/Tengine. Accessed 17 March 2022

  28. Tencent. NCNN. Available online: https://github.com/Tencent/ncnn. Accessed 17 March 2022

  29. Wu, H.I., Guo, D.Y., Chin, H.H., Tsay, R.S.: A pipeline-based scheduler for optimizing latency of convolution neural network inference over heterogeneous multicore systems. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 46–49. IEEE (2020)

    Google Scholar 

  30. Kim, B., Lee, S., Trivedi, A.R., Song, W.J.: Energy-efficient acceleration of deep neural networks on realtime-constrained embedded edge devices. IEEE Access 8, 216259–216270 (2020)

    Article  Google Scholar 

  31. Minakova, S., Tang, E., Stefanov, T.: Combining task- and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In: International Conference on Embedded Computer Systems, pp. 18–35. Springer, Cham (2020)

    Google Scholar 

  32. Tang, E., Minakova, S., Stefanov, T.: Energy-efficient and High-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) (2022)

    Google Scholar 

  33. Jeong, E., Kim, J., Tan, S., Lee, J., Ha, S.: Deep learning inference parallelization on heterogeneous processors with TensorRT. IEEE Embed. Syst. Lett. 14, 15–18 (2021)

    Article  Google Scholar 

  34. Zhong, G., Dubey, A., Tan, C., Mitra, T.: Synergy: an HW/SW framework for high throughput CNNs on embedded heterogeneous SoC. ACM Trans. Embed. Comput. Syst. 18(2), 1–23 (2019)

    Article  Google Scholar 

  35. Soomro, P.N., Abduljabbar, M., Castrillon, J., Pericàs, M.: An online guided tuning approach to run CNN pipelines on edge devices. In: Proceedings of the 18th ACM International Conference on Computing Frontiers, pp. 45–53 (2021)

    Google Scholar 

  36. Zhong, G., Prakash, A., Liang, Y., Mitra, T., Niar, S.: Lin-analyzer: A high-level performance analysis tool for FPGA-based accelerators. In 2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2016)

    Google Scholar 

  37. Zhong, G., Venkataramani, V., Liang, Y., Mitra, T., Niar, S.: Design space exploration of multiple loops on FPGAs using high level synthesis. In 2014 IEEE 32nd International Conference on Computer Design (ICCD), pp. 456–463. IEEE (2014)

    Google Scholar 

  38. XiTAO. https://github.com/CHART-Team/xitao. Accessed 17 March 2022

  39. Aghapour, E., Pathania, A., Ananthanarayanan, G. Integrated ARM big. Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv preprint (2021)

    Google Scholar 

  40. Aghapour, E., Sapra, D., Pimentel, A., Pathania, A.: CPU-GPU layer-switched low latency CNN inference. In: 2022 25th Euromicro Conference on Digital System Design (DSD) (2022)

    Google Scholar 

Download references

Acknowledgements

This research is partially supported by the National Research Foundation Singapore under its Competitive Research Program Award NRF-CRP23-2019-0003 and Singapore Ministry of Education Academic Research Fund T1 251RES1905.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tulika Mitra .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Aghapour, E., Zhang, Y., Pathania, A., Mitra, T. (2024). Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-39932-9_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39932-9_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39931-2

  • Online ISBN: 978-3-031-39932-9

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics