Abstract
Convolutional neural networks (CNNs)-based inference is a quintessential component in mobile machine learning applications. Privacy and real-time response requirements require applications to perform inference on the mobile (edge) devices themselves. Heterogeneous multi-processor system-on-chips (HMPSoCs) within the edge devices enable high-throughput, low-latency edge inference. An HMPSoC contains several processing cores, each capable of independently performing CNN inference. However, to meet stringent performance requirements, an application must simultaneously involve all core types in inferencing. A software-based CNN inference pipeline design allows for synergistic engagement of all the cores in an HMPSoC for a high-throughput and low-latency CNN inference. In this chapter, we present two different CNN inference pipeline designs. The first design creates a pipeline between two different types of CPU cores. The second design extends the pipeline from CPU to GPU. We also provide a future perspective and research directions on the subject.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
Mitra, T.: Heterogeneous multi-core architectures. Inf. Media Technol. 10(3), 383–394 (2015)
Prakash, A., Wang, S., Mitra, T.: Mobile application processors: Techniques for software power-performance optimization. IEEE Consumer Electron. Magaz. 9(4), 67–76 (2020)
Wang, S., Ananthanarayanan, G., Zeng, Y., Goel, N., Pathania, A., Mitra, T.: High-throughput CNN inference on embedded ARM Big. LITTLE multicore processors. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 39(10), 2254–2267 (2019)
Khadas VIM 3, https://www.khadas.com/vim3, 23 12 2011
Somu Muthukaruppan, T., Pathania, A., Mitra, T.: Price theory based power management for heterogeneous multi-cores. ACM SIGPLAN Notices 49(4), 161–176 (2014)
Mitra, T., Muthukaruppan, T.S., Pathania, A., Pricopi, M., Venkataramani, V., Vishin, S.: Power management of asymmetric multi-cores in the dark silicon Era. In: The Dark Side of Silicon, pp. 159–189. Springer, Cham (2017)
Rapp, M., Pathania, A., Mitra, T., Henkel, J.: Neural network-based performance prediction for task migration on S-NUCA many-cores. IEEE Trans. Comput. 70(10), 1691–1704 (2020)
Pricopi, M., Mitra, T.: Bahurupi: a polymorphic heterogeneous multi-core architecture. ACM Trans. Archit. Code Optimiz. 8(4), 1–21 (2012)
Mitra, T., Pricopi, M.: U.S. Patent No. 9,690,620. Washington, DC: U.S. Patent and Trademark Office (2017)
Pricopi, M., Mitra, T.: Task scheduling on adaptive multi-core. IEEE Trans. Comput. 63(10), 2590–2603 (2013)
Pathania, A., Jiao, Q., Prakash, A., Mitra, T.: Integrated CPU-GPU power management for 3D mobile games. In 2014 51st ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2014)
Pathania, A., Irimiea, A.E., Prakash, A., Mitra, T.: Power-performance modelling of mobile gaming workloads on heterogeneous MPSoCs. In Proceedings of the 52nd Annual Design Automation Conference, pp. 1–6 (2015)
Prakash, A., Wang, S., Irimiea, A. E., Mitra, T.: Energy-efficient execution of data-parallel applications on heterogeneous mobile platforms. In 2015 33rd IEEE International Conference on Computer Design (ICCD), pp. 208–215 (2015)
Karunaratne, M., Mohite, A.K., Mitra, T., Peh, L.S.: HyCUBE: A CGRA with reconfigurable single-cycle multi-hop interconnect. In Proceedings of the 54th Annual Design Automation Conference 2017, pp. 1–6 (2017)
Li, Z., Wijerathne, D., Chen, X., Pathania, A., Mitra, T.: ChordMap: Automated mapping of streaming applications onto CGRA. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 41, 306–319 (2021)
Wijerathne, D., Li, Z., Pathania, A., Mitra, T., Thiele, L.: HiMap: fast and scalable high-quality mapping on CGRA via hierarchical abstraction. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst. 41(10), 3290–3303 (2021)
Wijerathne, D., Li, Z., Karunarathne, M., Pathania, A., Mitra, T.: Cascade: High throughput data streaming via decoupled access-execute CGRA. ACM Trans. Embed. Comput. Syst. 18(5s), 1–26 (2019)
Li, Z., Wu, D., Wijerathne, D., Mitra, T.: LISA: Graph neural network based portable mapping on spatial accelerators. In: 2022 IEEE International Symposium on High-Performance Computer Architecture (HPCA), pp. 444–459. IEEE (2022)
Bandara, T.K., Wijerathne, D., Mitra, T., Peh, L.S.: REVAMP: A systematic framework for heterogeneous CGRA realization. In: Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 918–932 (2022)
Wijerathne, D., Li, Z., Bandara, T.K., Mitra, T.: PANORAMA: Divide-and-conquer approach for mapping complex loop kernels on CGRA. In: Proceedings of the 59th Annual Design Automation Conference 2022 (2022)
Venkataramani, V., Pathania, A., Mitra, T.: Unified thread-and data-mapping for multi-threaded multi-phase applications on SPM many-cores. In: 2020 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 1496–1501. IEEE (2020)
Wang, S., Pathania, A., Mitra, T.: Neural network inference on mobile SoCs. IEEE Design Test 37(5), 50–57 (2020)
Wang, S., Prakash, A., Mitra, T.: Software support for heterogeneous computing. In: 2018 IEEE Computer Society Annual Symposium on VLSI (ISVLSI), pp. 756–762. IEEE (2018)
Prakash, A., Wang, S., Mitra, T.: Mobile application processors: techniques for software power-performance optimization. IEEE Consumer Electron. Mag. 9(4), 67–76 (2020)
ARM. Arm Compute Library. Available online: https://developer.arm.com/ip-products/processors/machine-learning/compute-library. Accessed 17 March 2022
OAID. Tengine. Available online: https://github.com/OAID/Tengine. Accessed 17 March 2022
Tencent. NCNN. Available online: https://github.com/Tencent/ncnn. Accessed 17 March 2022
Wu, H.I., Guo, D.Y., Chin, H.H., Tsay, R.S.: A pipeline-based scheduler for optimizing latency of convolution neural network inference over heterogeneous multicore systems. In 2020 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), pp. 46–49. IEEE (2020)
Kim, B., Lee, S., Trivedi, A.R., Song, W.J.: Energy-efficient acceleration of deep neural networks on realtime-constrained embedded edge devices. IEEE Access 8, 216259–216270 (2020)
Minakova, S., Tang, E., Stefanov, T.: Combining task- and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In: International Conference on Embedded Computer Systems, pp. 18–35. Springer, Cham (2020)
Tang, E., Minakova, S., Stefanov, T.: Energy-efficient and High-throughput CNN inference on embedded CPUs-GPUs MPSoCs. In: International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS) (2022)
Jeong, E., Kim, J., Tan, S., Lee, J., Ha, S.: Deep learning inference parallelization on heterogeneous processors with TensorRT. IEEE Embed. Syst. Lett. 14, 15–18 (2021)
Zhong, G., Dubey, A., Tan, C., Mitra, T.: Synergy: an HW/SW framework for high throughput CNNs on embedded heterogeneous SoC. ACM Trans. Embed. Comput. Syst. 18(2), 1–23 (2019)
Soomro, P.N., Abduljabbar, M., Castrillon, J., Pericà s, M.: An online guided tuning approach to run CNN pipelines on edge devices. In: Proceedings of the 18th ACM International Conference on Computing Frontiers, pp. 45–53 (2021)
Zhong, G., Prakash, A., Liang, Y., Mitra, T., Niar, S.: Lin-analyzer: A high-level performance analysis tool for FPGA-based accelerators. In 2016 53rd ACM/EDAC/IEEE Design Automation Conference (DAC), pp. 1–6. IEEE (2016)
Zhong, G., Venkataramani, V., Liang, Y., Mitra, T., Niar, S.: Design space exploration of multiple loops on FPGAs using high level synthesis. In 2014 IEEE 32nd International Conference on Computer Design (ICCD), pp. 456–463. IEEE (2014)
XiTAO. https://github.com/CHART-Team/xitao. Accessed 17 March 2022
Aghapour, E., Pathania, A., Ananthanarayanan, G. Integrated ARM big. Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv preprint (2021)
Aghapour, E., Sapra, D., Pimentel, A., Pathania, A.: CPU-GPU layer-switched low latency CNN inference. In: 2022 25th Euromicro Conference on Digital System Design (DSD) (2022)
Acknowledgements
This research is partially supported by the National Research Foundation Singapore under its Competitive Research Program Award NRF-CRP23-2019-0003 and Singapore Ministry of Education Academic Research Fund T1 251RES1905.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Aghapour, E., Zhang, Y., Pathania, A., Mitra, T. (2024). Pipelined CNN Inference on Heterogeneous Multi-processor System-on-Chip. In: Pasricha, S., Shafique, M. (eds) Embedded Machine Learning for Cyber-Physical, IoT, and Edge Computing. Springer, Cham. https://doi.org/10.1007/978-3-031-39932-9_16
Download citation
DOI: https://doi.org/10.1007/978-3-031-39932-9_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39931-2
Online ISBN: 978-3-031-39932-9
eBook Packages: EngineeringEngineering (R0)