Deep learning parallel computing and evaluation for embedded system clustering architecture processor


In the era of intelligence, the processing of a large amount of information and various intelligent applications need to rely on embedded devices. This trend has made machine learning algorithms play an increasingly important role. High-performance embedded computing is an effective means to solve the lack of computing power of embedded devices. Aiming at the problem that the calculation amount of new intelligent embedded applications based on machine learning technology is higher, the computing power of traditional embedded systems is difficult to meet their needs, this paper studies the parallel optimization and implementation techniques of convolutional neural networks in Parallella platform. The parallel optimization strategy of convolutional neural network on the clustering architecture processor of heterogeneous multi-core system is given. Then the high-performance implementation of convolutional neural network on Parallella platform is studied, and the function of convolutional neural network system is implemented. A set of performance evaluation methods for embedded parallel processors is proposed. From the application point of S698P, the eCos operating system is selected as the platform. The single-core mode and multi-core mode are compared on the simulator GRSIM, and the parallel performance evaluation is given. Experiments have shown that the efficiency of deep learning tasks is significantly improved compared to traditional parallel methods.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9


  1. 1.

    Mai TNT, Kim S (2017) Parallel implementation of color-based particle filter for object tracking in embedded systems. Hum Cent Comput Inf Sci 7(1):2

    Google Scholar 

  2. 2.

    Gao F, Huang Z, Wang S et al (2017) Optimized parallel implementation of face detection based on embedded heterogeneous many-core architecture. Int J Pattern Recognit Artif Intell 31(7):1756011

    MathSciNet  Google Scholar 

  3. 3.

    Chen WH, Ji-Yao AN, Ren-Fa LI et al (2017) Review on deep-learning-based cognitive computing. Acta Autom Sin 43(11):1886–1897

    MATH  Google Scholar 

  4. 4.

    Niu J, Huang C, Li J et al (2018) Parallel computing techniques for concept-cognitive learning based on granular computing. Int J Mach Learn Cybernet 9(3):1–21

    Google Scholar 

  5. 5.

    Zeng G, Liu W (2017) An iso-time scaling method for big data tasks executing on parallel computing systems. J Supercomput 73(10):4493–4516

    Google Scholar 

  6. 6.

    Yin S, Peng O, Tang S et al (2018) A high energy efficient reconfigurable hybrid neural network processor for deep learning applications. IEEE J Solid State Circuits 53(4):968–982

    Google Scholar 

  7. 7.

    Wen S, Wei H, Zeng Z et al (2018) Memristive fully convolutional network: an accurate hardware image-segmentor in deep learning. IEEE Trans Emerg Top Comput Intell 2(5):324–334

    Google Scholar 

  8. 8.

    Gu X, Angelov PP, Zhang C et al (2018) A massively parallel deep rule-based ensemble classifier for remote sensing scenes. IEEE Geosci Remote Sens Lett 15(3):345–349

    Google Scholar 

  9. 9.

    Wang C, Shen Y, Jia J et al (2018) SingleCaffe: an efficient framework for deep learning on a single node. IEEE Access 6(99):69660–69671

    Google Scholar 

  10. 10.

    Chung I, Sainath TN, Ramabhadran B et al (2017) Parallel deep neural network training for Big Data on Blue Gene/Q. IEEE Trans Parallel Distrib Syst 28(6):1703–1714

    Google Scholar 

  11. 11.

    Sugie T, Akamatsu T, Nishitsuji T et al (2018) High-performance parallel computing for next-generation holographic imaging. Nat Electron 1(4):254–259

    Google Scholar 

  12. 12.

    Xia C, Yan L, Xin Z et al (2018) A novel DVR-ESS-embedded wind-energy conversion system. IEEE Trans Sustain Energy 9(3):1

    Google Scholar 

  13. 13.

    Cai B, Ye W, Zhao J (2018) A dynamic texture based segmentation method for ultrasound images with Surfacelet, HMT and parallel computing. Multimed Tools Appl 78(1):5381–5401

    Google Scholar 

  14. 14.

    Cunha MAP, Matoussi O, Pétrot F (2018) Detecting software cache coherence violations in MPSoC using traces captured on virtual platforms. ACM Trans Embed Comput Syst 16(2):1–21

    Google Scholar 

  15. 15.

    Dou W, Li Y (2018) A fault-tolerant computing method for Xdraw parallel algorithm. J Supercomput 74(3):1–25

    Google Scholar 

  16. 16.

    Thoman P, Dichev K, Heller T et al (2018) A taxonomy of task-based parallel programming technologies for high-performance computing. J Supercomput 74(4):1422–1434

    Google Scholar 

  17. 17.

    Yu L, Nina-Paravecino F, Kaeli D et al (2018) Scalable and massively parallel Monte Carlo photon transport simulations for heterogeneous computing platforms. J Biomed Opt 23(1):1–4

    Google Scholar 

  18. 18.

    Zhu G, Chen W, Wang D et al (2019) Study on high-density integration resistive random access memory array from multiphysics perspective by parallel computing. IEEE Trans Electron Devices 66(4):1747–1753

    Google Scholar 

  19. 19.

    Mo ZY (2018) Extreme-scale parallel computing: bottlenecks and strategies. Front Inf Technol Electron Eng 19(10):1251–1260

    Google Scholar 

  20. 20.

    Grubov VV, Nedaivozov VO (2018) Stream processing of multichannel EEG data using parallel computing technology with NVIDIA CUDA graphics processors. Tech Phys Lett 44(5):453–455

    Google Scholar 

  21. 21.

    Chen Y, Zhao Q, Hu X et al (2019) Multi-resolution parallel magnetic resonance image reconstruction in mobile computing-based IoT. IEEE Access 7(99):15623–15633

    Google Scholar 

Download references

Author information



Corresponding author

Correspondence to Yue Zu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Zu, Y. Deep learning parallel computing and evaluation for embedded system clustering architecture processor. Des Autom Embed Syst 24, 145–159 (2020).

Download citation


  • Clustered architecture processor
  • Parallel computing
  • Deep learning
  • Performance evaluation