Skip to main content
Log in

StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and application-specific processing elements by applications. We address issues of efficient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specific MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain. We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10
Figure 11
Figure 12
Figure 13
Figure 14
Figure 15
Figure 16
Figure 17
Figure 18
Figure 19
Figure 20
Figure 21
Figure 22

Similar content being viewed by others

Notes

  1. This set includes relatively generic instructions, such as a MAC4CLIP which performs SIMD multiplication on bytes of two input operands, saturates the two 16-bit results, and accumulates them with the result operand; as well as instructions dedicated to specific image processing functions, such as a XORSBCW, used in Support Vector Machine (SVM), which calculate the Hamming distance between two vectors.

  2. In this paper, we also use term actor for the KPN processes for the sake of coherence.

  3. In actual implementation, we have implemented two variants of the ORB: (1) with a rescaler tightly-coupled HW block and where the pyramid construction is part of the dataflow graph, and (2) with the pyramid construction as a pre-processing step.

  4. Any field of the Keyp_t structure can be used to communicate the number of corners.

  5. The synchronization overhead includes actions required to verify the token availability, and the associated scheduler actions.

  6. Unless there is uncontrolled accumulation of tokens in a channel.

  7. This real-time requirement also takes the match part of the application into account.

References

  1. Bezati, E. (2015). High-level synthesis of dataflow programs for heterogeneous platforms: design flow tools and design space exploration. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.

  2. Bezati, E., Brunet, S.C., Mattavelli, M., Janneck, J.W. (2016). High-level system synthesis and optimization of dataflow programs for mpsocs. In Matthews, M.B. (Ed.) ACSSC (pp. 417–421). IEEE.

  3. Bhattacharya, B., & Battacharyya, S. (2001). Parameterized dataflow modelling for dsp systems. IEEE Transactions on Signal Processing, 49(10), 2408–2421.

    Article  MathSciNet  Google Scholar 

  4. Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.). (Berlin). Handbook of signal processing systems: Springer.

  5. Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.A. (1995). Cyclo-static data flow. In ICASSP, (Vol. 5 pp. 3255–3258).

  6. Buck, J. (1993). Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD Thesis, Department of Electrical Enginnering and Computer Science, University of California at Berkeley.

  7. Buck, J.T. (1994). A dynamic dataflow model suitable for efficient mixed hardware and software implementations of dsp applications. In HSCD Workshop (pp. 165–172).

  8. Cockx, J., Denolf, K., Vanhoof, B., Stahl, R. (2007). Sprint: a tool to generate concurrent transaction-level models from sequential code. EURASIP Journal on Applied Signal Processing, 1, 213.

    MATH  Google Scholar 

  9. Dehyadegari, M., Marongiu, A., Kakoee, M., Benini, L., Mohammadi, S., Yazdani, N. (2012). A tightly-coupled multi-core cluster with shared memory hw accelerators. In ISCAMOS (pp. 96–103).

  10. Dennis, J. (1974). First version data flow procedure language. Tech. Rep. MAC TM61, MIT laboratory for computer science.

  11. de Dinechin, B.D., Ayrignac, R., Beaucamps, P.E., Couvert, P., Ganne, B., de Massas, P.G., Jacquet, F., Jones, S., Chaisemartin, N.M., Riss, F., Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. In HPEC (pp. 1–6): IEEE.

  12. de Kock, E.A., Smits, W., van der Wolf, P., Brunel, J.Y., Kruijtzer, W., Lieverse, P., Vissers, K.A., Essink, G. (2000). Yapi: application modeling for signal processing systems. In DAC (pp. 402–405).

  13. Dunkels, A., Schmidt, O., Voigt, T., Ali, M. (2006). Protothreads: simplifying event-driven programming of memory-constrained embedded systems. In Sensys (pp. 29–42).

  14. Edwards, S.A., & Tardieu, O. (2006). Shim: a deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(8), 854– 867.

    Article  Google Scholar 

  15. Edwards, S.A., Vasudevan, N., Tardieu, O. (2008). Programming shared memory multiprocessors with deterministic message-passing concurrency: compiling shim to pthreads. In Sciuto, D. (Ed.) DATE (pp. 1498–1503). ACM.

  16. Eker, J., & Janneck, J. (2002). Caltrop—language report (draft). Technical Memorandum, Electronics Research Lab, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley California, Berkeley, CA 94720, USA, http://www.gigascale.org/caltrop.

  17. Eker, J., & Janneck, J.W. (2012). Dataflow programming in cal – balancing expressiveness, analyzability, and implementability. In Asilomar conference on signals, systems and computers (pp. 1120–1124).

  18. Gangwal, O.P., Nieuwland, A., Lippens, P.E.R. (2001). A scalable and flexible data synchronization scheme for embedded hw-sw shared-memory systems. In Hermida, R., & Aboulhamid, E.M. (Eds.) ISSS (pp. 1–6). ACM / IEEE Computer Society.

  19. Gautier, T., Besseron, X., Pigeon, L. (2007). Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In PASCO (pp. 15–23).

  20. Gebrewahid, E., Yang, M., Cedersjö, G., Abdin, Z.U., Gaspes, V., Janneck, J.W., Svensson, B. (2014). Realizing efficient execution of dataflow actors on manycores. In EUC (pp. 321–328).

  21. Geilen, M., & Basten, T. (2003). Requirements on the execution of kahn process networks. In Degano, P. (Ed.) ESOP, Springer, lecture notes in computer science, (Vol. 2618 pp. 319–334).

  22. Girault, A., Lee, B., Lee, E.A. (1999). Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(6), 742–760.

    Article  Google Scholar 

  23. Goubier, T., Sirdey, R., Louise, S., David, V. (2011). ΣC: a programming model and language for embedded manycores. In ICA3PP (pp. 385–394).

  24. Haid, W. (2010). Design and performance analysis of multiprocessor streaming applications. PhD Thesis, ETH, Zurich.

  25. Haid, W., Schor, L., Huang, K., Bacivarov, I., Thiele, L. (2009). Efficient execution of kahn process networks on multi-processor systems using protothreads and windowed fifos. In ESTIMEdia (pp. 35–44).

  26. Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th alvey vision conference (pp. 147–151).

  27. Huang, K., Grunert, D., Thiele, L. (2007). Windowed fifos for fpga-based multiprocessor systems. In ASAP (pp. 36–41).

  28. Kahn, G. (1974). The semantics of a simple language for parallel programming. In IFIP Congress.

  29. Lee, E. (1997). A denotational semantics for dataflow with firing Memorandum UCB/ERL M97/3. Electronics Research Laboratory, U. C. Berkeley.

  30. Lee, E.A., & Messerschmitt, D.G. (1987). Synchronous data flow. Proceedings of the IEEE, 75(9), 1235–1245.

    Article  Google Scholar 

  31. Mattavelli, M., Amer, I., Raulet, M. (2010). The reconfigurable video coding standard [standards in a nutshell]. IEEE Signal Processing Magazine, 27(3), 159–167.

    Article  Google Scholar 

  32. Mattavelli, M., Raulet, M., Janneck, J.W. (2013). Mpeg reconfigurable video coding. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.) Handbook of signal processing systems (pp. 281–314). Springer.

  33. Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. In DAC (pp. 1137–1142).

  34. Michalska, M., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). A partition scheduler model for dynamic dataflow programs. In Connolly, M. (Ed.) ICCS, Elsevier, procedia computer science, (Vol. 80 pp. 2287–2291).

  35. Michalska, M., Zufferey, N., Boutellier, J., Bezati, E., Mattavelli, M. (2016). Efficient scheduling policies for dynamic data flow programs executed on multi-core. In 11th international meeting on logistics research.

  36. NVIDIA. (2010). Next generation cuda compute architecture: Fermi - white paper. http://www.nvidia.com.

  37. Olofsson, A., Nordström, T, Ul-Abdin, Z. (2014). Kickstarting high-performance energy-efficient manycore architectures with epiphany. In Asilomar conference on signals, systems and computers (pp. 1719–1726). IEEE.

  38. Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G. (2011). Tideflow: the time iterated dependency flow execution model. In Workshop on data-flow execution models for extreme scale computing (DFM) (pp. 1–9).

  39. Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J.F., Aridhi, S. (2014). Preesm: a dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In EDERC (pp. 36–40).

  40. Pimentel, A.D. (2008). The artemis workbench for system-level performance evaluation of embedded systems. International Journal of Embedded Systems, 3(3), 181–196.

    Article  Google Scholar 

  41. Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.S. (2008). Functional dif for rapid prototyping. In IEEE international workshop on rapid system prototyping (pp. 17–23). IEEE Computer Society.

  42. Plishker, W., Sane, N., Bhattacharyya, S.S. (2009). A generalized scheduling approach for dynamic dataflow applications. In Benini, L., Micheli, G.D., Al-Hashimi, B.M., Müller, W. (Eds.) DATE (pp. 111–116). IEEE.

  43. Plurality. (2011). Plurality hypercore. http://www.plurality.com.

  44. Pop, A., & Cohen, A. (2013). Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4), 53.

    Article  Google Scholar 

  45. Rahimi, A., Loi, I., Kakoee, M.R., Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in europe conference & exhibition (DATE), 2011 (pp. 1–6). IEEE.

  46. Rahman, A.A.H.A., Brunet, S.C., Alberti, C., Mattavelli, M. (2014). A methodology for optimizing buffer sizes of dynamic dataflow fpgas implementations. In ICASSP (pp. 5003–5007). IEEE.

  47. Rahman, A.A.H.B.A. (2014). Optimizing dataflow programs for hardware synthesis. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.

  48. Rosten, E., Porter, R., Drummond, T. (2010). Faster and better: a machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 105– 119.

    Article  Google Scholar 

  49. Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: an efficient alternative to sift or surf. In ICCV (pp. 2564–2571).

  50. Sane, N., Hsu, C.J., Pino, J.L., Bhattacharyya, S.S. (2010). Simulating dynamic communication systems using the core functional dataflow model. In ICASSP (pp. 1538–1541). IEEE.

  51. Sau, C., Meloni, P., Raffo, L., Palumbo, F., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). Automated design flow for multi-functional dataflow-based platforms. Signal Processing Systems, 85(1), 143–165.

    Article  Google Scholar 

  52. Schwambach, V., Cleyet-Merle, S., Issard, A., Mancini, S. (2015). Estimating the potential speedup of computer vision applications on embedded multiprocessors. arXiv:1502.07446.

  53. Shen, C., Plishker, W., Bhattacharyya, S.S. (2012). Dataflow-based design and implementation of image processing applications. In Guan, L., He, Y., Kung, S.-Y. (Eds.) Multimedia Image and Video Processing, 2nd edn. Chapter 24 (pp. 609–629). Boca Raton: CRC Press.

  54. Sriram, S., & Bhattacharyya, S.S. (2009). Embedded multiprocessors: scheduling and synchronization. Boca Raton: CRC Press.

    Book  Google Scholar 

  55. Sérot, J., Berry, F., Bourrasset, C. (2016). High-level dataflow programming for real-time image processing on smart cameras. Journal of Real-Time Image Processing, 12(4), 635–647.

    Article  Google Scholar 

  56. Stoutchinin, A., & Benini, L. (2017). Stream drive: a dynamic dataflow framework for clustered embedded architectures. In Conference on computing frontiers (pp. 1–8). ACM.

  57. Stuijk, S., Geilen, M., Thelen, B., Basten, T. (2011). Scenario-aware dataflow: modeling, analysis and implementation of dynamic applications. In International conference on embedded computer systems (pp. 404–411).

  58. Ul-Abdin, Z., & Yang, M. (2015). A radar signal processing case study for dataflow programming of manycores. Journal of Signal Processing Systems, 87(1), 49–62.

    Article  Google Scholar 

  59. Vasudevan, N., & Edwards, S.A. (2009). Celling shim: compiling deterministic concurrency to a heterogeneous multicore. In ACM symposium on applied computing (pp. 1626–1631).

  60. Vrba, Z., Halvorsen, P., Griwodz, C., Beskow, P., Espeland, H., Johansen, D. (2013). The nornir run-time system for parallel programs using kahn process networks on multi-core machines - a flexible alternative to mapreduce. The Journal of Supercomputing, 63(1), 191–217.

    Article  Google Scholar 

  61. YarKhan, A. (2012). Dynamic task execution on shared and distributed memory architectures. PhD Thesis, The University of Tennessee, Knoxville.

  62. Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2014). Efficient software synthesis of dynamic dataflow programs. In ICASSP (pp. 4988–4992). IEEE.

  63. Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2015). Embedded multi-core systems dedicated to dynamic dataflow programs. Signal Processing Systems, 80(1), 121–136.

    Article  Google Scholar 

  64. Zaki, G.F., Plishker, W., Bhattacharyya, S.S., Fruth, F. (2017). Implementation, scheduling, and adaptation of partial expansion graphs on multicore platforms. Signal Processing Systems, 87(1), 107–125.

    Article  Google Scholar 

Download references

Acknowledgements

This research was partially funded by the H2020 Project Opecomp (CA 732631) and by the ERC-ADG Project Multitherman (CA 291125). Authors would also like to thank the ST Microelectronics’ Embedded Computing Systems management for supporting this research.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arthur Stoutchinin.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Stoutchinin, A., Benini, L. StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures. J Sign Process Syst 91, 275–301 (2019). https://doi.org/10.1007/s11265-018-1351-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-018-1351-1

Keywords

Navigation