Advertisement

StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

  • Arthur Stoutchinin
  • Luca Benini
Article
  • 43 Downloads

Abstract

In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and application-specific processing elements by applications. We address issues of efficient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specific MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain. We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory.

Keywords

Embedded Multicore Shared memory Dataflow Kahn process Heterogeneous Accelerator 

Notes

Acknowledgements

This research was partially funded by the H2020 Project Opecomp (CA 732631) and by the ERC-ADG Project Multitherman (CA 291125). Authors would also like to thank the ST Microelectronics’ Embedded Computing Systems management for supporting this research.

References

  1. 1.
    Bezati, E. (2015). High-level synthesis of dataflow programs for heterogeneous platforms: design flow tools and design space exploration. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.Google Scholar
  2. 2.
    Bezati, E., Brunet, S.C., Mattavelli, M., Janneck, J.W. (2016). High-level system synthesis and optimization of dataflow programs for mpsocs. In Matthews, M.B. (Ed.) ACSSC (pp. 417–421). IEEE.Google Scholar
  3. 3.
    Bhattacharya, B., & Battacharyya, S. (2001). Parameterized dataflow modelling for dsp systems. IEEE Transactions on Signal Processing, 49(10), 2408–2421.MathSciNetCrossRefzbMATHGoogle Scholar
  4. 4.
    Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.). (Berlin). Handbook of signal processing systems: Springer.Google Scholar
  5. 5.
    Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.A. (1995). Cyclo-static data flow. In ICASSP, (Vol. 5 pp. 3255–3258).Google Scholar
  6. 6.
    Buck, J. (1993). Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD Thesis, Department of Electrical Enginnering and Computer Science, University of California at Berkeley.Google Scholar
  7. 7.
    Buck, J.T. (1994). A dynamic dataflow model suitable for efficient mixed hardware and software implementations of dsp applications. In HSCD Workshop (pp. 165–172).Google Scholar
  8. 8.
    Cockx, J., Denolf, K., Vanhoof, B., Stahl, R. (2007). Sprint: a tool to generate concurrent transaction-level models from sequential code. EURASIP Journal on Applied Signal Processing, 1, 213.zbMATHGoogle Scholar
  9. 9.
    Dehyadegari, M., Marongiu, A., Kakoee, M., Benini, L., Mohammadi, S., Yazdani, N. (2012). A tightly-coupled multi-core cluster with shared memory hw accelerators. In ISCAMOS (pp. 96–103).Google Scholar
  10. 10.
    Dennis, J. (1974). First version data flow procedure language. Tech. Rep. MAC TM61, MIT laboratory for computer science.Google Scholar
  11. 11.
    de Dinechin, B.D., Ayrignac, R., Beaucamps, P.E., Couvert, P., Ganne, B., de Massas, P.G., Jacquet, F., Jones, S., Chaisemartin, N.M., Riss, F., Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. In HPEC (pp. 1–6): IEEE.Google Scholar
  12. 12.
    de Kock, E.A., Smits, W., van der Wolf, P., Brunel, J.Y., Kruijtzer, W., Lieverse, P., Vissers, K.A., Essink, G. (2000). Yapi: application modeling for signal processing systems. In DAC (pp. 402–405).Google Scholar
  13. 13.
    Dunkels, A., Schmidt, O., Voigt, T., Ali, M. (2006). Protothreads: simplifying event-driven programming of memory-constrained embedded systems. In Sensys (pp. 29–42).Google Scholar
  14. 14.
    Edwards, S.A., & Tardieu, O. (2006). Shim: a deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(8), 854– 867.CrossRefGoogle Scholar
  15. 15.
    Edwards, S.A., Vasudevan, N., Tardieu, O. (2008). Programming shared memory multiprocessors with deterministic message-passing concurrency: compiling shim to pthreads. In Sciuto, D. (Ed.) DATE (pp. 1498–1503). ACM.Google Scholar
  16. 16.
    Eker, J., & Janneck, J. (2002). Caltrop—language report (draft). Technical Memorandum, Electronics Research Lab, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley California, Berkeley, CA 94720, USA, http://www.gigascale.org/caltrop.
  17. 17.
    Eker, J., & Janneck, J.W. (2012). Dataflow programming in cal – balancing expressiveness, analyzability, and implementability. In Asilomar conference on signals, systems and computers (pp. 1120–1124).Google Scholar
  18. 18.
    Gangwal, O.P., Nieuwland, A., Lippens, P.E.R. (2001). A scalable and flexible data synchronization scheme for embedded hw-sw shared-memory systems. In Hermida, R., & Aboulhamid, E.M. (Eds.) ISSS (pp. 1–6). ACM / IEEE Computer Society.Google Scholar
  19. 19.
    Gautier, T., Besseron, X., Pigeon, L. (2007). Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In PASCO (pp. 15–23).Google Scholar
  20. 20.
    Gebrewahid, E., Yang, M., Cedersjö, G., Abdin, Z.U., Gaspes, V., Janneck, J.W., Svensson, B. (2014). Realizing efficient execution of dataflow actors on manycores. In EUC (pp. 321–328).Google Scholar
  21. 21.
    Geilen, M., & Basten, T. (2003). Requirements on the execution of kahn process networks. In Degano, P. (Ed.) ESOP, Springer, lecture notes in computer science, (Vol. 2618 pp. 319–334).Google Scholar
  22. 22.
    Girault, A., Lee, B., Lee, E.A. (1999). Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(6), 742–760.CrossRefGoogle Scholar
  23. 23.
    Goubier, T., Sirdey, R., Louise, S., David, V. (2011). ΣC: a programming model and language for embedded manycores. In ICA3PP (pp. 385–394).Google Scholar
  24. 24.
    Haid, W. (2010). Design and performance analysis of multiprocessor streaming applications. PhD Thesis, ETH, Zurich.Google Scholar
  25. 25.
    Haid, W., Schor, L., Huang, K., Bacivarov, I., Thiele, L. (2009). Efficient execution of kahn process networks on multi-processor systems using protothreads and windowed fifos. In ESTIMEdia (pp. 35–44).Google Scholar
  26. 26.
    Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th alvey vision conference (pp. 147–151).Google Scholar
  27. 27.
    Huang, K., Grunert, D., Thiele, L. (2007). Windowed fifos for fpga-based multiprocessor systems. In ASAP (pp. 36–41).Google Scholar
  28. 28.
    Kahn, G. (1974). The semantics of a simple language for parallel programming. In IFIP Congress.Google Scholar
  29. 29.
    Lee, E. (1997). A denotational semantics for dataflow with firing Memorandum UCB/ERL M97/3. Electronics Research Laboratory, U. C. Berkeley.Google Scholar
  30. 30.
    Lee, E.A., & Messerschmitt, D.G. (1987). Synchronous data flow. Proceedings of the IEEE, 75(9), 1235–1245.CrossRefGoogle Scholar
  31. 31.
    Mattavelli, M., Amer, I., Raulet, M. (2010). The reconfigurable video coding standard [standards in a nutshell]. IEEE Signal Processing Magazine, 27(3), 159–167.CrossRefGoogle Scholar
  32. 32.
    Mattavelli, M., Raulet, M., Janneck, J.W. (2013). Mpeg reconfigurable video coding. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.) Handbook of signal processing systems (pp. 281–314). Springer.Google Scholar
  33. 33.
    Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. In DAC (pp. 1137–1142).Google Scholar
  34. 34.
    Michalska, M., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). A partition scheduler model for dynamic dataflow programs. In Connolly, M. (Ed.) ICCS, Elsevier, procedia computer science, (Vol. 80 pp. 2287–2291).Google Scholar
  35. 35.
    Michalska, M., Zufferey, N., Boutellier, J., Bezati, E., Mattavelli, M. (2016). Efficient scheduling policies for dynamic data flow programs executed on multi-core. In 11th international meeting on logistics research.Google Scholar
  36. 36.
    NVIDIA. (2010). Next generation cuda compute architecture: Fermi - white paper. http://www.nvidia.com.
  37. 37.
    Olofsson, A., Nordström, T, Ul-Abdin, Z. (2014). Kickstarting high-performance energy-efficient manycore architectures with epiphany. In Asilomar conference on signals, systems and computers (pp. 1719–1726). IEEE.Google Scholar
  38. 38.
    Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G. (2011). Tideflow: the time iterated dependency flow execution model. In Workshop on data-flow execution models for extreme scale computing (DFM) (pp. 1–9).Google Scholar
  39. 39.
    Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J.F., Aridhi, S. (2014). Preesm: a dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In EDERC (pp. 36–40).Google Scholar
  40. 40.
    Pimentel, A.D. (2008). The artemis workbench for system-level performance evaluation of embedded systems. International Journal of Embedded Systems, 3(3), 181–196.CrossRefGoogle Scholar
  41. 41.
    Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.S. (2008). Functional dif for rapid prototyping. In IEEE international workshop on rapid system prototyping (pp. 17–23). IEEE Computer Society.Google Scholar
  42. 42.
    Plishker, W., Sane, N., Bhattacharyya, S.S. (2009). A generalized scheduling approach for dynamic dataflow applications. In Benini, L., Micheli, G.D., Al-Hashimi, B.M., Müller, W. (Eds.) DATE (pp. 111–116). IEEE.Google Scholar
  43. 43.
    Plurality. (2011). Plurality hypercore. http://www.plurality.com.
  44. 44.
    Pop, A., & Cohen, A. (2013). Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4), 53.CrossRefGoogle Scholar
  45. 45.
    Rahimi, A., Loi, I., Kakoee, M.R., Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in europe conference & exhibition (DATE), 2011 (pp. 1–6). IEEE.Google Scholar
  46. 46.
    Rahman, A.A.H.A., Brunet, S.C., Alberti, C., Mattavelli, M. (2014). A methodology for optimizing buffer sizes of dynamic dataflow fpgas implementations. In ICASSP (pp. 5003–5007). IEEE.Google Scholar
  47. 47.
    Rahman, A.A.H.B.A. (2014). Optimizing dataflow programs for hardware synthesis. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.Google Scholar
  48. 48.
    Rosten, E., Porter, R., Drummond, T. (2010). Faster and better: a machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 105– 119.CrossRefGoogle Scholar
  49. 49.
    Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: an efficient alternative to sift or surf. In ICCV (pp. 2564–2571).Google Scholar
  50. 50.
    Sane, N., Hsu, C.J., Pino, J.L., Bhattacharyya, S.S. (2010). Simulating dynamic communication systems using the core functional dataflow model. In ICASSP (pp. 1538–1541). IEEE.Google Scholar
  51. 51.
    Sau, C., Meloni, P., Raffo, L., Palumbo, F., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). Automated design flow for multi-functional dataflow-based platforms. Signal Processing Systems, 85(1), 143–165.CrossRefGoogle Scholar
  52. 52.
    Schwambach, V., Cleyet-Merle, S., Issard, A., Mancini, S. (2015). Estimating the potential speedup of computer vision applications on embedded multiprocessors. arXiv:1502.07446.
  53. 53.
    Shen, C., Plishker, W., Bhattacharyya, S.S. (2012). Dataflow-based design and implementation of image processing applications. In Guan, L., He, Y., Kung, S.-Y. (Eds.) Multimedia Image and Video Processing, 2nd edn. Chapter 24 (pp. 609–629). Boca Raton: CRC Press.Google Scholar
  54. 54.
    Sriram, S., & Bhattacharyya, S.S. (2009). Embedded multiprocessors: scheduling and synchronization. Boca Raton: CRC Press.CrossRefGoogle Scholar
  55. 55.
    Sérot, J., Berry, F., Bourrasset, C. (2016). High-level dataflow programming for real-time image processing on smart cameras. Journal of Real-Time Image Processing, 12(4), 635–647.CrossRefGoogle Scholar
  56. 56.
    Stoutchinin, A., & Benini, L. (2017). Stream drive: a dynamic dataflow framework for clustered embedded architectures. In Conference on computing frontiers (pp. 1–8). ACM.Google Scholar
  57. 57.
    Stuijk, S., Geilen, M., Thelen, B., Basten, T. (2011). Scenario-aware dataflow: modeling, analysis and implementation of dynamic applications. In International conference on embedded computer systems (pp. 404–411).Google Scholar
  58. 58.
    Ul-Abdin, Z., & Yang, M. (2015). A radar signal processing case study for dataflow programming of manycores. Journal of Signal Processing Systems, 87(1), 49–62.CrossRefGoogle Scholar
  59. 59.
    Vasudevan, N., & Edwards, S.A. (2009). Celling shim: compiling deterministic concurrency to a heterogeneous multicore. In ACM symposium on applied computing (pp. 1626–1631).Google Scholar
  60. 60.
    Vrba, Z., Halvorsen, P., Griwodz, C., Beskow, P., Espeland, H., Johansen, D. (2013). The nornir run-time system for parallel programs using kahn process networks on multi-core machines - a flexible alternative to mapreduce. The Journal of Supercomputing, 63(1), 191–217.CrossRefGoogle Scholar
  61. 61.
    YarKhan, A. (2012). Dynamic task execution on shared and distributed memory architectures. PhD Thesis, The University of Tennessee, Knoxville.Google Scholar
  62. 62.
    Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2014). Efficient software synthesis of dynamic dataflow programs. In ICASSP (pp. 4988–4992). IEEE.Google Scholar
  63. 63.
    Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2015). Embedded multi-core systems dedicated to dynamic dataflow programs. Signal Processing Systems, 80(1), 121–136.CrossRefGoogle Scholar
  64. 64.
    Zaki, G.F., Plishker, W., Bhattacharyya, S.S., Fruth, F. (2017). Implementation, scheduling, and adaptation of partial expansion graphs on multicore platforms. Signal Processing Systems, 87(1), 107–125.CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  1. 1.ST MicroelectronicsGrenobleFrance
  2. 2.Electrical, Electronic, and Information Engineering DepartmentUniversity of BolognaBolognaItaly
  3. 3.Integrated Systems LaboratorySwiss Federal Institute of Technology (ETH)ZurichSwitzerland

Personalised recommendations