StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

Stoutchinin, Arthur; Benini, Luca

doi:10.1007/s11265-018-1351-1

StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

Published: 08 March 2018

Volume 91, pages 275–301, (2019)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

251 Accesses
3 Citations
Explore all metrics

Abstract

In this paper, we present StreamDrive, a dynamic dataflow framework for programming clustered embedded multicore architectures. StreamDrive simplifies development of dynamic dataflow applications starting from sequential reference C code and allows seamless handling of heterogeneous and application-specific processing elements by applications. We address issues of efficient implementation of the dynamic dataflow runtime system in the context of constrained embedded environments, which have not been sufficiently addressed by previous research. We conducted a detailed performance evaluation of the StreamDrive implementation on our Application Specific MultiProcessor (ASMP) cluster using the Oriented FAST and Rotated BRIEF (ORB) algorithm typical of image processing domain. We have used the proposed incremental development flow for the transformation of the ORB original reference C code into an optimized dynamic dataflow implementation. Our implementation has less than 10% parallelization overhead, near-linear speedup when the number of processors increases from 1 to 8, and achieves the performance of 15 VGA frames per second with a small cluster configuration of 4 processing elements and 64KB of shared memory, and of 30 VGA frames per second with 8 processors and 128KB of shared memory.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic Streamization of Image Processing Applications

Exploiting Heterogeneous Mobile Architectures Through a Unified Runtime Framework

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Article Open access 04 January 2019

Notes

This set includes relatively generic instructions, such as a MAC4CLIP which performs SIMD multiplication on bytes of two input operands, saturates the two 16-bit results, and accumulates them with the result operand; as well as instructions dedicated to specific image processing functions, such as a XORSBCW, used in Support Vector Machine (SVM), which calculate the Hamming distance between two vectors.
In this paper, we also use term actor for the KPN processes for the sake of coherence.
In actual implementation, we have implemented two variants of the ORB: (1) with a rescaler tightly-coupled HW block and where the pyramid construction is part of the dataflow graph, and (2) with the pyramid construction as a pre-processing step.
Any field of the Keyp_t structure can be used to communicate the number of corners.
The synchronization overhead includes actions required to verify the token availability, and the associated scheduler actions.
Unless there is uncontrolled accumulation of tokens in a channel.
This real-time requirement also takes the match part of the application into account.

References

Bezati, E. (2015). High-level synthesis of dataflow programs for heterogeneous platforms: design flow tools and design space exploration. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.
Bezati, E., Brunet, S.C., Mattavelli, M., Janneck, J.W. (2016). High-level system synthesis and optimization of dataflow programs for mpsocs. In Matthews, M.B. (Ed.) ACSSC (pp. 417–421). IEEE.
Bhattacharya, B., & Battacharyya, S. (2001). Parameterized dataflow modelling for dsp systems. IEEE Transactions on Signal Processing, 49(10), 2408–2421.
Article MathSciNet Google Scholar
Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.). (Berlin). Handbook of signal processing systems: Springer.
Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.A. (1995). Cyclo-static data flow. In ICASSP, (Vol. 5 pp. 3255–3258).
Buck, J. (1993). Scheduling dynamic dataflow graphs with bounded memory using the token flow model. PhD Thesis, Department of Electrical Enginnering and Computer Science, University of California at Berkeley.
Buck, J.T. (1994). A dynamic dataflow model suitable for efficient mixed hardware and software implementations of dsp applications. In HSCD Workshop (pp. 165–172).
Cockx, J., Denolf, K., Vanhoof, B., Stahl, R. (2007). Sprint: a tool to generate concurrent transaction-level models from sequential code. EURASIP Journal on Applied Signal Processing, 1, 213.
MATH Google Scholar
Dehyadegari, M., Marongiu, A., Kakoee, M., Benini, L., Mohammadi, S., Yazdani, N. (2012). A tightly-coupled multi-core cluster with shared memory hw accelerators. In ISCAMOS (pp. 96–103).
Dennis, J. (1974). First version data flow procedure language. Tech. Rep. MAC TM61, MIT laboratory for computer science.
de Dinechin, B.D., Ayrignac, R., Beaucamps, P.E., Couvert, P., Ganne, B., de Massas, P.G., Jacquet, F., Jones, S., Chaisemartin, N.M., Riss, F., Strudel, T. (2013). A clustered manycore processor architecture for embedded and accelerated applications. In HPEC (pp. 1–6): IEEE.
de Kock, E.A., Smits, W., van der Wolf, P., Brunel, J.Y., Kruijtzer, W., Lieverse, P., Vissers, K.A., Essink, G. (2000). Yapi: application modeling for signal processing systems. In DAC (pp. 402–405).
Dunkels, A., Schmidt, O., Voigt, T., Ali, M. (2006). Protothreads: simplifying event-driven programming of memory-constrained embedded systems. In Sensys (pp. 29–42).
Edwards, S.A., & Tardieu, O. (2006). Shim: a deterministic model for heterogeneous embedded systems. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 14(8), 854– 867.
Article Google Scholar
Edwards, S.A., Vasudevan, N., Tardieu, O. (2008). Programming shared memory multiprocessors with deterministic message-passing concurrency: compiling shim to pthreads. In Sciuto, D. (Ed.) DATE (pp. 1498–1503). ACM.
Eker, J., & Janneck, J. (2002). Caltrop—language report (draft). Technical Memorandum, Electronics Research Lab, Department of Electrical Engineering and Computer Sciences, University of California at Berkeley California, Berkeley, CA 94720, USA, http://www.gigascale.org/caltrop.
Eker, J., & Janneck, J.W. (2012). Dataflow programming in cal – balancing expressiveness, analyzability, and implementability. In Asilomar conference on signals, systems and computers (pp. 1120–1124).
Gangwal, O.P., Nieuwland, A., Lippens, P.E.R. (2001). A scalable and flexible data synchronization scheme for embedded hw-sw shared-memory systems. In Hermida, R., & Aboulhamid, E.M. (Eds.) ISSS (pp. 1–6). ACM / IEEE Computer Society.
Gautier, T., Besseron, X., Pigeon, L. (2007). Kaapi: a thread scheduling runtime system for data flow computations on cluster of multi-processors. In PASCO (pp. 15–23).
Gebrewahid, E., Yang, M., Cedersjö, G., Abdin, Z.U., Gaspes, V., Janneck, J.W., Svensson, B. (2014). Realizing efficient execution of dataflow actors on manycores. In EUC (pp. 321–328).
Geilen, M., & Basten, T. (2003). Requirements on the execution of kahn process networks. In Degano, P. (Ed.) ESOP, Springer, lecture notes in computer science, (Vol. 2618 pp. 319–334).
Girault, A., Lee, B., Lee, E.A. (1999). Hierarchical finite state machines with multiple concurrency models. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18(6), 742–760.
Article Google Scholar
Goubier, T., Sirdey, R., Louise, S., David, V. (2011). ΣC: a programming model and language for embedded manycores. In ICA3PP (pp. 385–394).
Haid, W. (2010). Design and performance analysis of multiprocessor streaming applications. PhD Thesis, ETH, Zurich.
Haid, W., Schor, L., Huang, K., Bacivarov, I., Thiele, L. (2009). Efficient execution of kahn process networks on multi-processor systems using protothreads and windowed fifos. In ESTIMEdia (pp. 35–44).
Harris, C., & Stephens, M. (1988). A combined corner and edge detector. In Proceedings of the 4th alvey vision conference (pp. 147–151).
Huang, K., Grunert, D., Thiele, L. (2007). Windowed fifos for fpga-based multiprocessor systems. In ASAP (pp. 36–41).
Kahn, G. (1974). The semantics of a simple language for parallel programming. In IFIP Congress.
Lee, E. (1997). A denotational semantics for dataflow with firing Memorandum UCB/ERL M97/3. Electronics Research Laboratory, U. C. Berkeley.
Lee, E.A., & Messerschmitt, D.G. (1987). Synchronous data flow. Proceedings of the IEEE, 75(9), 1235–1245.
Article Google Scholar
Mattavelli, M., Amer, I., Raulet, M. (2010). The reconfigurable video coding standard [standards in a nutshell]. IEEE Signal Processing Magazine, 27(3), 159–167.
Article Google Scholar
Mattavelli, M., Raulet, M., Janneck, J.W. (2013). Mpeg reconfigurable video coding. In Bhattacharyya, S.S., Deprettere, E.F., Leupers, R., Takala, J. (Eds.) Handbook of signal processing systems (pp. 281–314). Springer.
Melpignano, D., Benini, L., Flamand, E., Jego, B., Lepley, T., Haugou, G., Clermidy, F., Dutoit, D. (2012). Platform 2012, a many-core computing accelerator for embedded socs: performance evaluation of visual analytics applications. In DAC (pp. 1137–1142).
Michalska, M., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). A partition scheduler model for dynamic dataflow programs. In Connolly, M. (Ed.) ICCS, Elsevier, procedia computer science, (Vol. 80 pp. 2287–2291).
Michalska, M., Zufferey, N., Boutellier, J., Bezati, E., Mattavelli, M. (2016). Efficient scheduling policies for dynamic data flow programs executed on multi-core. In 11th international meeting on logistics research.
NVIDIA. (2010). Next generation cuda compute architecture: Fermi - white paper. http://www.nvidia.com.
Olofsson, A., Nordström, T, Ul-Abdin, Z. (2014). Kickstarting high-performance energy-efficient manycore architectures with epiphany. In Asilomar conference on signals, systems and computers (pp. 1719–1726). IEEE.
Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G. (2011). Tideflow: the time iterated dependency flow execution model. In Workshop on data-flow execution models for extreme scale computing (DFM) (pp. 1–9).
Pelcat, M., Desnos, K., Heulot, J., Guy, C., Nezan, J.F., Aridhi, S. (2014). Preesm: a dataflow-based rapid prototyping framework for simplifying multicore dsp programming. In EDERC (pp. 36–40).
Pimentel, A.D. (2008). The artemis workbench for system-level performance evaluation of embedded systems. International Journal of Embedded Systems, 3(3), 181–196.
Article Google Scholar
Plishker, W., Sane, N., Kiemb, M., Anand, K., Bhattacharyya, S.S. (2008). Functional dif for rapid prototyping. In IEEE international workshop on rapid system prototyping (pp. 17–23). IEEE Computer Society.
Plishker, W., Sane, N., Bhattacharyya, S.S. (2009). A generalized scheduling approach for dynamic dataflow applications. In Benini, L., Micheli, G.D., Al-Hashimi, B.M., Müller, W. (Eds.) DATE (pp. 111–116). IEEE.
Plurality. (2011). Plurality hypercore. http://www.plurality.com.
Pop, A., & Cohen, A. (2013). Openstream: Expressiveness and data-flow compilation of openmp streaming programs. ACM Transactions on Architecture and Code Optimization, 9(4), 53.
Article Google Scholar
Rahimi, A., Loi, I., Kakoee, M.R., Benini, L. (2011). A fully-synthesizable single-cycle interconnection network for shared-l1 processor clusters. In Design, automation & test in europe conference & exhibition (DATE), 2011 (pp. 1–6). IEEE.
Rahman, A.A.H.A., Brunet, S.C., Alberti, C., Mattavelli, M. (2014). A methodology for optimizing buffer sizes of dynamic dataflow fpgas implementations. In ICASSP (pp. 5003–5007). IEEE.
Rahman, A.A.H.B.A. (2014). Optimizing dataflow programs for hardware synthesis. PhD Thesis, ÉCOLE POLYTECHNIQUE FÉdÉRALE DE LAUSANNE.
Rosten, E., Porter, R., Drummond, T. (2010). Faster and better: a machine learning approach to corner detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(1), 105– 119.
Article Google Scholar
Rublee, E., Rabaud, V., Konolige, K., Bradski, G. (2011). Orb: an efficient alternative to sift or surf. In ICCV (pp. 2564–2571).
Sane, N., Hsu, C.J., Pino, J.L., Bhattacharyya, S.S. (2010). Simulating dynamic communication systems using the core functional dataflow model. In ICASSP (pp. 1538–1541). IEEE.
Sau, C., Meloni, P., Raffo, L., Palumbo, F., Bezati, E., Brunet, S.C., Mattavelli, M. (2016). Automated design flow for multi-functional dataflow-based platforms. Signal Processing Systems, 85(1), 143–165.
Article Google Scholar
Schwambach, V., Cleyet-Merle, S., Issard, A., Mancini, S. (2015). Estimating the potential speedup of computer vision applications on embedded multiprocessors. arXiv:1502.07446.
Shen, C., Plishker, W., Bhattacharyya, S.S. (2012). Dataflow-based design and implementation of image processing applications. In Guan, L., He, Y., Kung, S.-Y. (Eds.) Multimedia Image and Video Processing, 2nd edn. Chapter 24 (pp. 609–629). Boca Raton: CRC Press.
Sriram, S., & Bhattacharyya, S.S. (2009). Embedded multiprocessors: scheduling and synchronization. Boca Raton: CRC Press.
Book Google Scholar
Sérot, J., Berry, F., Bourrasset, C. (2016). High-level dataflow programming for real-time image processing on smart cameras. Journal of Real-Time Image Processing, 12(4), 635–647.
Article Google Scholar
Stoutchinin, A., & Benini, L. (2017). Stream drive: a dynamic dataflow framework for clustered embedded architectures. In Conference on computing frontiers (pp. 1–8). ACM.
Stuijk, S., Geilen, M., Thelen, B., Basten, T. (2011). Scenario-aware dataflow: modeling, analysis and implementation of dynamic applications. In International conference on embedded computer systems (pp. 404–411).
Ul-Abdin, Z., & Yang, M. (2015). A radar signal processing case study for dataflow programming of manycores. Journal of Signal Processing Systems, 87(1), 49–62.
Article Google Scholar
Vasudevan, N., & Edwards, S.A. (2009). Celling shim: compiling deterministic concurrency to a heterogeneous multicore. In ACM symposium on applied computing (pp. 1626–1631).
Vrba, Z., Halvorsen, P., Griwodz, C., Beskow, P., Espeland, H., Johansen, D. (2013). The nornir run-time system for parallel programs using kahn process networks on multi-core machines - a flexible alternative to mapreduce. The Journal of Supercomputing, 63(1), 191–217.
Article Google Scholar
YarKhan, A. (2012). Dynamic task execution on shared and distributed memory architectures. PhD Thesis, The University of Tennessee, Knoxville.
Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2014). Efficient software synthesis of dynamic dataflow programs. In ICASSP (pp. 4988–4992). IEEE.
Yviquel, H., Sanchez, A., Jääskeläinen, P., Takala, J., Raulet, M., Casseau, E. (2015). Embedded multi-core systems dedicated to dynamic dataflow programs. Signal Processing Systems, 80(1), 121–136.
Article Google Scholar
Zaki, G.F., Plishker, W., Bhattacharyya, S.S., Fruth, F. (2017). Implementation, scheduling, and adaptation of partial expansion graphs on multicore platforms. Signal Processing Systems, 87(1), 107–125.
Article Google Scholar

Download references

Acknowledgements

This research was partially funded by the H2020 Project Opecomp (CA 732631) and by the ERC-ADG Project Multitherman (CA 291125). Authors would also like to thank the ST Microelectronics’ Embedded Computing Systems management for supporting this research.

Author information

Authors and Affiliations

ST Microelectronics, Grenoble, France
Arthur Stoutchinin
Electrical, Electronic, and Information Engineering Department, University of Bologna, Bologna, Italy
Luca Benini
Integrated Systems Laboratory, Swiss Federal Institute of Technology (ETH), Zurich, Switzerland
Luca Benini

Authors

Arthur Stoutchinin
View author publications
You can also search for this author in PubMed Google Scholar
Luca Benini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Arthur Stoutchinin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Stoutchinin, A., Benini, L. StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures. J Sign Process Syst 91, 275–301 (2019). https://doi.org/10.1007/s11265-018-1351-1

Download citation

Received: 25 August 2017
Revised: 06 December 2017
Accepted: 23 February 2018
Published: 08 March 2018
Issue Date: March 2019
DOI: https://doi.org/10.1007/s11265-018-1351-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

Abstract

Access this article

Similar content being viewed by others

Automatic Streamization of Image Processing Applications

Exploiting Heterogeneous Mobile Architectures Through a Unified Runtime Framework

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

StreamDrive: a Dynamic Dataflow Framework for Clustered Embedded Architectures

Abstract

Access this article

Similar content being viewed by others

Automatic Streamization of Image Processing Applications

Exploiting Heterogeneous Mobile Architectures Through a Unified Runtime Framework

Frame-based Programming, Stream-Based Processing for Medical Image Processing Applications

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation