Skip to main content
Log in

Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

  • Published:
International Journal of Parallel Programming Aims and scope Submit manuscript

Abstract

Dataflow process networks are a convenient formalism for implementing robust concurrent systems that has been successfully used for hardware and software systems in the past. However, the strictly stream-based execution limits the performance of dataflow process networks and requires to carefully balance the entire execution to avoid backpressure and idle nodes. Inspired by related techniques used in processor architectures, we already introduced in our previous work out-of-order execution of dataflow process networks. In this paper, we extend this improvement with speculation of input values for process nodes and allow otherwise idle processes to start computations with speculated input values. Clearly, outputs based on speculated inputs have to be held back until the speculation can be proved right, and have to be withdrawn in case the speculation was wrong. In contrast to related work, our approach has been implemented purely in software using standard hardware to address a broad field of multicore processors. Moreover, a software implementation allows us to dynamically adapt parameters to the needs of the application. This allows us to enforce a user-defined hit ratio of speculation that might even switch speculation off. After a detailed description of this approach and a discussion of possibilities of its implementation, we show its feasibility using a couple of benchmarks. In these benchmarks, the use of speculation achieved an average speedup of 1.2 compared to the non-speculative out-of-order execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. http://www.openmp.org/mp-documents/spec30.pdf.

  2. http://threadingbuildingblocks.org.

  3. http://www.averest.org.

References

  1. Allen, J. (ed.): Software Synthesis from Dataflow Graphs. Kluwer, Dordrecht (1996)

    Google Scholar 

  2. Arvind, Nikhil, R.: Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. (T-C) 39(3), 300–318 (1990)

    Article  Google Scholar 

  3. Balakrishnan, S., Sohi, G.: Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs. In: International Symposium on Computer Architecture (ISCA), pp. 302–313. IEEE Computer Society, Boston, Massachusetts, USA (2006)

  4. Baudisch, D., Brandt, J., Schneider, K.: Dependency-driven distribution of synchronous programs. In: Hinchey, M., Kleinjohann, B., Kleinjohann, L., Lindsay, P., Rammig, F., Wolf, M. (eds.) Distributed and Parallel Embedded Systems (DIPES), pp. 169–180. International Federation for Information Processing (IFIP), Brisbane, Queensland, Australia (2010)

  5. Baudisch, D., Brandt, J., Schneider, K.: Translating synchronous systems to data-flow process networks. In: Yeo, S.S., Vaidya, B., Papadopoulos, G. (eds.) Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 354–361. IEEE Computer Society, Gwangju, Korea (2011)

    Google Scholar 

  6. Baudisch, D., Brandt, J., Schneider, K.: Out-of-order execution of synchronous data-flow networks. In: McAllister, J., Bhattacharyya, S. (eds.) International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (ICSAMOS), pp. 168–175. IEEE Computer Society, Samos, Greece (2012)

    Google Scholar 

  7. Bhattacharyya, S., Brebner, G., Janneck, J., Eker, J., von Platen, C., Mattavelli, M., Raulet, M.: OpenDF-a dataflow toolset for reconfigurable hardware and multicore systems. ACM SIGARCH Comput. Archit. News 36(5), 29–35 (2009)

    Article  Google Scholar 

  8. Bhattacharyya, S., Lee, E.: Scheduling synchronous dataflow graphs for efficient looping. J. VLSI Sig. Process. 6(3), 271–288 (1992)

    Article  Google Scholar 

  9. Bhattacharyya, S., Lee, E.: Looped schedules for dataflow descriptions of multirate signal processing algorithms. Formal Methods Syst. Des. 5(3), 183–205 (1994)

    Article  Google Scholar 

  10. Böhm, A., Oldehoeft, R., Cann, D., Feo, J.: SISAL 2.0 Reference Manual. Technical Report CS-91-118, Computer Science Department of Colorado State University (1991)

  11. Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.: Cyclo-static dataflow. IEEE Trans. Sig. Process. 44(2), 397–408 (1996)

    Article  Google Scholar 

  12. Bonfietti, A., Benini, L., Lombardi, M., Milano, M.: An efficient and complete approach for throughput-maximal SDF allocation and scheduling on multi-core platforms. Design, Automation and Test in Europe (DATE), pp. 897–902. EDA Consortium, Dresden, Germany (2010)

  13. Buck, J., Lee, E.: The token flow model. In: Bic, L., Gao, G., Gaudiot, J.L. (eds.) Advanced Topics in Dataflow Computing and Multithreading, pp. 267–290. IEEE Computer Society, Hamilton Island, Queensland, Australia (1995)

  14. Cintra, M., Martínez, J., Torrellas, J.: Architectural support for scalable speculative parallelization in shared-memory multiprocessors. International Symposium on Computer Architecture (ISCA), pp. 13–24. ACM, Vancouver, British Columbia, Canada (2000)

  15. Colohan, C., Ailamaki, A., Steffan, J., Mowry, T.: CMP support for large and dependent speculative threads. IEEE Trans. Parallel Distrib. Syst. 18(8), 1041–1054 (2007)

    Article  Google Scholar 

  16. Colwell, R., Hall, W., Joshi, C., Papworth, D., Rodman, P., Tomes, J.: Architecture and implementation of a VLIW supercomputer. Supercomputing, pp. 910–919. IEEE Computer Society, New York, NY, USA (1990)

    Google Scholar 

  17. Dennis, J.: Data flow supercomputers. IEEE. Comput. 13(11), 48–56 (1980)

    Article  Google Scholar 

  18. Dennis, J., Misunas, D.: A preliminary architecture for a basic data-flow processor. 25 Years of the International Symposia on Computer Architecture (ISCA), pp. 125–131. ACM, Barcelona, Spain (1998)

  19. Dennis, J., Misunas, D., Thiagarajan, P.: Data-flow computer architecture. Technical Report CSG-MEMO 104, MIT Lab for Computer Science, Cambridge, Massachusetts, USA (1974)

  20. Engels, M., Bilsen, G., Lauwereins, R., Peperstraete, J.: Cyclo-static dataflow: Model and implementation. In: Asilomar Conference on Signals, Systems and Computers (ACSSC). IEEE Computer Society, Pacific Grove, California, USA (1994)

  21. Fisher, J., Faraboschi, P., Young, C.: Embedded Computing: A VLIW Approach to Architecture. Compilers and Tools. Morgan Kaufmann, San Francisco (2005)

    Google Scholar 

  22. Gao, G., Govindarajan, R., Panangaden, P.: Well-behaved programs for DSP computation. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 561–564. IEEE Computer Society, San Francisco, California, USA (1992)

  23. Genin, D., De Moortel, J., Desmet, D., van de Velde, E.: System design, optimization, and intelligent code generation for standard digital signal processors. International Symposium on Circuits and Systems (ISCAS), pp. 565–569. IEEE Computer Society, Portland, Oregon, USA (1989)

  24. Hammond, L., Willey, M., Olukotun, K.: Data speculation support for a chip multiprocessor. In: Bhandarkar, D., Agarwal, A. (eds.) Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 58–69. ACM, San Jose, CA, USA (1998)

    Google Scholar 

  25. Janneck, J., Miller, I., Parlour, D., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study. Signal Processing Systems (SiPS), pp. 287–292. IEEE Computer Society, Washington, District of Columbia, USA (2008)

  26. Johnson, T., Eigenmann, R., Vijaykumar, T.: Min cut program decomposition for thread level speculation. In: Chambers, C. (ed.) Programming Language Design and Implementation (PLDI), pp. 59–70. ACM, Washington, DC, USA (2004)

    Google Scholar 

  27. Johnston, W., Hanna, J., Millar, R.: Advances in dataflow programming languages. ACM Comput. Surv. (CSUR) 36(1), 1–34 (2004)

    Article  Google Scholar 

  28. Kahn, G.: The semantics of a simple language for parallel programming. In: Rosenfeld, J. (ed.) Information Processing, pp. 471–475. North-Holland, Stockholm, Sweden (1974)

    Google Scholar 

  29. Kazi, I., Lilja, D.: Coarse-grained thread pipelining—a speculative parallel execution model for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst 12(9), 952–966 (2001)

    Article  Google Scholar 

  30. Le Mentec, F., Gautier, T., Danjean, V.: The X-Kaapi’s application programming interface. part I: Data flow programming. Technical Report RT-0418, Institut National de Recherche en Informatique et en Automatique (INRIA) (2011)

  31. Lee, B., Hurson, A.: Dataflow architectures and multithreading. IEEE. Comput. 27(8), 27–39 (1994)

    Article  Google Scholar 

  32. Lee, E.: Consistency in dataflow graphs. IEEE Trans. Parallel Distrib. Syst. 2(2) (1991)

  33. Lee, E.: The problem with threads. IEEE. Comput. 39(5), 33–42 (2006)

    Article  Google Scholar 

  34. Lee, E.: Computing needs time. Commun. ACM (CACM) 52(5), 70–79 (2009)

    Article  Google Scholar 

  35. Lee, E., Ha, S.: Scheduling strategies for multiprocessor real-time DSP. In: Global Telecommunications Conference (GLOBECOM), pp. 1279–1283. IEEE Computer Society (1989)

  36. Lee, E., Messerschmitt, D.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Comput. 36(1), 24–35 (1987)

    Article  MATH  Google Scholar 

  37. Lee, E., Messerschmitt, D.: Synchronous data flow. Proc. IEEE 75(9), 1235–1245 (1987)

    Article  Google Scholar 

  38. Lee, E., Parks, T.: Dataflow process networks. Proc. IEEE 83(5), 773–801 (1995)

    Article  Google Scholar 

  39. Lilja, D.: Reducing the branch penalty in pipelined processors. IEEE Comput. 21(7), 47–55 (1988)

    Article  Google Scholar 

  40. Lipasti, M., Shen, J.: Exceeding the dataflow limit via value prediction. Microarchitecture (MICRO), pp. 226–237. IEEE Computer Society, Paris, France (1996)

  41. Madriles, C., López, P., Codina, J., Gibert, E., Latorre, F., Martínez, A., Martínez, R., González, A.: Boosting single-thread performance in multi-core systems through fine-grain multi-threading. In: Keckler, S., Barroso, L. (eds.) International Symposium on Computer Architecture (ISCA), pp. 474–483. ACM, Austin, TX, USA (2009)

    Google Scholar 

  42. Marcuello, P., González, A.: Exploiting speculative thread-level parallelism on a SMT processor. In: Sloot, P., Bubak, M., Hoekstra, A., Hertzberger, B. (eds.) International Conference on High-Performance Computing and Networking (HPCN), LNCS, vol. 1593, pp. 754–763. Springer, Amsterdam, The Netherlands (1999)

    Chapter  Google Scholar 

  43. Marcuello, P., González, A., Tubella, J.: Thread partitioning and value prediction for exploiting speculative thread-level parallelism. IEEE Trans. Comput. 53(2), 114–125 (2004)

    Article  Google Scholar 

  44. McGraw, J.: The VAL language: description and analysis. ACM Trans. Program. Lang. Syst. 4(1), 44–82 (1982)

    Article  MATH  Google Scholar 

  45. McKenney, P.: Memory barriers: A hardware view for software hackers. http://www.rdrop.com/users/paulmck (2010)

  46. Moshovos, A., Breach, S., Vijaykumar, T., Sohi, G.: Dynamic speculation and synchronization of data dependences. In: International Symposium on Computer Architecture (ISCA), pp. 181–193 (1997)

  47. Murthy, P., Bhattacharyya, S., Lee, E.: Joint minimization of code and data for synchronous dataflow programs. Formal Methods Syst. Des. 11(1), 41–70 (1997)

    Article  Google Scholar 

  48. Nikhil, R.: Dataflow Programming Languages. Technical Report CSG-MEMO 333, Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA (1991)

  49. Pajuelo, A., González, A., Valero, M.: Speculative execution for hiding memory latency. In: MEmory Performance: DEaling with Applications, Systems and Architecture (MEDEA), pp. 49–56. ACM, Antibes Juan-les-Pins, France (2004)

  50. Parks, T.: Bounded Scheduling of Process Networks. Ph.D. Thesis, Princeton University (1995)

  51. Powell, D., Lee, E., Newmann, W.: Direct synthesis of optimized DSP assembly from signal flow diagrams. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 553–556. IEEE Computer Society, San Francisco, California, USA (1992)

  52. Pérez, J., Badia, R., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: International Conference on Cluster Computing (CLUSTER), pp. 142–151. IEEE Computer Society, Tsukuba, Japan (2008)

  53. Ramamoorthy, C., Li, H.: Pipeline architecture. ACM Comput. Surv. 9(1), 61–102 (1977)

    Article  MATH  Google Scholar 

  54. Renau, J., Strauss, K., Ceze, L., Liu, W., Sarangi, S., Tuck, J., Torrellas, J.: Thread-level speculation on a CMP can be energy efficient. International Conference on Supercomputing (ICS), pp. 219–228. ACM, Cambridge, Massachusetts, USA (2005)

  55. Richardson, S.: Caching function results: Faster arithmetic by avoiding unnecessary computation. Technical Report SMLI TR-92-1, Sun Microsystems Inc., Mountain View, CA, USA (1992)

  56. Roquier, G., Lucarz, C., Mattavelli, M., Wipliez, M., Raulet, M., Janneck, J., Miller, I., Parlour, D.: An integrated environment for HW/SW co-design based on a CAL specification and HW/SW code generators. In: International Symposium on Circuits and Systems (ISCAS), pp. 799–799. IEEE Computer Society, Taipei, Taiwan (2009)

  57. Rumbaugh, J.: A data flow multiprocessor. IEEE Trans. Comput. 26(2), 138–146 (1977)

    Article  MATH  Google Scholar 

  58. Schneider, K.: The synchronous programming language Quartz. Internal Report 375, Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany (2009)

  59. Steinke, R., Nutt, G.: A unified theory of shared memory consistency. J. ACM 51(5), 800–849 (2004)

    Article  MATH  MathSciNet  Google Scholar 

  60. Stulova, A., Leupers, R., Ascheid, G.: Throughput driven transformations of synchronous data flows for mapping to heterogeneous MPSoCs. In: McAllister, J., Bhattacharyya, S. (eds.) International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (ICSAMOS), pp. 144–151. IEEE Computer Society, Samos, Greece (2012)

  61. Tejedor, E., Farreras, M., Grove, D., Almasi, G., Labarta, J.: ClusterSs: a task-based programming model for clusters. In: High Performance Distributed Computing (HPDC), pp. 267–268. ACM, San Jose, CA, USA (2011)

  62. Tomasulo, R.: An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11(1), 25–33 (1967)

    Article  MATH  Google Scholar 

  63. Vachharajani, N., Rangan, R., Raman, E., Bridges, M., Ottoni, G., August, D.: Speculative decoupled software pipelining. Parallel Architectures and Compilation Techniques (PACT), pp. 49–59. IEEE Computer Society, Brasov, Romania (2007)

  64. Zilles, C., Sohi, G.: Master/slave speculative parallelization. Microarchitecture (MICRO), pp. 85–96. IEEE Computer Society, Istanbul, Turkey (2002)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Baudisch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baudisch, D., Schneider, K. Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks. Int J Parallel Prog 43, 86–129 (2015). https://doi.org/10.1007/s10766-013-0277-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-013-0277-2

Keywords

Navigation