Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

Baudisch, Daniel; Schneider, Klaus

doi:10.1007/s10766-013-0277-2

Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

Published: 22 October 2013

Volume 43, pages 86–129, (2015)
Cite this article

International Journal of Parallel Programming Aims and scope Submit manuscript

Daniel Baudisch¹ &
Klaus Schneider¹

225 Accesses
2 Citations
Explore all metrics

Abstract

Dataflow process networks are a convenient formalism for implementing robust concurrent systems that has been successfully used for hardware and software systems in the past. However, the strictly stream-based execution limits the performance of dataflow process networks and requires to carefully balance the entire execution to avoid backpressure and idle nodes. Inspired by related techniques used in processor architectures, we already introduced in our previous work out-of-order execution of dataflow process networks. In this paper, we extend this improvement with speculation of input values for process nodes and allow otherwise idle processes to start computations with speculated input values. Clearly, outputs based on speculated inputs have to be held back until the speculation can be proved right, and have to be withdrawn in case the speculation was wrong. In contrast to related work, our approach has been implemented purely in software using standard hardware to address a broad field of multicore processors. Moreover, a software implementation allows us to dynamically adapt parameters to the needs of the application. This allows us to enforce a user-defined hit ratio of speculation that might even switch speculation off. After a detailed description of this approach and a discussion of possibilities of its implementation, we show its feasibility using a couple of benchmarks. In these benchmarks, the use of speculation achieved an average speedup of 1.2 compared to the non-speculative out-of-order execution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Allen, J. (ed.): Software Synthesis from Dataflow Graphs. Kluwer, Dordrecht (1996)
Google Scholar
Arvind, Nikhil, R.: Executing a program on the MIT tagged-token dataflow architecture. IEEE Trans. Comput. (T-C) 39(3), 300–318 (1990)
Article Google Scholar
Balakrishnan, S., Sohi, G.: Program demultiplexing: Data-flow based speculative parallelization of methods in sequential programs. In: International Symposium on Computer Architecture (ISCA), pp. 302–313. IEEE Computer Society, Boston, Massachusetts, USA (2006)
Baudisch, D., Brandt, J., Schneider, K.: Dependency-driven distribution of synchronous programs. In: Hinchey, M., Kleinjohann, B., Kleinjohann, L., Lindsay, P., Rammig, F., Wolf, M. (eds.) Distributed and Parallel Embedded Systems (DIPES), pp. 169–180. International Federation for Information Processing (IFIP), Brisbane, Queensland, Australia (2010)
Baudisch, D., Brandt, J., Schneider, K.: Translating synchronous systems to data-flow process networks. In: Yeo, S.S., Vaidya, B., Papadopoulos, G. (eds.) Parallel and Distributed Computing, Applications and Technologies (PDCAT), pp. 354–361. IEEE Computer Society, Gwangju, Korea (2011)
Google Scholar
Baudisch, D., Brandt, J., Schneider, K.: Out-of-order execution of synchronous data-flow networks. In: McAllister, J., Bhattacharyya, S. (eds.) International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (ICSAMOS), pp. 168–175. IEEE Computer Society, Samos, Greece (2012)
Google Scholar
Bhattacharyya, S., Brebner, G., Janneck, J., Eker, J., von Platen, C., Mattavelli, M., Raulet, M.: OpenDF-a dataflow toolset for reconfigurable hardware and multicore systems. ACM SIGARCH Comput. Archit. News 36(5), 29–35 (2009)
Article Google Scholar
Bhattacharyya, S., Lee, E.: Scheduling synchronous dataflow graphs for efficient looping. J. VLSI Sig. Process. 6(3), 271–288 (1992)
Article Google Scholar
Bhattacharyya, S., Lee, E.: Looped schedules for dataflow descriptions of multirate signal processing algorithms. Formal Methods Syst. Des. 5(3), 183–205 (1994)
Article Google Scholar
Böhm, A., Oldehoeft, R., Cann, D., Feo, J.: SISAL 2.0 Reference Manual. Technical Report CS-91-118, Computer Science Department of Colorado State University (1991)
Bilsen, G., Engels, M., Lauwereins, R., Peperstraete, J.: Cyclo-static dataflow. IEEE Trans. Sig. Process. 44(2), 397–408 (1996)
Article Google Scholar
Bonfietti, A., Benini, L., Lombardi, M., Milano, M.: An efficient and complete approach for throughput-maximal SDF allocation and scheduling on multi-core platforms. Design, Automation and Test in Europe (DATE), pp. 897–902. EDA Consortium, Dresden, Germany (2010)
Buck, J., Lee, E.: The token flow model. In: Bic, L., Gao, G., Gaudiot, J.L. (eds.) Advanced Topics in Dataflow Computing and Multithreading, pp. 267–290. IEEE Computer Society, Hamilton Island, Queensland, Australia (1995)
Cintra, M., Martínez, J., Torrellas, J.: Architectural support for scalable speculative parallelization in shared-memory multiprocessors. International Symposium on Computer Architecture (ISCA), pp. 13–24. ACM, Vancouver, British Columbia, Canada (2000)
Colohan, C., Ailamaki, A., Steffan, J., Mowry, T.: CMP support for large and dependent speculative threads. IEEE Trans. Parallel Distrib. Syst. 18(8), 1041–1054 (2007)
Article Google Scholar
Colwell, R., Hall, W., Joshi, C., Papworth, D., Rodman, P., Tomes, J.: Architecture and implementation of a VLIW supercomputer. Supercomputing, pp. 910–919. IEEE Computer Society, New York, NY, USA (1990)
Google Scholar
Dennis, J.: Data flow supercomputers. IEEE. Comput. 13(11), 48–56 (1980)
Article Google Scholar
Dennis, J., Misunas, D.: A preliminary architecture for a basic data-flow processor. 25 Years of the International Symposia on Computer Architecture (ISCA), pp. 125–131. ACM, Barcelona, Spain (1998)
Dennis, J., Misunas, D., Thiagarajan, P.: Data-flow computer architecture. Technical Report CSG-MEMO 104, MIT Lab for Computer Science, Cambridge, Massachusetts, USA (1974)
Engels, M., Bilsen, G., Lauwereins, R., Peperstraete, J.: Cyclo-static dataflow: Model and implementation. In: Asilomar Conference on Signals, Systems and Computers (ACSSC). IEEE Computer Society, Pacific Grove, California, USA (1994)
Fisher, J., Faraboschi, P., Young, C.: Embedded Computing: A VLIW Approach to Architecture. Compilers and Tools. Morgan Kaufmann, San Francisco (2005)
Google Scholar
Gao, G., Govindarajan, R., Panangaden, P.: Well-behaved programs for DSP computation. International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 561–564. IEEE Computer Society, San Francisco, California, USA (1992)
Genin, D., De Moortel, J., Desmet, D., van de Velde, E.: System design, optimization, and intelligent code generation for standard digital signal processors. International Symposium on Circuits and Systems (ISCAS), pp. 565–569. IEEE Computer Society, Portland, Oregon, USA (1989)
Hammond, L., Willey, M., Olukotun, K.: Data speculation support for a chip multiprocessor. In: Bhandarkar, D., Agarwal, A. (eds.) Architectural Support for Programming Languages and Operating Systems (ASPLOS), pp. 58–69. ACM, San Jose, CA, USA (1998)
Google Scholar
Janneck, J., Miller, I., Parlour, D., Roquier, G., Wipliez, M., Raulet, M.: Synthesizing hardware from dataflow programs: An MPEG-4 simple profile decoder case study. Signal Processing Systems (SiPS), pp. 287–292. IEEE Computer Society, Washington, District of Columbia, USA (2008)
Johnson, T., Eigenmann, R., Vijaykumar, T.: Min cut program decomposition for thread level speculation. In: Chambers, C. (ed.) Programming Language Design and Implementation (PLDI), pp. 59–70. ACM, Washington, DC, USA (2004)
Google Scholar
Johnston, W., Hanna, J., Millar, R.: Advances in dataflow programming languages. ACM Comput. Surv. (CSUR) 36(1), 1–34 (2004)
Article Google Scholar
Kahn, G.: The semantics of a simple language for parallel programming. In: Rosenfeld, J. (ed.) Information Processing, pp. 471–475. North-Holland, Stockholm, Sweden (1974)
Google Scholar
Kazi, I., Lilja, D.: Coarse-grained thread pipelining—a speculative parallel execution model for shared-memory multiprocessors. IEEE Trans. Parallel Distrib. Syst 12(9), 952–966 (2001)
Article Google Scholar
Le Mentec, F., Gautier, T., Danjean, V.: The X-Kaapi’s application programming interface. part I: Data flow programming. Technical Report RT-0418, Institut National de Recherche en Informatique et en Automatique (INRIA) (2011)
Lee, B., Hurson, A.: Dataflow architectures and multithreading. IEEE. Comput. 27(8), 27–39 (1994)
Article Google Scholar
Lee, E.: Consistency in dataflow graphs. IEEE Trans. Parallel Distrib. Syst. 2(2) (1991)
Lee, E.: The problem with threads. IEEE. Comput. 39(5), 33–42 (2006)
Article Google Scholar
Lee, E.: Computing needs time. Commun. ACM (CACM) 52(5), 70–79 (2009)
Article Google Scholar
Lee, E., Ha, S.: Scheduling strategies for multiprocessor real-time DSP. In: Global Telecommunications Conference (GLOBECOM), pp. 1279–1283. IEEE Computer Society (1989)
Lee, E., Messerschmitt, D.: Static scheduling of synchronous data flow programs for digital signal processing. IEEE Trans. Comput. 36(1), 24–35 (1987)
Article MATH Google Scholar
Lee, E., Messerschmitt, D.: Synchronous data flow. Proc. IEEE 75(9), 1235–1245 (1987)
Article Google Scholar
Lee, E., Parks, T.: Dataflow process networks. Proc. IEEE 83(5), 773–801 (1995)
Article Google Scholar
Lilja, D.: Reducing the branch penalty in pipelined processors. IEEE Comput. 21(7), 47–55 (1988)
Article Google Scholar
Lipasti, M., Shen, J.: Exceeding the dataflow limit via value prediction. Microarchitecture (MICRO), pp. 226–237. IEEE Computer Society, Paris, France (1996)
Madriles, C., López, P., Codina, J., Gibert, E., Latorre, F., Martínez, A., Martínez, R., González, A.: Boosting single-thread performance in multi-core systems through fine-grain multi-threading. In: Keckler, S., Barroso, L. (eds.) International Symposium on Computer Architecture (ISCA), pp. 474–483. ACM, Austin, TX, USA (2009)
Google Scholar
Marcuello, P., González, A.: Exploiting speculative thread-level parallelism on a SMT processor. In: Sloot, P., Bubak, M., Hoekstra, A., Hertzberger, B. (eds.) International Conference on High-Performance Computing and Networking (HPCN), LNCS, vol. 1593, pp. 754–763. Springer, Amsterdam, The Netherlands (1999)
Chapter Google Scholar
Marcuello, P., González, A., Tubella, J.: Thread partitioning and value prediction for exploiting speculative thread-level parallelism. IEEE Trans. Comput. 53(2), 114–125 (2004)
Article Google Scholar
McGraw, J.: The VAL language: description and analysis. ACM Trans. Program. Lang. Syst. 4(1), 44–82 (1982)
Article MATH Google Scholar
McKenney, P.: Memory barriers: A hardware view for software hackers. http://www.rdrop.com/users/paulmck (2010)
Moshovos, A., Breach, S., Vijaykumar, T., Sohi, G.: Dynamic speculation and synchronization of data dependences. In: International Symposium on Computer Architecture (ISCA), pp. 181–193 (1997)
Murthy, P., Bhattacharyya, S., Lee, E.: Joint minimization of code and data for synchronous dataflow programs. Formal Methods Syst. Des. 11(1), 41–70 (1997)
Article Google Scholar
Nikhil, R.: Dataflow Programming Languages. Technical Report CSG-MEMO 333, Computer Science and Artificial Intelligence Laboratory, Cambridge, Massachusetts, USA (1991)
Pajuelo, A., González, A., Valero, M.: Speculative execution for hiding memory latency. In: MEmory Performance: DEaling with Applications, Systems and Architecture (MEDEA), pp. 49–56. ACM, Antibes Juan-les-Pins, France (2004)
Parks, T.: Bounded Scheduling of Process Networks. Ph.D. Thesis, Princeton University (1995)
Powell, D., Lee, E., Newmann, W.: Direct synthesis of optimized DSP assembly from signal flow diagrams. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 553–556. IEEE Computer Society, San Francisco, California, USA (1992)
Pérez, J., Badia, R., Labarta, J.: A dependency-aware task-based programming environment for multi-core architectures. In: International Conference on Cluster Computing (CLUSTER), pp. 142–151. IEEE Computer Society, Tsukuba, Japan (2008)
Ramamoorthy, C., Li, H.: Pipeline architecture. ACM Comput. Surv. 9(1), 61–102 (1977)
Article MATH Google Scholar
Renau, J., Strauss, K., Ceze, L., Liu, W., Sarangi, S., Tuck, J., Torrellas, J.: Thread-level speculation on a CMP can be energy efficient. International Conference on Supercomputing (ICS), pp. 219–228. ACM, Cambridge, Massachusetts, USA (2005)
Richardson, S.: Caching function results: Faster arithmetic by avoiding unnecessary computation. Technical Report SMLI TR-92-1, Sun Microsystems Inc., Mountain View, CA, USA (1992)
Roquier, G., Lucarz, C., Mattavelli, M., Wipliez, M., Raulet, M., Janneck, J., Miller, I., Parlour, D.: An integrated environment for HW/SW co-design based on a CAL specification and HW/SW code generators. In: International Symposium on Circuits and Systems (ISCAS), pp. 799–799. IEEE Computer Society, Taipei, Taiwan (2009)
Rumbaugh, J.: A data flow multiprocessor. IEEE Trans. Comput. 26(2), 138–146 (1977)
Article MATH Google Scholar
Schneider, K.: The synchronous programming language Quartz. Internal Report 375, Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany (2009)
Steinke, R., Nutt, G.: A unified theory of shared memory consistency. J. ACM 51(5), 800–849 (2004)
Article MATH MathSciNet Google Scholar
Stulova, A., Leupers, R., Ascheid, G.: Throughput driven transformations of synchronous data flows for mapping to heterogeneous MPSoCs. In: McAllister, J., Bhattacharyya, S. (eds.) International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (ICSAMOS), pp. 144–151. IEEE Computer Society, Samos, Greece (2012)
Tejedor, E., Farreras, M., Grove, D., Almasi, G., Labarta, J.: ClusterSs: a task-based programming model for clusters. In: High Performance Distributed Computing (HPDC), pp. 267–268. ACM, San Jose, CA, USA (2011)
Tomasulo, R.: An efficient algorithm for exploiting multiple arithmetic units. IBM J. Res. Dev. 11(1), 25–33 (1967)
Article MATH Google Scholar
Vachharajani, N., Rangan, R., Raman, E., Bridges, M., Ottoni, G., August, D.: Speculative decoupled software pipelining. Parallel Architectures and Compilation Techniques (PACT), pp. 49–59. IEEE Computer Society, Brasov, Romania (2007)
Zilles, C., Sohi, G.: Master/slave speculative parallelization. Microarchitecture (MICRO), pp. 85–96. IEEE Computer Society, Istanbul, Turkey (2002)

Download references

Author information

Authors and Affiliations

Embedded Systems Group, Department of Computer Science, University of Kaiserslautern, Kaiserslautern, Germany
Daniel Baudisch & Klaus Schneider

Authors

Daniel Baudisch
View author publications
You can also search for this author in PubMed Google Scholar
Klaus Schneider
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daniel Baudisch.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Baudisch, D., Schneider, K. Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks. Int J Parallel Prog 43, 86–129 (2015). https://doi.org/10.1007/s10766-013-0277-2

Download citation

Received: 29 January 2013
Accepted: 04 October 2013
Published: 22 October 2013
Issue Date: February 2015
DOI: https://doi.org/10.1007/s10766-013-0277-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

Abstract

Access this article

Similar content being viewed by others

Pyverilog: A Python-Based Hardware Design Processing Toolkit for Verilog HDL

Cluster-aware scheduling in multitasking GPUs

Parallel programming models for heterogeneous many-cores: a comprehensive survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Evaluation of Speculation in Out-of-Order Execution of Synchronous Dataflow Networks

Abstract

Access this article

Similar content being viewed by others

Pyverilog: A Python-Based Hardware Design Processing Toolkit for Verilog HDL

Cluster-aware scheduling in multitasking GPUs

Parallel programming models for heterogeneous many-cores: a comprehensive survey

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation