Skip to main content

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

Abstract

This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures and (4) a presentation of the implementation of a low overhead runtime system for TIDeFlow.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

References

  1. Agrawal, G., Saltz, J.: Interprocedural data flow based optimizations for compilation of irregular problems. In: Huang, C.H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 465–479. Springer, Berlin (1996). doi:10.1007/BFb0014218

    Chapter  Google Scholar 

  2. Arvind, Culler, D.E.: Dataflow Architectures, pp. 225–253. Annual Reviews Inc., Palo Alto. http://portal.acm.org/citation.cfm?id=17814.17824 (1986)

  3. Blumofe, R., Leiserson, C.: Scheduling multithreaded computations by work stealing. In: Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on, pp. 356 –368 (1994). doi:10.1109/SFCS.1994.365680

  4. Butenhof, D.: Programming with POSIX Threads. Addison-Wesley Professional, Boston (1997)

    Google Scholar 

  5. Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, Cambridge (2007)

    Google Scholar 

  6. del Cuvillo, J., Zhu, W., Gao, G.: Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: CF ’06: Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, USA, pp. 41–50, (2006a). doi:10.1145/1128022.1128030

  7. del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Toward a software infrastructure for the cyclops-64 cellular architecture. In: High-Performance Computing in an Advanced Collaborative Environment, p. 9 (2006b). doi:10.1109/HPCS.2006.48

  8. Del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.: Tiny threads: a thread virtual machine for the cyclops64 cellular architecture. In: Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, IEEE, p. 8 (2005)

  9. Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque sur la Programmation. Springer, London, pp. 362–376 (1974). http://portal.acm.org/citation.cfm?id=647323.721501

  10. Duran, A., Ayguad, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), pp. 173–193 (2011). doi:10.1142/S0129626411000151

  11. Ebcioglu, K., Saraswat, V., Sarkar, V.: X10: programming for hierarchical parallelism and non-uniform data access. In: Proceedings of the International Workshop on Language Runtimes, OOPSLA (2004)

  12. Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz—open source graph drawing tools. In: Mutzel, P., Jünger, M., Leipert, S.(eds.) Graph Drawing. Lecture Notes in Computer Science, vol. 2265, pp. 483–484. Springer, Berlin Heidelberg (2002). doi:10.1007/3-540-45848-4_57

  13. Gao, G.R.: A pipelined code mapping scheme for static data flow computers. PhD thesis, Massachusetts Institute of Technology. http://hdl.handle.net/1721.1/37165 (1986)

  14. Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.: A dynamic schema to increase performance in many-core architectures through percolation operations. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing (HiPC 2013), Bangalore. IEEE Computer Society (2013)

  15. Garcia, E., Venetis, I.E., Khan, R., Gao, G.: Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the Sixteenth International Conference on Parallel Computing (Euro-Par 2010), Part II, Springer, Ischia, Italy, Lecture Notes in Computer Science, vol. 6272, pp. 316–327 (2010b)

  16. Garcia, E., Venetis, I.E., Khan, R., Gao, G.R.: Optimized dense matrix multiplication on a many-core architecture. In: Euro-Par 2010-Parallel Processing, pp. 316–327 (2010c)

  17. Garcia, E., Orozco, D., Khan, R., Venetis, I.E., Livingston, K., Gao, G.R.: Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures. In: ACM International Conference on Computing Frontiers 2012 (CF’12) (2012a)

  18. Garcia, E., Orozco, D., Pavel, R., Gao, G.: A discussion in favor of dynamic scheduling for regular applications in many-core architectures. In: Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International, IEEE, pp. 1591–1600 (2012b)

  19. Garcia, E., Orozco, D., Pavel, R., Gao, G.R.: A discussion in favor of Dynamic Scheduling for regular applications in Many-core Architectures. In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012), pp. 1591–1600. ACM, Shanghai (2012)

  20. Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07, pp. 15–23. ACM, New York, NY, USA (2007)

  21. Gropp, W., Lusk, E., Thakur, R.: Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, Cambridge (1999)

    Google Scholar 

  22. Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pp. 319–329. ACM (1988)

  23. Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). doi:10.1109/5.24143

    Article  Google Scholar 

  24. Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the dataflow computational model. Parallel Comput. 25, 1907–1929 (1999)

    Article  Google Scholar 

  25. Nemawarkar, S., Gao, G.: Measurement and modeling of earth-manna multithreaded architecture. In: Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS ’96, Proceedings of the Fourth International Workshop on, pp. 109–114 (1996). doi:10.1109/MASCOT.1996.501002

  26. Gulati, K., Khatri, S.P.: GPU architecture and the CUDA programming model. In: Hardware acceleration of EDA algorithms, pp. 23–30. Springer US (2010). doi:10.1007/978-1-4419-0944-2_3

  27. Orozco, D.: Tideflow: a parallel execution model for high performance computing programs. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, p. 211 (2011)

  28. Orozco, D., Gao, G.: Mapping the FDTD application to many-core chip architectures. In: Parallel Processing. ICPP ’09. International Conference on, pp. 309–316 (2009)

  29. Orozco, D., Xue, L., Bolat, M., Li, X., Gao, G.R.: Experience of optimizing FFT on intel architectures. In: Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, IEEE, pp. 1–8 (2007)

  30. Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing, LCPC’10, pp. 77–91. Springer, Berlin (2011a)

  31. Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.: High throughput queue algorithms. Tech. rep., CAPSL Technical Memo 103 (2011b)

  32. Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. In: Proceedings of the 24th International Conference on Languages and Compilers for Parallel Computing, LCPC 11 (2011c)

  33. Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. CAPSL Technical Memo 105 (2011d)

  34. Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.R.: Toward high-throughput algorithms on many-core architectures. ACM Trans. Archit. Code Optim. 8(4), 49 (2012)

    Article  Google Scholar 

  35. Sarkar, V., Hennessy, J.: Partitioning parallel programs for macro-dataflow. In: Proceedings of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86, pp. 202–211. ACM, New York, NY, USA (1986). doi:10.1145/319838.319863

  36. Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)

    Article  Google Scholar 

  37. Theobald, K.: Earth: an efficient architecture for running threads. PhD thesis, University of Delaware (1999)

  38. Yan, Y., Chatterjee, S., Orozco, D., Garcia, E., Budimlic, Z., Shirako, J., Pavel, R., Sarkar, V., Gao, G.: Hardware and software tradeoffs for task synchronization on manycore architectures. In: Proceedings of the Seventeenth International Conference on Parallel Computing (Euro-Par 2011), Bordeaux, France, Lecture Notes in Computer Science (2011)

  39. Zuckerman, S., Suetterlein, J., Knauerhase, R.,Gao, G.: Using a codelet program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 64–69. ACM (2011)

Download references

Acknowledgments

This research was made possible by the generous support of the NSF through Grants CCF-0833122, CCF-0925863, CCF-0937907, CNS-0720531, and OCI-0904534.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Orozco.

Additional information

This research was, in part, funded by the U.S. Government. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the U.S. Government.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Orozco, D., Garcia, E., Pavel, R. et al. The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining. Int J Parallel Prog 44, 278–307 (2016). https://doi.org/10.1007/s10766-015-0373-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10766-015-0373-6

Keywords

  • Dataflow
  • Task pipelining
  • Parallel execution models
  • TIDeFlow
  • Runtime system
  • Graph languages
  • Codelets
  • Iterated dataflow
  • Dependency graph