International Journal of Parallel Programming

, Volume 44, Issue 2, pp 278–307 | Cite as

The Design and Implementation of TIDeFlow: A Dataflow-Inspired Execution Model for Parallel Loops and Task Pipelining

  • Daniel Orozco
  • Elkin Garcia
  • Robert Pavel
  • Jaime Arteaga
  • Guang Gao


This paper provides an extended description of the design and implementation of the Time Iterated Dependency Flow (TIDeFlow) execution model. TIDeFlow is a dataflow-inspired model that simplifies the scheduling of shared resources on many-core processors. To accomplish this, programs are specified as directed graphs and the dataflow model is extended through the introduction of intrinsic constructs for parallel loops and the arbitrary pipelining of operations. The main contributions of this paper are: (1) a formal description of the TIDeFlow execution model and its programming model, (2) a description of the TIDeFlow implementation and its strengths over previous execution models, such as the ability to natively express parallel loops and task pipelining, (3) an analysis of experimental results showing the advantages of TIDeFlow with respect to expressing parallel programs on many-core architectures and (4) a presentation of the implementation of a low overhead runtime system for TIDeFlow.


Dataflow Task pipelining Parallel execution models  TIDeFlow Runtime system Graph languages Codelets Iterated dataflow Dependency graph 



This research was made possible by the generous support of the NSF through Grants CCF-0833122, CCF-0925863, CCF-0937907, CNS-0720531, and OCI-0904534.


  1. 1.
    Agrawal, G., Saltz, J.: Interprocedural data flow based optimizations for compilation of irregular problems. In: Huang, C.H., Sadayappan, P., Banerjee, U., Gelernter, D., Nicolau, A., Padua, D. (eds.) Languages and Compilers for Parallel Computing, Lecture Notes in Computer Science, vol. 1033, pp. 465–479. Springer, Berlin (1996). doi: 10.1007/BFb0014218 CrossRefGoogle Scholar
  2. 2.
    Arvind, Culler, D.E.: Dataflow Architectures, pp. 225–253. Annual Reviews Inc., Palo Alto. (1986)
  3. 3.
    Blumofe, R., Leiserson, C.: Scheduling multithreaded computations by work stealing. In: Foundations of Computer Science, 1994 Proceedings, 35th Annual Symposium on, pp. 356 –368 (1994). doi: 10.1109/SFCS.1994.365680
  4. 4.
    Butenhof, D.: Programming with POSIX Threads. Addison-Wesley Professional, Boston (1997)Google Scholar
  5. 5.
    Chapman, B., Jost, G., van der Pas, R.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). MIT Press, Cambridge (2007)Google Scholar
  6. 6.
    del Cuvillo, J., Zhu, W., Gao, G.: Landing openmp on cyclops-64: an efficient mapping of openmp to a many-core system-on-a-chip. In: CF ’06: Proceedings of the 3rd Conference on Computing Frontiers, ACM, New York, NY, USA, pp. 41–50, (2006a). doi: 10.1145/1128022.1128030
  7. 7.
    del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.R.: Toward a software infrastructure for the cyclops-64 cellular architecture. In: High-Performance Computing in an Advanced Collaborative Environment, p. 9 (2006b). doi: 10.1109/HPCS.2006.48
  8. 8.
    Del Cuvillo, J., Zhu, W., Hu, Z., Gao, G.: Tiny threads: a thread virtual machine for the cyclops64 cellular architecture. In: Parallel and Distributed Processing Symposium, 2005. Proceedings. 19th IEEE International, IEEE, p. 8 (2005)Google Scholar
  9. 9.
    Dennis, J.B.: First version of a data flow procedure language. In: Programming Symposium, Proceedings Colloque sur la Programmation. Springer, London, pp. 362–376 (1974).
  10. 10.
    Duran, A., Ayguad, E., Badia, R.M., Labarta, J., Martinell, L., Martorell, X., Planas, J.: Ompss: a proposal for programming heterogeneous multi-core architectures. Parallel Process. Lett. 21(02), pp. 173–193 (2011). doi: 10.1142/S0129626411000151
  11. 11.
    Ebcioglu, K., Saraswat, V., Sarkar, V.: X10: programming for hierarchical parallelism and non-uniform data access. In: Proceedings of the International Workshop on Language Runtimes, OOPSLA (2004)Google Scholar
  12. 12.
    Ellson, J., Gansner, E., Koutsofios, L., North, S., Woodhull, G.: Graphviz—open source graph drawing tools. In: Mutzel, P., Jünger, M., Leipert, S.(eds.) Graph Drawing. Lecture Notes in Computer Science, vol. 2265, pp. 483–484. Springer, Berlin Heidelberg (2002). doi: 10.1007/3-540-45848-4_57
  13. 13.
    Gao, G.R.: A pipelined code mapping scheme for static data flow computers. PhD thesis, Massachusetts Institute of Technology. (1986)
  14. 14.
    Garcia, E., Orozco, D., Khan, R., Venetis, I., Livingston, K., Gao, G.: A dynamic schema to increase performance in many-core architectures through percolation operations. In: Proceedings of the 2013 IEEE International Conference on High Performance Computing (HiPC 2013), Bangalore. IEEE Computer Society (2013)Google Scholar
  15. 15.
    Garcia, E., Venetis, I.E., Khan, R., Gao, G.: Optimized dense matrix multiplication on a many-core architecture. In: Proceedings of the Sixteenth International Conference on Parallel Computing (Euro-Par 2010), Part II, Springer, Ischia, Italy, Lecture Notes in Computer Science, vol. 6272, pp. 316–327 (2010b)Google Scholar
  16. 16.
    Garcia, E., Venetis, I.E., Khan, R., Gao, G.R.: Optimized dense matrix multiplication on a many-core architecture. In: Euro-Par 2010-Parallel Processing, pp. 316–327 (2010c)Google Scholar
  17. 17.
    Garcia, E., Orozco, D., Khan, R., Venetis, I.E., Livingston, K., Gao, G.R.: Dynamic percolation: a case of study on the shortcomings of traditional optimization in many-core architectures. In: ACM International Conference on Computing Frontiers 2012 (CF’12) (2012a)Google Scholar
  18. 18.
    Garcia, E., Orozco, D., Pavel, R., Gao, G.: A discussion in favor of dynamic scheduling for regular applications in many-core architectures. In: Parallel and Distributed Processing Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International, IEEE, pp. 1591–1600 (2012b)Google Scholar
  19. 19.
    Garcia, E., Orozco, D., Pavel, R., Gao, G.R.: A discussion in favor of Dynamic Scheduling for regular applications in Many-core Architectures. In: Proceedings of 2012 Workshop on Multithreaded Architectures and Applications (MTAAP 2012); 26th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2012), pp. 1591–1600. ACM, Shanghai (2012)Google Scholar
  20. 20.
    Gautier, T., Besseron, X., Pigeon, L.: Kaapi: A thread scheduling runtime system for data flow computations on cluster of multi-processors. In: Proceedings of the 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07, pp. 15–23. ACM, New York, NY, USA (2007)Google Scholar
  21. 21.
    Gropp, W., Lusk, E., Thakur, R.: Using MPI-2: Advanced Features of the Message-Passing Interface. MIT Press, Cambridge (1999)Google Scholar
  22. 22.
    Irigoin, F., Triolet, R.: Supernode partitioning. In: Proceedings of the 15th ACM SIGPLAN-SIGACT symposium on Principles of Programming Languages, pp. 319–329. ACM (1988)Google Scholar
  23. 23.
    Murata, T.: Petri nets: properties, analysis and applications. Proc. IEEE 77(4), 541–580 (1989). doi: 10.1109/5.24143 CrossRefGoogle Scholar
  24. 24.
    Najjar, W.A., Lee, E.A., Gao, G.R.: Advances in the dataflow computational model. Parallel Comput. 25, 1907–1929 (1999)CrossRefGoogle Scholar
  25. 25.
    Nemawarkar, S., Gao, G.: Measurement and modeling of earth-manna multithreaded architecture. In: Modeling, Analysis, and Simulation of Computer and Telecommunication Systems. MASCOTS ’96, Proceedings of the Fourth International Workshop on, pp. 109–114 (1996). doi: 10.1109/MASCOT.1996.501002
  26. 26.
    Gulati, K., Khatri, S.P.: GPU architecture and the CUDA programming model. In: Hardware acceleration of EDA algorithms, pp. 23–30. Springer US (2010). doi: 10.1007/978-1-4419-0944-2_3
  27. 27.
    Orozco, D.: Tideflow: a parallel execution model for high performance computing programs. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, p. 211 (2011)Google Scholar
  28. 28.
    Orozco, D., Gao, G.: Mapping the FDTD application to many-core chip architectures. In: Parallel Processing. ICPP ’09. International Conference on, pp. 309–316 (2009)Google Scholar
  29. 29.
    Orozco, D., Xue, L., Bolat, M., Li, X., Gao, G.R.: Experience of optimizing FFT on intel architectures. In: Parallel and Distributed Processing Symposium. IPDPS 2007. IEEE International, IEEE, pp. 1–8 (2007)Google Scholar
  30. 30.
    Orozco, D., Garcia, E., Gao, G.: Locality optimization of stencil applications using data dependency graphs. In: Proceedings of the 23rd International Conference on Languages and Compilers for Parallel Computing, LCPC’10, pp. 77–91. Springer, Berlin (2011a)Google Scholar
  31. 31.
    Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.: High throughput queue algorithms. Tech. rep., CAPSL Technical Memo 103 (2011b)Google Scholar
  32. 32.
    Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. In: Proceedings of the 24th International Conference on Languages and Compilers for Parallel Computing, LCPC 11 (2011c)Google Scholar
  33. 33.
    Orozco, D., Garcia, E., Pavel, R., Khan, R., Gao, G.R.: Polytasks: a compressed task representation for hpc runtimes. CAPSL Technical Memo 105 (2011d)Google Scholar
  34. 34.
    Orozco, D., Garcia, E., Khan, R., Livingston, K., Gao, G.R.: Toward high-throughput algorithms on many-core architectures. ACM Trans. Archit. Code Optim. 8(4), 49 (2012)CrossRefGoogle Scholar
  35. 35.
    Sarkar, V., Hennessy, J.: Partitioning parallel programs for macro-dataflow. In: Proceedings of the 1986 ACM Conference on LISP and Functional Programming, LFP ’86, pp. 202–211. ACM, New York, NY, USA (1986). doi: 10.1145/319838.319863
  36. 36.
    Stone, J.E., Gohara, D., Shi, G.: Opencl: a parallel programming standard for heterogeneous computing systems. Comput. Sci. Eng. 12(3), 66 (2010)CrossRefGoogle Scholar
  37. 37.
    Theobald, K.: Earth: an efficient architecture for running threads. PhD thesis, University of Delaware (1999)Google Scholar
  38. 38.
    Yan, Y., Chatterjee, S., Orozco, D., Garcia, E., Budimlic, Z., Shirako, J., Pavel, R., Sarkar, V., Gao, G.: Hardware and software tradeoffs for task synchronization on manycore architectures. In: Proceedings of the Seventeenth International Conference on Parallel Computing (Euro-Par 2011), Bordeaux, France, Lecture Notes in Computer Science (2011)Google Scholar
  39. 39.
    Zuckerman, S., Suetterlein, J., Knauerhase, R.,Gao, G.: Using a codelet program execution model for exascale machines: position paper. In: Proceedings of the 1st International Workshop on Adaptive Self-Tuning Computing Systems for the Exaflop Era, pp. 64–69. ACM (2011)Google Scholar

Copyright information

© Springer Science+Business Media New York 2015

Authors and Affiliations

  • Daniel Orozco
    • 1
  • Elkin Garcia
    • 1
  • Robert Pavel
    • 1
  • Jaime Arteaga
    • 1
  • Guang Gao
    • 1
  1. 1.University of DelawareNewarkUSA

Personalised recommendations