Abstract
Dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.
Similar content being viewed by others
References
Agullo E, Aumage O, Faverge M, Furmento N, Pruvost F, Sergent M, Thibault S (2014) Harnessing clusters of hybrid nodes with a sequential task-based programming model. In: International Workshop on Parallel Matrix Algorithms and Applications (PMAA 2014)
Augonnet C, Thibault S, Namyst R, Wacrenier P (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198
Bachan J, Baden SB, Hofmeyr S, Jacquelin M, Kamil A, Bonachea D, Hargrove PH, Ahmed H (2019) UPC++: a high-performance communication framework for asynchronous computation. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp 963–973
Bauer M, Treichler S, Slaughter E, Aiken A (2012) Legion: expressing locality and independence with logical regions. In: International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pp 1–11
Bosilca G, Bouteiller A, Danalis A, Faverge M, Haidar A, Herault T, Kurzak J, Langou J, Lemarinier P, Ltaief H, Luszczek P, YarKhan A, Dongarra J (2011) Flexible development of dense linear algebra algorithms on massively parallel architectures with DPLASMA. In: 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp 1432–1441. https://doi.org/10.1109/IPDPS.2011.299
Bosilca G, Bouteiller A, Danalis A, Hérault T, Lemarinier P, Dongarra J (2012) DAGuE: a generic distributed DAG engine for high performance computing. Parallel Comput 38(1–2):37–51. https://doi.org/10.1016/j.parco.2011.10.003
Bueno J, Martorell X, Badia RM, Ayguadé E, Labarta J (2013) Implementing OmpSs support for regions of data in architectures with multiple address spaces. In: 27th International Conference on Supercomputing, ICS ’13, pp 359–368
Burke MG, Knobe K, Newton R, Sarkar V (2005) UPC language specifications, v1.2. Technical Report LBNL-59208, Lawrence Berkeley National Lab
Chamberlain B, Callahan D, Zima H (2007) Parallel programmability and the Chapel language. Int J High Perform Comput Appl 21(3):291–312. https://doi.org/10.1177/1094342007078442
Charles P, Grothoff C, Saraswat V, Donawa C, Kielstra A, Ebcioglu K, von Praun C, Sarkar V (2005) X10: an object-oriented approach to non-uniform cluster computing. In: 20th Annual ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages, and Applications, OOPSLA ’05, pp 519–538
Cosnard M, Loi M (1995) Automatic task graph generation techniques. In: 28th Annual Hawaii International Conference on System Sciences, HICSS’28, vol 2, pp 113–122. https://doi.org/10.1109/HICSS.1995.375471
Danalis A, Jagode H, Bosilca G, Dongarra J (2015) PaRSEC in practice: optimizing a legacy chemistry application through distributed task-based execution. In: 2015 IEEE International Conference on Cluster Computing, pp 304–313. https://doi.org/10.1109/CLUSTER.2015.50
Fraguela BB (2017) A comparison of task parallel frameworks based on implicit dependencies in multi-core environments. In: 50th Hawaii International Conference on System Sciences, HICSS’50, pp 6202–6211. https://doi.org/10.24251/HICSS.2017.750
Fraguela BB, Andrade D (2019) Easy dataflow programming in clusters with UPC++ DepSpawn. IEEE Trans Parallel Distrib Syst 30(6):1267–1282. https://doi.org/10.1109/TPDS.2018.2884716
González CH, Fraguela BB (2013) A framework for argument-based task synchronization with automatic detection of dependencies. Parallel Comput 39(9):475–489. https://doi.org/10.1016/j.parco.2013.04.012
Cray Inc (2017) Chapel language specification version 0.984
Koniges A, Cook B, Deslippe J, Kurth T, Shan H (2016) MPI usage at NERSC: present and future. In: 23rd European MPI Users’ Group Meeting, EuroMPI 2016, p 217. https://doi.org/10.1145/2966884.2966894
Nieplocha J, Palmer B, Tipparaju V, Krishnan M, Trease H, Aprà E (2006) Advances, applications and performance of the global arrays shared memory programming toolkit. Int J High Perform Comput Appl 20(2):203–231. https://doi.org/10.1177/1094342006064503
Numrich RW, Reid J (1998) Co-array Fortran for parallel programming. SIGPLAN Fortran Forum 17(2):1–31. https://doi.org/10.1145/289918.289920
Pugh W (1991) The Omega test: a fast and practical integer programming algorithm for dependence analysis. In: 1991 ACM/IEEE Conference on Supercomputing, Supercomputing ’91, pp 4–13. Association for Computing Machinery, New York, NY, USA. https://doi.org/10.1145/125826.125848
Slaughter E, Lee W, Treichler S, Bauer M, Aiken A (2015) Regent: a high-productivity programming language for HPC with logical regions. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’15, pp 1–12 . https://doi.org/10.1145/2807591.2807629
Tejedor E, Farreras M, Grove D, Badia RM, Almasi G, Labarta J (2012) A high-productivity task-based programming model for clusters. Concurr Comput Pract Exp 24(18):2421–2448
Wozniak JM, Armstrong TG, Wilde M, Katz DS, Lusk E, Foster IT (2013) Swift/T: large-scale application composition via distributed-memory dataflow processing. In: 13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, pp 95–102. https://doi.org/10.1109/CCGrid.2013.99
Yelick K, Bonachea D, Chen WY, Colella P, Datta K, Duell J, Graham SL, Hargrove P, Hilfinger P, Husbands P, Iancu C, Kamil A, Nishtala R, Su J, Welcome M, Wen T (2007) Productivity and performance using partitioned global address space languages. In: Proceedings 2007 International Workshop on Parallel Symbolic Computation, PASCO ’07, pp 24–32. https://doi.org/10.1145/1278177.1278183
Yelick KA, Graham SL, Hilfinger PN, Bonachea D, Su J, Kamil A, Datta K, Colella P, Wen T (2011) Titanium. In: Encyclopedia of Parallel Computing, pp 2049–2055. Springer US
Zheng Y, Kamil A, Driscoll MB, Shan H, Yelick K (2014) UPC++: a PGAS extension for C++. In: IEEE 28th International Parallel and Distributed Processing Symposium (IPDPS 2014), pp 1105–1114
Acknowledgements
This research was supported by the Ministry of Science and Innovation of Spain (TIN2016-75845-P and PID2019-104184RB-I00, AEI/FEDER/EU, 10.13039/501100011033), and by the Xunta de Galicia co-funded by the European Regional Development Fund (ERDF) under the Consolidation Programme of Competitive Reference Groups (ED431C 2017/04). We acknowledge also the support from the Centro Singular de Investigación de Galicia “CITIC”, funded by Xunta de Galicia and the European Union (European Regional Development Fund—Galicia 2014–2020 Program), by Grant ED431G 2019/01. We also acknowledge the Centro de Supercomputación de Galicia (CESGA) for the use of their computers.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Fraguela, B.B., Andrade, D. High-performance dataflow computing in hybrid memory systems with UPC++ DepSpawn. J Supercomput 77, 7676–7689 (2021). https://doi.org/10.1007/s11227-020-03607-1
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03607-1