Abstract
We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.
Similar content being viewed by others
Notes
We omit the dimension n in \(\sigma ^n_k\) here, as we assume the dimension n is the same as that of the array a: a single dimensional array will have \(n=1\), a 2D matrix \(n=2\), and so on.
The current implementation does not allow mixing of CPU and GPUs (or other accelerators) for deploying a single Loop-of-stencil-reduce instance.
A n-GPU pattern is a pattern deployed onto n GPU devices.
We implicitly define a FastFlowtask as the computation to be performed over a single stream item by a FastFlowpattern.
References
Aldinucci M, Coppola M, Danelutto M, Vanneschi M, Zoccolo C (2006) ASSIST as a research framework for high-performance grid programming environments. In: Grid computing: software environments and tools, chap. 10. Springer, pp 230–256
Aldinucci M, Danelutto M, Drocco M, Kilpatrick P, Peretti Pezzi G, Torquati M (2015) The loop-of-stencil-reduce paradigm. In: Proceedings of International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms. IEEE, Helsinki
Aldinucci M, Danelutto M, Kilpatrick P, Meneghin M, Torquati M (2011) Accelerating code on multi-cores with FastFlow. In: Proceedings of 17th International Euro-Par 2011 Parallel Processing, LNCS, vol 6853. Springer, Bordeaux, pp 170–181
Aldinucci M, Danelutto M, Meneghin M, Torquati M, Kilpatrick P (2010) Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed, Advances in Parallel Computing, vol 19. Elsevier, Amsterdam
Aldinucci M, Peretti Pezzi G, Drocco M, Spampinato C, Torquati M (2015) Parallel visual data restoration on multi-GPGPUs using stencil-reduce pattern. Int J High Perform Comput Appl 29(4):461–472. doi:10.1177/1094342014567907
Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198
Breuer S, Steuwer M, Gorlatch S (2014) Extending the SkelCL skeleton library for stencil computations on multi-GPU systems. In: Proceedings of the 1st International Workshop on High-performance Stencil Computations, Vienna, pp 15–21
Bueno-Hedo J, Planas J, Duran A, Badia RM, Martorell X, Ayguadé E, Labarta J (2012) Productive programming of GPU clusters with OmpSs. In: 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2012), pp 557–568
Danelutto M, Torquati M (2015) Structured parallel programming with “core” fastFlow. In: Central European Functional Programming School, LNCS, vol 8606. Springer, pp 29–75
Enmyren J, Kessler CW (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-level Parallel Programming and Applications, HLPP ’10. ACM, New York, pp 5–14
Ernsting S, Kuchen H (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: Proceedings of PARCO 2011. IOS Press
Garcia JD REPARA C++ open specification. Tech. Rep. ICT-609666-D2.1, REPARA EU FP7 project (2-14)
Gardner M (1970) Mathematical games: the fantastic combinations of John Conway’s new solitaire game ‘Life’. Sci Am 223(4):120–123
González-Vélez H, Leyton M (2010) A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Software Pract Exp 40:12
Khronos Compute Working Group: OpenACC Directives for Accelerators (2012). http://www.openacc-standard.org
Lutz T, Fensch C, Cole M (2013) Partans: an autotuning framework for stencil computation on multi-gpu systems. ACM Trans Archit Code Optim 9(4):59:1–59:24
Owens J (2007) SC 07, high performance computing with CUDA tutorial
Steuwer M, Gorlatch S (2013) Skelcl: Enhancing opencl for high-level programming of multi-gpu systems. In: Proceedings of the 12th International Conference on Parallel Computing Technologies, St. Petersburg, pp 258–272
Acknowledgments
This work was supported by EU FP7 project REPARA (No. 609666), the EU H2020 Project RePhrase (No. 644235), and by the NVidia GPU Research Center at the University of Torino.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Aldinucci, M., Danelutto, M., Drocco, M. et al. A parallel pattern for iterative stencil + reduce. J Supercomput 74, 5690–5705 (2018). https://doi.org/10.1007/s11227-016-1871-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-016-1871-z