The Journal of Supercomputing

, Volume 74, Issue 11, pp 5690–5705 | Cite as

A parallel pattern for iterative stencil + reduce

  • M. Aldinucci
  • M. Danelutto
  • M. Drocco
  • P. Kilpatrick
  • C. Misale
  • G. Peretti Pezzi
  • M. Torquati


We advocate the Loop-of-stencil-reduce pattern as a means of simplifying the implementation of data-parallel programs on heterogeneous multi-core platforms. Loop-of-stencil-reduce is general enough to subsume map, reduce, map-reduce, stencil, stencil-reduce, and, crucially, their usage in a loop in both data-parallel and streaming applications, or a combination of both. The pattern makes it possible to deploy a single stencil computation kernel on different GPUs. We discuss the implementation of Loop-of-stencil-reduce in FastFlow, a framework for the implementation of applications based on the parallel patterns. Experiments are presented to illustrate the use of Loop-of-stencil-reduce in developing data-parallel kernels running on heterogeneous systems.


Parallel patterns OpenCL GPUs Heterogeneous multi-cores 



This work was supported by EU FP7 project REPARA (No. 609666), the EU H2020 Project RePhrase (No. 644235), and by the NVidia GPU Research Center at the University of Torino.


  1. 1.
    Aldinucci M, Coppola M, Danelutto M, Vanneschi M, Zoccolo C (2006) ASSIST as a research framework for high-performance grid programming environments. In: Grid computing: software environments and tools, chap. 10. Springer, pp 230–256Google Scholar
  2. 2.
    Aldinucci M, Danelutto M, Drocco M, Kilpatrick P, Peretti Pezzi G, Torquati M (2015) The loop-of-stencil-reduce paradigm. In: Proceedings of International Workshop on Reengineering for Parallelism in Heterogeneous Parallel Platforms. IEEE, HelsinkiGoogle Scholar
  3. 3.
    Aldinucci M, Danelutto M, Kilpatrick P, Meneghin M, Torquati M (2011) Accelerating code on multi-cores with FastFlow. In: Proceedings of 17th International Euro-Par 2011 Parallel Processing, LNCS, vol 6853. Springer, Bordeaux, pp 170–181CrossRefGoogle Scholar
  4. 4.
    Aldinucci M, Danelutto M, Meneghin M, Torquati M, Kilpatrick P (2010) Efficient streaming applications on multi-core with FastFlow: the biosequence alignment test-bed, Advances in Parallel Computing, vol 19. Elsevier, AmsterdamGoogle Scholar
  5. 5.
    Aldinucci M, Peretti Pezzi G, Drocco M, Spampinato C, Torquati M (2015) Parallel visual data restoration on multi-GPGPUs using stencil-reduce pattern. Int J High Perform Comput Appl 29(4):461–472. doi: 10.1177/1094342014567907 CrossRefGoogle Scholar
  6. 6.
    Augonnet C, Thibault S, Namyst R, Wacrenier PA (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. Concurr Comput Pract Exp 23(2):187–198CrossRefGoogle Scholar
  7. 7.
    Breuer S, Steuwer M, Gorlatch S (2014) Extending the SkelCL skeleton library for stencil computations on multi-GPU systems. In: Proceedings of the 1st International Workshop on High-performance Stencil Computations, Vienna, pp 15–21Google Scholar
  8. 8.
    Bueno-Hedo J, Planas J, Duran A, Badia RM, Martorell X, Ayguadé E, Labarta J (2012) Productive programming of GPU clusters with OmpSs. In: 26th IEEE International Parallel and Distributed Processing Symposium (IPDPS 2012), pp 557–568Google Scholar
  9. 9.
    Danelutto M, Torquati M (2015) Structured parallel programming with “core” fastFlow. In: Central European Functional Programming School, LNCS, vol 8606. Springer, pp 29–75Google Scholar
  10. 10.
    Enmyren J, Kessler CW (2010) SkePU: a multi-backend skeleton programming library for multi-GPU systems. In: Proceedings of the Fourth International Workshop on High-level Parallel Programming and Applications, HLPP ’10. ACM, New York, pp 5–14Google Scholar
  11. 11.
    Ernsting S, Kuchen H (2011) Data parallel skeletons for GPU clusters and multi-GPU systems. In: Proceedings of PARCO 2011. IOS PressGoogle Scholar
  12. 12.
    Garcia JD REPARA C++ open specification. Tech. Rep. ICT-609666-D2.1, REPARA EU FP7 project (2-14)Google Scholar
  13. 13.
    Gardner M (1970) Mathematical games: the fantastic combinations of John Conway’s new solitaire game ‘Life’. Sci Am 223(4):120–123CrossRefGoogle Scholar
  14. 14.
    González-Vélez H, Leyton M (2010) A survey of algorithmic skeleton frameworks: high-level structured parallel programming enablers. Software Pract Exp 40:12CrossRefGoogle Scholar
  15. 15.
    Khronos Compute Working Group: OpenACC Directives for Accelerators (2012).
  16. 16.
    Lutz T, Fensch C, Cole M (2013) Partans: an autotuning framework for stencil computation on multi-gpu systems. ACM Trans Archit Code Optim 9(4):59:1–59:24CrossRefGoogle Scholar
  17. 17.
    Owens J (2007) SC 07, high performance computing with CUDA tutorialGoogle Scholar
  18. 18.
    Steuwer M, Gorlatch S (2013) Skelcl: Enhancing opencl for high-level programming of multi-gpu systems. In: Proceedings of the 12th International Conference on Parallel Computing Technologies, St. Petersburg, pp 258–272Google Scholar

Copyright information

© Springer Science+Business Media New York 2016

Authors and Affiliations

  • M. Aldinucci
    • 2
  • M. Danelutto
    • 1
  • M. Drocco
    • 2
  • P. Kilpatrick
    • 3
  • C. Misale
    • 2
  • G. Peretti Pezzi
    • 4
  • M. Torquati
    • 1
  1. 1.Department of Computer ScienceUniversity of PisaPisaItaly
  2. 2.Department of Computer ScienceUniversity of TurinTurinItaly
  3. 3.Department of Computer ScienceQueen’s University BelfastBelfastUK
  4. 4.Swiss National Supercomputing CentreLuganoSwitzerland

Personalised recommendations