The Journal of Supercomputing

, Volume 74, Issue 4, pp 1580–1608 | Cite as

A tuning approach for iterative multiple 3d stencil pipeline on GPUs: Anisotropic Nonlinear Diffusion algorithm as case study

  • S. TabikEmail author
  • M. Peemen
  • L. F. Romero


This paper focuses on challenging applications that can be expressed as an iterative pipeline of multiple 3d stencil stages and explores their optimization space on GPUs. For this study, we selected a representative example from the field of digital signal processing, the Anisotropic Nonlinear Diffusion algorithm. An open issue to these applications is to determine the optimal fission/fusion level of the involved stages and whether that combination benefits from data tiling. This implies exploring a large space of all the possible fission/fusion combinations with and without tiling, thus making the process non-trivial. This study provides insights to reduce the optimization tuning space and programming effort of iterative multiple 3d stencils. Our results demonstrate that all combinations that fuse the bottleneck stencil with high halos update cost (\(>25\%\), this percentage can be measured or estimated experimentally for each single stencil) and high registers and shared memory accesses must not be considered in the exploration process. The optimal fission/fusion combination is up to 1.65\(\times \) faster than the case in which we fully decompose our stencil without tiling and 5.3\(\times \) faster with respect to the fully fused version on the NVIDIA GPUs.


3d stencils Fission Fusion Tiling GPUs Anisotropic Nonlinear Diffusion 3d images 



This work was partially supported by Junta de Andalusia under Projects TIC-8260 and P11-TIC-7176. Siham Tabik was supported by the Ramón y Cajal Programme (RYC-2015-18136).


  1. 1.
    Whitepaper nvidias next generation cuda compute architecture: Kepler tm gk110. In NVIDIA Google Scholar
  2. 2.
    Barash D (2002) Fundamental relationship between bilateral filtering, adaptive smoothing, and the nonlinear diffusion equation. IEEE Trans Pattern Anal Mach Intell 24(6):844–847CrossRefGoogle Scholar
  3. 3.
    Dang V, El-Araby E, Dao L, Chang L-C (2013) Accelerating nonlinear diffusion tensor estimation for medical image processing using high performance GPU clusters. In: 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), pp 265–268Google Scholar
  4. 4.
    Fernandez J-J (2009) Tomobflow: feature-preserving noise filtering for electron tomography. BMC Bioinf 10(1):178CrossRefGoogle Scholar
  5. 5.
    Fernandez J-J, Lawrence AF, Roca J, Garcia I, Ellisman MH, Carazo JM (2002) High-performance electron tomography of complex biological specimens. J Struct Biol 138:6–20CrossRefzbMATHGoogle Scholar
  6. 6.
    Fernández J-J, Li S (2003) An improved algorithm for anisotropic nonlinear diffusion for denoising cryo-tomograms. J Struct Biol 144(1):152–161CrossRefGoogle Scholar
  7. 7.
    Fernandez J-J, Sam L (2005) Anisotropic nonlinear filtering of cellular structures in cryoelectron tomography. Comput Sci Eng 7(5):54–61CrossRefGoogle Scholar
  8. 8.
    Filipovič J, Madzin M, Fousek J, Matyska L (2015) Optimizing cuda code by kernel fusion: application on blas. J Supercomput 71(10):3934–3957CrossRefGoogle Scholar
  9. 9.
    Frangakis AS, Hegerl R (2001) Noise reduction in electron tomographic reconstructions using nonlinear anisotropic diffusion. J Struct Biol 135(3):239–250CrossRefGoogle Scholar
  10. 10.
    Frangakis AS, Stoschek A, Hegerl R (2001) Wavelet transform filtering and nonlinear anisotropic diffusion assessed for signal reconstruction performance on multidimensional biomedical data. IEEE Trans Biomed Eng 48(2):213–222CrossRefGoogle Scholar
  11. 11.
    Fehrenbach JMJ (2013) Small non-negative stencils for anisotropic diffusion. Numerical Analysis. arXiv:1301.3925 Google Scholar
  12. 12.
    Fuller SH, Millett LI (2011) Computing performance: Game over or next level? Computer 1:31–38CrossRefGoogle Scholar
  13. 13.
    Gysi T, Grosser T, Hoefler T (2015) Modesto: data-centric analytic optimization of complex stencil programs on heterogeneous architectures. In: Proceedings of the 29th ACM on International Conference on Supercomputing, pp 177–186Google Scholar
  14. 14.
    Holewinski J, Pouchet L-N, Sadayappan P (2012) High-performance code generation for stencil computations on GPU architectures. In: Proceedings of the 26th ACM International Conference on Supercomputing, ACM, pp 311–320Google Scholar
  15. 15.
    Kamil S, Chan C, Oliker L, Shalf J, Williams S (2010) An auto-tuning framework for parallel multicore stencil computations. In: 2010 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), pp 1–12. IEEEGoogle Scholar
  16. 16.
    Kuijper A, Schwarzkopf A, Kalbe T, Bajaj CL, Roth S, Goesele M (2013) 3d anisotropic diffusion on gpus by closed-form local tensor computations. Numer Math 6:72–94MathSciNetzbMATHGoogle Scholar
  17. 17.
    Micikevicius P (2009) 3d finite difference computation on GPUs using CUDA. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, pp 79–84. ACMGoogle Scholar
  18. 18.
    Ragan-Kelley J, Barnes C, Adams A, Paris S, Durand F, Amarasinghe S (2013) Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. ACM SIGPLAN Not 48(6):519–530CrossRefGoogle Scholar
  19. 19.
    Rumpf M, Strzodka R (2001) Nonlinear diffusion in graphics hardware. Springer, BerlinCrossRefzbMATHGoogle Scholar
  20. 20.
    Schfer A, Fey D (2011) High performance stencil code algorithms for GPGPUs. Procedia Comput Sci 4:2027–2036 (2011 Proceedings of the International Conference on Computational Science, ICCS)CrossRefGoogle Scholar
  21. 21.
    Schwarzkopf A, Kalbe T, Bajaj C, Kuijper A, Goesele M (2012) Volumetric nonlinear anisotropic diffusion on GPUs. Scale Space and Variational Methods in Computer Vision. Volume 6667 of Lecture Notes in Computer Science. Springer, Berlin, pp 62–73Google Scholar
  22. 22.
    Tabik S, Murarasu A, Romero L (2014) Anisotropic nonlinear diffusion for filtering 3d images on gpus. In: 2014 IEEE International Conference on Cluster Computing (CLUSTER), pp 339–345Google Scholar
  23. 23.
    Tabik S, Murarasu A, Romero L (2014) Evaluating the fission/fusion transformation of an iterative multiple 3d-stencil on gpus. In: 1st Int’l Workshop on High-Performance Stencil Computations (HiStencils 2014), pp 81–88Google Scholar
  24. 24.
    Tabik S, Ortega G, Garzón EM (2014) Performance evaluation of kernel fusion blas routines on the gpu: iterative solvers as case study. J Supercomput 70(2):577–587CrossRefGoogle Scholar
  25. 25.
    Tabik S, Peemen M, Guil N, Corporaal H (2015) Demystifying the 16$\times $ 16 thread-block for stencils on the GPU. Concurr Comput Pract Exp 27(18):5557–557CrossRefGoogle Scholar
  26. 26.
    Weickert J (1998) Anisotropic diffusion in image processing. Teubner, StuttgartzbMATHGoogle Scholar
  27. 27.
    Wu H, Diamos G, Wang J, Cadambi S, Yalamanchili S, Chakradhar S (2012) Optimizing data warehousing applications for gpus using kernel fusion/fission. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum (IPDPSW), pp 2433–2442. IEEEGoogle Scholar
  28. 28.
    Yang W, Li K, Li K (2017) A parallel solving method for block-tridiagonal equations on cpu-gpu heterogeneous computing systems. J Supercomput 73(5):1760–1781CrossRefGoogle Scholar
  29. 29.
    Zhao Y (2008) Lattice boltzmann based pde solver on the GPU. Vis Comput 24(5):323–333CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2017

Authors and Affiliations

  1. 1.Department of Computer Science and Artificial IntelligenceUniversity of GranadaGranadaSpain
  2. 2.Department of Electrical EngineeringEindhoven University of TechnologyEindhovenThe Netherlands
  3. 3.Department of Computer ArchitectureUniversity of MalagaMálagaSpain

Personalised recommendations