Advertisement

The Journal of Supercomputing

, Volume 70, Issue 2, pp 830–844 | Cite as

Improving an autotuning engine for 3D Fast Wavelet Transform on manycore systems

  • Gregorio BernabéEmail author
  • Javier Cuenca
  • Luis Pedro García
  • Domingo Giménez
Article

Abstract

This paper presents an enhanced auto-optimization method to run the 3D-Fast Wavelet Transform on different computing units in a system (GPU, MIC, CPU). The proposed method automatically selects a set of parameter values (block size, number of streams and number of threads) in order to reduce the total execution time, obtaining performances close to the optimal and decreasing the number of evaluations needed.

Keywords

Autotuning 3D-FWT GPUs CUDA Streams MIC 

Notes

Acknowledgments

This work was supported by the Spanish MINECO, as well as by European Commission FEDER funds, under grant TIN2012-38341-C04-03. We are grateful to the reviewers for their valuable comments.

References

  1. 1.
    Manocha D (2005) General-purpose computation using graphic processors. IEEE Comput 38(8):85–88CrossRefGoogle Scholar
  2. 2.
    Owens JD, Luebke D, Govindaraju N, Harris M, Krüger J, Lefohn AE, Purcell TJ (2007) A survey of general-purpose computation on graphics hardware. Comput Graph Forum 26(1):80–113CrossRefGoogle Scholar
  3. 3.
    CUDA Zone maintained by NVIDIA. http://www.nvidia.com/object/cuda.html (2009)
  4. 4.
    NVIDIA, Whitepaper NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110. http://www.nvidia.com/content/pdf/kepler/nvidia-kepler-gk110-architecture-whitepaper.pdf (2012)
  5. 5.
    Intel Corporation, An Overview of Programming for Intel Xeon processors and Intel Xeon Phi. coprocessors, https://software.intel.com/en-us/articles/an-overview-of-programming-for-intel-xeon-processors-and-intel-xeon-phi-coprocessors (2013)
  6. 6.
    Bernabé G, Cuenca J, Giménez D (2013) Optimizing a 3D-FWT code in heterogeneous cluster of multicore CPUs and manycore GPUs. In: 25th international symposium on computer architecture and high performance computing (2013)Google Scholar
  7. 7.
    Carvalho E, Calazans N, Moraes F (2007) Heuristics for dynamic task mapping in NoC-based heterogeneous MPSoCs. In: Proceedings of 18th IEEE/IFIP international workshop on rapid system prototyping, pp 34–40Google Scholar
  8. 8.
    Almeida F, González D, Moreno L (2006) The master-slave paradigm on heterogeneous systems: a dynamic programming approach for the optimal mapping. J Syst Architect 52:105–116CrossRefGoogle Scholar
  9. 9.
    Giersch A, Robert Y, Vivien F (2006) Scheduling tasks sharing files on heterogeneous master-slave platforms. J Syst Archit 52:88–104CrossRefGoogle Scholar
  10. 10.
    Hsu C, Chen T, Li K (2007) Performance effective pre-scheduling strategy for heterogeneous grid systems in the master slave paradigm. Future Gener Comput Syst 23:569–579CrossRefGoogle Scholar
  11. 11.
    Banino C, Beaumont O, Carter L, Ferrante J, Legrand A, Robert Y (2004) Scheduling strategies for master-slave tasking on heterogeneous processor platforms. IEEE Trans Parallel Distrib Syst 15:319–330CrossRefGoogle Scholar
  12. 12.
    Volkov V, Demmel JW (2008) Benchmarking GPUs to tune dense linear algebra. In: Proceedings of 2008 ACM/IEEE conference on supercomputing SC’08Google Scholar
  13. 13.
    Yinan L, Dongarra J, Tomov S (2009) A note on auto-tuning GEMM for GPUs. In: Proceedings of 9th international conference on computational science: part I, pp 884–892 (2009)Google Scholar
  14. 14.
    Davidson A, Owens J (2012) Toward techniques for auto-tuning GPU algorithms. Appl Parallel Sci Comput Lect Notes Comput Sci 7134:110–119CrossRefGoogle Scholar
  15. 15.
    Fatica M (2009) Accelerating linpack with CUDA on heterogenous clusters. In: Proceedings of 2nd workshop on general purpose processing on graphics processing units, GPGPU-2, pp 46–51Google Scholar
  16. 16.
    Spiga F, Girotto I (2008) phiGEMM: a CPU-GPU library for porting quantum ESPRESSO on hybrid systems. In: Proceedings of 16th Euromicro conference on parallel, distributed and network-based processing, pp 368–375Google Scholar
  17. 17.
    Wang F, Yang C, Du Y, Chen HYJ, Xu W (2011) Optimizing LINPACK benchmark on GPU-accelerated petascale supercomputer. J Comput Sci Technol 26:854–865CrossRefGoogle Scholar
  18. 18.
    Tsai Y, Wang W, Chen R (2012) Tuning block size for QR factorization on CPU-GPU hybrid systems. In: Proceedings of IEEE 6th international symposium on embedded multicore socs (MCSoC), pp 205–211Google Scholar
  19. 19.
    Augonnet C, Thibault S, Namyst R, Wacrenier P (2011) StarPU: a unified platform for task scheduling on heterogeneous multicore architectures. J Comput Sci Technol 23:187–198Google Scholar
  20. 20.
    Intel Corporation, Intel MKL web page. http://software.intel.com/en-us/intel-mkl/ (2013)
  21. 21.
    Dongarra J, Gates M, Haidar A, Jia Y, Kabir K, Luszczek P, Tomov S (2013) Portable HPC programming on intel many-integrated-core hardware with MAGMA Port to Xeon Phi. In: Parallel processing and applied mathematics (2013)Google Scholar
  22. 22.
    Mallat S (1989) A theory for multiresolution signal descomposition: the wavelet representation. IEEE Trans Patt Anal Mach Intell 11(7):674–693CrossRefzbMATHGoogle Scholar
  23. 23.
    Bernabé G, García JM, González J (2009) A lossy 3D wavelet transform for high-quality compression of medical video. J Syst Softw 82(3):526–534CrossRefGoogle Scholar
  24. 24.
    Daubechies I (1992) Ten lectures on wavelets. Society for Industrial and Applied MathematicsGoogle Scholar
  25. 25.
    The Khronos Group, The OpenCL core API specification, http://www.khronos.org/registry/cl (2011)
  26. 26.
    Franco J, Bernabé G, Fernández J, Ujaldón M (2010) Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. In: 10 international conference on computational science (2010)Google Scholar
  27. 27.
    Bernabé G, Cuenca J, Giménez D (2013) Optimization techniques for 3D-FWT on systems with manycore GPUs and multicore CPUs. In: International conference on computational science (2013)Google Scholar
  28. 28.
    Cámara J, Cuenca J, Giménez D, García LP, Vidal A (2014) Empirical installation of linear algebra shared-memory subroutines for auto-tuning. Int J Parallel Program 42:408–434CrossRefGoogle Scholar
  29. 29.
    Franco J, Bernabé G, Fernández J, Acacio ME, Parallel A (2009) Implementation of the 2D wavelet transform using CUDA. In: 17 Euromicro international conference on parallel, distributed, and network-based processing (2009)Google Scholar
  30. 30.
    NVIDIA Tutorial at PDP’08, CUDA: A New Architecture for Computing on the GPU (February 2008)Google Scholar

Copyright information

© Springer Science+Business Media New York 2014

Authors and Affiliations

  • Gregorio Bernabé
    • 1
    Email author
  • Javier Cuenca
    • 1
  • Luis Pedro García
    • 2
  • Domingo Giménez
    • 1
  1. 1.University of MurciaMurciaSpain
  2. 2.Technical University of CartagenaCartagenaSpain

Personalised recommendations