Fast Wavelet Transform Utilizing a Multicore-Aware Framework

  • Markus Stürmer
  • Harald Köstler
  • Ulrich Rüde
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7134)


The move to multicore processors creates new demands on software development in order to profit from the improved capabilities. Most important, algorithm and code must be parallelized wherever possible, but also the growing memory wall must be considered. Additionally, high computational performance can only be reached if architecture-specific features are made use of. To address this complexity, we developed a C++ framework that simplifies the development of performance-optimized, parallel, memory-efficient, stencil-based codes on standard multicore processors and the heterogeneous Cell processor developed jointly by Sony, Toshiba, and IBM. We illustrate the implementation and optimization of the Fast Wavelet Transform and its inverse for Haar wavelets within our hybrid framework, using OpenMP, and using the Open Compute Language, and analyze performance results for different platforms.


cache blocking parallelization CBEA OpenCL OpenMP 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abschlussbericht des Projekts Ru 422/7-5 (DiME-2). Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg (2008)Google Scholar
  2. 2.
    Christen, M., Schenk, O., Neufeld, E., Messmer, P., Burkhart, H.: Parallel data-locality aware stencil computations on modern micro-architectures. In: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–10. IEEE Computer Society (2009)Google Scholar
  3. 3.
    Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1), 129–159 (2009)CrossRefzbMATHGoogle Scholar
  4. 4.
    Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, pp. 1–12 (2009)Google Scholar
  5. 5.
    Franco, J., Bernabé, G., Fernández, J., Acacio, M.: A Parallel Implementation of the 2D Wavelet Transform Using CUDA. In: Parallel, Distributed and Network-Based Processing, pp. 111–118 (2009)Google Scholar
  6. 6.
    Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. Procedia Computer Science 1(1), 1095–1104 (2010)CrossRefGoogle Scholar
  7. 7.
    Garcia, A., Shen, H.: GPU-based 3D wavelet reconstruction with tileboarding. The Visual Computer 21(8), 755–763 (2005)CrossRefGoogle Scholar
  8. 8.
    Haar, A.: Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen 69, 331–371 (1910)MathSciNetCrossRefzbMATHGoogle Scholar
  9. 9.
    International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation: Cell Broadband Engine Architecture 1.02 (2007)Google Scholar
  10. 10.
    Kowarschik, M.: Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures (2004)Google Scholar
  11. 11.
    McKinley, K.S., Carr, S., Tseng, C.W.: Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18(4), 424–453 (1996)CrossRefGoogle Scholar
  12. 12.
    Mohiyuddin, M., Hoemmen, M., Demmel, J., Yelick, K.: Minimizing communication in sparse matrix solvers. In: SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–12. ACM, New York (2009)Google Scholar
  13. 13.
    Ohshima, S., Hirasawa, S., Honda, H.: OMPCUDA: OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 161–173 (2010)Google Scholar
  14. 14.
    Stürmer, M., Rüde, U.: A framework that supports in writing performance-optimized stencil-based codes. Tech. Rep. 10-5, Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg (2010)Google Scholar
  15. 15.
    Tenllado, C., Setoain, J., Prieto, M., et al.: Parallel implementation of the 2d discrete wavelet transform on graphics processing units: Filter bank versus lifting. IEEE Transactions on Parallel and Distributed Systems 19(3), 299–310 (2008)CrossRefGoogle Scholar
  16. 16.
    Weiß, C.: Data Locality Optimizations for Multigrid Methods on Structured Grids. Ph.D. thesis, Lehrstuhlr für Rechnertechnik und Rechnerorganisation, Institut für Informatik, Technische Universität München, Munich, Germany (2001)Google Scholar
  17. 17.
    Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference, vol. 01, pp. 579–586. IEEE Computer Society (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Markus Stürmer
    • 1
  • Harald Köstler
    • 1
  • Ulrich Rüde
    • 1
  1. 1.System Simulation GroupUniversity of Erlangen-NurembergGermany

Personalised recommendations