Abstract
The move to multicore processors creates new demands on software development in order to profit from the improved capabilities. Most important, algorithm and code must be parallelized wherever possible, but also the growing memory wall must be considered. Additionally, high computational performance can only be reached if architecture-specific features are made use of. To address this complexity, we developed a C++ framework that simplifies the development of performance-optimized, parallel, memory-efficient, stencil-based codes on standard multicore processors and the heterogeneous Cell processor developed jointly by Sony, Toshiba, and IBM. We illustrate the implementation and optimization of the Fast Wavelet Transform and its inverse for Haar wavelets within our hybrid framework, using OpenMP, and using the Open Compute Language, and analyze performance results for different platforms.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Abschlussbericht des Projekts Ru 422/7-5 (DiME-2). Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg (2008)
Christen, M., Schenk, O., Neufeld, E., Messmer, P., Burkhart, H.: Parallel data-locality aware stencil computations on modern micro-architectures. In: Proceedings of the 2009 IEEE International Symposium on Parallel & Distributed Processing, pp. 1–10. IEEE Computer Society (2009)
Datta, K., Kamil, S., Williams, S., Oliker, L., Shalf, J., Yelick, K.: Optimization and performance modeling of stencil computations on modern microprocessors. SIAM Review 51(1), 129–159 (2009)
Datta, K., Murphy, M., Volkov, V., Williams, S., Carter, J., Oliker, L., Patterson, D., Shalf, J., Yelick, K.: Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures. In: International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2008, pp. 1–12 (2009)
Franco, J., Bernabé, G., Fernández, J., Acacio, M.: A Parallel Implementation of the 2D Wavelet Transform Using CUDA. In: Parallel, Distributed and Network-Based Processing, pp. 111–118 (2009)
Franco, J., Bernabé, G., Fernández, J., Ujaldón, M.: Parallel 3D fast wavelet transform on manycore GPUs and multicore CPUs. Procedia Computer Science 1(1), 1095–1104 (2010)
Garcia, A., Shen, H.: GPU-based 3D wavelet reconstruction with tileboarding. The Visual Computer 21(8), 755–763 (2005)
Haar, A.: Zur Theorie der orthogonalen Funktionensysteme. Mathematische Annalen 69, 331–371 (1910)
International Business Machines Corporation, Sony Computer Entertainment Incorporated, Toshiba Corporation: Cell Broadband Engine Architecture 1.02 (2007)
Kowarschik, M.: Data Locality Optimizations for Iterative Numerical Algorithms and Cellular Automata on Hierarchical Memory Architectures (2004)
McKinley, K.S., Carr, S., Tseng, C.W.: Improving data locality with loop transformations. ACM Trans. Program. Lang. Syst. 18(4), 424–453 (1996)
Mohiyuddin, M., Hoemmen, M., Demmel, J., Yelick, K.: Minimizing communication in sparse matrix solvers. In: SC 2009: Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. pp. 1–12. ACM, New York (2009)
Ohshima, S., Hirasawa, S., Honda, H.: OMPCUDA: OpenMP Execution Framework for CUDA Based on Omni OpenMP Compiler. In: Beyond Loop Level Parallelism in OpenMP: Accelerators, Tasking and More, pp. 161–173 (2010)
Stürmer, M., Rüde, U.: A framework that supports in writing performance-optimized stencil-based codes. Tech. Rep. 10-5, Lehrstuhl für Informatik 10 (Systemsimulation), Friedrich-Alexander-Universität Erlangen-Nürnberg (2010)
Tenllado, C., Setoain, J., Prieto, M., et al.: Parallel implementation of the 2d discrete wavelet transform on graphics processing units: Filter bank versus lifting. IEEE Transactions on Parallel and Distributed Systems 19(3), 299–310 (2008)
Weiß, C.: Data Locality Optimizations for Multigrid Methods on Structured Grids. Ph.D. thesis, Lehrstuhlr für Rechnertechnik und Rechnerorganisation, Institut für Informatik, Technische Universität München, Munich, Germany (2001)
Wellein, G., Hager, G., Zeiser, T., Wittmann, M., Fehske, H.: Efficient temporal blocking for stencil computations by multicore-aware wavefront parallelization. In: Proceedings of the 2009 33rd Annual IEEE International Computer Software and Applications Conference, vol. 01, pp. 579–586. IEEE Computer Society (2009)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Stürmer, M., Köstler, H., Rüde, U. (2012). Fast Wavelet Transform Utilizing a Multicore-Aware Framework. In: Jónasson, K. (eds) Applied Parallel and Scientific Computing. PARA 2010. Lecture Notes in Computer Science, vol 7134. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-28145-7_31
Download citation
DOI: https://doi.org/10.1007/978-3-642-28145-7_31
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-28144-0
Online ISBN: 978-3-642-28145-7
eBook Packages: Computer ScienceComputer Science (R0)