Efficient Mapping of Streaming Applications for Image Processing on Graphics Cards

  • Richard MembarthEmail author
  • Hritam Dutta
  • Frank Hannig
  • Jürgen Teich
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11225)


In the last decade, there has been a dramatic growth in research and development of massively parallel commodity graphics hardware both in academia and industry. Graphics card architectures provide an optimal platform for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, it is hard to efficiently map such algorithms to the graphics hardware even with detailed insight into the architecture. This paper presents a multiresolution image processing algorithm and shows the efficient mapping of this type of algorithms to graphics hardware as well as double buffering concepts to hide memory transfers. Furthermore, the impact of execution configuration is illustrated and a method is proposed to determine offline the best configuration. Using CUDA as programming model, it is demonstrated that the image processing algorithm is significantly accelerated and that a speedup of more than \(145\times \) can be achieved on NVIDIA’s Tesla C1060 compared to a parallelized implementation on a Xeon Quad Core. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead to the graphics card is reduced by a factor of six using double buffering.


CUDA OpenCL Image processing Mapping methodology Streaming application 



We are indebted to our colleagues Philipp Kutzer and Michael Glaß for providing the sample pictures.


  1. 1.
    Baskaran, M., Bondhugula, U., Krishnamoorthy, S., Ramanujam, J., Rountev, A., Sadayappan, P.: Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 1–10. ACM, February 2008.
  2. 2.
    do Carmo Lucas, A., Ernst, R.: An image processor for digital film. In: Proceedings of IEEE 16th International Conference on Application-Specific Systems, Architectures, and Processors (ASAP), pp. 219–224. IEEE, July 2005.
  3. 3.
    Christopoulos, C., Skodras, A., Ebrahimi, T.: The JPEG2000 still image coding system: an overview. Trans. Consum. Electron. 46(4), 1103–1127 (2000). Scholar
  4. 4.
    Dutta, H., Hannig, F., Teich, J., Heigl, B., Hornegger, H.: A design methodology for hardware acceleration of adaptive filter algorithms in image processing. In: Proceedings of IEEE 17th International Conference on Application-Specific Systems, Architectures, and Processors (ASAP), pp. 331–337. IEEE, September 2006.
  5. 5.
    Kemal Ekenel, H., Sankur, B.: Multiresolution face recognition. Image Vis. Comput. 23(5), 469–477 (2005). Scholar
  6. 6.
    Kunz, D., Eck, K., Fillbrandt, H., Aach, T.: Nonlinear multiresolution gradient adaptive filter for medical images. In: Proceedings of the SPIE: Medical Imaging 2003: Image Processing, vol. 5032, pp. 732–742. SPIE, May 2003.
  7. 7.
    Li, W.: Overview of fine granularity scalability in MPEG-4 video standard. Trans. Circuit. Syst. Video Technol. 11(3), 301–317 (2001). Scholar
  8. 8.
    Lindholm, E., Nickolls, J., Oberman, S., Montrym, J.: NVIDIA Tesla: a unified graphics and computing architecture. IEEE Micro 28(2), 39–55 (2008). Scholar
  9. 9.
    Membarth, R., Hannig, F., Dutta, H., Teich, J.: Efficient mapping of multiresolution image filtering algorithms on graphics processors. In: Bertels, K., Dimopoulos, N., Silvano, C., Wong, S. (eds.) SAMOS 2009. LNCS, vol. 5657, pp. 277–288. Springer, Heidelberg (2009). Scholar
  10. 10.
    Munshi, A.: The OpenCL Specification. Khronos OpenCL Working Group (2009)Google Scholar
  11. 11.
    Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., Phillips, J.: GPU computing. Proc. IEEE 96(5), 879–899 (2008). Scholar
  12. 12.
    Ryoo, S., Rodrigues, C., Stone, S., Baghsorkhi, S., Ueng, S., Hwu, W.: Program optimization study on a 128-core GPU. In: The First Workshop on General Purpose Processing on Graphics Processing Units (GPGPU) (2007)Google Scholar
  13. 13.
    Ryoo, S., Rodrigues, C., Baghsorkhi, S., Stone, S., Kirk, D., Wen-Mei, W.: Optimization principles and application performance evaluation of a multithreaded GPU using CUDA. In: Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), pp. 73–82. ACM, February 2008.
  14. 14.
    Stone, S., Haldar, J., Tsao, S., Wen-Mei, W., Liang, Z., Sutton, B.: Accelerating advanced MRI reconstructions on GPUs. In: Proceedings of the 2008 Conference on Computing Frontiers, pp. 261–272, October 2008. Scholar
  15. 15.
    Tomasi, C., Manduchi, R.: Bilateral filtering for gray and color images. In: Proceedings of the Sixth International Conference on Computer Vision, pp. 839–846, January 1998.
  16. 16.
    Wolfe, M., Shanklin, C., Ortega, L.: High Performance Compilers for Parallel Computing. Addison-Wesley Longman Publishing Co., Boston (1995)Google Scholar

Copyright information

© Springer-Verlag GmbH Germany, part of Springer Nature 2019

Authors and Affiliations

  1. 1.DFKI GmbHSaarbrückenGermany
  2. 2.Saarland UniversitySaarbrückenGermany
  3. 3.Robert Bosch GmbHStuttgartGermany
  4. 4.Friedrich-Alexander University Erlangen-NürnbergErlangenGermany

Personalised recommendations