High Performance Parallel Summed-Area Table Kernels for Multi-core and Many-core Systems

  • Angelos PapatriantafyllouEmail author
  • Dimitris Sacharidis
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9833)


The summed-area table (SAT), also known as integral image, is a data structure extensively used in computer graphics and vision for fast image filtering. The parallelization of its construction has been thoroughly investigated and many algorithms have been proposed for GPUs. Generally speaking, state-of-the-art methods cannot efficiently solve this problem in multi-core and many-core (Xeon Phi) systems due to cache misses, strided and/or remote memory accesses. This work proposes three novel cache-aware parallel SAT algorithms, which generalize parallel block-based prefix-sums algorithms. In addition, we discuss 2D matrix partitioning policies which play an important role in the efficient operation of the cache subsystem. The combination of a SAT algorithm and a partition is manually tuned according to the matrix layout and the number of threads. Experimental evaluation of our algorithms on two NUMA systems and Intel’s Xeon Phi, and for three datatypes (int, float, double) by utilizing all system cores, shows, in all experimental settings, better performance compared to the best known CPU and GPU approaches (up to 4.55\(\times \) on NUMA and 2.8\(\times \) on Xeon Phi).


NUMA System Single Instruction Multiple Data Matrix Partitioning Block Column Shared Buffer 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.



We would like to thank the members of the TU Wien Research Group Parallel Computing and the anonymous reviewers for their valuable comments.


  1. 1.
    Bay, H., Ess, A., Tuytelaars, T., Gool, L.J.V.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)CrossRefGoogle Scholar
  2. 2.
    Bradski, G.R., Kaehler, A.: Learning OpenCV - Computer Vision with the OpenCV Library: Software That Sees. O’Reilly, Beijing (2008)Google Scholar
  3. 3.
    Chatterjee, S., Blelloch, G.E., Zagha, M.: Scan primitives for vector computers. In: Proceedings Supercomputing 1990, pp. 666–675 (1990)Google Scholar
  4. 4.
    Crow, F.C.: Summed-area tables for texture mapping. In: Proceedings of the 11th Annual conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pp. 207–212 (1984)Google Scholar
  5. 5.
    Dotsenko, Y., Govindaraju, N.K., Sloan, P., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors. In: Proceedings of the 22nd Annual International Conference on Supercomputing (ICS), pp. 205–213 (2008)Google Scholar
  6. 6.
    Hensley, J., Scheuermann, T., Coombe, G., Singh, M., Lastra, A.: Fast summed-area table generation and its applications. Comput. Graph. Forum 24(3), 547–555 (2005)CrossRefGoogle Scholar
  7. 7.
    Kasagi, A., Nakano, K., Ito, Y.: Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of the 43rd International Conference on Parallel Processing (ICPP), pp. 251–260 (2014)Google Scholar
  8. 8.
    Kisacanin, B.: Integral image optimizations for embedded vision applications. In: Proceedings of the 2008 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 181–184 (2008)Google Scholar
  9. 9.
    Nehab, D., Maximo, A., Lima, R.S., Hoppe, H.: GPU-efficient recursive filtering and summed-area tables. ACM Trans. Graph. 30(6), 176 (2011)CrossRefGoogle Scholar
  10. 10.
    Papatriantafyllou, A.: Energy characterization and optimization of parallel prefix-sums kernels. In: Hunold, S., et al. (eds.) Euro-Par 2015 Workshops. LNCS, vol. 9523, pp. 685–696. Springer, Heidelberg (2015). doi: 10.1007/978-3-319-27308-2_55 CrossRefGoogle Scholar
  11. 11.
    Sengupta, S., Harris, M., Garland, M.: Efficient Parallel Scan Algorithms for GPUs. Technical report, NVIDIA Corporation (2008)Google Scholar
  12. 12.
    Singler, J., Sanders, P., Putze, F.: MCSTL: the multi-core standard template library. In: Kermarrec, A.-M., Bougé, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 682–694. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  13. 13.
    Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 511–518 (2001)Google Scholar
  14. 14.
    Viola, P.A., Jones, M.J., Snow, D.: Detecting pedestrians using patterns of motion and appearance. Int. J. Comput. Vis. 63(2), 153–161 (2005)CrossRefGoogle Scholar
  15. 15.
    Yan, S., Zhang, Y., Long, G.: Summed-area table algorithm optimization based on the OpenCL. In: Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way? (2012)Google Scholar
  16. 16.
    Zhang, N.: Working towards efficient parallel computing of integral images on multi-core processors. In: Proceedings of the 2nd International Conference on Computer Engineering and Technology (ICCET), pp. V2-30–V2-34 (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  • Angelos Papatriantafyllou
    • 1
    Email author
  • Dimitris Sacharidis
    • 2
  1. 1.Research Group Parallel Computing, Faculty of Informatics, Institute of Information SystemsTU WienViennaAustria
  2. 2.E-Commerce Group, Faculty of Informatics, Institute of Software Technology and Interactive SystemsTU WienViennaAustria

Personalised recommendations