High Performance Parallel Summed-Area Table Kernels for Multi-core and Many-core Systems
The summed-area table (SAT), also known as integral image, is a data structure extensively used in computer graphics and vision for fast image filtering. The parallelization of its construction has been thoroughly investigated and many algorithms have been proposed for GPUs. Generally speaking, state-of-the-art methods cannot efficiently solve this problem in multi-core and many-core (Xeon Phi) systems due to cache misses, strided and/or remote memory accesses. This work proposes three novel cache-aware parallel SAT algorithms, which generalize parallel block-based prefix-sums algorithms. In addition, we discuss 2D matrix partitioning policies which play an important role in the efficient operation of the cache subsystem. The combination of a SAT algorithm and a partition is manually tuned according to the matrix layout and the number of threads. Experimental evaluation of our algorithms on two NUMA systems and Intel’s Xeon Phi, and for three datatypes (int, float, double) by utilizing all system cores, shows, in all experimental settings, better performance compared to the best known CPU and GPU approaches (up to 4.55\(\times \) on NUMA and 2.8\(\times \) on Xeon Phi).
KeywordsNUMA System Single Instruction Multiple Data Matrix Partitioning Block Column Shared Buffer
We would like to thank the members of the TU Wien Research Group Parallel Computing and the anonymous reviewers for their valuable comments.
- 2.Bradski, G.R., Kaehler, A.: Learning OpenCV - Computer Vision with the OpenCV Library: Software That Sees. O’Reilly, Beijing (2008)Google Scholar
- 3.Chatterjee, S., Blelloch, G.E., Zagha, M.: Scan primitives for vector computers. In: Proceedings Supercomputing 1990, pp. 666–675 (1990)Google Scholar
- 4.Crow, F.C.: Summed-area tables for texture mapping. In: Proceedings of the 11th Annual conference on Computer Graphics and Interactive Techniques (SIGGRAPH), pp. 207–212 (1984)Google Scholar
- 5.Dotsenko, Y., Govindaraju, N.K., Sloan, P., Boyd, C., Manferdelli, J.: Fast scan algorithms on graphics processors. In: Proceedings of the 22nd Annual International Conference on Supercomputing (ICS), pp. 205–213 (2008)Google Scholar
- 7.Kasagi, A., Nakano, K., Ito, Y.: Parallel algorithms for the summed area table on the asynchronous hierarchical memory machine, with GPU implementations. In: Proceedings of the 43rd International Conference on Parallel Processing (ICPP), pp. 251–260 (2014)Google Scholar
- 8.Kisacanin, B.: Integral image optimizations for embedded vision applications. In: Proceedings of the 2008 IEEE Southwest Symposium on Image Analysis and Interpretation (SSIAI), pp. 181–184 (2008)Google Scholar
- 11.Sengupta, S., Harris, M., Garland, M.: Efficient Parallel Scan Algorithms for GPUs. Technical report, NVIDIA Corporation (2008)Google Scholar
- 13.Viola, P.A., Jones, M.J.: Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), pp. 511–518 (2001)Google Scholar
- 15.Yan, S., Zhang, Y., Long, G.: Summed-area table algorithm optimization based on the OpenCL. In: Proceedings of the ATIP/A*CRC Workshop on Accelerator Technologies for High-Performance Computing: Does Asia Lead the Way? (2012)Google Scholar
- 16.Zhang, N.: Working towards efficient parallel computing of integral images on multi-core processors. In: Proceedings of the 2nd International Conference on Computer Engineering and Technology (ICCET), pp. V2-30–V2-34 (2010)Google Scholar