Abstract
Spatiotemporal feature extraction algorithms are widely used in many image processing and computer vision applications. They are favored because of their robust generated features. However, they have high computational complexity. Parallelizing these algorithms, in order to speed their execution up, is of great importance. In this paper, we propose new parallel implementations, using GPU computing, for the two most widely used spatiotemporal feature extraction algorithms: scale-invariant feature transform and speeded up robust features. In our implementations, we solve problems with previous parallel implementations, such as load imbalance, thread synchronization, and the use of atomic operations. Our implementations speed up the execution by simultaneously processing all the work of each stage of the two algorithms, without dividing that stage into smaller sequential ones. The allocation of the threads in our implementations further allows them to increase the occupancy of the GPU streaming multiprocessors (SMs). We compare our presented implementations to previous CPU and GPU parallel implementations of the two algorithms. Results show that the proposed implementations could do all the processing in real time with high accuracy. They further achieve higher speedup, frame rate, and SM occupancy than the previous best-known parallel implementations of the two algorithms.
Similar content being viewed by others
References
Lowe, D.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Laptev, I., Lindeberg, T.: Local descriptors for spatio-temporal recognition. Lect. Notes Comput. Sci. 3667, 91–103 (2006)
Bay, H., Ess, A., Tuytelaars, T., Van Gool, L.: Speeded-up robust features (SURF). Comput. Vis. Image Underst. 110(3), 346–359 (2008)
Lee, C., Rhee, C.E., Lee, H.-J.: Complexity reduction by modified scale-space construction in sift generation optimized for a mobile GPU. IEEE Trans. Circuits Syst. Video Technol. 27(10), 2246–2259 (2017)
Zhang, Q., Chen, Y., Zhang, Y., Xu, Y.: SIFT implementation and optimization for multi-core systems. In: 2008. IPDPS 2008. IEEE International Symposium on Parallel and Distributed Processing, pp. 1–8. IEEE (2008)
Moren, K., Göhringer, D.: A framework for accelerating local feature extraction with OpenCL on multi-core CPUs and co-processors. J. Real-Time Image Process. 10(1007), 1–18 (2016)
Zhu, F., Chen, P., Yang, D., Zhang, W., Chen, H., Zang, B.: A GPU-based high-throughput image retrieval algorithm. In: Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units. ACM30-37, (2012)
Yan, W., Shi, X., Yan, X., Wang, L.: Computing OpenSURF on OpenCL and general purpose GPU. Int. J. Adv. Robot. Syst. 10(10), 375 (2013)
Lu, Y., Li, Y., Song, B., Zhang, W., Chen, H., Peng, L.: Parallelizing image feature extraction algorithms on multi-core platforms. J. Parallel Distrib. Comput. 92, 1–14 (2016)
Luebke, D.: CUDA: scalable parallel programming for high-performance scientific computing. In: The 2008 5th IEEE International Symposium on Biomedical Imaging: From Nano to Macro (ISBI 2008). IEEE836-838, (2008)
Hwu, W.-M.W.: GPU Computing Gems Emerald Edition. Elsevier, Amsterdam (2011)
Brown, M., Lowe, D. G.: Invariant features from interest point groups. In: Proceedings of the British Machine Vision Conference 2002, BMVC, pp. 253–262. (2002)
Antonini, M., Barlaud, M., Mathieu, P., Daubechies, I.: Image coding using wavelet transform. IEEE Trans. Image Process. 1(2), 205–220 (1992)
Heymann, S., Muller, K., Smolic, A., Frohlich, B., Wiegand, F.: SIFT implementation and optimization for general-purpose GPU. In: Proceedings of the International Conference in Central Europe on Computer Graphics, Visualization and Computer Vision, (2007)
Sinha, S. N., Frahm, J.-M., Pollefeys, M., Genc, Y.: GPU-based video feature tracking and matching. In: EDGE, Workshop on Edge Computing Using New Commodity Architectures, vol. 278, p. 4321. (2006)
Sinha, S., Frahm, J.-M., Pollefeys, M., Genc, Y.: Feature tracking and matching in video using programmable graphics hardware. Mach. Vis. Appl. 22(1), 207–217 (2007)
Wu, C.: SiftGPU: a GPU implementation of scale invariant feature transform, https://github.com/pitzer/SiftGPU (2012)
Vedaldi, A.: An open implementation of the SIFT detector and descriptor. UCLA CSD, http://vision.ucla.edu/~vedaldi/code/sift.html (2007)
Yonglong, Z., Kuizhi, M., Xiang, J., Peixiang, D.: Parallelization and optimization of sift on GPU using CUDA. In: 2013 IEEE 10th International Conference on High Performance Computing and Communications, The 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), IEEE1351-1358, (2013)
Mohammadi, M., Rezaeian, M.: Towards affordable computing: SiftCU a simple but elegant GPU-based implementation of SIFT. Int. J. Comput. Appl. 90(7), 30–37 (2014)
Acharya, K., Babu, R. V., Vadhiyar, S. S: A real-time implementation of SIFT using GPU. J. Real-Time Image Process. 1–11 (2014). https://doi.org/10.1007/s11554-014-0446-6
Harris, M., Sengupta, S., Owens, J.D.: Parallel prefix sum (scan) with CUDA. GPU Gems 3(39), 851–876 (2007)
Terriberry, T., French, L., Helmsen, J.: GPU accelerating speeded-up robust features. In: Proceedings of 3DPVT. p. 355–362. (2008)
Blelloch, G.: Prefix sums and their applications. In: J.H. Reif (ed). Synthesis of Parallel Algorithms, Morgan Kaufmann Publishers Inc. San Francisco, CA, USA (1993)
Bilgic, B., Horn, B. K., Masaki, I.: Efficient integral image computation on the GPU. In: Intelligent Vehicles Symposium (IV), 2010 IEEE, IEEE528-533, (2010)
Fang, Z., Yang, D., Zhang, W., Chen, H., Zang, B.: A comprehensive analysis and parallelization of an image retrieval algorithm. In: 2011 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), IEEE154-164, (2011)
Schulz, A., Jung, F., Hartte, S.: CUDA SURF: a real-time implementation for SURF. https://www.d2.mpi-inf.mpg.de/surf (2011)
Cheon, S., Eom, I.K., Ha, S.W., Moon, Y.H.: An enhanced SURF algorithm based on new interest point detection procedure and fast computation technique. J. Real-Time Image Process (2016). https://doi.org/10.1007/s11554-016-0614-y
Hong, S., Kim, H.: An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness. In: ACM SIGARCH Computer Architecture News, ACM.37, 3, pp. 152–163. (2009)
Hennessy, J.L., Patterson, D.A.: Computer Architecture: A Quantitative Approach. Elsevier, Amsterdam (2011)
Nvidia: NVIDIA Tesla P100: the most advanced datacenter accelerator ever built, featuring pascal GP100, the world’s fastest GPU, In: whitepaper. https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf
C. Nvidia: C Programming Guide v9. 1. Nvidia Corporation, Santa Clara (2017)
Barandiaran, I., Cortes, C., Nieto, M., Grana, M., Ruiz, O. E.: A new evaluation framework and image dataset for keypoint extraction and feature descriptor matching. In: Proceedings of the International Conference on Computer Vision Theory and Applications (VISAPP). vol 1, pp. 252–257. (2013)
Mikolajczyk, K., Schmid, C.: A performance evaluation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 27(10), 1615–1630 (2005)
Van Rijsbergen, C.: Information Retrieval. vol 14, Department of Computer Science, University of glasgow. citeseer.ist.psu.edu/vanrijsbergen79information.html (1979)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mehrez, A., Morgan, A.A. & Hemayed, E.E. Speeding up spatiotemporal feature extraction using GPU. J Real-Time Image Proc 16, 2379–2407 (2019). https://doi.org/10.1007/s11554-018-0755-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-018-0755-2