Journal of Real-Time Image Processing

, Volume 12, Issue 2, pp 549–562 | Cite as

Fast motion estimation for HEVC on graphics processing unit (GPU)

  • Dongkyu Lee
  • Donggyu Sim
  • Keeseong Cho
  • Seoung-Jun Oh
Special Issue Paper

Abstract

The recent video compression standard, HEVC (high efficiency video coding), will most likely be used in various applications in the near future. However, the encoding process is far too slow for real-time applications. At the same time, computing capabilities of GPUs (graphics processing units) have become more powerful in these days. In this paper, we have proposed a GPU-based parallel motion estimation (ME) algorithm to enhance the performance of an HEVC encoder. A frame is partitioned into two subframes for pipelined execution to improve GPU utilization. The flow chart is redetermined to solve data hazards in the pipelined execution. Two new methods are introduced in the proposed ME: decision of a representative search center position (RSCP) and warp-based concurrent parallel reduction (WCPR). A RSCP employs motion vectors of a co-located CTU in a previously encoded frame to solve a dependency problem in parallel computation with negligible coding loss. WCPR concurrently executes several parallel reduction operations, which increases the thread utilization from 20 to 89 % without any thread synchronization. The proposed encoder can make the portion of ME in the encoder negligible with 2.2 % bitrate increase against the HEVC test model (HM) encoder. In terms of ME, the proposed ME is 130.7 times faster than that of the HM encoder.

Keywords

HEVC GPGPU CUDA Motion estimation Parallel reduction 

Notes

Acknowledgments

This work was partly supported by Institute for Information and communications Technology Promotion (IITP) Grant funded by the Korea government (MSIP) (R0101-15-293, Development of Object-based Knowledge Convergence Service Platform using Image Recognition in Broadcasting Contents) and the Research Grant of Kwangwoon University in 2013.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.

References

  1. 1.
    Wiegand, T., Sullivan, G.J., Bjontegaard, G., Luthra, A.: Overview of the H.264/AVC video coding standard. IEEE Trans. Circuits Syst. Video Technol. 13(7), 560–576 (2003)CrossRefGoogle Scholar
  2. 2.
    Bross, B., Han, W.-J., Ohm, J.-R., Sullivan, G.J., Wang, Y.-K., Wiegand, T.: High Efficiency Video Coding (HEVC) Text Specification Draft 10. Doc. JCTVC-L1003, Geneva (2013)Google Scholar
  3. 3.
    Zhu, S., Ma, K.-K.: A new diamond search algorithm for fast block-matching motion estimation. IEEE Trans. Image Process. 9(2), 287–290 (2000)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Jing, X., Chau, L.-P.: An efficient three-step search algorithm for block motion estimation. IEEE Trans. Multimed. 6(3), 435–438 (2004)CrossRefGoogle Scholar
  5. 5.
    NVIDIA: CUDA C Programming Guide. NVIDIA Corp., Santa Clara (2014)Google Scholar
  6. 6.
    Shen, G., Gao, G.-P., Li, S., Shum, H.-Y., Zhang, Y.-Q.: Accelerate video decoding with generic GPU. IEEE Trans. Circuits Syst. Video Technol. 15(5), 685–693 (2005)CrossRefGoogle Scholar
  7. 7.
    Cheung, N.-M., Au, O.C., Kung, M.-C., Wong, P.H.W., Liu, C.H.: Highly parallel rate-distortion optimized intra-mode decision on multicore graphics processors. IEEE Trans. Circuits Syst. Video Technol. 19(11), 1692–1703 (2009)CrossRefGoogle Scholar
  8. 8.
    Pieters, B., Hollemeersch, C.-F.J., De Cock, J., Lambert, P., De Neve, W., Van de Walle, R.: Parallel deblocking filtering in MPEG-4 AVC/H.264 on massively parallel architectures. IEEE Trans. Circuits Syst. Video Technol. 21(1), 96–100 (2011)CrossRefGoogle Scholar
  9. 9.
    Chen, W.-N., Hang, H.-M.: H.264/AVC motion estimation implementation on Compute Unified Device Architecture (CUDA). In: Proceedings of the IEEE International Conference on Multimedia and Expo (ICME), pp. 697–700 (2008)Google Scholar
  10. 10.
    Jing, Z., Liangbao, J., Xuehong, C.: Implementation of parallel full search algorithm for motion estimation on multi-core processors. In: Proceedings of the International Conference on Next Generation Information Technology (ICNIT), pp. 31–35 (2011)Google Scholar
  11. 11.
    Lee, D.-K., Oh, S.-J.: Variable block size motion estimation implementation on compute unified device architecture (CUDA). In: Proceedings of the IEEE International Conference Consumer Electronics, pp. 635–636 (2013)Google Scholar
  12. 12.
    Rodriguez-Sanchez, R., Martinez, J.L., Fernandez-Escribano, G., Claver, J.M., Sanchez, J.L.: Reducing complexity in H.264/AVC motion estimation by using a GPU. In: Proceedings of the IEEE International Workshop on Multimedia Signal Processing (MMSP), pp. 1–6 (2011)Google Scholar
  13. 13.
    Ko, Y.S., Yi, Y.M., Ha, S.H.: An efficient parallelization technique for × 264 encoder on heterogeneous platforms consisting of CPUs and GPUs. J. Real-Time Image Proc. 9(1), 5–18 (2013)CrossRefGoogle Scholar
  14. 14.
    Monteiro, E., Vizzotto, B., Diniz, C., Maule, M., Zatt, B., Bampi, S.: Parallelization of full search motion estimation algorithm for parallel and distributed platforms. Int. J. Parallel Prog. 42(2), 239–264 (2012)CrossRefGoogle Scholar
  15. 15.
    Radicke, S., Hahn, J., Grecos, C., Wang, Q.: Highly-parallel HEVC motion estimation with CUDA. In: Proceedings of the IEEE European Workshop on Visual Information Processing (EUVIP), pp. 148–153 (2013)Google Scholar
  16. 16.
    Wang, X.-W., Song, L., Chen, M., Yang, J.-J.: Paralleling variable block size motion estimation of HEVC on multicore CPU plus GPU platform. In: Proceedings of the IEEE International Conference Image Process. (ICIP), pp. 1836–1839 (2013)Google Scholar
  17. 17.
    Radicke, S., Hahn, J., Grecos, C., Wang, Q.: A highly-parallel approach on motion estimation for high efficiency video coding (HEVC). In: Proceedings of the IEEE International Conference Consumer Electronics (ICCE), pp. 187–188 (2014)Google Scholar
  18. 18.
    Bossen, F.: Common Test Conditions and Software Reference Configurations. Doc. JCTVC-L1100, Geneva (2013)Google Scholar
  19. 19.
    Bjontegaard, G.: Calculation of Average PSNR Differences Between RD-Curves. Doc. VCEG-M33, Austin (2011)Google Scholar
  20. 20.
    Farber, R.: CUDA Application Design and Development, pp. 109–131. Morgan Kaufmann, Waltham (2011)Google Scholar
  21. 21.
    Kim, S.M., Lee, D.K., Ahn, Y.J., Hwang, T.J., Sim, D.G., Oh, S.-J.: DCT-based interpolation filtering for HEVC on graphics processing units. In: Proceedings of the International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), pp. 155–158 (2013)Google Scholar
  22. 22.
    Harris, M.: Optimizing parallel reduction in CUDA. NVIDIA Developer Technology (2007). http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_websweb/projects/reduction/doc/reduction.pdf

Copyright information

© Springer-Verlag Berlin Heidelberg 2015

Authors and Affiliations

  • Dongkyu Lee
    • 1
  • Donggyu Sim
    • 2
  • Keeseong Cho
    • 3
  • Seoung-Jun Oh
    • 1
  1. 1.Department of Electronic EngineeringKwangwoon UniversitySeoulKorea
  2. 2.Department of Computer EngineeringKwangwoon UniversitySeoulKorea
  3. 3.Broadcasting and Telecommunications Media Research LaboratoryElectronics and Telecommunications Research InstituteDaejeonKorea

Personalised recommendations