Can learned frame prediction compete with block motion compensation for video coding?

Abstract

Given recent advances in learned video prediction, we investigate whether a simple video codec using a pretrained deep model for next frame prediction based on previously encoded/decoded frames without sending any motion side information can compete with standard video codecs based on block motion compensation. Frame differences given learned frame predictions are encoded by a standard still-image (intra) codec. Experimental results show that the rate distortion performance of the simple codec with symmetric complexity is on average better than that of x264 codec on 10 MPEG test videos, but does not yet reach the level of x265 codec. This result demonstrates the power of learned frame prediction (LFP), since unlike motion compensation, LFP does not use information from the current picture. The implications of training with \(\ell ^1\), \(\ell ^2\) or combined \(\ell ^2\) and adversarial loss on prediction performance and compression efficiency are analyzed.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

References

  1. 1.

    A new image format for the web. https://developers.google.com/speed/webp

  2. 2.

    x264: A high performance h.264/avc encoder. https://www.videolan.org/developers/x264.html (2006)

  3. 3.

    Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: International Conference on Learning Representations (ICLR), Vancouver, Canada (2018)

  4. 4.

    Bai, S., Kolter, J.Z., Koltun, V.: An empirical evaluation of generic convolutional and recurrent networks for sequence modeling (2018). arXiv:1803.01271.pdf

  5. 5.

    Bellard, F.: Better portable graphics. https://www.bellard.org/bpg. Last accessed: April 2020

  6. 6.

    Bellard, F.: Ffmpeg multimedia system. https://www.ffmpeg.org/ [Last accessed: Apr. 2020]

  7. 7.

    Bjontegaard, G.: Calculation of average PSNR differences between rd-curves. VCEG-M33 (2001)

  8. 8.

    Chen, Z., He, T., Jin, X., Wu, F.: Learning for video compression. IEEE Trans. Circuits Syst. Video Technol. 30(2), 566–576 (2020)

    Article  Google Scholar 

  9. 9.

    Chintala, S., Denton, E., Arjovsky, M., Mathieu, M.: How to train a GAN? Tips and tricks to make GANs work. https://github.com/soumith/ganhacks (2016)

  10. 10.

    Choi, H., Bajić, I.V.: Deep Frame Prediction for Video Coding. IEEE Trans. Circuits Syst. Video Technol. 30(7), 1843–1855 (2020)

    Google Scholar 

  11. 11.

    Chollet, F.: Deep Learning with Python. Manning Publications Company, Shelter Island (2017)

    Google Scholar 

  12. 12.

    Denton, E., Fergus, R.: Stochastic video generation with a learned prior. In: Proceedings of International Conference on Machine Learning (PMLR), vol. 80, pp. 1174–1183 (2018)

  13. 13.

    Dosovitskiy, A., Brox, T.: Generating images with perceptual similarity metrics based on deep networks. In: Advances in Neural Information Processing Systems, pp. 658–666 (2016)

  14. 14.

    Dumas, T., Roumy, A., Guillemot, C.: Autoencoder based image compression: can the learning be quantization independent? In: IEEE ICASSP, Calgary, Canada (2018)

  15. 15.

    Finn, C., Goodfellow, I., Levine, S.: Unsupervised learning for physical interaction through video prediction. In: Advances in Neural Information Processing Systems, pp. 64–72 (2016)

  16. 16.

    Huo, S., Liu, D., Wu, F., Li, H.: Convolutional neural network-based motion compensation refinement for video coding. In: IEEE International Symposium on Circuits and Systems (ISCAS), Florence, Italy (2018)

  17. 17.

    Kalchbrenner, N., Oord, A.v.d., Simonyan, K., Danihelka, I., Vinyals, O., Graves, A., Kavukcuoglu, K.: Video pixel networks. In: Proceedings of International Conference on Machine Learning (PMLR), vol. 70, pp. 1771–1779 (2017)

  18. 18.

    Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: International Conference on Representations Learning (ICLR) (2015)

  19. 19.

    Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction (2018). arXiv:1804.01523

  20. 20.

    Lim, B., Son, S., Kim, H., Nah, S., Lee, K.M.: Enhanced deep residual networks for single image super-resolution. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), vol. 1, p. 4 (2017)

  21. 21.

    Lin, J., Liu, D., Li, H., Wu, F.: Generative adversarial network-based frame extrapolation for video coding. In: Visual Communications and Image Processing (VCIP) (2018)

  22. 22.

    Lu, G., Zhang, X., Chen, L., Gao, Z.: Novel integration of frame rate up conversion and HEVC coding based on rate-distortion optimization. IEEE Trans. Image Process. 27(2), 678–691 (2018)

    MathSciNet  Article  Google Scholar 

  23. 23.

    Mathieu, M., Couprie, C., LeCun, Y.: Deep multi-scale video prediction beyond mean square error. In: Proceedings of International Conference on Learning Representation (ICLR) (2016)

  24. 24.

    Radford, A., Metz, L., Chintala, S.: Unsupervised representation learning with deep convolutional generative adversarial networks. In: ICLR (Poster) (2016)

  25. 25.

    Schwarz, H., Wiegand, T.: Video coding: part II of fundamentals of source and video coding. Found. Trends Signal Process. 10(1–3), 1–346 (2016)

    MathSciNet  MATH  Google Scholar 

  26. 26.

    Selva Castelló, J.: A comprehensive survey on deep future frame video prediction. Master’s thesis, Universitat Politècnica de Catalunya (2018)

  27. 27.

    Shi, W., et al.: Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In: IEEE Conference on CVPR, pp. 1874–1883 (2016)

  28. 28.

    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild (2012). arXiv:1212.0402

  29. 29.

    Srivastava, N., Mansimov, E., Salakhudinov, R.: Unsupervised learning of video representations using LSTMs. In: International Conference on Machine Learning, pp. 843–852 (2015)

  30. 30.

    Szegedy, C., Ioffe, S., Vanhoucke, V., Alemi, A.A.: Inception-v4, inception-resnet and the impact of residual connections on learning. In: AAAI, vol. 4, p. 12 (2017)

  31. 31.

    Timofte, R., et al.: NTIRE 2017 challenge on single image super-resolution: methods and results. In: IEEE Conference Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 1110–1121 (2017)

  32. 32.

    Timofte, R., et al.: NTIRE 2018 challenge on single image super-resolution: methods and results. In: IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 965–976 (2018)

  33. 33.

    Van Amersfoort, J., Kannan, A., Ranzato, M., Szlam, A., Tran, D., Chintala, S.: Transformation-based models of video sequences (2017). arXiv:1701.08435

  34. 34.

    van den Oord, A., Kalchbrenner, N., Kavukcuoglu, K.: Pixel recurrent neural networks. In: Proceedings of International Conference on Machine Learning (ICML), vol. 48, pp. 1747–1756 (2016)

  35. 35.

    Villegas, R., Pathak, A., Kannan, H., Erhan, D., Le, Q.V., Lee, H.: High fidelity video prediction with large stochastic recurrent neural networks. In: Conference on Neural Information Processing Systems (NIPS) (2019)

  36. 36.

    Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., Lee, H.: Learning to generate long-term future via hierarchical prediction. In: International Conference on Machine Learning (ICML) (2017)

  37. 37.

    Vondrick, C., Torralba, A.: Generating the future with adversarial transformers. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR), vol. 1, p. 3 (2017)

  38. 38.

    Wang, X., et al.: ESRGAN: enhanced super-resolution generative adversarial networks. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

  39. 39.

    Wang, Y., Fan, X., Jia, C., Zhao, D., Gao, W.: Neural network based inter prediction for HEVC. In: IEEE International Conference on Multimedia and Expo (2018)

  40. 40.

    Wichers, N., Villegas, R., Erhan, D., Lee, H.: Hierarchical long-term video prediction without supervision. In: Proceedings of International Conference on Machine Learning (PMLR), Stockholm (2018)

  41. 41.

    Xia, S., Yang, W., Hu, Y., Liu, J.: Deep inter prediction via pixel-wise motion oriented reference generation. In: IEEE International Conference Image Processing (2019)

  42. 42.

    Zhao, L., Wang, S., Zhang, X., Wang, S., Ma, S., Gao, W.: Enhanced CTU-level inter prediction with deep frame rate up-conversion for high efficiency video coding. In: IEEE International Conference on Image Processing (2018)

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to A. Murat Tekalp.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A. M. Tekalp acknowledges support from the TUBITAK project 217E033 and Turkish Academy of Sciences (TUBA).

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sulun, S., Tekalp, A.M. Can learned frame prediction compete with block motion compensation for video coding?. SIViP 15, 401–410 (2021). https://doi.org/10.1007/s11760-020-01751-y

Download citation

Keywords

  • Deep learning
  • Frame prediction
  • Predictive frame difference
  • HEVC-Intra codec
  • Rate-distortion performance