Advertisement

Dilated temporal relational adversarial network for generic video summarization

  • Yujia ZhangEmail author
  • Michael Kampffmeyer
  • Xiaodan Liang
  • Dingwen Zhang
  • Min Tan
  • Eric P. Xing
Article
  • 35 Downloads

Abstract

The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

Keywords

Video summarization Dilated temporal relation Generative adversarial network Three-player loss 

Notes

Acknowledgements

We would like to thank Xiaohui Zeng for her valuable discussions. This project is supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. This work is also partially funded by the National Natural Science Foundation of China (Grant No. 61673378 and 61333016), and Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.

Supplementary material

(MOV 108 MB)

References

  1. 1.
    Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/, software available from tensorflow.org
  2. 2.
    Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial nets. In: Proceedings of international conference on machine learningGoogle Scholar
  3. 3.
    Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:170605587
  4. 4.
    Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the 31st AAAI conference on artificial intelligence, pp 3981–3987Google Scholar
  5. 5.
    Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 606–612Google Scholar
  6. 6.
    Chu WS, Song Y, Jaimes A (2015) Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  7. 7.
    De Avila SEF, Lopes APB, da Luz A Jr, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68CrossRefGoogle Scholar
  8. 8.
    Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476Google Scholar
  9. 9.
    Fu TJ, Tai SH, Chen HT (2019) Attentive and adversarial learning for video summarization. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 1579–1587Google Scholar
  10. 10.
    Ghosh A, Kulharia V, Mukerjee A, Namboodiri V, Bansal M (2016) Contextual rnn-gans for abstract reasoning diagram generation. arXiv:160909444
  11. 11.
    Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in neural information processing systems, pp 2069–2077Google Scholar
  12. 12.
    Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680Google Scholar
  13. 13.
    Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18 (5–6):602–610CrossRefGoogle Scholar
  14. 14.
    Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6645–6649Google Scholar
  15. 15.
    Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: Proceedings of European conference on computer vision, pp 505–520CrossRefGoogle Scholar
  16. 16.
    Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3090–3098Google Scholar
  17. 17.
    He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778Google Scholar
  18. 18.
    Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780CrossRefGoogle Scholar
  19. 19.
    Huang S, Li X, Zhang Z, Wu F, Han J (2019) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664MathSciNetCrossRefGoogle Scholar
  20. 20.
    Ji Z, Xiong K, Pang Y, Li X (2017) Video summarization with attention-based encoder-decoder networks. arXiv:170809545
  21. 21.
    Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video TechnologyGoogle Scholar
  22. 22.
    Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4225–4232Google Scholar
  23. 23.
    Kulesza A, Taskar B, et al. (2012) Determinantal point processes for machine learning. Found Trends®; Mach Learn 5(2–3):123–286CrossRefGoogle Scholar
  24. 24.
    Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks for action segmentation and detection. arXiv:161105267
  25. 25.
    Li Y, Wang L, Yang T, Gong B (2018) How local is the local diversity? reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarizations. In: Proceedings of the European conference on computer vision, pp 156–174Google Scholar
  26. 26.
    Liang X, Lee L, Dai W, Xing EP (2017) Dual motion GAN for future-flow embedded video predictionGoogle Scholar
  27. 27.
    Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  28. 28.
    Mao X, Li Q, Xie H, Lau RY, Wang Z, Smolley SP (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2813–2821Google Scholar
  29. 29.
    Mathieu MF, Zhao JJ, Zhao J, Ramesh A, Sprechmann P, LeCun Y (2016) Disentangling factors of variation in deep representation using adversarial training. In: Advances in neural information processing systems, pp 5040–5048Google Scholar
  30. 30.
    Meng J, Wang H, Yuan J, Tan YP (2016) From keyframes to key objects: Video summarization by representative object proposal selection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1039–1048Google Scholar
  31. 31.
    Panda R, Roy-Chowdhury AK (2017) Collaborative summarization of topic-related videos. In: Proc IEEE Conf Comput Vis Pattern Recogn, vol 2, p 5Google Scholar
  32. 32.
    Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  33. 33.
    Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of European conference on computer vision, pp 540–555CrossRefGoogle Scholar
  34. 34.
    Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:151106434
  35. 35.
    Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv:160505396
  36. 36.
    Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252MathSciNetCrossRefGoogle Scholar
  37. 37.
    Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242Google Scholar
  38. 38.
    Sharghi A, Gong B, Shah M (2016) Query-focused extractive video summarization. In: Proceedings of European conference on computer visionGoogle Scholar
  39. 39.
    Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107–2116Google Scholar
  40. 40.
    Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, pp 321–330Google Scholar
  41. 41.
    Song Y, Vallmitjana J, Stent A, Jaimes A (2015) TVSum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187Google Scholar
  42. 42.
    Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4489–4497Google Scholar
  43. 43.
    Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2235–2244Google Scholar
  44. 44.
    Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 982–990Google Scholar
  45. 45.
    Zatsushi K, Luc VG, Yoshitaka U, Tatsuya H (2018) Viewpoint-aware video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognitionGoogle Scholar
  46. 46.
    Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067Google Scholar
  47. 47.
    Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of European conference on computer vision, pp 766–782CrossRefGoogle Scholar
  48. 48.
    Zhang S, Zhu Y, Roy-Chowdhury AK (2016) Context-aware surveillance video summarization. IEEE Trans Image Process 25(11):5469–5478MathSciNetCrossRefGoogle Scholar
  49. 49.
    Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: Dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China. ACM, p 89Google Scholar
  50. 50.
    Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520Google Scholar
  51. 51.
    Zhao S, Zhao X, Ding G, Keutzer K (2018) EmotionGAN: unsupervised domain adaptation for learning discrete probability distributions of image emotions. In: ACM Multimedia conference on multimedia conference, pp 1319–1327Google Scholar
  52. 52.
    Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of AAAI conference on artificial intelligence, pp 7582–7589Google Scholar
  53. 53.
    Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:170310593

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Institute of AutomationChinese Academy of SciencesBeijingChina
  2. 2.University of Chinese Academy of SciencesBeijingChina
  3. 3.Machine Learning GroupUiT The Arctic University of NorwayTromsøNorway
  4. 4.Machine Learning DepartmentCarnegie Mellon UniversityPittsburghUSA
  5. 5.Xidian UniversityXi’anChina

Personalised recommendations