Skip to main content
Log in

Dilated temporal relational adversarial network for generic video summarization

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

The large amount of videos popping up every day, make it more and more critical that key information within videos can be extracted and understood in a very short time. Video summarization, the task of finding the smallest subset of frames, which still conveys the whole story of a given video, is thus of great significance to improve efficiency of video understanding. We propose a novel Dilated Temporal Relational Generative Adversarial Network (DTR-GAN) to achieve frame-level video summarization. Given a video, it selects the set of key frames, which contain the most meaningful and compact information. Specifically, DTR-GAN learns a dilated temporal relational generator and a discriminator with three-player loss in an adversarial manner. A new dilated temporal relation (DTR) unit is introduced to enhance temporal representation capturing. The generator uses this unit to effectively exploit global multi-scale temporal context to select key frames and to complement the commonly used Bi-LSTM. To ensure that summaries capture enough key video representation from a global perspective rather than a trivial randomly shorten sequence, we present a discriminator that learns to enforce both the information completeness and compactness of summaries via a three-player loss. The loss includes the generated summary loss, the random summary loss, and the real summary (ground-truth) loss, which play important roles for better regularizing the learned model to obtain useful summaries. Comprehensive experiments on three public datasets show the effectiveness of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2015) TensorFlow: large-scale machine learning on heterogeneous systems. https://www.tensorflow.org/, software available from tensorflow.org

  2. Arjovsky M, Chintala S, Bottou L (2017) Wasserstein generative adversarial nets. In: Proceedings of international conference on machine learning

  3. Chen LC, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. arXiv:170605587

  4. Chen M, Ding G, Zhao S, Chen H, Liu Q, Han J (2017) Reference based LSTM for image captioning. In: Proceedings of the 31st AAAI conference on artificial intelligence, pp 3981–3987

  5. Chen H, Ding G, Lin Z, Zhao S, Han J (2018) Show, observe and tell: attribute-driven attention model for image captioning. In: Proceedings of the 27th international joint conference on artificial intelligence, pp 606–612

  6. Chu WS, Song Y, Jaimes A (2015) Video co-summarization: video summarization by visual co-occurrence. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  7. De Avila SEF, Lopes APB, da Luz A Jr, de Albuquerque Araújo A (2011) Vsumm: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recogn Lett 32(1):56–68

    Article  Google Scholar 

  8. Feichtenhofer C, Pinz A, Wildes R (2016) Spatiotemporal residual networks for video action recognition. In: Advances in neural information processing systems, pp 3468–3476

  9. Fu TJ, Tai SH, Chen HT (2019) Attentive and adversarial learning for video summarization. In: Proceedings of the IEEE winter conference on applications of computer vision, pp 1579–1587

  10. Ghosh A, Kulharia V, Mukerjee A, Namboodiri V, Bansal M (2016) Contextual rnn-gans for abstract reasoning diagram generation. arXiv:160909444

  11. Gong B, Chao WL, Grauman K, Sha F (2014) Diverse sequential subset selection for supervised video summarization. In: Advances in neural information processing systems, pp 2069–2077

  12. Goodfellow I, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial nets. In: Advances in neural information processing systems, pp 2672–2680

  13. Graves A, Schmidhuber J (2005) Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Netw 18 (5–6):602–610

    Article  Google Scholar 

  14. Graves A, Mohamed A, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, pp 6645–6649

  15. Gygli M, Grabner H, Riemenschneider H, Van Gool L (2014) Creating summaries from user videos. In: Proceedings of European conference on computer vision, pp 505–520

    Chapter  Google Scholar 

  16. Gygli M, Grabner H, Van Gool L (2015) Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3090–3098

  17. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  18. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  19. Huang S, Li X, Zhang Z, Wu F, Han J (2019) User-ranking video summarization with multi-stage spatio–temporal representation. IEEE Trans Image Process 28(6):2654–2664

    Article  MathSciNet  Google Scholar 

  20. Ji Z, Xiong K, Pang Y, Li X (2017) Video summarization with attention-based encoder-decoder networks. arXiv:170809545

  21. Ji Z, Xiong K, Pang Y, Li X (2019) Video summarization with attention-based encoder-decoder networks. IEEE Transactions on Circuits and Systems for Video Technology

  22. Kim G, Sigal L, Xing EP (2014) Joint summarization of large-scale collections of web images and videos for storyline reconstruction. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 4225–4232

  23. Kulesza A, Taskar B, et al. (2012) Determinantal point processes for machine learning. Found Trends®; Mach Learn 5(2–3):123–286

    Article  Google Scholar 

  24. Lea C, Flynn MD, Vidal R, Reiter A, Hager GD (2016) Temporal convolutional networks for action segmentation and detection. arXiv:161105267

  25. Li Y, Wang L, Yang T, Gong B (2018) How local is the local diversity? reinforcing sequential determinantal point processes with dynamic ground sets for supervised video summarizations. In: Proceedings of the European conference on computer vision, pp 156–174

  26. Liang X, Lee L, Dai W, Xing EP (2017) Dual motion GAN for future-flow embedded video prediction

  27. Mahasseni B, Lam M, Todorovic S (2017) Unsupervised video summarization with adversarial LSTM networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  28. Mao X, Li Q, Xie H, Lau RY, Wang Z, Smolley SP (2017) Least squares generative adversarial networks. In: Proceedings of the IEEE international conference on computer vision, pp 2813–2821

  29. Mathieu MF, Zhao JJ, Zhao J, Ramesh A, Sprechmann P, LeCun Y (2016) Disentangling factors of variation in deep representation using adversarial training. In: Advances in neural information processing systems, pp 5040–5048

  30. Meng J, Wang H, Yuan J, Tan YP (2016) From keyframes to key objects: Video summarization by representative object proposal selection. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 1039–1048

  31. Panda R, Roy-Chowdhury AK (2017) Collaborative summarization of topic-related videos. In: Proc IEEE Conf Comput Vis Pattern Recogn, vol 2, p 5

  32. Plummer BA, Brown M, Lazebnik S (2017) Enhancing video summarization via vision-language embedding. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  33. Potapov D, Douze M, Harchaoui Z, Schmid C (2014) Category-specific video summarization. In: Proceedings of European conference on computer vision, pp 540–555

    Chapter  Google Scholar 

  34. Radford A, Metz L, Chintala S (2015) Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv:151106434

  35. Reed S, Akata Z, Yan X, Logeswaran L, Schiele B, Lee H (2016) Generative adversarial text to image synthesis. arXiv:160505396

  36. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252

    Article  MathSciNet  Google Scholar 

  37. Salimans T, Goodfellow I, Zaremba W, Cheung V, Radford A, Chen X (2016) Improved techniques for training gans. In: Advances in neural information processing systems, pp 2234–2242

  38. Sharghi A, Gong B, Shah M (2016) Query-focused extractive video summarization. In: Proceedings of European conference on computer vision

  39. Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107–2116

  40. Smeaton AF, Over P, Kraaij W (2006) Evaluation campaigns and TRECVid. In: Proceedings of the 8th ACM international workshop on multimedia information retrieval, pp 321–330

  41. Song Y, Vallmitjana J, Stent A, Jaimes A (2015) TVSum: summarizing web videos using titles. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5179–5187

  42. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3D convolutional networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4489–4497

  43. Xu J, Mukherjee L, Li Y, Warner J, Rehg JM, Singh V (2015) Gaze-enabled egocentric video summarization via constrained submodular maximization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2235–2244

  44. Yao T, Mei T, Rui Y (2016) Highlight detection with pairwise deep ranking for first-person video summarization. In: Proceedings of the IEEE Conference on computer vision and pattern recognition, pp 982–990

  45. Zatsushi K, Luc VG, Yoshitaka U, Tatsuya H (2018) Viewpoint-aware video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition

  46. Zhang K, Chao WL, Sha F, Grauman K (2016) Summary transfer: exemplar-based subset selection for video summarization. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1059–1067

  47. Zhang K, Chao WL, Sha F, Grauman K (2016) Video summarization with long short-term memory. In: Proceedings of European conference on computer vision, pp 766–782

    Chapter  Google Scholar 

  48. Zhang S, Zhu Y, Roy-Chowdhury AK (2016) Context-aware surveillance video summarization. IEEE Trans Image Process 25(11):5469–5478

    Article  MathSciNet  Google Scholar 

  49. Zhang Y, Kampffmeyer M, Zhao X, Tan M (2019) Dtr-gan: Dilated temporal relational adversarial network for video summarization. In: Proceedings of the ACM turing celebration conference-China. ACM, p 89

  50. Zhao B, Xing EP (2014) Quasi real-time summarization for consumer videos. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2513–2520

  51. Zhao S, Zhao X, Ding G, Keutzer K (2018) EmotionGAN: unsupervised domain adaptation for learning discrete probability distributions of image emotions. In: ACM Multimedia conference on multimedia conference, pp 1319–1327

  52. Zhou K, Qiao Y, Xiang T (2018) Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Proceedings of AAAI conference on artificial intelligence, pp 7582–7589

  53. Zhu JY, Park T, Isola P, Efros AA (2017) Unpaired image-to-image translation using cycle-consistent adversarial networks. arXiv:170310593

Download references

Acknowledgements

We would like to thank Xiaohui Zeng for her valuable discussions. This project is supported by the Department of Defense under Contract No. FA8702-15-D-0002 with Carnegie Mellon University for the operation of the Software Engineering Institute, a federally funded research and development center. This work is also partially funded by the National Natural Science Foundation of China (Grant No. 61673378 and 61333016), and Norwegian Research Council FRIPRO grant no. 239844 on developing the Next Generation Learning Machines.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yujia Zhang.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Work done while Yujia Zhang was at CMU

Electronic supplementary material

Below is the link to the electronic supplementary material.

(MOV 108 MB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, Y., Kampffmeyer, M., Liang, X. et al. Dilated temporal relational adversarial network for generic video summarization. Multimed Tools Appl 78, 35237–35261 (2019). https://doi.org/10.1007/s11042-019-08175-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-019-08175-y

Keywords

Navigation