Advertisement

Video Summarization with LSTM and Deep Attention Models

  • Luis Lebron Casas
  • Eugenia Koblents
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11296)

Abstract

In this paper we propose two video summarization models based on the recently proposed vsLSTM and dppLSTM deep networks, which allow to model frame relevance and similarity. The proposed deep learning architectures additionally incorporate an attention mechanism to model user interest. In this paper the proposed models are compared to the original ones in terms of prediction accuracy and computational complexity. The proposed vsLSTM+Att method with an attention model outperforms the original methods when evaluated on common public datasets. Additionally, results obtained on a real video dataset containing terrorist-related content are provided to highlight the challenges faced in real-life applications. The proposed method yields outstanding results in this complex scenario, when compared to the original methods.

Keywords

Video summarization LSTM Attention model Digital forensics 

Notes

Acknowledgements

The work presented in this paper was supported by the European Commission under contract H2020-700367 DANTE.

References

  1. 1.
    Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46478-7_47CrossRefGoogle Scholar
  2. 2.
    Mendi, E., Clemente, H.B., Bayrak, C.: Sports video summarization based on motion analysis. Comput. Electr. Eng. 39(3), 790–796 (2013)CrossRefGoogle Scholar
  3. 3.
    Wolf, W.: Key frame selection by motion analysis. In: Acoustics, Speech, and Signal Processing, vol. 2, pp. 1228–1231. IEEE (1996)Google Scholar
  4. 4.
    Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10599-4_35CrossRefGoogle Scholar
  5. 5.
    Khosla, A., Hamid, R., Lin, C.-J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705 (2013)Google Scholar
  6. 6.
    Sun, M., Farhadi, A., Seitz, S.: Ranking domain-specific highlights by analyzing edited videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 787–802. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10590-1_51CrossRefGoogle Scholar
  7. 7.
    Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3090–3098 (2015)Google Scholar
  8. 8.
    Sharghi, A., Gong, B., Shah, M.: Query-focused extractive video summarization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 3–19. Springer, Cham (2016).  https://doi.org/10.1007/978-3-319-46484-8_1CrossRefGoogle Scholar
  9. 9.
    Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking, CoRR, vol. abs/1109.3737 (2011). http://arxiv.org/abs/1109.3737
  10. 10.
    Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems, vol. 28, pp. 2773–2781. Curran Associates Inc. (2015)Google Scholar
  11. 11.
    Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
  12. 12.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  13. 13.
    Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179–5187 (2015)Google Scholar
  14. 14.
    Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014).  https://doi.org/10.1007/978-3-319-10584-0_33CrossRefGoogle Scholar
  15. 15.
    De Avila, S.E.F., da Luz, A.P.B., de Albuquerque Araújo, A.: VSUMM: a mechanism designed to produce static video summaries and a novel evaluation method. Pattern Recognit. Lett. 32(1), 56–68 (2011)CrossRefGoogle Scholar
  16. 16.
    The open video project. https://open-video.org
  17. 17.
    Kingma, D., Ba, J.: Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014)
  18. 18.
    Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.United Technology Research Center IrelandCorkRepublic of Ireland

Personalised recommendations