Video Summarization with LSTM and Deep Attention Models
In this paper we propose two video summarization models based on the recently proposed vsLSTM and dppLSTM deep networks, which allow to model frame relevance and similarity. The proposed deep learning architectures additionally incorporate an attention mechanism to model user interest. In this paper the proposed models are compared to the original ones in terms of prediction accuracy and computational complexity. The proposed vsLSTM+Att method with an attention model outperforms the original methods when evaluated on common public datasets. Additionally, results obtained on a real video dataset containing terrorist-related content are provided to highlight the challenges faced in real-life applications. The proposed method yields outstanding results in this complex scenario, when compared to the original methods.
KeywordsVideo summarization LSTM Attention model Digital forensics
The work presented in this paper was supported by the European Commission under contract H2020-700367 DANTE.
- 3.Wolf, W.: Key frame selection by motion analysis. In: Acoustics, Speech, and Signal Processing, vol. 2, pp. 1228–1231. IEEE (1996)Google Scholar
- 5.Khosla, A., Hamid, R., Lin, C.-J., Sundaresan, N.: Large-scale video summarization using web-image priors. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2698–2705 (2013)Google Scholar
- 7.Gygli, M., Grabner, H., Van Gool, L.: Video summarization by learning submodular mixtures of objectives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3090–3098 (2015)Google Scholar
- 9.Denil, M., Bazzani, L., Larochelle, H., de Freitas, N.: Learning where to attend with deep architectures for image tracking, CoRR, vol. abs/1109.3737 (2011). http://arxiv.org/abs/1109.3737
- 10.Vinyals, O., Kaiser, L., Koo, T., Petrov, S., Sutskever, I., Hinton, G.: Grammar as a foreign language. In: Advances in Neural Information Processing Systems, vol. 28, pp. 2773–2781. Curran Associates Inc. (2015)Google Scholar
- 11.Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)Google Scholar
- 13.Song, Y., Vallmitjana, J., Stent, A., Jaimes, A.: Tvsum: summarizing web videos using titles. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5179–5187 (2015)Google Scholar
- 16.The open video project. https://open-video.org
- 17.Kingma, D., Ba, J.: Adam: a method for stochastic optimization, arXiv preprint arXiv:1412.6980 (2014)