Abstract
Video summarization compresses videos while preserving the most meaningful content for users. Many image-based works focus on how to effectively utilize video visual cues to choose keyframes. However, apart from visual content, videos also contain useful audio information. In this paper, we propose a novel attention-based audio-visual fusion framework which integrates the audio information with visual information. Our framework is composed of two key components: asymmetrical self-attention mechanism, and odd-even attention. The asymmetrical self-attention mechanism addresses the problem that visual information is more strongly related to video summarization than audio information. The odd-even attention focuses on alleviating the memory requirements. Besides, we create ViAu-SumMe, an audio-visual dataset, which is based on SumMe dataset. Experimental results on the dataset show that our proposed method outperforms the state-of-the-art methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
37 Mind blowing youtube facts, figures and statistics – 2019 (2019). https://merchdope.com/youtube-stats/
Fajtl, J., Sokeh, H.S., Argyriou, V., Monekosso, D., Remagnino, P.: Summarizing videos with attention (2018)
Gong, Y., Liu, X.: Video summarization using singular value decomposition. In: Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No. PR00662), vol. 2, pp. 174–180. IEEE (2000)
Gygli, M., Grabner, H., Riemenschneider, H., Van Gool, L.: Creating summaries from user videos. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8695, pp. 505–520. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10584-0_33
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 131–135. IEEE (2017)
Katsaggelos, A.K., Bahaadini, S., Molina, R.: Audiovisual fusion: challenges and new approaches. Proc. IEEE 103(9), 1635–1653 (2015)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lahat, D., Adali, T., Jutten, C.: Multimodal data fusion: an overview of methods, challenges, and prospects. Proc. IEEE 103(9), 1449–1477 (2015)
Lu, J., Yang, J., Batra, D., Parikh, D.: Hierarchical question-image co-attention for visual question answering. In: Advances In Neural Information Processing Systems, pp. 289–297 (2016)
Nam, J., Tewfik, A.H.: Video abstract of video. In: 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No. 99TH8451), pp. 117–122. IEEE (1999)
Petridis, S., Wang, Y., Li, Z., Pantic, M.: End-to-end audiovisual fusion with LSTMs. arXiv preprint arXiv:1709.04343 (2017)
Potapov, D., Douze, M., Harchaoui, Z., Schmid, C.: Category-specific video summarization. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8694, pp. 540–555. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10599-4_35
Rochan, M., Ye, L., Wang, Y.: Video summarization using fully convolutional sequence networks. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 358–374. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_22
Sterpu, G., Saam, C., Harte, N.: Attention-based audio-visual fusion for robust automatic speech recognition. In: Proceedings of the 2018 on International Conference on Multimodal Interaction, pp. 111–115. ACM (2018)
Wei, H., Ni, B., Yan, Y., Yu, H., Yang, X., Yao, C.: Video summarization via semantic attended networks. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Yuan, L., Tay, F.E., Li, P., Zhou, L., Feng, J.: Cycle-SUM: cycle-consistent adversarial LSTM networks for unsupervised video summarization. arXiv preprint arXiv:1904.08265 (2019)
Zhang, K., Chao, W.-L., Sha, F., Grauman, K.: Video summarization with long short-term memory. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9911, pp. 766–782. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_47
Zhou, K., Qiao, Y., Xiang, T.: Deep reinforcement learning for unsupervised video summarization with diversity-representativeness reward. In: Thirty-Second AAAI Conference on Artificial Intelligence (2018)
Zhou, P., Yang, W., Chen, W., Wang, Y., Jia, J.: Modality attention for end-to-end audio-visual speech recognition. arXiv preprint arXiv:1811.05250 (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Fang, Y., Zhang, J., Lu, C. (2019). Attention-Based Audio-Visual Fusion for Video Summarization. In: Gedeon, T., Wong, K., Lee, M. (eds) Neural Information Processing. ICONIP 2019. Lecture Notes in Computer Science(), vol 11954. Springer, Cham. https://doi.org/10.1007/978-3-030-36711-4_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-36711-4_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-36710-7
Online ISBN: 978-3-030-36711-4
eBook Packages: Computer ScienceComputer Science (R0)