Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

Lin, Yan-Bo; Wang, Yu-Chiang Frank

doi:10.1007/978-3-030-69544-6_17

Yan-Bo Lin¹² &
Yu-Chiang Frank Wang^12,13

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12627))

Included in the following conference series:

Asian Conference on Computer Vision

850 Accesses
18 Citations

Abstract

Audio-visual event localization requires one to identify the event label across video frames by jointly observing visual and audio information. To address this task, we propose a deep learning framework of cross-modality co-attention for video event localization. Our proposed audiovisual transformer (AV-transformer) is able to exploit intra and inter-frame visual information, with audio features jointly observed to perform co-attention over the above three modalities. With visual, temporal, and audio information observed across consecutive video frames, our model achieves promising capability in extracting informative spatial/temporal features for improved event localization. Moreover, our model is able to produce instance-level attention, which would identify image regions at the instance level which are associated with the sound/event of interest. Experiments on a benchmark dataset confirm the effectiveness of our proposed framework, with ablation studies performed to verify the design of our propose network model.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 252–268. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_16
Chapter Google Scholar
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV (2019)
Google Scholar
Ramaswamy, J.: What makes the sound?: a dual-modality interacting network for audio-visual event localization. In: ICASSP (2020)
Google Scholar
Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: Proceedings of the IEEE Winter Conference on Applications of Computer Vision (WACV) (2020)
Google Scholar
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI (2020)
Google Scholar
Hu, D., Li, X., et al.: Temporal multimodal learning in audiovisual speech recognition. In: CVPR (2016)
Google Scholar
Kiela, D., Grave, E., Joulin, A., Mikolov, T.: Efficient large-scale multi-modal classification. In: AAAI (2018)
Google Scholar
Chung, J.S., Senior, A.W., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: CVPR (2017)
Google Scholar
Wiles, O., Koepke, A.S., Zisserman, A.: X2Face: a network for controlling face generation using images, audio, and pose codes. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11217, pp. 690–706. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01261-8_41
Chapter Google Scholar
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: AAAI (2019)
Google Scholar
Zhou, H., Liu, Z., Xu, X., Luo, P., Wang, X.: Vision-infused deep audio inpainting. In: ICCV (2019)
Google Scholar
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: CVPR (2016)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 587–604. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_35
Chapter Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., Kweon, I.S.: Learning to localize sound source in visual scenes. In: CVPR (2018)
Google Scholar
Arandjelović, R., Zisserman, A.: Objects that sound. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 451–466. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_27
Chapter Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11210, pp. 639–658. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01231-1_39
Chapter Google Scholar
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. In: TPAMI (2017)
Google Scholar
Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., Li, F.: Large-scale video classification with convolutional neural networks. In: CVPR (2014)
Google Scholar
Zolfaghari, M., Oliveira, G.L., Sedaghat, N., Brox, T.: Chained multi-stream networks exploiting pose, motion, and appearance for action classification and detection. In: ICCV, pp. 2923–2932 3 (2017)
Google Scholar
Wang, L., et al.: Temporal segment networks: towards good practices for deep action recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 20–36. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_2
Chapter Google Scholar
Zolfaghari, M., Singh, K., Brox, T.: ECO: efficient convolutional network for online video understanding. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11206, pp. 713–730. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01216-8_43
Chapter Google Scholar
Xie, S., Sun, C., Huang, J., Tu, Z., Murphy, K.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 318–335. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_19
Chapter Google Scholar
Zhou, B., Andonian, A., Oliva, A., Torralba, A.: Temporal relational reasoning in videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11205, pp. 831–846. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01246-5_49
Chapter Google Scholar
Tran, D., Bourdev, L.D., Fergus, R., Torresani, L., Paluri, M.: C3D: generic features for video analysis. arXiv (2014)
Google Scholar
Tran, D., Ray, J., Shou, Z., Chang, S., Paluri, M.: Convnet architecture search for spatiotemporal feature learning. CoRR abs/1708.05038 (2017)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: CVPR (2017)
Google Scholar
Lev, G., Sadeh, G., Klein, B., Wolf, L.: RNN Fisher vectors for action recognition and image annotation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 833–850. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_50
Chapter Google Scholar
Li, Z., Gavrilyuk, K., Gavves, E., Jain, M., Snoek, C.G.M.: VideoLSTM convolves, attends and flows for action recognition. In: CVIU (2016)
Google Scholar
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A., Gould, S.: Dynamic image networks for action recognition. In: CVPR, pp. 3034–3042 (2016)
Google Scholar
Bilen, H., Fernando, B., Gavves, E., Vedaldi, A.: Action recognition with dynamic image networks. In: TPAMI (2018)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: ICCV (2017)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: SoundNet: learning sound representations from unlabeled video. In: NeurIPS (2016)
Google Scholar
Owens, A., Wu, J., McDermott, J.H., Freeman, W.T., Torralba, A.: Ambient sound provides supervision for visual learning. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 801–816. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_48
Chapter Google Scholar
Gao, R., Grauman, K.: 2.5d-visual-sound. In: CVPR (2019)
Google Scholar
Tian, Y., Guan, C., Goodman, J., Moore, M., Xu, C.: An attempt towards interpretable audio-visual video captioning. arXiv (2018)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11207, pp. 36–54. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01219-9_3
Chapter Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM TOG (2018)
Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions. In: ICCV (2019)
Google Scholar
Xu, X., Dai, B., Lin, D.: Recursive visual sound separation using minus-plus net. In: ICCV (2019)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: ICCV (2019)
Google Scholar
Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: CVPR (2020)
Google Scholar
Lin, Y.B., Li, Y.J., Wang, Y.C.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)
Google Scholar
Kim, K.-M., Choi, S.-H., Kim, J.-H., Zhang, B.-T.: Multimodal dual attention memory for video story question answering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 698–713. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_41
Chapter Google Scholar
Nguyen, D.K., Okatani, T.: Improved fusion of visual and language representations by dense symmetric co-attention for visual question answering. In: CVPR (2018)
Google Scholar
Bai, Y., Fu, J., Zhao, T., Mei, T.: Deep attention neural tensor network for visual question answering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11216, pp. 21–37. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01258-8_2
Chapter Google Scholar
Shi, Y., Furlanello, T., Zha, S., Anandkumar, A.: Question type guided attention in visual question answering. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11208, pp. 158–175. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01225-0_10
Chapter Google Scholar
Das, A., Agrawal, H., Zitnick, C.L., Parikh, D., Batra, D.: Human attention in visual question answering: do humans and deep networks look at the same regions? In: EMNLP (2016)
Google Scholar
Yu, D., Fu, J., Mei, T., Rui, Y.: Multi-level attention networks for visual question answering. In: CVPR (2017)
Google Scholar
Wang, X., Girshick, R.B., Gupta, A., He, K.: Non-local neural networks. In: CVPR (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NeurIPS (2017)
Google Scholar
Kim, J., et al.: Multimodal residual learning for visual QA. In: NeurIPS (2016)
Google Scholar
Gemmeke, J.F., Ellis, D.P.W., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv (2014)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Google Scholar
Hershey, S., Chaudhuri, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. In: Neural computation (1997)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: NeurIPS (2014)
Google Scholar

Download references

Acknowledgement

This work is supported in part by the Ministry of Science and Technology of Taiwan under grant MOST 109-2634-F-002-037.

Author information

Authors and Affiliations

Graduate Institute of Communication Engineering, National Taiwan University, Taipei, Taiwan
Yan-Bo Lin & Yu-Chiang Frank Wang
ASUS Intelligent Cloud Services, Taipei, Taiwan
Yu-Chiang Frank Wang

Authors

Yan-Bo Lin
View author publications
You can also search for this author in PubMed Google Scholar
Yu-Chiang Frank Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yu-Chiang Frank Wang .

Editor information

Editors and Affiliations

Waseda University, Tokyo, Japan
Hiroshi Ishikawa
Institute of Automation of Chinese Academy of Sciences, Beijing, China
Cheng-Lin Liu
Czech Technical University in Prague, Prague, Czech Republic
Tomas Pajdla
University of Pennsylvania, Philadelphia, PA, USA
Jianbo Shi

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 25718 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, YB., Wang, YC.F. (2021). Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12627. Springer, Cham. https://doi.org/10.1007/978-3-030-69544-6_17

Download citation

DOI: https://doi.org/10.1007/978-3-030-69544-6_17
Published: 26 February 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69543-9
Online ISBN: 978-3-030-69544-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics