Multi-scale network with shared cross-attention for audio–visual correlation learning

Zhang, Jiwei; Yu, Yi; Tang, Suhua; Li, Wei; Wu, Jianming

doi:10.1007/s00521-023-08817-1

Multi-scale network with shared cross-attention for audio–visual correlation learning

Original Article
Published: 19 July 2023

Volume 35, pages 20173–20187, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Jiwei Zhang^1,5,
Yi Yu ORCID: orcid.org/0000-0002-0294-6620¹,
Suhua Tang²,
Wei Li³ &
…
Jianming Wu⁴

241 Accesses
1 Citation
Explore all metrics

Abstract

Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Audio-visual Learning: A Survey

Article Open access 15 April 2021

Masked co-attention model for audio-visual event localization

Article 13 January 2024

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Data availability

The datasets used in the current study are available in the VEGAS [35] and AVE [36].

References

Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162
Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 38(10):2010–2023
Article Google Scholar
Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(05):365–377
Article Google Scholar
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162
He L, Xu X, Lu H, Yang Y, Shen F, Shen HT (2017) Unsupervised cross-modal retrieval through adversarial learning. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1153–1158. IEEE
Athanasiadis C, Hortal E, Asteriadis S (2020) Audio-visual domain adaptation using conditional semi-supervised generative adversarial networks. Neurocomputing 397:331–344
Article Google Scholar
Yu Y, Shen Z, Zimmermann R (2012) Automatic music soundtrack generation for outdoor videos from contextual sensor information. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1377–1378
Prétet L, Richard G, Peeters G (2021) Cross-modal music-video recommendation: a study of design choices. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE
Xue C, Zhong X, Cai M, Chen H, Wang W (2021) Audio-visual event localization by learning spatial and semantic co-attention. IEEE Trans Multim
Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multim
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190. Springer
Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 159–167
Zeng D, Wu J, Hattori G, Xu R, Yu Y (2022) Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM)
Wang S, Pan P, Lu Y, Xie L (2015) Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model. Multim Tools Appl 74:2009–2032
Article Google Scholar
Chun S, Oh SJ, De Rezende RS, Kalantidis Y, Larlus D (2021) Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424
Zhu Y, Wu Y, Latapie H, Yang Y, Yan Y (2021) Learning audio-visual correlations from variational cross-modal generation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4300–4304. IEEE
Zhang J, Yu Y, Tang S, Wu J, Li W (2022) Variational autoencoder with cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl
Yu Y, Tang S, Aizawa K, Aizawa A (2018) Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE Trans Neural Netw Learn Syst 30(4):1250–1258
Article MathSciNet Google Scholar
Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 234–241. IEEE
Zeng D, Yu Y, Oyama K (2018) Audio-visual embedding for cross-modal music video retrieval through supervised deep cca. In: 2018 IEEE International Symposium on Multimedia (ISM), pp. 143–150. IEEE
Chung S-W, Chung JS, Kang H-G (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3965–3969. IEEE
Wu F, Jing X-Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recogn 104:107335
Article Google Scholar
Surís D, Duarte A, Salvador A, Torres J, Giró-i-Nieto X (2018) Cross-modal embeddings for video and audio retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops
Zhang L, Ma B, He J, Li G, Huang Q, Tian Q (2017) Adaptively unified semi-supervised learning for cross-modal retrieval. In: IJCAI, pp. 3406–3412
Yu E, Sun J, Li J, Chang X, Han X-H, Hauptmann AG (2018) Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Trans Multim 21(5):1276–1288
Article Google Scholar
Mandal D, Rao P, Biswas S (2019) Semi-supervised cross-modal retrieval with label prediction. IEEE Trans Multim 22(9):2345–2353
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803
Cao D, Yu Z, Zhang H, Fang J, Nie L, Tian Q (2019) Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1685–1693
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) Cnn architectures for large-scale audio classification. In: 2017 Ieee International Conference on Acoustics, Speech and Signal Processing (icassp), pp. 131–135. IEEE
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer
Bottou L (2012) Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer
Zhou Y, Wang Z, Fang C, Bui T, Berg TL (2018) Visual to sound: Generating natural sound for videos in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3558
Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263
Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255. PMLR
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Artificial Intelligence and Statistics, pp. 823–831. PMLR
Zeng D, Yu Y, Oyama K (2020) Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM) 16(3):1–23
Article Google Scholar
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Thirty-Second AAAI Conference on Artificial Intelligence

Download references

Author information

Authors and Affiliations

Digital Content and Media Sciences Research Division, National Institute of Informatics, Chiyoda-ku, Tokyo, 101-8430, Japan
Jiwei Zhang & Yi Yu
Department of Computer and Network Engineering, The University of Electro-Communications, 1-5-1 Chofugaoka, Chofu-shi, Tokyo, 182-8585, Japan
Suhua Tang
School of Computer Science, Fudan University, 220 Handan Road, Yangpu District, Shanghai, 356-8502, China
Wei Li
Human-Centered AI Laboratories, KDDI Research, Inc, 2-1-15 Ohara, Fujimino, Saitama Prefecture, 356-8502, Japan
Jianming Wu
Department of Systems Engineering, Wakayama University, 930 Sakaedani, Wakayama, 640-8441, Japan
Jiwei Zhang

Authors

Jiwei Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Yu
View author publications
You can also search for this author in PubMed Google Scholar
Suhua Tang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Li
View author publications
You can also search for this author in PubMed Google Scholar
Jianming Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yi Yu.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jiwei Zhang was involved in this work when he worked as an assistant researcher at the National Institute of Informatics, Tokyo, Japan.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Zhang, J., Yu, Y., Tang, S. et al. Multi-scale network with shared cross-attention for audio–visual correlation learning. Neural Comput & Applic 35, 20173–20187 (2023). https://doi.org/10.1007/s00521-023-08817-1

Download citation

Received: 24 November 2022
Accepted: 28 June 2023
Published: 19 July 2023
Issue Date: September 2023
DOI: https://doi.org/10.1007/s00521-023-08817-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale network with shared cross-attention for audio–visual correlation learning

Abstract

Access this article

Similar content being viewed by others

Deep Audio-visual Learning: A Survey

Masked co-attention model for audio-visual event localization

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Multi-scale network with shared cross-attention for audio–visual correlation learning

Abstract

Access this article

Similar content being viewed by others

Deep Audio-visual Learning: A Survey

Masked co-attention model for audio-visual event localization

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Data availability

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation