Abstract
Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model.
Similar content being viewed by others
References
Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162
Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 38(10):2010–2023
Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(05):365–377
Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162
He L, Xu X, Lu H, Yang Y, Shen F, Shen HT (2017) Unsupervised cross-modal retrieval through adversarial learning. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1153–1158. IEEE
Athanasiadis C, Hortal E, Asteriadis S (2020) Audio-visual domain adaptation using conditional semi-supervised generative adversarial networks. Neurocomputing 397:331–344
Yu Y, Shen Z, Zimmermann R (2012) Automatic music soundtrack generation for outdoor videos from contextual sensor information. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1377–1378
Prétet L, Richard G, Peeters G (2021) Cross-modal music-video recommendation: a study of design choices. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE
Xue C, Zhong X, Cai M, Chen H, Wang W (2021) Audio-visual event localization by learning spatial and semantic co-attention. IEEE Trans Multim
Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multim
Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190. Springer
Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 159–167
Zeng D, Wu J, Hattori G, Xu R, Yu Y (2022) Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM)
Wang S, Pan P, Lu Y, Xie L (2015) Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model. Multim Tools Appl 74:2009–2032
Chun S, Oh SJ, De Rezende RS, Kalantidis Y, Larlus D (2021) Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424
Zhu Y, Wu Y, Latapie H, Yang Y, Yan Y (2021) Learning audio-visual correlations from variational cross-modal generation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4300–4304. IEEE
Zhang J, Yu Y, Tang S, Wu J, Li W (2022) Variational autoencoder with cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl
Yu Y, Tang S, Aizawa K, Aizawa A (2018) Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE Trans Neural Netw Learn Syst 30(4):1250–1258
Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 234–241. IEEE
Zeng D, Yu Y, Oyama K (2018) Audio-visual embedding for cross-modal music video retrieval through supervised deep cca. In: 2018 IEEE International Symposium on Multimedia (ISM), pp. 143–150. IEEE
Chung S-W, Chung JS, Kang H-G (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3965–3969. IEEE
Wu F, Jing X-Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recogn 104:107335
Surís D, Duarte A, Salvador A, Torres J, Giró-i-Nieto X (2018) Cross-modal embeddings for video and audio retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops
Zhang L, Ma B, He J, Li G, Huang Q, Tian Q (2017) Adaptively unified semi-supervised learning for cross-modal retrieval. In: IJCAI, pp. 3406–3412
Yu E, Sun J, Li J, Chang X, Han X-H, Hauptmann AG (2018) Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Trans Multim 21(5):1276–1288
Mandal D, Rao P, Biswas S (2019) Semi-supervised cross-modal retrieval with label prediction. IEEE Trans Multim 22(9):2345–2353
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30
Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803
Cao D, Yu Z, Zhang H, Fang J, Nie L, Tian Q (2019) Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1685–1693
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) Cnn architectures for large-scale audio classification. In: 2017 Ieee International Conference on Acoustics, Speech and Signal Processing (icassp), pp. 131–135. IEEE
Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675
Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer
Bottou L (2012) Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer
Zhou Y, Wang Z, Fang C, Bui T, Berg TL (2018) Visual to sound: Generating natural sound for videos in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3558
Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263
Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE
Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)
Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255. PMLR
Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Artificial Intelligence and Statistics, pp. 823–831. PMLR
Zeng D, Yu Y, Oyama K (2020) Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM) 16(3):1–23
Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Thirty-Second AAAI Conference on Artificial Intelligence
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflicts of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Jiwei Zhang was involved in this work when he worked as an assistant researcher at the National Institute of Informatics, Tokyo, Japan.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Zhang, J., Yu, Y., Tang, S. et al. Multi-scale network with shared cross-attention for audio–visual correlation learning. Neural Comput & Applic 35, 20173–20187 (2023). https://doi.org/10.1007/s00521-023-08817-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08817-1