Skip to main content
Log in

Multi-scale network with shared cross-attention for audio–visual correlation learning

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Cross-modal audio–visual correlation learning has been an interesting research topic, which aims to capture and understand semantic correspondences between audio and video. We face two challenges during audio–visual correlation learning: (i) audio and visual feature sequences, respectively, belong to different feature spaces, and (ii) semantic mismatch between audio and visual sequences inevitably happens. To solve these challenges, existing works mainly focus on how to efficiently extract discriminative features, while ignoring the abundant granular features of audio and visual modalities. In this work, we introduce the multi-scale network with shared cross-attention (MSNSCA) module for audio–visual correlation learning, a supervised representation learning framework for capturing semantic audio–visual correspondences by integrating a multi-scale feature extraction module with strong cross-attention into an end-to-end trainable deep network. MSNSCA can extract more effective audio–visual particle features with excellent audio–visual semantic matching capability. Experiments on various audio–visual learning tasks, including audio–visual matching and retrieval on benchmark datasets, demonstrate the effectiveness of the proposed MSNSCA model.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

The datasets used in the current study are available in the VEGAS [35] and AVE [36].

References

  1. Zhen L, Hu P, Wang X, Peng D (2019) Deep supervised cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10394–10403

  2. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162

  3. Wang K, He R, Wang L, Wang W, Tan T (2015) Joint feature selection and subspace learning for cross-modal retrieval. IEEE Trans. Patt. Anal. Mach. Intell. 38(10):2010–2023

    Article  Google Scholar 

  4. Lai PL, Fyfe C (2000) Kernel and nonlinear canonical correlation analysis. Int J Neural Syst 10(05):365–377

    Article  Google Scholar 

  5. Wang B, Yang Y, Xu X, Hanjalic A, Shen HT (2017) Adversarial cross-modal retrieval. In: Proceedings of the 25th ACM International Conference on Multimedia, pp. 154–162

  6. He L, Xu X, Lu H, Yang Y, Shen F, Shen HT (2017) Unsupervised cross-modal retrieval through adversarial learning. In: 2017 IEEE International Conference on Multimedia and Expo (ICME), pp. 1153–1158. IEEE

  7. Athanasiadis C, Hortal E, Asteriadis S (2020) Audio-visual domain adaptation using conditional semi-supervised generative adversarial networks. Neurocomputing 397:331–344

    Article  Google Scholar 

  8. Yu Y, Shen Z, Zimmermann R (2012) Automatic music soundtrack generation for outdoor videos from contextual sensor information. In: Proceedings of the 20th ACM International Conference on Multimedia, pp. 1377–1378

  9. Prétet L, Richard G, Peeters G (2021) Cross-modal music-video recommendation: a study of design choices. In: 2021 International Joint Conference on Neural Networks (IJCNN), pp. 1–9. IEEE

  10. Xue C, Zhong X, Cai M, Chen H, Wang W (2021) Audio-visual event localization by learning spatial and semantic co-attention. IEEE Trans Multim

  11. Zheng A, Hu M, Jiang B, Huang Y, Yan Y, Luo B (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multim

  12. Hotelling H (1992) Relations between two sets of variates. In: Breakthroughs in Statistics, pp. 162–190. Springer

  13. Gu W, Gu X, Gu J, Li B, Xiong Z, Wang W (2019) Adversary guided asymmetric hashing for cross-modal retrieval. In: Proceedings of the 2019 on International Conference on Multimedia Retrieval, pp. 159–167

  14. Zeng D, Wu J, Hattori G, Xu R, Yu Y (2022) Learning explicit and implicit dual common subspaces for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM)

  15. Wang S, Pan P, Lu Y, Xie L (2015) Improving cross-modal and multi-modal retrieval combining content and semantics similarities with probabilistic model. Multim Tools Appl 74:2009–2032

    Article  Google Scholar 

  16. Chun S, Oh SJ, De Rezende RS, Kalantidis Y, Larlus D (2021) Probabilistic embeddings for cross-modal retrieval. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8415–8424

  17. Zhu Y, Wu Y, Latapie H, Yang Y, Yan Y (2021) Learning audio-visual correlations from variational cross-modal generation. In: ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4300–4304. IEEE

  18. Zhang J, Yu Y, Tang S, Wu J, Li W (2022) Variational autoencoder with cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl

  19. Yu Y, Tang S, Aizawa K, Aizawa A (2018) Category-based deep cca for fine-grained venue discovery from multimodal data. IEEE Trans Neural Netw Learn Syst 30(4):1250–1258

    Article  MathSciNet  Google Scholar 

  20. Wang C, Yang H, Meinel C (2015) Deep semantic mapping for cross-modal retrieval. In: 2015 IEEE 27th International Conference on Tools with Artificial Intelligence (ICTAI), pp. 234–241. IEEE

  21. Zeng D, Yu Y, Oyama K (2018) Audio-visual embedding for cross-modal music video retrieval through supervised deep cca. In: 2018 IEEE International Symposium on Multimedia (ISM), pp. 143–150. IEEE

  22. Chung S-W, Chung JS, Kang H-G (2019) Perfect match: improved cross-modal embeddings for audio-visual synchronisation. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3965–3969. IEEE

  23. Wu F, Jing X-Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recogn 104:107335

    Article  Google Scholar 

  24. Surís D, Duarte A, Salvador A, Torres J, Giró-i-Nieto X (2018) Cross-modal embeddings for video and audio retrieval. In: Proceedings of the European Conference on Computer Vision (ECCV) Workshops

  25. Zhang L, Ma B, He J, Li G, Huang Q, Tian Q (2017) Adaptively unified semi-supervised learning for cross-modal retrieval. In: IJCAI, pp. 3406–3412

  26. Yu E, Sun J, Li J, Chang X, Han X-H, Hauptmann AG (2018) Adaptive semi-supervised feature selection for cross-modal retrieval. IEEE Trans Multim 21(5):1276–1288

    Article  Google Scholar 

  27. Mandal D, Rao P, Biswas S (2019) Semi-supervised cross-modal retrieval with label prediction. IEEE Trans Multim 22(9):2345–2353

    Article  Google Scholar 

  28. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. Advances in neural information processing systems 30

  29. Wang X, Girshick R, Gupta A, He K (2018) Non-local neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794–7803

  30. Cao D, Yu Z, Zhang H, Fang J, Nie L, Tian Q (2019) Video-based cross-modal recipe retrieval. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 1685–1693

  31. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B, et al (2017) Cnn architectures for large-scale audio classification. In: 2017 Ieee International Conference on Acoustics, Speech and Signal Processing (icassp), pp. 131–135. IEEE

  32. Abu-El-Haija S, Kothari N, Lee J, Natsev P, Toderici G, Varadarajan B, Vijayanarasimhan S (2016) Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675

  33. Wen Y, Zhang K, Li Z, Qiao Y (2016) A discriminative feature learning approach for deep face recognition. In: European Conference on Computer Vision, pp. 499–515. Springer

  34. Bottou L (2012) Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer

  35. Zhou Y, Wang Z, Fang C, Bui T, Berg TL (2018) Visual to sound: Generating natural sound for videos in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3550–3558

  36. Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 247–263

  37. Gemmeke JF, Ellis DP, Freedman D, Jansen A, Lawrence W, Moore RC, Plakal M, Ritter M (2017) Audio set: An ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE

  38. Van der Maaten L, Hinton G (2008) Visualizing data using t-sne. J Mach Learn Res 9(11)

  39. Andrew G, Arora R, Bilmes J, Livescu K (2013) Deep canonical correlation analysis. In: International Conference on Machine Learning, pp. 1247–1255. PMLR

  40. Rasiwasia N, Mahajan D, Mahadevan V, Aggarwal G (2014) Cluster canonical correlation analysis. In: Artificial Intelligence and Statistics, pp. 823–831. PMLR

  41. Zeng D, Yu Y, Oyama K (2020) Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Trans Multim Comput Commun Appl (TOMM) 16(3):1–23

    Article  Google Scholar 

  42. Zhang J, Peng Y, Yuan M (2018) Unsupervised generative adversarial cross-modal hashing. In: Thirty-Second AAAI Conference on Artificial Intelligence

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Yu.

Ethics declarations

Conflict of interest

All authors declare that they have no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Jiwei Zhang was involved in this work when he worked as an assistant researcher at the National Institute of Informatics, Tokyo, Japan.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, J., Yu, Y., Tang, S. et al. Multi-scale network with shared cross-attention for audio–visual correlation learning. Neural Comput & Applic 35, 20173–20187 (2023). https://doi.org/10.1007/s00521-023-08817-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08817-1

Keywords

Navigation