Skip to main content

Advertisement

Log in

Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval

  • Short Paper
  • Published:
International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

Abstract

The human brain can process sound and visual information in overlapping areas of the cerebral cortex, which means that audio and visual information are deeply correlated with each other when we explore the world. To simulate this function of the human brain, audio-visual event retrieval (AVER) has been proposed. AVER is about using data from one modality (e.g., audio data) to query data from another. In this work, we aim to improve the performance of audio-visual event retrieval. To achieve this goal, first, we propose a novel network, InfoIIM, which enhance the accuracy of intra-model feature representation and inter-model feature alignment. The backbone of this network is a parallel connection of two VAE models with two different encoders and a shared decoder. Secondly, to enable the VAE to learn better feature representations and to improve intra-modal retrieval performance, we have used InfoMax-VAE instead of the vanilla VAE model. Additionally, we study the influence of modality-shared features on the effectiveness of audio-visual event retrieval. To verify the effectiveness of our proposed method, we validate our model on the AVE dataset, and the results show that our model outperforms several existing algorithms in most of the metrics. Finally, we present our future research directions, hoping to inspire relevant researchers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Data availability

All data analyzed during this study are included in this published article“Audio-Visual Event Localization in Unconstrained Videos.”

References

  1. Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215

  2. Suresha M, Kuppa S, Raghukumar D (2020) A study on deep learning spatiotemporal models and feature extraction techniques for video understanding. Int J Multimed Inf Retr 9(2):81–101

    Article  Google Scholar 

  3. Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia, pp 7–16

  4. Wang H, Zhang Y, Yu X (2020) An overview of image caption generation methods. Comput Intell Neurosci 2020

  5. Wang L, Shang C, Qiu H, Zhao T, Qiu B, Li H (2020) Multi-stage tag guidance network in video caption. In: Proceedings of the 28th ACM international conference on multimedia, pp 4610–4614

  6. Heller S, Gsteiger V, Bailer W, Gurrin C, Jónsson BÞ, Lokoč J, Leibetseder A, Mejzlík F, Peška L, Rossetto L et al (2022) Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int J Multimed Inf Retr 11(1):1–18

    Article  Google Scholar 

  7. Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8427–8436

  8. Nagrani A, Albanie S, Zisserman A (2018) Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European conference on computer vision (ECCV), pp 71–88

  9. Wen P, Xu Q, Jiang Y, Yang Z, He Y, Huang Q (2021) Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16347–16356

  10. Ning H, Zheng X, Lu X, Yuan Y (2021) Disentangled representation learning for cross-modal biometric matching. IEEE Trans Multimed 24:1763–1774

    Article  Google Scholar 

  11. Saeed MS, Khan MH, Nawaz S, Yousaf MH, Del Bue A (2022) Fusion and orthogonal projection for improved face-voice association. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7057–7061

  12. Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: Proceedings of the IEEE international conference on computer vision, pp 609–617

  13. Hong S, Im W, Yang HS (2017) Content-based video-music retrieval using soft intra-modal structure constraint. arXiv preprint arXiv:1704.06761

  14. Arandjelovic R, Zisserman A (2018) Objects that sound. In: Proceedings of the European conference on computer vision (ECCV), pp 435–451

  15. Zhu Y, Wu Y, Latapie H, Yang Y, Yan Y (2021) Learning audio-visual correlations from variational cross-modal generation. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4300–4304

  16. Chung JS, Huh J, Mun S, Lee M, Heo HS, Choe S, Ham C, Jung S, Lee B-J, Han I (2020) In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982

  17. Li J, Jing M, Zhu L, Ding Z, Lu K, YangY (2020) Learning modality-invariant latent representations for generalized zero-shot learning. In: Proceedings of the 28th ACM international conference on multimedia, pp 1348–1356

  18. Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 131–135

  19. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  20. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748

    Article  Google Scholar 

  21. Smith HM, Dunn AK, Baguley T, Stacey PC (2016) Matching novel face and voice identity using static and dynamic facial images. Atten Percept Psychophys 78(3):868–879

    Article  Google Scholar 

  22. Kim C, Shin HV, Oh T-H, Kaspar A, Elgharib M, Matusik W (2018) On learning associations of faces and voices. In: Asian conference on computer vision. Springer, pp 276–292

  23. Aslaksen K, Lorås H (2018) The modality-specific learning style hypothesis: a mini-review. Front psychol 9:1538

    Article  Google Scholar 

  24. Wang Y, Peng Y (2021) Mars: learning modality-agnostic representation for scalable cross-media retrieval. IEEE Transactions on Circuits and Systems for Video Technology

  25. Wu F, Jing X-Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognit 104:107335

    Article  Google Scholar 

  26. Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731

  27. Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV), pp 247–263

  28. Takida Y, Liao W-H, Uesaka T, Takahashi S, Mitsufuji Y (2021) Preventing posterior collapse induced by oversmoothing in gaussian VAE. arXiv preprint arXiv:2102.08663

  29. Bowman SR, Vilnis L, Vinyals O, Dai AM, Jozefowicz R, Bengio S (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349

Download references

Funding

This project is supported by Funding for General Scientific Research of Macau University of Science and Technology (Grant No. FRG-22-102-FIE).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nannan Li.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, R., Li, N. & Wang, W. Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval. Int J Multimed Info Retr 12, 10 (2023). https://doi.org/10.1007/s13735-023-00276-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13735-023-00276-7

Keywords

Navigation