Abstract
The human brain can process sound and visual information in overlapping areas of the cerebral cortex, which means that audio and visual information are deeply correlated with each other when we explore the world. To simulate this function of the human brain, audio-visual event retrieval (AVER) has been proposed. AVER is about using data from one modality (e.g., audio data) to query data from another. In this work, we aim to improve the performance of audio-visual event retrieval. To achieve this goal, first, we propose a novel network, InfoIIM, which enhance the accuracy of intra-model feature representation and inter-model feature alignment. The backbone of this network is a parallel connection of two VAE models with two different encoders and a shared decoder. Secondly, to enable the VAE to learn better feature representations and to improve intra-modal retrieval performance, we have used InfoMax-VAE instead of the vanilla VAE model. Additionally, we study the influence of modality-shared features on the effectiveness of audio-visual event retrieval. To verify the effectiveness of our proposed method, we validate our model on the AVE dataset, and the results show that our model outperforms several existing algorithms in most of the metrics. Finally, we present our future research directions, hoping to inspire relevant researchers.
Data availability
All data analyzed during this study are included in this published article“Audio-Visual Event Localization in Unconstrained Videos.”
References
Wang K, Yin Q, Wang W, Wu S, Wang L (2016) A comprehensive survey on cross-modal retrieval. arXiv preprint arXiv:1607.06215
Suresha M, Kuppa S, Raghukumar D (2020) A study on deep learning spatiotemporal models and feature extraction techniques for video understanding. Int J Multimed Inf Retr 9(2):81–101
Feng F, Wang X, Li R (2014) Cross-modal retrieval with correspondence autoencoder. In: Proceedings of the 22nd ACM international conference on multimedia, pp 7–16
Wang H, Zhang Y, Yu X (2020) An overview of image caption generation methods. Comput Intell Neurosci 2020
Wang L, Shang C, Qiu H, Zhao T, Qiu B, Li H (2020) Multi-stage tag guidance network in video caption. In: Proceedings of the 28th ACM international conference on multimedia, pp 4610–4614
Heller S, Gsteiger V, Bailer W, Gurrin C, Jónsson BÞ, Lokoč J, Leibetseder A, Mejzlík F, Peška L, Rossetto L et al (2022) Interactive video retrieval evaluation at a distance: comparing sixteen interactive video search systems in a remote setting at the 10th video browser showdown. Int J Multimed Inf Retr 11(1):1–18
Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 8427–8436
Nagrani A, Albanie S, Zisserman A (2018) Learnable pins: cross-modal embeddings for person identity. In: Proceedings of the European conference on computer vision (ECCV), pp 71–88
Wen P, Xu Q, Jiang Y, Yang Z, He Y, Huang Q (2021) Seeking the shape of sound: an adaptive framework for learning voice-face association. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 16347–16356
Ning H, Zheng X, Lu X, Yuan Y (2021) Disentangled representation learning for cross-modal biometric matching. IEEE Trans Multimed 24:1763–1774
Saeed MS, Khan MH, Nawaz S, Yousaf MH, Del Bue A (2022) Fusion and orthogonal projection for improved face-voice association. In: ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7057–7061
Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: Proceedings of the IEEE international conference on computer vision, pp 609–617
Hong S, Im W, Yang HS (2017) Content-based video-music retrieval using soft intra-modal structure constraint. arXiv preprint arXiv:1704.06761
Arandjelovic R, Zisserman A (2018) Objects that sound. In: Proceedings of the European conference on computer vision (ECCV), pp 435–451
Zhu Y, Wu Y, Latapie H, Yang Y, Yan Y (2021) Learning audio-visual correlations from variational cross-modal generation. In: ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4300–4304
Chung JS, Huh J, Mun S, Lee M, Heo HS, Choe S, Ham C, Jung S, Lee B-J, Han I (2020) In defence of metric learning for speaker recognition. arXiv preprint arXiv:2003.11982
Li J, Jing M, Zhu L, Ding Z, Lu K, YangY (2020) Learning modality-invariant latent representations for generalized zero-shot learning. In: Proceedings of the 28th ACM international conference on multimedia, pp 1348–1356
Hershey S, Chaudhuri S, Ellis DP, Gemmeke JF, Jansen A, Moore RC, Plakal M, Platt D, Saurous RA, Seybold B et al (2017) CNN architectures for large-scale audio classification. In: 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 131–135
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264(5588):746–748
Smith HM, Dunn AK, Baguley T, Stacey PC (2016) Matching novel face and voice identity using static and dynamic facial images. Atten Percept Psychophys 78(3):868–879
Kim C, Shin HV, Oh T-H, Kaspar A, Elgharib M, Matusik W (2018) On learning associations of faces and voices. In: Asian conference on computer vision. Springer, pp 276–292
Aslaksen K, Lorås H (2018) The modality-specific learning style hypothesis: a mini-review. Front psychol 9:1538
Wang Y, Peng Y (2021) Mars: learning modality-agnostic representation for scalable cross-media retrieval. IEEE Transactions on Circuits and Systems for Video Technology
Wu F, Jing X-Y, Wu Z, Ji Y, Dong X, Luo X, Huang Q, Wang R (2020) Modality-specific and shared generative adversarial network for cross-modal retrieval. Pattern Recognit 104:107335
Chen X, Kingma DP, Salimans T, Duan Y, Dhariwal P, Schulman J, Sutskever I, Abbeel P (2016) Variational lossy autoencoder. arXiv preprint arXiv:1611.02731
Tian Y, Shi J, Li B, Duan Z, Xu C (2018) Audio-visual event localization in unconstrained videos. In: Proceedings of the European conference on computer vision (ECCV), pp 247–263
Takida Y, Liao W-H, Uesaka T, Takahashi S, Mitsufuji Y (2021) Preventing posterior collapse induced by oversmoothing in gaussian VAE. arXiv preprint arXiv:2102.08663
Bowman SR, Vilnis L, Vinyals O, Dai AM, Jozefowicz R, Bengio S (2015) Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349
Funding
This project is supported by Funding for General Scientific Research of Macau University of Science and Technology (Grant No. FRG-22-102-FIE).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Li, R., Li, N. & Wang, W. Maximizing mutual information inside intra- and inter-modality for audio-visual event retrieval. Int J Multimed Info Retr 12, 10 (2023). https://doi.org/10.1007/s13735-023-00276-7
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13735-023-00276-7