Skip to main content
Log in

Multi-speaker DoA Estimation Using Audio and Visual Modality

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Traditionally, direction of arrival (DoA) approaches only use a single audio modality. However, humans naturally locate sound sources through auditory and visual clues. Inspired by this motivation, we adopt audio and visual modalities for DoA estimation, where the video is used as a prominent supplementary modality for sound source localization. Additionally, this paper introduces a novel transformer-based sound source localization framework. We also use self-attention mechanisms to capture temporal dependencies in the multi-channel audio signals. The whole model is trained to map an ideal spatial spectrum with likelihood-based output coding. The framework is evaluated on an available multi-speaker sound source localization dataset and compared against state-of-the-art methods in terms of DoA estimation error and localization accuracy. Experimental results show that the proposed audio-visual multi-speaker DoA estimation method yields improved performance over the baselines.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. https://www.idiap.ch/dataset/sslr/.

  2. https://pytorch.org/.

References

  1. Adavanne S, Politis A, Virtanen T (2018) Direction of arrival estimation for multiple sound sources using convolutional recurrent neural network. In: 26th european signal processing conference (EUSIPCO), pp 1462–1466

  2. Adavanne S, Politis A, Nikunen J, Virtanen T (2019) Sound event localization and detection of overlapping sources using convolutional recurrent neural networks. IEEE J Sel Top Signal Process 13(1):34–48

  3. Adavanne S, Politis A, Virtanen T (2019b) Localization, detection and tracking of multiple moving sound sources with a convolutional recurrent neural network. In: Proceedings of the workshop on detection and classification of acoustic scenes and events (DCASE)

  4. Adavanne S, Politis A, Virtanen T (2021) Differentiable tracking-based training of deep learning sound source localizers. In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 211–215

  5. Argentieri S, Danès P, Souères P (2015) A survey on sound source localization in robotics: from binaural to array processing methods. Comput Speech Lang 34(1):87–112

    Article  Google Scholar 

  6. Brandstein MS, Silverman HF (1997) A robust method for speech signal time-delay estimation in reverberant rooms. In: IEEE international conference on acoustics, speech, and signal processing (ICASSP), vol 1, pp 375–378

  7. Chakrabarty S, Habets EA (2017a) Broadband doa estimation using convolutional neural networks trained with noise signals. In: IEEE workshop on applications of signal processing to audio and acoustics (WASPAA), pp 136–140

  8. Chakrabarty S, Habets EA (2017b) Multi-speaker localization using convolutional neural network trained with noise. arXiv preprint arXiv:1712.04276

  9. Chakrabarty S, Habets EA (2019) Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE J Sel Top Signal Process 13(1):8–21

  10. Deng J, Guo J, Ververas E, Kotsia I, Zafeiriou S (2020) Retinaface: Single-shot multi-level face localisation in the wild. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 5203–5212

  11. DiBiase JH, Silverman HF, Brandstein MS (2001) Robust localization in reverberant rooms. In: Microphone arrays, Springer, pp 157–180

  12. Dmochowski JP, Benesty J, Affes S (2007) A generalized steered response power method for computationally viable source localization. IEEE Trans Audio Speech Lang Process 15(8):2510–2526

    Article  Google Scholar 

  13. Ephrat A, Mosseri I, Lang O, Dekel T, Wilson K, Hassidim A, Freeman WT, Rubinstein M (2018) Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans Graph 37(4):109:1-109:11

    Article  Google Scholar 

  14. Grumiaux PA, Kitić S, Girin L, Guérin A (2022) A survey of sound source localization with deep learning methods. J Acoust Soc Am 152(1):107–151

    Article  Google Scholar 

  15. Hartley R, Zisserman A (2003) Multiple view geometry in computer vision. Cambridge University Press, Cambridge

  16. He W, Motlícek P, Odobez J (2018) Deep neural networks for multiple speaker detection and localization. In: IEEE international conference on robotics and automation (ICRA), pp 74–79

  17. Hirvonen T (2015) Classification of spatial audio location and content using convolutional neural networks. Audio Eng Soc Conv 138:1–10

  18. Jarrett DP, Habets EA, Naylor PA (2017) Theory and applications of spherical microphone array processing, vol 9. Springer, New York

  19. Jones B, Kabanoff B (1975) Eye movements in auditory space perception. Percept Psychophys 17(3):241–245

    Article  Google Scholar 

  20. Kim Y, Ling H (2011) Direction of arrival estimation of humans with a small sensor array using an artificial neural network. Prog Electromagn Res B 27:127–149

  21. Kingma DP, Ba JL (2015) Adam: a method for stochastic optimization. In: International conference on learning representations (ICLR)

  22. Knapp CH, Carter GC (1976) The generalized correlation method for estimation of time delay. IEEE Trans Acoust Speech Signal Process 24(4):320–327

    Article  Google Scholar 

  23. Kuhn GF (1977) Model for the interaural time differences in the azimuthal plane. J Acoust Soc Am 62(1):157–167

    Article  Google Scholar 

  24. Liaquat MU, Munawar HS, Rahman A, Qadir Z, Kouzani AZ, Mahmud MAP (2021) Localization of sound sources: a systematic review. Energies 14(13):1–17

  25. Nguyen TNT, Nguyen NK, Phan H, Pham L, Ooi K, Jones DL, Gan WS (2021) A general network architecture for sound event localization and detection using transfer learning and recurrent neural network. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 935–939

  26. Politis A, Mesaros A, Adavanne S, Heittola T, Virtanen T (2021) Overview and evaluation of sound event localization and detection in DCASE 2019. IEEE/ACM Trans Audio Speech Lang Process 29:684–698

  27. Pulkki V, Delikaris-Manias S, Politis A (2017) Parametric time-frequency domain spatial audio. Wiley, Hoboken

  28. Qian X, Xompero A, Brutti A, Lanz O, Omologo M, Cavallaro A (2018) 3d mouth tracking from a compact microphone array co-located with a camera. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 3071–3075

  29. Qian X, Liu Q, Wang J, Li H (2021) Three-dimensional speaker localization: audio-refined visual scaling factor estimation. IEEE Signal Process Lett 28:1405–1409

    Article  Google Scholar 

  30. Qian X, Madhavi M, Pan Z, Wang J, Li H (2021b) Multi-target DoA estimation with an audio-visual fusion mechanism. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 4280–4284

  31. Rascon C, Meza I (2017) Localization of sound sources in robotics: a review. Robot Auton Syst 96:184–210

  32. Schmidt RO (1986) Multiple emitter location and signal parameter estimation. IEEE Trans Antennas Propag 34(3):276–280

    Article  MathSciNet  Google Scholar 

  33. Senocak A, Oh TH, Kim J, Yang MH, Kweon IS (2018) Learning to localize sound source in visual scenes. In: IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 4358–4366

  34. Thomas F, Ros L (2005) Revisiting trilateration for robot localization. IEEE Trans Rob 21(1):93–101

  35. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Ukaszkaiser L, Polosukhin I (2017) Attention is all you need. Adv Neural Inf Process Syst 30:1–11

  36. Wang ZQ, Zhang X, Wang D (2018) Robust speaker localization guided by deep learning-based time-frequency masking. IEEE/ACM Trans Audio Speech Lang Process 27(1):178–188

    Article  Google Scholar 

  37. Wightman FL, Kistler DJ (1992) The dominant role of low-frequency interaural time differences in sound localization. J Acoust Soc Am 91(3):1648–1661

    Article  Google Scholar 

  38. Xenaki A, Boldt JB, Christensen MG (2018) Sound source localization and speech enhancement with sparse Bayesian learning beamforming. J Acoust Soc Am 143(6):3912–3921

    Article  Google Scholar 

  39. Xiao X, Zhao S, Zhong X, Jones DL, Chng ES, Li H (2015) A learning-based approach to direction of arrival estimation in noisy and reverberant environments. In: IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2814–2818

  40. Zotter F, Frank M (2019) Ambisonics: a practical 3D audio theory for recording, studio production, sound reinforcement, and virtual reality, vol 19. Springer, New York

Download references

Acknowledgements

This work was supported by the National Nature Science Foundation of China (62271358).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruimin Hu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, Y., Hu, R., Wang, X. et al. Multi-speaker DoA Estimation Using Audio and Visual Modality. Neural Process Lett 55, 8887–8901 (2023). https://doi.org/10.1007/s11063-023-11183-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-023-11183-7

Keywords

Navigation