Skip to main content
Log in

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

  • S.I. : BMVC'21
  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Binaural audio provides human listeners with an immersive spatial sound experience, but most existing videos lack binaural audio recordings. We propose an audio spatialization method that draws on visual information in videos to convert their monaural (single-channel) audio to binaural audio. Whereas existing approaches leverage visual features extracted directly from video frames, our approach explicitly disentangles the geometric cues present in the visual stream to guide the learning process. In particular, we develop a multi-task framework that learns geometry-aware features for binaural audio generation by accounting for the underlying room impulse response, the visual stream’s coherence with the sound source(s) positions, and the consistency in geometry of the sounding objects over time. Furthermore, we introduce two new large video datasets: one with realistic binaural audio simulated for real-world scanned environments, and the other with pseudo-binaural audio obtained from ambisonic sounds in YouTube \(360^{\circ }\) videos. On three datasets, we demonstrate the efficacy of our method, which achieves state-of-the-art results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Notes

  1. The SimBinaural dataset was constructed at, and will be released by, The University of Texas at Austin.

  2. SoundSpaces (Chen et al. 2020a) provides room impulse responses at a spatial resolution of 1 m. These state-of-the-art RIRs capture how sound from each source propagates and interacts with the surrounding geometry and materials, modeling all the major real-world features of the RIR: direct sounds, early specular/diffuse reflections, reverberations, binaural spatialization, and frequency dependent effects from materials and air absorption.

  3. This is the typical case. However, there can be instances where a sound source is not visualized in the video at all; for example, if music is playing from a small radio and the radio is not visible to the camera. While the binaural data generated in such cases is still correct, it might be harder for any model (including ours) to learn from such samples. Empirically, the number of such clips forms a very small portion of the data.

  4. The pre-trained model provided by PseudoBinaural (Xu et al. 2021) is trained on a different split instead of the standard split from Gao and Grauman (2019a) and hence it is not directly comparable in Table 2. We evaluate on the new split in Table 3.

References

  • Afouras, T., Chung, J. S., & Zisserman, A. (2019). My lips are concealed: Audio-visual speech enhancement through obstructions. In ICASSP.

  • Arandjelovic, R., & Zisserman, A. (2017). Look, listen and learn. In ICCV.

  • Arandjelović, R., & Zisserman, A. (2018). Objects that sound. In ECCV.

  • Aytar, Y., Vondrick, C., & Torralba, A. (2016). Learning sound representations from unlabeled video. NeurIPS: Soundnet.

    Google Scholar 

  • Chang, A., Dai, A., Funkhouser, T., Halber, M., Niessner, M., Savva, M., Song, S., Zeng, A., Zhang, Y. (2017). Matterport3d: Learning from RGB-D data in indoor environments. In International conference on 3D vision (3DV). MatterPort3D dataset license available at: http://kaldir.vc.in.tum.de/matterport/MP_TOS.pdf

  • Chen, C., Al-Halah, Z., & Grauman, K. (2021). Semantic audio-visual navigation. In CVPR.

  • Chen, C., Gao, R., Calamia, P., & Grauman, K. (2022). Visual acoustic matching. In CVPR.

  • Chen, C., Jain, U., Schissler, C., Gari, S. V. A., Al-Halah, Z., Ithapu, V. K., Robinson, P., & Grauman, K. (2020). Soundspaces: Audio-visual navigation in 3D environments. In ECCV.

  • Chen, C., Majumder, S., Al-Halah, Z., Gao, R., Ramakrishnan, S. K., & Grauman, K. (2020). Learning to set waypoints for audio-visual navigation. In ICLR.

  • Chen, P., Zhang, Y., Tan, M., Xiao, H., Huang, D., & Gan, C. (2020). Generating visually aligned sound from videos. In IEEE TIP.

  • Christensen, J. H., Hornauer, S., & Stella, X. Y. (2020). Batvision: Learning to see 3d spatial layout with two ears. In ICRA.

  • Chung, J. S., Senior, A., Vinyals, O., & Zisserman, A. (2017). Lip reading sentences in the wild. In CVPR.

  • Dean, V., Tulsiani, S., & Gupta, A. (2020). See, hear, explore: Curiosity via audio-visual association. In NeurIPS.

  • Engel, J., Agrawal, K. K., Chen, S., Gulrajani, I., Donahue, C., & Roberts, A. (2019). Gansynth: Adversarial neural audio synthesis. In ICLR.

  • Ephrat, A., Mosseri, I., Lang, O., Dekel, T., Wilson, K., Hassidim, A., Freeman, W. T., & Rubinstein, M. (2018). Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. In SIGGRAPH.

  • Font, F., Roma, G., & Serra, X. (2013). Freesound technical demo. In Proceedings of the 21st ACM International Conference on Multimedia.

  • Gabbay, A., Shamir, A., & Peleg, S. (2018). Visual speech enhancement. In INTERSPEECH.

  • Gan, C., Huang, D., Chen, P., Tenenbaum, J. B., & Torralba, A. (2020). Foley music: Learning to generate music from videos. In ECCV.

  • Gan, C., Huang, D., Zhao, H., Tenenbaum, J. B., & Torralba, A. (2020). Music gesture for visual sound separation. In CVPR.

  • Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B. (2020). Look, listen, and act: Towards audio-visual embodied navigation. ICRA.

  • Gao, R., Chen, C., Al-Halah, Z., Schissler, C., & Grauman, K. (2020). Visualechoes: Spatial image representation learning through echolocation. In ECCV.

  • Gao, R., Feris, R., & Grauman, K. (2018). Learning to separate object sounds by watching unlabeled video. In ECCV.

  • Gao, R., & Grauman, K. (2019a). 2.5d visual sound. In CVPR.

  • Gao, R., & Grauman, K. (2019b). Co-separating sounds of visual objects. In ICCV.

  • Gao, R., & Grauman, K. (2021). Visualvoice: Audio-visual speech separation with cross-modal consistency. In CVPR.

  • Gao, R., Oh, T.-H., Grauman, K., & Torresani, L. (2020). Listen to look: Action recognition by previewing audio. In CVPR.

  • Garg, R., Gao, R., & Grauman, K. (2021). Geometry-aware multi-task learning for binaural audio generation from video. In BMVC.

  • Griffin, D., & Lim, J. (1984). Signal estimation from modified short-time fourier transform. In IEEE Transactions on Acoustics, Speech, and Signal Processing.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hu, D., & Li, X. (2016). Temporal multimodal learning in audiovisual speech recognition. In CVPR.

  • Hu, D., Qian, R., Jiang, M., Tan, X., Wen, S., Ding, E., Lin, W., & Dou, D. (2020). Discriminative sounding objects localization via self-supervised audiovisual matching. In NeurIPS.

  • Kingma, D., & Ba, J. (2015). Adam: A method for stochastic optimization. In ICLR.

  • Korbar, B., Tran, D., & Torresani, L. (2018). Co-training of audio and video representations from self-supervised temporal synchronization. In NeurIPS.

  • Lu, Y.-D., Lee, H.-Y., Tseng, H.-Y., & Yang, M.-H. (2019). Self-supervised audio spatialization with correspondence classifier. In ICIP.

  • Majumder, S., Al-Halah, Z., & Grauman, K. (2021). Move2Hear: Active audio-visual source separation. In ICCV.

  • Majumder, S., & Grauman, K. (2022). Active audio-visual separation of dynamic sound sources. In ECCV.

  • Morgado, P., Li, Y., & Nvasconcelos, N. (2020). Learning representations from audio-visual spatial alignment. In NeurIPS.

  • Morgado, P., Vasconcelos, N., Langlois, T., & Wang, O. (2018). Self-supervised generation of spatial audio for 360\({}^\circ \) video. In: NeurIPS.

  • Murphy, D. T., & Shelley, S. (2010). Openair: An interactive auralization web resource and database. In Audio Engineering Society Convention 129.

  • Owens, A., & Efros, A. A. (2018). Audio-visual scene analysis with self-supervised multisensory features. In ECCV.

  • Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E. H., & Freeman, W. T. (2016). Visually indicated sounds. In CVPR.

  • Owens, A., Wu, J., McDermott, J. H., Freeman, W. T., & Torralba, A. (2016). Ambient sound provides supervision for visual learning. In ECCV.

  • Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, Z., Antiga, L., Desmaison, A., Köpf, A., Yang, E. Z., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., & Chintala, S. (2019). Pytorch: An imperative style, high-performance deep learning library. In NeurIPS.

  • Perraudin, N., Balazs, P., & Søndergaard, P. L. (2013). A fast griffin-lim algorithm. In WASPAA.

  • Purushwalkam, S., Gari, S. V. A., Ithapu, V. K., Schissler, C., Robinson, P., Gupta, A., & Grauman, K. (2021). Audio-visual floorplan reconstruction. In ICCV.

  • Rayleigh, L. (1875). On our perception of the direction of a source of sound. In Proceedings of the Musical Association.

  • Richard, A., Markovic, D., Gebru, I. D., Krenn, S., Butler, G., de la Torre, F., & Sheikh, Y. (2021). Neural synthesis of binaural speech from mono audio. In ICLR.

  • Ronneberger, O., Fischer, P., Brox, T. (2015). U-net: Convolutional networks for biomedical image segmentation. In International conference on medical image computing and computer-assisted intervention.

  • Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., & Torralba, A. (2019). Self-supervised audio-visual co-segmentation. In ICASSP.

  • Savva, M., Kadian, A., Maksymets, O., Zhao, Y., Wijmans, E., Jain, B., Straub, J., Liu, J., Koltun, V., Malik, J., Parikh, D., & Batra D. (2019). Habitat: A platform for embodied ai research. In ICCV.

  • Schissler, C., Loftin, C., & Manocha, D. (2017). Acoustic classification and optimization for multi-modal rendering of real-world scenes. IEEE Transactions on Visualization and Computer Graphics.

  • Schroeder, M. R. (1965). New method of measuring reverberation time. The Journal of the Acoustical Society of America, 37(6), 1187–1188.

  • Senocak, A., Oh, T.-H., Kim, J., Yang, M.-H., & So Kweon, I. (2018). Learning to localize sound source in visual scenes. In CVPR.

  • Tang, Z., Bryan, N.J., Li, D., Langlois, T. R., & Manocha, D. (2020). Scene-aware audio rendering via deep acoustic analysis. In IEEE Transactions on Visualization and Computer Graphics.

  • Tian, Y., Li, D., & Xu, C. (2020). Unified multisensory perception: Weakly-supervised audio-visual video parsing. In ECCV.

  • Tian, Y., Shi, J., Li, B., Duan, Z., & Xu, C. (2018). Audio-visual event localization in unconstrained videos. In ECCV.

  • Tzinis, E., Wisdom, S., Jansen, A., Hershey, S., Remez, T., Ellis, D. P., & Hershey, J. R. (2021). Into the wild with audioscope: Unsupervised audio-visual separation of on-screen sounds. In ICLR.

  • Van der Maaten, L., & Hinton, G. (2008). Visualizing data using t-sne. In JMLR.

  • Wu, Y., Zhu, L., Yan, Y., & Yang, Y. (2019). Dual attention matching for audio-visual event localization. In ICCV.

  • Xu, X., Dai, B., & Lin, D. (2019). Recursive visual sound separation using minus-plus net. In ICCV.

  • Xu, X., Zhou, H., Liu, Z., Dai, B., Wang, X., & Lin, D. (2021). Visually informed binaural audio generation without binaural audios. In CVPR.

  • Yang, K., Russell, B., & Salamon, J. (2020). Telling left from right: Learning spatial correspondence of sight and sound. In CVPR.

  • Yu, J., Zhang, S.-X., Wu, J., Ghorbani, S., Wu, B., Kang, S., Liu, S., Liu, X., Meng, H., & Yu, D. (2020). Audio-visual recognition of overlapped speech for the lrs2 dataset. In ICASSP.

  • Zaunschirm, M., Schörkhuber, C., & Höldrich, R. (2018). Binaural rendering of ambisonic signals by head-related impulse response time alignment and a diffuseness constraint. The Journal of the Acoustical Society of America, 143(6), 3616–3627.

    Article  Google Scholar 

  • Zhao, H., Gan, C., Ma, W.-C., & Torralba, A. (2019). The sound of motions. In ICCV.

  • Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., & Torralba, A. (2018). The sound of pixels. In ECCV.

  • Zhou, H., Liu, Y., Liu, Z., Luo, P., & Wang, X. (2019). Talking face generation by adversarially disentangled audio-visual representation. In AAAI.

  • Zhou, H., Xu, X., Lin, D., Wang, X., & Liu, Z. (2020). Sep-stereo: Visually guided stereophonic audio generation by associating source separation. In ECCV.

  • Zhou, Y., Wang, Z., Fang, C., Bui, T., & Berg, T. L. (2018). Visual to sound: Generating natural sound for videos in the wild. In CVPR.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rishabh Garg.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Garg, R., Gao, R. & Grauman, K. Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning. Int J Comput Vis 131, 2723–2737 (2023). https://doi.org/10.1007/s11263-023-01816-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-023-01816-8

Keywords

Navigation