Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Zhou, Hang; Xu, Xudong; Lin, Dahua; Wang, Xiaogang; Liu, Ziwei

doi:10.1007/978-3-030-58610-2_4

Hang Zhou¹²,
Xudong Xu¹²,
Dahua Lin¹²,
Xiaogang Wang¹² &
…
Ziwei Liu¹²

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12357))

Included in the following conference series:

European Conference on Computer Vision

4803 Accesses
39 Citations

Abstract

Stereophonic audio is an indispensable ingredient to enhance human auditory experience. Recent research has explored the usage of visual information as guidance to generate binaural or ambisonic audio from mono ones with stereo supervision. However, this fully supervised paradigm suffers from an inherent drawback: the recording of stereophonic audio usually requires delicate devices that are expensive for wide accessibility. To overcome this challenge, we propose to leverage the vastly available mono data to facilitate the generation of stereophonic audio. Our key observation is that the task of visually indicated audio separation also maps independent audios to their corresponding visual positions, which shares a similar objective with stereophonic audio generation. We integrate both stereo generation and source separation into a unified framework, Sep-Stereo, by considering source separation as a particular type of audio spatialization. Specifically, a novel associative pyramid network architecture is carefully designed for audio-visual feature fusion. Extensive experiments demonstrate that our framework can improve the stereophonic audio generation results while performing accurate sound separation with a shared backbone (Code, models and demo video are available at https://hangz-nju-cuhk.github.io/projects/Sep-Stereo.).

H. Zhou and X. Xu—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multiple Sound Sources Localization from Coarse to Fine

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Article 20 June 2023

Category-Guided Localization Network for Visual Sound Source Separation

References

Afouras, T., Chung, J.S., Zisserman, A.: The conversation: deep audio-visual speech enhancement. In: Proceedings Interspeech 2018 (2018)
Google Scholar
Arandjelovic, R., Zisserman, A.: Look, listen and learn. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Arandjelovic, R., Zisserman, A.: Objects that sound. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: learning sound representations from unlabeled video. In: Advances in Neural Information Processing Systems (NeurIPS) (2016)
Google Scholar
Chen, C., et al.: Audio-visual embodied navigation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Chen, L., Li, Z., Maddox, R.K., Duan, Z., Xu, C.: Lip movements generation at a glance. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Chen, L., Srivastava, S., Duan, Z., Xu, C.: Deep cross-modal audio-visual generation. In: Proceedings of the on Thematic Workshops of ACM Multimedia (2017)
Google Scholar
Chung, J.S., Jamaludin, A., Zisserman, A.: You said that? In: BMVC (2017)
Google Scholar
Ephrat, A., et al.: Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Trans. Graph. (TOG) (2018)
Google Scholar
Fisher III, J.W., Darrell, T., Freeman, W.T., Viola, P.A.: Learning joint statistical models for audio-visual fusion and segregation. In: Advances In Neural Information Processing Systems (NeurIPS) (2001)
Google Scholar
Gan, C., Huang, D., Zhao, H., Tenenbaum, J.B., Torralba, A.: Music gesture for visual sound separation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
Gan, C., Zhang, Y., Wu, J., Gong, B., Tenenbaum, J.B.: Look, listen, and act: towards audio-visual embodied navigation. In: ICRA (2020)
Google Scholar
Gan, C., Zhao, H., Chen, P., Cox, D., Torralba, A.: Self-supervised moving vehicle tracking with stereo sound. In: Proceedings of the IEEE International Conference on Computer Vision (2019)
Google Scholar
Gao, R., Chen, C., Al-Halah, Z., Schissler, C., Grauman, K.: Visualechoes: spatial image representation learning through echolocation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Gao, R., Feris, R., Grauman, K.: Learning to separate object sounds by watching unlabeled video. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Gao, R., Grauman, K.: 2.5 D visual sound. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Gao, R., Grauman, K.: Co-separating sounds of visual objects. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hu, D., Li, X., Lu, X.: Temporal multimodal learning in audiovisual speech recognition. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Hu, D., Nie, F., Li, X.: Deep multimodal clustering for unsupervised audiovisual learning. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)
Google Scholar
Huang, Q., Xiong, Y., Rao, A., Wang, J., Lin, D.: Movienet: A holistic dataset for movie understanding. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Korbar, B., Tran, D., Torresani, L.: Cooperative learning of audio and video models from self-supervised synchronization. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Li, D., Langlois, T.R., Zheng, C.: Scene-aware audio for 360 videos. ACM Trans. Graph. (TOG) 37(4), 1–12 (2018)
Google Scholar
Liu, Z., Miao, Z., Zhan, X., Wang, J., Gong, B., Yu, S.X.: Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2019)
Google Scholar
Lu, Y.D., Lee, H.Y., Tseng, H.Y., Yang, M.H.: Self-supervised audio spatialization with correspondence classifier. In: 2019 IEEE International Conference on Image Processing (ICIP) (2019)
Google Scholar
Maganti, H.K., Gatica-Perez, D., McCowan, I.: Speech enhancement and recognition in meetings with an audio-visual sensor array. IEEE Trans. Audio Speech Lang. Process. 15(8), 2257–2269 (2007)
Article Google Scholar
Morgado, P., Nvasconcelos, N., Langlois, T., Wang, O.: Self-supervised generation of spatial audio for 360 video. In: Advances in Neural Information Processing Systems (NeurIPS) (2018)
Google Scholar
Owens, A., Efros, A.A.: Audio-visual scene analysis with self-supervised multisensory features. In: European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Owens, A., Isola, P., McDermott, J., Torralba, A., Adelson, E.H., Freeman, W.T.: Visually indicated sounds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)
Google Scholar
Parekh, S., Essid, S., Ozerov, A., Duong, N.Q., Pérez, P., Richard, G.: Motion informed audio source separation. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017)
Google Scholar
Qian, R., Hu, D., Dinkel, H., Wu, M., Xu, N., Lin, W.: Learning to visually localize multiple sound sources via a two-stage manner code. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Rao, A., et al.: A local-to-global approach to multi-modal movie scene segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (2020)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Rouditchenko, A., Zhao, H., Gan, C., McDermott, J., Torralba, A.: Self-supervised audio-visual co-segmentation. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019)
Google Scholar
Senocak, A., Oh, T.H., Kim, J., Yang, M.H., So Kweon, I.: Learning to localize sound source in visual scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Son Chung, J., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
Google Scholar
Tian, Y., Li, D., Xu, C.: Unified multisensory perception: weakly-supervised audio-visual video parsing. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020)
Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Wen, Y., Raj, B., Singh, R.: Face reconstruction from voice using generative adversarial networks. In: Advances in Neural Information Processing Systems (NeurIPS) (2019)
Google Scholar
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Xu, X., Dai, B., Lin, D.: Recursive visual sound separation using minus-plus net. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Yu, J., et al.: Audio-visual recognition of overlapped speech for the lrs2 dataset. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020)
Google Scholar
Zhao, H., Gan, C., Ma, W.C., Torralba, A.: The sound of motions (2019)
Google Scholar
Zhao, H., Gan, C., Rouditchenko, A., Vondrick, C., McDermott, J., Torralba, A.: The sound of pixels. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Google Scholar
Zhou, H., Liu, Y., Liu, Z., Luo, P., Wang, X.: Talking face generation by adversarially disentangled audio-visual representation. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI) (2019)
Google Scholar
Zhou, H., Liu, Z., Xu, X., Luo, P., Wang, X.: Vision-infused deep audio inpainting. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2019)
Google Scholar
Zhou, Y., Li, D., Han, X., Kalogerakis, E., Shechtman, E., Echevarria, J.: Makeittalk: Speaker-aware talking head animation. arXiv preprint arXiv:2004.12992 (2020)
Zhou, Y., Wang, Z., Fang, C., Bui, T., Berg, T.L.: Visual to sound: generating natural sound for videos in the wild. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Zhu, H., Huang, H., Li, Y., Zheng, A., He, R.: Arbitrary talking face generation via attentional audio-visual coherence learning. In: International Joint Conference on Artificial Intelligence (IJCAI) (2020)
Google Scholar
Zhu, H., Luo, M., Wang, R., Zheng, A., He, R.: Deep audio-visual learning: a survey. arXiv preprint arXiv:2001.04758 (2020)

Download references

Acknowledgements

This work is supported by SenseTime Group Limited, the General Research Fund through the Research Grants Council of Hong Kong under Grants CUHK14202217, CUHK14203118, CUHK14205615, CUHK14207814, CUHK14208619, CUHK14213616, CUHK14203518, and Research Impact Fund R5001-18.

Author information

Authors and Affiliations

CUHK - SenseTime Joint Lab, The Chinese University of Hong Kong, Hong Kong, China
Hang Zhou, Xudong Xu, Dahua Lin, Xiaogang Wang & Ziwei Liu

Authors

Hang Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xudong Xu
View author publications
You can also search for this author in PubMed Google Scholar
Dahua Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiaogang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Ziwei Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hang Zhou .

Editor information

Editors and Affiliations

University of Oxford, Oxford, UK
Andrea Vedaldi
Graz University of Technology, Graz, Austria
Horst Bischof
University of Freiburg, Freiburg im Breisgau, Germany
Thomas Brox
University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Jan-Michael Frahm

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 522 KB)

Supplementary material 2 (mp4 86709 KB)

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhou, H., Xu, X., Lin, D., Wang, X., Liu, Z. (2020). Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, JM. (eds) Computer Vision – ECCV 2020. ECCV 2020. Lecture Notes in Computer Science(), vol 12357. Springer, Cham. https://doi.org/10.1007/978-3-030-58610-2_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-58610-2_4
Published: 07 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-58609-6
Online ISBN: 978-3-030-58610-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Abstract

Access this chapter

Similar content being viewed by others

Multiple Sound Sources Localization from Coarse to Fine

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Category-Guided Localization Network for Visual Sound Source Separation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 522 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Sep-Stereo: Visually Guided Stereophonic Audio Generation by Associating Source Separation

Abstract

Access this chapter

Similar content being viewed by others

Multiple Sound Sources Localization from Coarse to Fine

Visually-Guided Audio Spatialization in Video with Geometry-Aware Multi-task Learning

Category-Guided Localization Network for Visual Sound Source Separation

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

1 Electronic supplementary material

Supplementary material 1 (pdf 522 KB)

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation