Skip to main content
Log in

A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Speech enhancement in 3D reverberant environments is a challenging and significant problem for many downstream applications, such as speech recognition, speaker identification, and audio analysis. Existing deep neural network models have shown efficacy for 3D speech enhancement tasks, but they often introduce distortions or unnatural artifacts in the enhanced speech. In this work, we propose a novel two-stage refiner system that integrates a neural beamforming network and a diffusion model for robust 3D speech enhancement. The neural beamforming network performs spatial filtering to suppress the noise and reverberation; while, the diffusion model leverages its generative capability to restore the missing or distorted speech components from the beamformed output. To the best of our knowledge, this is the first work that applies the diffusion model as a backend refiner to 3D speech enhancement. We investigate the effect of training the diffusion model with either enhanced speech or clean speech, and find that clean speech can better capture the prior knowledge of speech components and improve the speech recovery. We evaluate our proposed system on different datasets and beamformer architectures, and show that it achieves consistent improvements in metrics like WER and NISQA, indicating that the diffusion model has strong generalization ability and can serve as a backend refinement module for 3D speech enhancement, regardless of the front-end beamforming network. Our work demonstrates the effectiveness of integrating discriminative and generative models for robust 3D speech enhancement, and also opens up a new direction for applying generative diffusion models to 3D speech processing tasks, which can be used as a backend to various beamforming enhancement methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Algorithm 1
Algorithm 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data Availability

All L3DAS challenge datasets used in this study are publicly available in https://www.l3das.com/editions.html. Code is available at https://github.com/flchenwhu/3D-SE-Diffusion.

References

  1. B.D. Anderson, Reverse-time diffusion equation models. Stoch. Process. Appl. 12(3), 313–326 (1982)

    Article  MathSciNet  Google Scholar 

  2. A. Baevski, Y. Zhou, A. Mohamed, M. Auli, wav2vec 20: A framework for self-supervised learning of speech representations. Adv. Neural Inform. Process. Syst. 33, 12449–12460 (2020)

    Google Scholar 

  3. H.-S. Choi, S. Park, J.H. Lee, et al., Real-time denoising and dereverberation with tiny recurrent u-net. ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE (2021)

  4. H. Erdogan, J.R. Hershey, S. Watanabe, et al., in Interspeech, Improved mvdr beamforming using single-channel mask prediction networks (2016), pp. 1981–1985

  5. E. Fonseca, X. Favory, J. Pons et al., Fsd50k: an open dataset of human-labeled sound events. IEEE/ACM Trans. Audio Speech Lang. Process. 30, 829–852 (2021)

    Article  Google Scholar 

  6. S.W. Fu, T.Y. Hu, Y. Tsao, X. Lu, in 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), Complex spectrogram enhancement by convolutional neural network with multi-metrics learning (2017), pp. 1–6

  7. S.-W. Fu, Y. Tsao, X. Lu, H. Kawai, in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Raw waveform-based speech enhancement by fully convolutional networks (IEEE, 2017)

  8. I. Goodfellow, J. Pouget-Abadie, M. Mirza et al., Generative adversarial networks. Commun. ACM 63(11), 139–144 (2020)

    Article  MathSciNet  Google Scholar 

  9. R.C. Hendriks, T. Gerkmann, J. Jensen, DFT-domain based single-microphone noise reduction for speech enhancement: a survey of the state of the art. Synth. Lect. Speech Audio Process. 9(1), 1–80 (2013)

    Google Scholar 

  10. J. Heymann, L. Drude, R. Haeb-Umbach, in 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Neural network based spectral mask estimation for acoustic beamforming (2016), pp. 196–200

  11. J. Jensen, C.H. Taal, An algorithm for predicting the intelligibility of speech masked by modulated noise maskers. IEEE/ACM Trans. Audio Speech Lang. Process. 24(11), 2009–2022 (2016)

    Article  Google Scholar 

  12. W. Jiang, C. Sun, F. Chen et al., Low complexity speech enhancement network based on frame-level Swin transformer. Electronics 12(6), 1330 (2023)

    Article  Google Scholar 

  13. D.P. Kingma, M. Welling, Auto-encoding variational Bayes (2013). arXiv:1312.6114

  14. R. LeBlanc, S.A. Selouani, A two-stage deep neuroevolutionary technique for self-adaptive speech enhancement. IEEE Access 10, 5083–5102 (2022)

    Article  Google Scholar 

  15. J. Le Roux, S. Wisdom, H. Erdogan, J.R. Hershey, SDR–half-baked or well done? in ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2019), pp. 626–630

  16. J. Li, Y. Zhu, D. Luo, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), The PCG-AIID system for L3DAS22 challenge: MIMO and MISO convolutional recurrent network for multi channel speech enhancement and speech recognition (2022), pp. 9211–9215

  17. W. Lin, F. Chen, C. Sun, Z. Zhu, 3D speech enhancement algorithm for two-stage U-net beamforming network. Comput. Eng. Appl. 59(22), 128–135 (2023)

    Google Scholar 

  18. Y.J. Lu, S. Cornell, X. Chang, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Towards low-distortion multi-channel speech enhancement: The ESPNET-SE submission to the L3DAS22 challenge (2022), pp. 9201–9205

  19. Y.-J. Lu, Y. Tsao, S. Watanabe, in 2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), A study on speech enhancement based on diffusion probabilistic model (IEEE, 2021)

  20. Y.-J. Lu, Z.-Q. Wang, S. Watanabe, et al., in ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Conditional diffusion probabilistic model for speech enhancement (IEEE, 2022)

  21. Y. Luo, Z. Chen, N. Mesgarani, T. Yoshioka, in ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), End-to-end microphone permutation and number invariant multi-channel speech separation (2020), pp. 6394–6398

  22. Y. Luo, N. Mesgarani, Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019)

    Article  Google Scholar 

  23. G. Mittag, B. Naderi, A. Chehadi, S. Möller, Nisqa: a deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets (2021). arXiv:2104.09494

  24. G. Mittag, S. Möller, Deep learning based assessment of synthetic speech naturalness. arXiv:2104.11673 (2021)

  25. S.A. Nossier, J. Wall, M. Moniri, et al., in 2022 International Joint Conference on Neural Networks (IJCNN), Two-stage deep learning approach for speech enhancement and reconstruction in the frequency and time domains (2022), pp. 1–10

  26. V. Panayotov, G. Chen, D. Povey, S. Khudanpur, in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), Librispeech: an asr corpus based on public domain audio books (2015), pp. 5206–5210

  27. Z. Qiu, M. Fu, Y. Yu, et al., SRTNet: Time Domain Speech Enhancement via Stochastic Refinement (2022). arXiv:2210.16805

  28. X. Ren, L. Chen, X. Zheng, et al., in 2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP), A neural beamforming network for b-format 3d speech enhancement and recognition (2021), pp. 1–6

  29. J. Richter, S. Welker, J.M. Lemercier, et al., Speech enhancement and dereverberation with diffusion-based generative models. IEEE/ACM Trans. Audio Speech Lang. Process. (2023).

  30. S. Särkkä, A. Solin, Applied stochastic differential equations (Cambridge University Press, Cambridge, 2019)

    Book  Google Scholar 

  31. J. Serrà, S. Pascual, J. Pons, et al., Universal speech enhancement with score-based diffusion. arXiv:2206.03065 (2022)

  32. R. Sawata, N. Murata, Y. Takida, et al., A versatile diffusion-based generative refiner for speech enhancement. arXiv:2210.17287 (2022).

  33. Y. Song, J. Sohl-Dickstein, D.P. Kingma, et al., Score-based generative modeling through stochastic differential equations. arXiv:2011.13456 (2020)

  34. S. Welker, J. Richter, T. Gerkmann, Speech enhancement with score-based generative models in the complex STFT domain. arXiv:2203.17004 (2022)

  35. D.S. Williamson, Y. Wang, D. Wang, Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 24(3), 483–492 (2015)

    Article  Google Scholar 

  36. Z. Zhang, Y. Xu, M. Yu, et al., in ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), ADL-MVDR: All deep learning MVDR beamformer for target speech separation (2021), pp. 6089–6093

Download references

Acknowledgements

This work was supported partly by Jiangxi Province Degree and Postgraduate Education Teaching Reform Project (No. JXYJG-2023-134), Nanchang Hangkong University PhD Foundation (No. EA201904283) and Nanchang Hangkong University Graduate Foundation (No. YC2022-044).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Feilong Chen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, F., Lin, W., Sun, C. et al. A Two-Stage Beamforming and Diffusion-Based Refiner System for 3D Speech Enhancement. Circuits Syst Signal Process (2024). https://doi.org/10.1007/s00034-024-02652-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00034-024-02652-y

Keywords

Navigation