Skip to main content
Log in

A New Neural Beamformer for Multi-channel Speech Separation

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Speech separation is the key to many speech backend tasks, like multi-speaker speech recognition. In recent years, with the development and aid of deep learning technology, many single-channel speech separation models have shown good performance in weak reverberant environment. However, with the presence of reverberation, the multi-channel speech separation model still has greater advantages. Among them, the deep neural network (DNN) based beamformers (also known as neural beamformers) have achieved significant improvements in separation quality. The current neural beamformers can’t jointly optimize beamforming layers and DNN layers when using the prior knowledge of the existing beamforming algorithms, which may make the model unable to obtain the optimal separation performance. In order to solve this problem, this paper employs a set of beamformers that uniformly sample the space as a learning module in the neural network, and the initial values of their coefficients are determined by the existing maximum directivity factor (DF) beamformer. Furthermore, to obtain beam representations of source signals when their directions are unknown, a cross-attention mechanism is introduced. The experimental results show that in the separation task with reverberation, the proposed method has better performance than the current state-of-the-art temporal neural beamformer filter-and-sum network (FasNet) and several mainstream multi-channel speech separation approaches in terms of scale-invariant signal-to-noise ratio (SI-SNR), perceptual evaluation of speech quality (PESQ) and short-time objective intelligibility measure (STOI).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5

Similar content being viewed by others

References

  1. Cherry, C., & Bowles, J. (1960). Contribution to a study of the cocktail party problem. The Journal of the Acoustical Society of America, 32(7), 884.

    Article  Google Scholar 

  2. Nugraha, A. A., Liutkus, A., & Vincent, E. (2016). Multichannel audio source separation with deep neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(9), 1652–1664.

    Article  Google Scholar 

  3. Chen, Z., Yoshioka, T., Xiao, X., Li, L., Seltzer, M. L., & Gong, Y. (2018). Efficient integration of fixed beamformers and speech separation networks for multi-channel far-field speech separation. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 5384–5388.

  4. Wang, D., & Chen, J. (2018). Supervised speech separation based on deep learning: An overview. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(10), 1702–1726.

    Article  Google Scholar 

  5. Gannot, S., Vincent, E., Markovich-Golan, S., & Ozerov, A. (2017). A consolidated perspective on multimicrophone speech enhancement and source separation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(4), 692–730.

    Article  Google Scholar 

  6. Nakagome, Y., Togami, M., Ogawa, T., & Kobayashi, T. (2020). Mentoring-reverse mentoring for unsupervised multi-channel speech source separation. Proceedings of Interspeech, 2020, 86–90.

    Google Scholar 

  7. Minhua, W., Kumatani, K., Sundaram, S., Ström, N., & Hoffmeister, B. (2019). Frequency domain multi-channel acoustic modeling for distant speech recognition. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6640–6644.

  8. Chen, Z., Xiao, X., Yoshioka, T., Erdogan, H., Li, J., & Gong, Y. (2018). Multi-channel overlapped speech recognition with location guided speech extraction network. In IEEE Spoken Language Technology Workshop (SLT). IEEE 558–565.

  9. Ochiai, T., Delcroix, M., Ikeshita, R., Kinoshita, K., Nakatani, T., & Araki, S. (2020). Beam-tasnet: Time-domain audio separation network meets frequency-domain beamformer. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp. 6384–6388.

  10. Jo, M. J., Lee, G. W., Moon, J. M., Cho, C., & Kim, H. K. (2018). Estimation of mvdr beamforming weights based on deep neural network. In Audio Engineering Society Convention 145. Audio Engineering Society.

  11. Xiao, X., Xu, C., Zhang, Z., Zhao, S., Sun, S., Watanabe, S., Wang, L., Xie, L., Jones, D. L., Chng, E. S., et al. (2016). A study of learning based beamforming methods for speech recognition. In CHiME 2016 workshop. pp. 26–31.

  12. Luo, Y., Chen, Z., Mesgarani, N., & Yoshioka, T. (2020). End-to-end microphone permutation and number invariant multi-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 6394–6398.

  13. Luo, Y., Han, C., Mesgarani, N., Ceolini, E., & Liu, S.-C. (2019). Fasnet: Low-latency adaptive beamforming for multi-microphone audio processing. In IEEE automatic speech recognition and understanding workshop (ASRU). IEEE 260–267.

  14. Fan, C., Tao, J., Liu, B., Yi, J., Wen, Z., & Liu, X. (2020). End-to-end post-filter for speech separation with deep attention fusion features. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28, 1303–1314.

    Article  Google Scholar 

  15. Rix, A. W., Hollier, M. P., Hekstra, A. P., & Beerends, J. G. (2002). Perceptual Evaluation of Speech Quality (PESQ) the new itu standard for end-to-end speech quality assessment part i-time-delay compensation. Journal of the Audio Engineering Society, 50(10), 755–764.

    Google Scholar 

  16. Taal, C. H., Hendriks, R. C., Heusdens, R., & Jensen, J. (2011). An algorithm for intelligibility prediction of time-frequency weighted noisy speech. IEEE Transactions on Audio, Speech, and Language Processing, 19(7), 2125–2136.

    Article  Google Scholar 

  17. Luo, Y., & Mesgarani, N. (2019). Conv-tasnet: Surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM transactions on audio, speech, and language processing, 27(8), 1256–1266.

    Article  Google Scholar 

  18. Luo, Y., Chen, Z., & Yoshioka, T. (2020). Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. In ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 46–50.

  19. Ito, N., Araki, S., & Nakatani, T. (2016). Complex angular central gaussian mixture model for directional statistics in mask-based microphone array signal processing. In 24th European Signal Processing Conference (EUSIPCO). IEEE 1153–1157.

  20. Hu, Y., Liu, Y., Lv, S., Xing, M., Zhang, S., Fu, Y., Wu, J., Zhang, B., & Xie, L. (2020). DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement. Interspeech.

  21. Doclo, S., & Moonen, M. (2007). Superdirective beamforming robust against microphone mismatch. IEEE Transactions on Audio, Speech, and Language Processing, 15(2), 617–631.

    Article  Google Scholar 

  22. Kumatani, K., Minhua, W., Sundaram, S., Ström, N., & Hoffmeister, B. (2019). Multi-geometry spatial acoustic modeling for distant speech recognition. In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 6635–6639.

  23. Le Roux, J., Wisdom, S., Erdogan, H., & Hershey, J. R. (2019). SDR:half-baked or well done? In ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. pp. 626–630.

  24. Kolbæk, M., Yu, D., Tan, Z.-H., & Jensen, J. (2017). Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25(10), 1901–1913.

    Article  Google Scholar 

  25. Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). LibriSpeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 5206–5210.

  26. Wu, J., Chen, Z., Li, J., Yoshioka, T., Tan, Z., Lin, E., Luo, Y., & Xie, L. (2020). An end-to-end architecture of online multi-channel speech separation. arXiv preprint arXiv:2009.03141.

  27. Allen, J. B., & Berkley, D. A. (1979). Image method for efficiently simulating small-room acoustics. The Journal of the Acoustical Society of America, 65(4), 943–950.

    Article  Google Scholar 

  28. Habets, E. A., & Gannot, S. (2007). Generating sensor signals in isotropic noise fields. The Journal of the Acoustical Society of America, 122(6), 3464–3470.

    Article  Google Scholar 

  29. Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yi Zhou.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liu, R., Zhou, Y., Liu, H. et al. A New Neural Beamformer for Multi-channel Speech Separation. J Sign Process Syst 94, 977–987 (2022). https://doi.org/10.1007/s11265-022-01770-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-022-01770-7

Keywords

Navigation