Skip to main content
Log in

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

会议场景下多目标说话人的语音提取

  • Published:
Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Abstract

The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the Alimeeting dataset.

摘要

传统的目标语音分离技术直接估计目标说话人语音成分,忽略了每帧上不同说话人之间的相互关系。我们提出了一种多目标说话人语音分离(MTSS)模型,可以同时从混合语音中提取每个说话人的语音成分,而不仅仅是最优估计得到单个目标说话人语音成分。此外,还提出了一种基于说话人日志技术(SD)的多目标说话人语音分离系统(SD-MTSS)。通过利用目标说话人声活动检测(TSVAD)和估计的掩模,SD-MTSS模型可以在会话录音中同时提取每个说话人的语音成分,无需提前注册目标说话人语音。实验结果表明,提出的MTSS模型在WSJ0-2mix-extr数据集上分别实现了1.38 dB的信号失真比(SDR)、1.34 dB的尺度不变信号失真比(SI-SDR)和0.13的语音质量感知评估(PESQ)改进。SD-MTSS系统在Alimeeting数据集上实现了19.2%的说话人相关的字符错误率降低。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 31–35.

    Google Scholar 

  2. CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 246–250.

    Google Scholar 

  3. YU D, KOLBÆK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 241–245.

    Google Scholar 

  4. KOLBÆK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901–1913.

    Article  Google Scholar 

  5. LUO Y, MESGARANI N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 696–700.

    Google Scholar 

  6. LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266.

    Article  Google Scholar 

  7. LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation [C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46–50.

    Google Scholar 

  8. GE M, XU C L, WANG L B, et al. Multi-stage speaker extraction with utterance and frame-level reference signals [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6109–6113.

    Google Scholar 

  9. DELCROIX M, ZMOLIKOVA K, OCHIAI T, et al. Speaker activity driven neural speech extraction [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6099–6103.

    Google Scholar 

  10. WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking [C]//Interspeech 2019. ISCA: Graz, 2019: 2728–2732.

    Chapter  Google Scholar 

  11. LI T L, LIN Q J, BAO Y Y, et al. Atss-net: Target speaker separation via attention-based neural network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1411–1415.

    Chapter  Google Scholar 

  12. CHEN J, RAO W, WANG Z L, et al. MC-SpEx: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation [C]//Interspeech 2023. Dublin: ISCA, 2023: 4034–4038.

    Chapter  Google Scholar 

  13. WANG Q, DOWNEY C, WAN L, et al. Speaker diarization with LSTM [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5239–5243.

    Google Scholar 

  14. WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171–9175.

    Google Scholar 

  15. YU F, ZHANG S, FU Y, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6167–6171.

    Google Scholar 

  16. DING S J, WANG Q, CHANG S Y, et al. Personal VAD: Speaker-conditioned voice activity detection [C]//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo: ISCA, 2020: 433–439.

    Chapter  Google Scholar 

  17. GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1406–1410.

    Chapter  Google Scholar 

  18. WANG W Q, LI M, LIN Q J. Online target speaker voice activity detection for speaker diarization [C]//Interspeech 2022. Incheon: ISCA, 2022: 1441–1445.

    Chapter  Google Scholar 

  19. LIN Q J, YIN R Q, LI M, et al. LSTM based similarity measurement with spectral clustering for speaker diarization [C]//Interspeech 2019. Graz: ISCA, 2019: 366–370.

    Chapter  Google Scholar 

  20. HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.

    Google Scholar 

  21. DENG J K, GUO J, XUE N N, et al. ArcFace: Additive angular margin loss for deep face recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4685–4694.

    Google Scholar 

  22. COSENTINO J, PARIENTE M, CORNELL S, et al. LibriMix: An open-source dataset for generalizable speech separation [DB/OL]. (2020-05-22) [2023-12-19]. http://arxiv.org/abs/2005.11262

  23. LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done? [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 626–630.

    Google Scholar 

  24. WANG W Q,CAI D W,LIN Q J, et al. The DKU-DukeECE-lenovo system for the diarization task of the 2021 VoxCeleb speaker recognition challenge [DB/OL]. (2021-09-05) [2023-12-19]. http://arxiv.org/abs/2109.02002

  25. YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560–564.

    Chapter  Google Scholar 

  26. DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5554–5558.

    Google Scholar 

  27. XU C L, RAO W, CHNG E S, et al. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 6990–6994.

    Google Scholar 

  28. XU C L, RAO W, CHNG E S, et al. Time-domain speaker extraction network [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 327–334.

    Chapter  Google Scholar 

  29. XU C L, RAO W, CHNG E S, et al. SpEx: Multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370–1384.

    Article  Google Scholar 

  30. YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054–4058.

    Chapter  Google Scholar 

  31. ZHANG B B, LV H, GUO P C, et al. WENET-SPEECH: A 10000 hours multi-domain mandarin corpus for speech recognition [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6182–6186.

    Google Scholar 

Download references

Acknowledgment

The authors thank the computational resource provided by the Advanced Computing East China Sub-Center.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ming Li  (李明).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Foundation item: the National Natural Science Foundation of China (No. 62171207), and the Science and Technology Program of Suzhou City (No. SYC2022051) and OPPO

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zeng, B., Suo, H., Wan, Y. et al. Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2739-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12204-024-2739-7

Keywords

关键词

CLC number

Document code

Navigation