Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

Zeng, Bang; Suo, Hongbin; Wan, Yulong; Li, Ming

doi:10.1007/s12204-024-2739-7

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

会议场景下多目标说话人的语音提取

Published: 11 May 2024

(2024)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Bang Zeng (曾邦)^1,2,
Hongbin Suo (索宏彬)³,
Yulong Wan (万玉龙)³ &
…
Ming Li (李明)^1,2

42 Accesses
Explore all metrics

Abstract

The common target speech separation directly estimates the target source, ignoring the interrelationship between different speakers at each frame. We propose a multiple-target speech separation (MTSS) model to simultaneously extract each speaker’s voice from the mixed speech rather than just optimally estimating the target source. Moreover, we propose a speaker diarization (SD) aware MTSS system (SD-MTSS). By exploiting the target speaker voice activity detection (TSVAD) and the estimated mask, our SD-MTSS model can extract the speech signal of each speaker concurrently in a conversational recording without additional enrollment audio in advance. Experimental results show that our MTSS model achieves improvements of 1.38 dB signal-to-distortion ratio (SDR), 1.34 dB scale-invariant signal-to-distortion ratio (SISDR), and 0.13 perceptual evaluation of speech quality (PESQ) over the baseline on the WSJ0-2mix-extr dataset, separately. The SD-MTSS system makes a 19.2% relative speaker dependent character error rate reduction on the Alimeeting dataset.

摘要

传统的目标语音分离技术直接估计目标说话人语音成分,忽略了每帧上不同说话人之间的相互关系。我们提出了一种多目标说话人语音分离(MTSS)模型,可以同时从混合语音中提取每个说话人的语音成分,而不仅仅是最优估计得到单个目标说话人语音成分。此外,还提出了一种基于说话人日志技术(SD)的多目标说话人语音分离系统(SD-MTSS)。通过利用目标说话人声活动检测(TSVAD)和估计的掩模,SD-MTSS模型可以在会话录音中同时提取每个说话人的语音成分,无需提前注册目标说话人语音。实验结果表明,提出的MTSS模型在WSJ0-2mix-extr数据集上分别实现了1.38 dB的信号失真比(SDR)、1.34 dB的尺度不变信号失真比(SI-SDR)和0.13的语音质量感知评估(PESQ)改进。SD-MTSS系统在Alimeeting数据集上实现了19.2%的说话人相关的字符错误率降低。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

Conventional and contemporary approaches used in text to speech synthesis: a review

Article 13 November 2022

References

HERSHEY J R, CHEN Z, LE ROUX J, et al. Deep clustering: Discriminative embeddings for segmentation and separation [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 31–35.
Google Scholar
CHEN Z, LUO Y, MESGARANI N. Deep attractor network for single-microphone speaker separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 246–250.
Google Scholar
YU D, KOLBÆK M, TAN Z H, et al. Permutation invariant training of deep models for speaker-independent multi-talker speech separation [C]//2017 IEEE International Conference on Acoustics, Speech and Signal Processing. New Orleans: IEEE, 2017: 241–245.
Google Scholar
KOLBÆK M, YU D, TAN Z H, et al. Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2017, 25(10): 1901–1913.
Article Google Scholar
LUO Y, MESGARANI N. TaSNet: Time-domain audio separation network for real-time, single-channel speech separation [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 696–700.
Google Scholar
LUO Y, MESGARANI N. Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2019, 27(8): 1256–1266.
Article Google Scholar
LUO Y, CHEN Z, YOSHIOKA T. Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation [C]//2020 IEEE International Conference on Acoustics, Speech and Signal Processing. Barcelona: IEEE, 2020: 46–50.
Google Scholar
GE M, XU C L, WANG L B, et al. Multi-stage speaker extraction with utterance and frame-level reference signals [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6109–6113.
Google Scholar
DELCROIX M, ZMOLIKOVA K, OCHIAI T, et al. Speaker activity driven neural speech extraction [C]//2021 IEEE International Conference on Acoustics, Speech and Signal Processing. Toronto: IEEE, 2021: 6099–6103.
Google Scholar
WANG Q, MUCKENHIRN H, WILSON K, et al. VoiceFilter: Targeted voice separation by speaker-conditioned spectrogram masking [C]//Interspeech 2019. ISCA: Graz, 2019: 2728–2732.
Chapter Google Scholar
LI T L, LIN Q J, BAO Y Y, et al. Atss-net: Target speaker separation via attention-based neural network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1411–1415.
Chapter Google Scholar
CHEN J, RAO W, WANG Z L, et al. MC-SpEx: Towards effective speaker extraction with multi-scale interfusion and conditional speaker modulation [C]//Interspeech 2023. Dublin: ISCA, 2023: 4034–4038.
Chapter Google Scholar
WANG Q, DOWNEY C, WAN L, et al. Speaker diarization with LSTM [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5239–5243.
Google Scholar
WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171–9175.
Google Scholar
YU F, ZHANG S, FU Y, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6167–6171.
Google Scholar
DING S J, WANG Q, CHANG S Y, et al. Personal VAD: Speaker-conditioned voice activity detection [C]//The Speaker and Language Recognition Workshop (Odyssey 2020). Tokyo: ISCA, 2020: 433–439.
Chapter Google Scholar
GE M, XU C L, WANG L B, et al. SpEx+: A complete time domain speaker extraction network [C]//Interspeech 2020. Shanghai: ISCA, 2020: 1406–1410.
Chapter Google Scholar
WANG W Q, LI M, LIN Q J. Online target speaker voice activity detection for speaker diarization [C]//Interspeech 2022. Incheon: ISCA, 2022: 1441–1445.
Chapter Google Scholar
LIN Q J, YIN R Q, LI M, et al. LSTM based similarity measurement with spectral clustering for speaker diarization [C]//Interspeech 2019. Graz: ISCA, 2019: 366–370.
Chapter Google Scholar
HE K M, ZHANG X Y, REN S Q, et al. Deep residual learning for image recognition [C]//2016 IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas: IEEE, 2016: 770–778.
Google Scholar
DENG J K, GUO J, XUE N N, et al. ArcFace: Additive angular margin loss for deep face recognition [C]//2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Long Beach: IEEE, 2019: 4685–4694.
Google Scholar
COSENTINO J, PARIENTE M, CORNELL S, et al. LibriMix: An open-source dataset for generalizable speech separation [DB/OL]. (2020-05-22) [2023-12-19]. http://arxiv.org/abs/2005.11262
LE ROUX J, WISDOM S, ERDOGAN H, et al. SDR-half-baked or well done? [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 626–630.
Google Scholar
WANG W Q,CAI D W,LIN Q J, et al. The DKU-DukeECE-lenovo system for the diarization task of the 2021 VoxCeleb speaker recognition challenge [DB/OL]. (2021-09-05) [2023-12-19]. http://arxiv.org/abs/2109.02002
YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560–564.
Chapter Google Scholar
DELCROIX M, ZMOLIKOVA K, KINOSHITA K, et al. Single channel target speaker extraction and recognition with speaker beam [C]//2018 IEEE International Conference on Acoustics, Speech and Signal Processing. Calgary: IEEE, 2018: 5554–5558.
Google Scholar
XU C L, RAO W, CHNG E S, et al. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss [C]//2019 IEEE International Conference on Acoustics, Speech and Signal Processing. Brighton: IEEE, 2019: 6990–6994.
Google Scholar
XU C L, RAO W, CHNG E S, et al. Time-domain speaker extraction network [C]//2019 IEEE Automatic Speech Recognition and Understanding Workshop. Singapore: IEEE, 2019: 327–334.
Chapter Google Scholar
XU C L, RAO W, CHNG E S, et al. SpEx: Multi-scale time domain speaker extraction network [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 1370–1384.
Article Google Scholar
YAO Z Y, WU D, WANG X, et al. WeNet: Production oriented streaming and non-streaming end-to-end speech recognition toolkit [C]//Interspeech 2021. Brno: ISCA, 2021: 4054–4058.
Chapter Google Scholar
ZHANG B B, LV H, GUO P C, et al. WENET-SPEECH: A 10000 hours multi-domain mandarin corpus for speech recognition [C]//2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 6182–6186.
Google Scholar

Download references

Acknowledgment

The authors thank the computational resource provided by the Advanced Computing East China Sub-Center.

Author information

Authors and Affiliations

School of Computer Science, Wuhan University, Wuhan, 430027, China
Bang Zeng (曾邦) & Ming Li (李明)
Suzhou Municipal Key Laboratory of Multimodal Intelligent Systems, Duke Kunshan University, Kunshan, Jiangsu, 215316, China
Bang Zeng (曾邦) & Ming Li (李明)
Data & AI Engineering System, OPPO, Beijing, 100125, China
Hongbin Suo (索宏彬) & Yulong Wan (万玉龙)

Authors

Bang Zeng (曾邦)
View author publications
You can also search for this author in PubMed Google Scholar
Hongbin Suo (索宏彬)
View author publications
You can also search for this author in PubMed Google Scholar
Yulong Wan (万玉龙)
View author publications
You can also search for this author in PubMed Google Scholar
Ming Li (李明)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ming Li (李明).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Foundation item: the National Natural Science Foundation of China (No. 62171207), and the Science and Technology Program of Suzhou City (No. SYC2022051) and OPPO

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zeng, B., Suo, H., Wan, Y. et al. Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2739-7

Download citation

Received: 19 December 2023
Accepted: 05 January 2024
Published: 11 May 2024
DOI: https://doi.org/10.1007/s12204-024-2739-7

Keywords

关键词

CLC number

TN912.34

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

Abstract

摘要

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

关键词

CLC number

Document code

Navigation

Simultaneous Speech Extraction for Multiple Target Speakers Under Meeting Scenarios

Abstract

摘要

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

Conventional and contemporary approaches used in text to speech synthesis: a review

References

Acknowledgment

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

CLC number

Document code

Search

Navigation