Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Xu, Luzhen; Yan, Haoyin; He, Maokui; Guo, Zixian; Zhou, Yeping; Liu, Peiqi; Zhang, Jie; Dai, Lirong

doi:10.1007/s12204-024-2715-2

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

基于多帧跨通道注意力和说话人日志的多通道多方会议转录说话人相关自动语音识别系统

Published: 08 April 2024

(2024)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Luzhen Xu (许露真)¹,
Haoyin Yan (严浩尹)¹,
Maokui He (何茂奎)¹,
Zixian Guo (郭子娴)¹,
Yeping Zhou (周叶萍)²,
Peiqi Liu (刘沛奇)²,
Jie Zhang (张结)¹ &
…
Lirong Dai (戴礼荣)¹

40 Accesses
Explore all metrics

Abstract

This paper describes a speaker-attributed automatic speech recognition (SA-ASR) system submitted to the multi-channel multi-party meeting transcription challenge, which aims to address the “who spoke what” problem. We align the serialized output training-based multi-speaker ASR hypotheses and speaker diarization (SD) results to obtain speaker-attributed transcriptions. We use a pre-trained multi-frame cross-channel attention (MFCCA) model as the ASR module. We build a cascade system which includes a pre-trained speaker overlap-aware neural diarization and target-speaker voice activity detection model as the SD module. Decoding and alignment strategies are further used to improve the SA-ASR performance. Our proposed system outperforms the baseline with a relative improvement of 40.3% in terms of concatenated minimum-permutation character error rate on the AliMeeting dataset, which ranks top-3 on the fixed sub-track.

摘要

本文介绍了我们提交给多通道多方会议转录(M2MeT2.0)比赛的说话人相关自动语音识别 (SA-ASR)系统, 该系统旨在解决“谁说了什么”问题。将基于序列化输出训练的多说话人语音识别转录和说话人日志结果对齐, 以获得说话人相关的转录。使用预训练的多帧跨通道注意力(MFCCA)模型作为语音识别模块。构建了一个级联系统, 其中包括一个预训练的说话人重叠感知神经日志和目标说话人语音活动检测模型作为说话人日志模块。使用解码和对齐策略来进一步提高SA-ASR性能。提出的系统在AliMeeting数据集上的级联最小排列字符错误率方面优于基线, 且相对提高了40.3%, 在限定数据子赛道上排名前三。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speaker extraction network with attention mechanism for speech dialogue system

Article 13 June 2022

Exploring single channel speech separation for short-time text-dependent speaker verification

Article 13 January 2022

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

Article Open access 01 July 2023

References

FISCUS J G, AJOT J, GAROFOLO J S. The rich transcription 2007 meeting recognition evaluation [M]//Multimodal technologies for perception of humans. Berlin, Heidelberg: Springer Berlin Heidelberg, 2008: 373–389.
Chapter Google Scholar
YU D, CHANG X K, QIAN Y M. Recognizing multi-talker speech with permutation invariant training [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2456–2460.
Chapter Google Scholar
SHI M, DU Z, CHEN Q, et al. CASA-ASR: Context-aware speaker-attributed ASR [DB/OL]. (2023-05-21). https://arxiv.org/abs/2305.12459
SEKI H, HORI T, WATANABE S, et al. A purely end-to-end system for multi-speaker speech recognition [C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: ACL, 2018: 2620–2630.
Google Scholar
KANDA N, GAUR Y, WANG X F, et al. Serialized output training for end-to-end overlapped speech recognition [C]//Interspeech 2020. Shanghai: ISCA, 2020: 2797–2801.
Chapter Google Scholar
YU F, ZHANG S L, FU Y H, et al. M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge [DB/OL]. (2021-10-14). http://arxiv.org/abs/2110.07393
YU F, ZHANG S L, GUO P C, et al. Summary on the ICASSP 2022 multi-channel multi-party meeting transcription grand challenge [DB/OL]. (2022-02-08). http://arxiv.org/abs/2202.03647
YU F, DU Z H, ZHANG S L, et al. A comparative study on speaker-attributed automatic speech recognition in multi-party meetings [C]//Interspeech 2022. Incheon: ISCA, 2022: 560–564.
Chapter Google Scholar
FU Y H, CHENG L Y, LV S B, et al. AISHELL-4: An open source dataset for speech enhancement, separation, recognition and speaker diarization in conference scenario [C]//Interspeech 2021. Brno: ISCA, 2021: 3665–3669.
Chapter Google Scholar
FAN Y, KANG J W, LI L T, et al. CN-CELEB: A challenging Chinese speaker recognition dataset [DB/OL]. (2019-10-31). http://arxiv.org/abs/1911.01799
HE M K, LV X, ZHOU W L, et al. The USTC-Ximalaya system for the ICASSP 2022 multichannel multi-party meeting transcription (M2MeT) challenge [DB/OL]. (2022-02-10). http://arxiv.org/abs/2202.04855
DU Z H, ZHANG S L, ZHENG S Q, et al. Speaker overlap-aware neural diarization for multi-party meeting analysis [C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi: ACL, 2022: 7458–7469.
Chapter Google Scholar
SHI M H, ZHANG J, DU Z H, et al. A comparative study on multichannel speaker-attributed automatic speech recognition in multi-party meetings [DB/OL]. (2022-11-01). http://arxiv.org/abs/2211.00511
YU F, ZHANG S L, GUO P C, et al. MFCCA: Multiframe cross-channel attention for multi-speaker ASR in multi-party meeting scenario [C]//2022 IEEE Spoken Language Technology Workshop. Doha: IEEE, 2023: 144–151.
Chapter Google Scholar
CHANG F J, RADFAR M, MOUCHTARIS A, et al. Multi-channel transformer transducer for speech recognition [C]//Interspeech 2021. Brno: ISCA, 2021: 296–300.
Chapter Google Scholar
WANG W Q, QIN X Y, LI M. Cross-channel attention-based target speaker voice activity detection: Experimental results for the M2met challenge [C]//ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing. Singapore: IEEE, 2022: 9171–9175.
Chapter Google Scholar
MEDENNIKOV I, KORENEVSKY M, PRISYACH T, et al. Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario [C]//Interspeech 2020. Shanghai: ISCA, 2020: 274–278.
Chapter Google Scholar
HE M K, RAJ D, HUANG Z L, et al. Target-speaker voice activity detection with improved i-vector estimation for unknown number of speaker [C]//Interspeech 2021. Brno: ISCA, 2021: 3555–3559.
Chapter Google Scholar
SNYDER D, CHEN G G, POVEY D. MUSAN: A music, speech, and noise corpus [DB/OL]. (2015-10-28). http://arxiv.org/abs/1510.08484
YOSHIOKA T, NAKATANI T. Generalization of multi-channel linear prediction methods for blind MIMO impulse response shortening [J]. IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(10): 2707–2720.
Article Google Scholar
ZHANG H Y, CISSE M, DAUPHIN Y N, et al. Mixup: Beyond empirical risk minimization [DB/OL]. (2017-10-25). http://arxiv.org/abs/1710.09412
KINGMA D P, BA J. Adam: A method for stochastic optimization [DB/OL]. (2014-12-22). http://arxiv.org/abs/1412.6980
FISCUS J G, AJOT J, MICHEL M, et al. The rich transcription 2006 spring meeting recognition evaluation [M]//Machine learning for multimodal interaction. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006: 309–322.
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, 230026, China
Luzhen Xu (许露真), Haoyin Yan (严浩尹), Maokui He (何茂奎), Zixian Guo (郭子娴), Jie Zhang (张结) & Lirong Dai (戴礼荣)
China Merchants Bank, Shenzhen, Guangdong, 518048, China
Yeping Zhou (周叶萍) & Peiqi Liu (刘沛奇)

Authors

Luzhen Xu (许露真)
View author publications
You can also search for this author in PubMed Google Scholar
Haoyin Yan (严浩尹)
View author publications
You can also search for this author in PubMed Google Scholar
Maokui He (何茂奎)
View author publications
You can also search for this author in PubMed Google Scholar
Zixian Guo (郭子娴)
View author publications
You can also search for this author in PubMed Google Scholar
Yeping Zhou (周叶萍)
View author publications
You can also search for this author in PubMed Google Scholar
Peiqi Liu (刘沛奇)
View author publications
You can also search for this author in PubMed Google Scholar
Jie Zhang (张结)
View author publications
You can also search for this author in PubMed Google Scholar
Lirong Dai (戴礼荣)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jie Zhang (张结).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Foundation item: the National Natural Science Foundation of China (No. 62101523), the Joint AI Laboratory of CMB-USTC (No. FTIT2022058), and the USTC Research Funds of the Double First-Class Initiative (No. YD2100002008)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xu, L., Yan, H., He, M. et al. Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2715-2

Download citation

Received: 19 December 2023
Accepted: 05 January 2024
Published: 08 April 2024
DOI: https://doi.org/10.1007/s12204-024-2715-2

Keywords

关键词

CLC number

TN912.34

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Abstract

摘要

Access this article

Similar content being viewed by others

Speaker extraction network with attention mechanism for speech dialogue system

Exploring single channel speech separation for short-time text-dependent speaker verification

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

关键词

CLC number

Document code

Navigation

Multi-Frame Cross-Channel Attention and Speaker Diarization Based Speaker-Attributed Automatic Speech Recognition System for Multi-Channel Multi-Party Meeting Transcription

Abstract

摘要

Access this article

Similar content being viewed by others

Speaker extraction network with attention mechanism for speech dialogue system

Exploring single channel speech separation for short-time text-dependent speaker verification

Multi-task deep cross-attention networks for far-field speaker verification and keyword spotting

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

CLC number

Document code

Search

Navigation