A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects

Wu, Jiaju; Wen, Zhengchang; Huang, Haitian; Su, Hanjing; Liu, Fei; Wang, Huan; Ding, Yi; Wu, Qingyao

doi:10.1007/s11761-024-00384-0

A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects

Original Research Paper
Published: 26 March 2024

(2024)
Cite this article

Service Oriented Computing and Applications Aims and scope Submit manuscript

Jiaju Wu^1,3,
Zhengchang Wen^1,3,
Haitian Huang⁶,
Hanjing Su⁵,
Fei Liu¹,
Huan Wang⁷,
Yi Ding² &
…
Qingyao Wu ORCID: orcid.org/0000-0002-8564-7289^1,3,4

37 Accesses
Explore all metrics

Abstract

Automatic speech recognition (ASR) is an important technology in many fields like video-sharing services, online education and live broadcast. Most recent ASR methods are based on deep learning technology. A dataset containing training samples of standard Mandarin and its sub-dialects can be used to train a neural network-based ASR model that can recognize standard Mandarin and its sub-dialects. Usually, due to different costs of collecting different sub-dialects, the number of training samples of standard Mandarin in the dataset is much larger than the number of training samples of sub-dialects, resulting in the recognition performance of the model for standard Mandarin being much higher than that of sub-dialects. In this paper, to enhance the recognition performance for sub-dialects, we propose to reweight the recognition loss for different sub-dialects based on their similarity to standard Mandarin. The proposed reweighting method makes the model pay more attention to sub-dialects with larger loss weights, alleviating the problem of poor recognition performance for sub-dialects. Our model was trained and validated on an open-source dataset named KeSpeech, including standard Mandarin and its eight sub-dialects. Experimental results show that the proposed model is better at recognizing most sub-dialects than the baseline and is about 0.5 lower than the baseline in Character Error Rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Research on Khalkha Dialect Mongolian Speech Recognition Acoustic Model Based on Weight Transfer

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Article 15 September 2023

Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

References

Tang Z, Wang D, Xu Y, Sun J, Lei X, Zhao S, Wen C, Tan X, Xie C, Zhou S, Yan R, Lv C, Han Y, Zou W, Li X (2021) KeSpeech: an open source speech dataset of mandarin and its eight subdialects. https://openreview.net/forum?id=b3Zoeq2sCLq
Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP (2002) Smote: synthetic minority over-sampling technique. J Artif Intell Res 16:321–357
Article Google Scholar
Huang C, Li Y, Loy CC, Tang X (2016) Learning deep representation for imbalanced classification. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 5375–5384
Kang B, Xie S, Rohrbach M, Yan Z, Gordo A, Feng J, Kalantidis Y (2019) Decoupling representation and classifier for long-tailed recognition. arXiv preprint arXiv:1910.09217
Cui Y, Jia M, Lin T-Y, Song Y, Belongie S (2019) Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9268–9277
Liu Z, Miao Z, Zhan X, Wang J, Gong B, Yu SX (2019) Large-scale long-tailed recognition in an open world. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 2537–2546
Rabiner L, Juang B (1986) An introduction to hidden Markov models. IEEE Assp Mag 3(1):4–16
Article Google Scholar
Rabiner LR (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77(2):257–286
Article Google Scholar
Abdel-Hamid O, Mohamed A-R, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545
Article Google Scholar
Han W, Zhang Z, Zhang Y, Yu J, Chiu C-C, Qin J, Gulati A, Pang R, Wu Y (2020) Contextnet: improving convolutional neural networks for automatic speech recognition with global context. arXiv preprint arXiv:2005.03191
Hao Y, Wu J, Huang X, Zhang Z, Liu F, Wu Q (2022) Speaker extraction network with attention mechanism for speech dialogue system. SOCA 16(2):111–119
Article Google Scholar
Miao Y, Gowayyed M, Metze F (2015) EESEN: end-to-end speech recognition using deep RNN models and WFST-based decoding. In: 2015 IEEE workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 167–174
Shewalkar A, Nyavanandi D, Ludwig SA (2019) Performance evaluation of deep neural networks applied to speech recognition: RNN, LSTM and GRU. J Artif Intell Soft Comput Res 9(4):235–245
Article Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Advances in neural information processing systems, vol 30
Watanabe S, Hori T, Karita S, Hayashi T, Nishitoba J, Unno Y, Soplin NEY, Heymann J, Wiesner M, Chen N et al (2018) Espnet: end-to-end speech processing toolkit. arXiv preprint arXiv:1804.00015
Dong L, Xu S, Xu B (2018) Speech-transformer: a no-recurrence sequence-to-sequence model for speech recognition. In: 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5884–5888
Wang Y, Mohamed A, Le D, Liu C, Xiao A, Mahadeokar J, Huang H, Tjandra A, Zhang X, Zhang F et al (2020) Transformer-based acoustic modeling for hybrid speech recognition. In: ICASSP 2020-2020 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6874–6878
Chan W, Jaitly N, Le Q, Vinyals, O (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: 2016 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 4960–4964
Gulati A, Qin J, Chiu C-C, Parmar N, Zhang Y, Yu J, Han W, Wang S, Zhang Z, Wu Y et al (2020) Conformer: convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100
Yao Z, Wu D, Wang X, Zhang B, Yu F, Yang C, Peng Z, Chen X, Xie L, Lei X (2021) Wenet: production oriented streaming and non-streaming end-to-end speech recognition toolkit. In: Proc Interspeech, Brno, Czech Republic. IEEE
Amodei D, Ananthanarayanan S, Anubhai R, Bai J, Battenberg E, Case C, Casper J, Catanzaro B, Cheng Q, Chen G et al (2016) Deep speech 2: end-to-end speech recognition in English and mandarin. In: International conference on machine learning. PMLR, pp 173–182
Hannun A, Lee A, Xu Q, Collobert R (2019) Sequence-to-sequence speech recognition with time-depth separable convolutions. arXiv preprint arXiv:1904.02619
He Y, Sainath TN, Prabhavalkar R, McGraw I, Alvarez R, Zhao D, Rybach D, Kannan A, Wu Y, Pang R et al (2019) Streaming end-to-end speech recognition for mobile devices. In: ICASSP 2019-2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 6381–6385
He H, Garcia EA (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Article Google Scholar
Liu P, Zheng G (2022) Handling imbalanced data: uncertainty-guided virtual adversarial training with batch nuclear-norm optimization for semi-supervised medical image classification. IEEE J Biomed Health Inform 26(7):2983–2994
Article Google Scholar
Shamsudin H, Yusof UK, Jayalakshmi A, Khalid MNA (2020) Combining oversampling and undersampling techniques for imbalanced classification: a comparative study using credit card fraudulent transaction dataset. In: 2020 IEEE 16th international conference on control & automation (ICCA). IEEE, pp 803–808
Zhao L, Shang Z, Tan J, Zhou M, Zhang M, Gu D, Zhang T, Tang YY (2022) Siamese networks with an online reweighted example for imbalanced data learning. Pattern Recogn 132:108947
Kannan A, Datta A, Sainath TN, Weinstein E, Ramabhadran B, Wu Y, Bapna A, Chen Z, Lee S (2019) Large-scale multilingual speech recognition with a streaming end-to-end model. arXiv preprint arXiv:1909.05330
Soky K, Li S, Mimura M, Chu C, Kawahara T (2021) On the use of speaker information for automatic speech recognition in speaker-imbalanced corpora. In: 2021 Asia-Pacific signal and information processing association annual summit and conference (APSIPA ASC). IEEE, pp 433–437
Winata GI, Wang G, Xiong C, Hoi S (2020) Adapt-and-adjust: overcoming the long-tail problem of multilingual speech recognition. arXiv preprint arXiv:2012.01687

Download references

Funding

This work was supported by the National Natural Science Foundation of China (NSFC) 62272172, Guangdong Basic and Applied Basic Research Foundation 2023A1515012920, Basic and Applied Basic Research Project of Guangzhou Basic Research Program with Grant No. 2023A04J1051.

Author information

Authors and Affiliations

School of Software Engineering, South China University of Technology, Guangzhou, China
Jiaju Wu, Zhengchang Wen, Fei Liu & Qingyao Wu
Hunan University of Arts and Science, Changde, China
Yi Ding
Key Laboratory of Big Data and Intelligent Robot, Ministry of Education, Guangzhou, China
Jiaju Wu, Zhengchang Wen & Qingyao Wu
Pazhou Lab, Guangzhou, China
Qingyao Wu
Tencent Wechat Department, Shenzhen, China
Hanjing Su
Shenzhen Zhenhua Microelectronics, Ltd. - ZHM, Shenzhen, China
Haitian Huang
Industrial Technology Research Center, Guangdong Institute of Scientific and Technical Information, Guangzhou, China
Huan Wang

Authors

Jiaju Wu
View author publications
You can also search for this author in PubMed Google Scholar
Zhengchang Wen
View author publications
You can also search for this author in PubMed Google Scholar
Haitian Huang
View author publications
You can also search for this author in PubMed Google Scholar
Hanjing Su
View author publications
You can also search for this author in PubMed Google Scholar
Fei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Huan Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yi Ding
View author publications
You can also search for this author in PubMed Google Scholar
Qingyao Wu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jiaju Wu and Zhengchang Wen developed the proposed method and drafted the manuscript. Yi Ding supervised the project, contributed to the discussion and analysis, and provided important suggestions for the paper. Haitian Huang, Hanjin Su, Fei Liu, Huan Wang and Qingyao Wu participated in the discussion about the proposed method. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Fei Liu or Yi Ding.

Ethics declarations

Conflict of interest

The authors declare no potential conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wu, J., Wen, Z., Huang, H. et al. A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects. SOCA (2024). https://doi.org/10.1007/s11761-024-00384-0

Download citation

Received: 15 March 2023
Revised: 12 January 2024
Accepted: 22 January 2024
Published: 26 March 2024
DOI: https://doi.org/10.1007/s11761-024-00384-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects

Abstract

Access this article

Similar content being viewed by others

Research on Khalkha Dialect Mongolian Speech Recognition Acoustic Model Based on Weight Transfer

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A reweighting method for speech recognition with imbalanced data of Mandarin and sub-dialects

Abstract

Access this article

Similar content being viewed by others

Research on Khalkha Dialect Mongolian Speech Recognition Acoustic Model Based on Weight Transfer

Hybrid deep learning based automatic speech recognition model for recognizing non-Indian languages

Significance of Indic Self-supervised Speech Representations for Indic Under-Resourced ASR

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation