Skip to main content
Log in

Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

基于通用音素识别器的低资源语言发音词典生成探索

  • Published:
Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Abstract

The lexicon is an essential component in the hybrid automatic speech recognition (ASR) system. However, a high-quality lexicon requires significant efforts from the linguistic experts and is difficult to obtain, especially for low-resource languages. This paper addresses the problem of using a well-trained universal phone recognizer, obtained through the training of multilingual speech data and pronunciation lexicons, to generate pronunciation lexicons for low-resource languages driven by speech data. We propose a simple pipeline that utilizes this approach to generate pronunciation lexicons and apply them into ASR systems. The steps to generate the lexicon are simple and generic: apply the International Phonetic Alphabet (IPA) phone recognizer on the speech, then align it with the reference word sequence, followed by filtering to obtain a series of AUTO-subwords, using them to generate the AUTO-subword lexicon and the AUTO-IPA lexicon. We used the pronunciation lexicon generated for the hybrid system and for fine-tuning the pre-trained model. According to the experiment results, we are able to construct the lexicon without resourcing to linguistic experts. Furthermore, the generated lexicon is able to outperform grapheme-based lexicon and is comparable to expert lexicon.

摘要

发音词典是传统混合自动语音识别系统的重要组成部分。然而, 高质量词典需要语言专家的精心标注, 通常难以获得, 特别是对于低资源语言。本文要解决的问题是, 如何利用多语言语音数据和发音词典训练获得的通用音素识别器, 通过语音数据驱动的方式为低资源语言生成发音词典。提出了一个简易的方案来生成发音词典, 并将其应用到自动语音识别系统中。生成词典步骤是通用的:首先, 在语音数据上使用国际音标(IPA)音素识别器, 然后将音素识别结果与参考文本进行对齐, 接着进行过滤以获得一系列子词, 利用来生成AUTO-subword词典和AUTO-IPA词典。将生成的发音词典用于混合系统和微调预训练模型。实验结果表明, 能够在无需语言专家资源的情况下构建词典, 并应用到语音识别系统中。此外, 生成词典的性能优于基于字素的词典, 并可与专家词典相媲美。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision [DB/OL]. (2022-12-06) [2023-12-19]. http://arxiv.org/abs/2212.04356

  2. BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [C]//34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 12449–12460.

    Google Scholar 

  3. HSU W N, BOLTE B, TSAI Y H H, et al. Hu-BERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451–3460.

    Article  Google Scholar 

  4. CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505–1518.

    Article  Google Scholar 

  5. BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07) [2023-12-19]. http://arxiv.org/abs/2202.03555

  6. YUAN J H, CAI X Y, GAO D J, et al. Decoupling recognition and transcription in mandarin ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1019–1025.

    Chapter  Google Scholar 

  7. HARWATH D, GLASS J R. Speech recognition without a lexicon — Bridging the gap between graphemic and phonetic systems [C]//Interspeech 2014. Singapore: ISCA, 2014: 2655–2659.

    Chapter  Google Scholar 

  8. GALES M J F, KNILL K M, RAGNI A. Unicode-based graphemic systems for limited resource languages [C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 5186–5190.

    Google Scholar 

  9. LEE C, ZHANG Y, GLASS J. Joint learning of phonetic units and word pronunciations for ASR [C]//2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 182–192.

    Google Scholar 

  10. LEE C Y, O’DONNELL T J, GLASS J. Unsupervised lexicon discovery from acoustic input [J]. Transactions of the Association for Computational Linguistics, 2015, 3: 389–403.

    Article  Google Scholar 

  11. AGENBAG W, NIESLER T. Improving automatically induced lexicons for highly agglutinating languages using data-driven morphological segmentation [C]//Interspeech 2019. Graz: ISCA, 2019: 3515–3519.

    Chapter  Google Scholar 

  12. GOEL N, THOMAS S, AGARWAL M, et al. Approaches to automatic lexicon learning with limited training examples [C]//2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas: IEEE, 2010: 5094–5097.

    Chapter  Google Scholar 

  13. CHEN G G, POVEY D, KHUDANPUR S. Acoustic data-driven pronunciation lexicon generation for logographic languages [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 5350–5354.

    Google Scholar 

  14. ZHANG X H, MANOHAR V, POVEY D, et al. Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2541–2545.

    Chapter  Google Scholar 

  15. XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23) [2023-12-19]. http://arxiv.org/abs/2109.11680

  16. ARDILA R, BRANSON M, DAVIS K, et al. Common voice: A massively-multilingual speech corpus [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06670

  17. GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued [C]//Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: ISCA, 2014: 16–23.

    Google Scholar 

  18. GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]//23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369–376.

    Google Scholar 

  19. NOVAK J R, MINEMATSU N, HIROSE K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework [J]. Natural Language Engineering, 2016, 22(6): 907–938.

    Article  Google Scholar 

  20. POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hawaii: IEEE, 2011: 1–4.

    Google Scholar 

  21. PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613–2617.

    Chapter  Google Scholar 

  22. POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI [C]//Interspeech 2016. San Francisco: ISCA, 2016: 2751–2755.

    Chapter  Google Scholar 

  23. STOLCKE A. SRILM-an extensible language modeling toolkit [C]//7th International Conference on Spoken Language Processing. Denver: ISCA, 2002: 1–4.

    Google Scholar 

  24. OTT M, EDUNOV S, BAEVSKI A, et al. Fairseq: A fast, extensible toolkit for sequence modeling [DB/OL]. (2019-04-01) [2023-12-19]. http://arxiv.org/abs/1904.01038

  25. CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24) [2023-12-19]. http://arxiv.org/abs/2006.13979

  26. HEAFIELD K. KenLM: Faster and smaller language model queries [C]//Sixth Workshop on Statistical Machine Translation. Edinburgh: ACL, 2011: 187–197.

    Google Scholar 

  27. BISANI M, NEY H. Joint-sequence models for grapheme-to-phoneme conversion [J]. Speech Communication, 2008, 50(5): 434–451.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Weiqiang Zhang  (张卫强).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Foundation item: the National Natural Science Foundation of China (Nos. 62276153 and 62206171)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Chen, X. & Zhang, W. Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2730-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s12204-024-2730-3

Keywords

关键词

CLC number

Document code

Navigation