Abstract
The lexicon is an essential component in the hybrid automatic speech recognition (ASR) system. However, a high-quality lexicon requires significant efforts from the linguistic experts and is difficult to obtain, especially for low-resource languages. This paper addresses the problem of using a well-trained universal phone recognizer, obtained through the training of multilingual speech data and pronunciation lexicons, to generate pronunciation lexicons for low-resource languages driven by speech data. We propose a simple pipeline that utilizes this approach to generate pronunciation lexicons and apply them into ASR systems. The steps to generate the lexicon are simple and generic: apply the International Phonetic Alphabet (IPA) phone recognizer on the speech, then align it with the reference word sequence, followed by filtering to obtain a series of AUTO-subwords, using them to generate the AUTO-subword lexicon and the AUTO-IPA lexicon. We used the pronunciation lexicon generated for the hybrid system and for fine-tuning the pre-trained model. According to the experiment results, we are able to construct the lexicon without resourcing to linguistic experts. Furthermore, the generated lexicon is able to outperform grapheme-based lexicon and is comparable to expert lexicon.
摘要
发音词典是传统混合自动语音识别系统的重要组成部分。然而, 高质量词典需要语言专家的精心标注, 通常难以获得, 特别是对于低资源语言。本文要解决的问题是, 如何利用多语言语音数据和发音词典训练获得的通用音素识别器, 通过语音数据驱动的方式为低资源语言生成发音词典。提出了一个简易的方案来生成发音词典, 并将其应用到自动语音识别系统中。生成词典步骤是通用的:首先, 在语音数据上使用国际音标(IPA)音素识别器, 然后将音素识别结果与参考文本进行对齐, 接着进行过滤以获得一系列子词, 利用来生成AUTO-subword词典和AUTO-IPA词典。将生成的发音词典用于混合系统和微调预训练模型。实验结果表明, 能够在无需语言专家资源的情况下构建词典, 并应用到语音识别系统中。此外, 生成词典的性能优于基于字素的词典, 并可与专家词典相媲美。
Similar content being viewed by others
References
RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision [DB/OL]. (2022-12-06) [2023-12-19]. http://arxiv.org/abs/2212.04356
BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [C]//34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 12449–12460.
HSU W N, BOLTE B, TSAI Y H H, et al. Hu-BERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451–3460.
CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505–1518.
BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07) [2023-12-19]. http://arxiv.org/abs/2202.03555
YUAN J H, CAI X Y, GAO D J, et al. Decoupling recognition and transcription in mandarin ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1019–1025.
HARWATH D, GLASS J R. Speech recognition without a lexicon — Bridging the gap between graphemic and phonetic systems [C]//Interspeech 2014. Singapore: ISCA, 2014: 2655–2659.
GALES M J F, KNILL K M, RAGNI A. Unicode-based graphemic systems for limited resource languages [C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 5186–5190.
LEE C, ZHANG Y, GLASS J. Joint learning of phonetic units and word pronunciations for ASR [C]//2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 182–192.
LEE C Y, O’DONNELL T J, GLASS J. Unsupervised lexicon discovery from acoustic input [J]. Transactions of the Association for Computational Linguistics, 2015, 3: 389–403.
AGENBAG W, NIESLER T. Improving automatically induced lexicons for highly agglutinating languages using data-driven morphological segmentation [C]//Interspeech 2019. Graz: ISCA, 2019: 3515–3519.
GOEL N, THOMAS S, AGARWAL M, et al. Approaches to automatic lexicon learning with limited training examples [C]//2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas: IEEE, 2010: 5094–5097.
CHEN G G, POVEY D, KHUDANPUR S. Acoustic data-driven pronunciation lexicon generation for logographic languages [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 5350–5354.
ZHANG X H, MANOHAR V, POVEY D, et al. Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2541–2545.
XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23) [2023-12-19]. http://arxiv.org/abs/2109.11680
ARDILA R, BRANSON M, DAVIS K, et al. Common voice: A massively-multilingual speech corpus [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06670
GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued [C]//Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: ISCA, 2014: 16–23.
GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]//23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369–376.
NOVAK J R, MINEMATSU N, HIROSE K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework [J]. Natural Language Engineering, 2016, 22(6): 907–938.
POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hawaii: IEEE, 2011: 1–4.
PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613–2617.
POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI [C]//Interspeech 2016. San Francisco: ISCA, 2016: 2751–2755.
STOLCKE A. SRILM-an extensible language modeling toolkit [C]//7th International Conference on Spoken Language Processing. Denver: ISCA, 2002: 1–4.
OTT M, EDUNOV S, BAEVSKI A, et al. Fairseq: A fast, extensible toolkit for sequence modeling [DB/OL]. (2019-04-01) [2023-12-19]. http://arxiv.org/abs/1904.01038
CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24) [2023-12-19]. http://arxiv.org/abs/2006.13979
HEAFIELD K. KenLM: Faster and smaller language model queries [C]//Sixth Workshop on Statistical Machine Translation. Edinburgh: ACL, 2011: 187–197.
BISANI M, NEY H. Joint-sequence models for grapheme-to-phoneme conversion [J]. Speech Communication, 2008, 50(5): 434–451.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interest The authors declare that they have no conflict of interest.
Additional information
Foundation item: the National Natural Science Foundation of China (Nos. 62276153 and 62206171)
Rights and permissions
About this article
Cite this article
Li, J., Chen, X. & Zhang, W. Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2730-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s12204-024-2730-3
Keywords
- International Phonetic Alphabet (IPA)
- lexicon learning
- phone recognition
- low-resource speech recognition