Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

Li, Jinpeng; Chen, Xie; Zhang, Weiqiang

doi:10.1007/s12204-024-2730-3

Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

基于通用音素识别器的低资源语言发音词典生成探索

Published: 23 April 2024

(2024)
Cite this article

Journal of Shanghai Jiaotong University (Science) Aims and scope Submit manuscript

Jinpeng Li (李金朋)¹,
Xie Chen (陈谐)² &
Weiqiang Zhang (张卫强)¹

38 Accesses
Explore all metrics

Abstract

The lexicon is an essential component in the hybrid automatic speech recognition (ASR) system. However, a high-quality lexicon requires significant efforts from the linguistic experts and is difficult to obtain, especially for low-resource languages. This paper addresses the problem of using a well-trained universal phone recognizer, obtained through the training of multilingual speech data and pronunciation lexicons, to generate pronunciation lexicons for low-resource languages driven by speech data. We propose a simple pipeline that utilizes this approach to generate pronunciation lexicons and apply them into ASR systems. The steps to generate the lexicon are simple and generic: apply the International Phonetic Alphabet (IPA) phone recognizer on the speech, then align it with the reference word sequence, followed by filtering to obtain a series of AUTO-subwords, using them to generate the AUTO-subword lexicon and the AUTO-IPA lexicon. We used the pronunciation lexicon generated for the hybrid system and for fine-tuning the pre-trained model. According to the experiment results, we are able to construct the lexicon without resourcing to linguistic experts. Furthermore, the generated lexicon is able to outperform grapheme-based lexicon and is comparable to expert lexicon.

摘要

发音词典是传统混合自动语音识别系统的重要组成部分。然而, 高质量词典需要语言专家的精心标注, 通常难以获得, 特别是对于低资源语言。本文要解决的问题是, 如何利用多语言语音数据和发音词典训练获得的通用音素识别器, 通过语音数据驱动的方式为低资源语言生成发音词典。提出了一个简易的方案来生成发音词典, 并将其应用到自动语音识别系统中。生成词典步骤是通用的:首先, 在语音数据上使用国际音标(IPA)音素识别器, 然后将音素识别结果与参考文本进行对齐, 接着进行过滤以获得一系列子词, 利用来生成AUTO-subword词典和AUTO-IPA词典。将生成的发音词典用于混合系统和微调预训练模型。实验结果表明, 能够在无需语言专家资源的情况下构建词典, 并应用到语音识别系统中。此外, 生成词典的性能优于基于字素的词典, 并可与专家词典相媲美。

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Lexical modeling for the development of Amharic automatic speech recognition systems

Article 03 May 2023

Jira: a Central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon

Article 14 June 2022

Lexicon-based vs. Lexicon-free ASR for Norwegian Parliament Speech Transcription

References

RADFORD A, KIM J W, XU T, et al. Robust speech recognition via large-scale weak supervision [DB/OL]. (2022-12-06) [2023-12-19]. http://arxiv.org/abs/2212.04356
BAEVSKI A, ZHOU Y, MOHAMED A, et al. wav2vec 2.0: A framework for self-supervised learning of speech representations [C]//34th Conference on Neural Information Processing Systems. Vancouver: NIPS, 2020: 12449–12460.
Google Scholar
HSU W N, BOLTE B, TSAI Y H H, et al. Hu-BERT: Self-supervised speech representation learning by masked prediction of hidden units [J]. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, 29: 3451–3460.
Article Google Scholar
CHEN S Y, WANG C Y, CHEN Z Y, et al. WavLM: Large-scale self-supervised pre-training for full stack speech processing [J]. IEEE Journal of Selected Topics in Signal Processing, 2022, 16(6): 1505–1518.
Article Google Scholar
BAEVSKI A, HSU W N, XU Q T, et al. data2vec: A general framework for self-supervised learning in speech, vision and language [DB/OL]. (2022-02-07) [2023-12-19]. http://arxiv.org/abs/2202.03555
YUAN J H, CAI X Y, GAO D J, et al. Decoupling recognition and transcription in mandarin ASR [C]//2021 IEEE Automatic Speech Recognition and Understanding Workshop. Cartagena: IEEE, 2021: 1019–1025.
Chapter Google Scholar
HARWATH D, GLASS J R. Speech recognition without a lexicon — Bridging the gap between graphemic and phonetic systems [C]//Interspeech 2014. Singapore: ISCA, 2014: 2655–2659.
Chapter Google Scholar
GALES M J F, KNILL K M, RAGNI A. Unicode-based graphemic systems for limited resource languages [C]//2015 IEEE International Conference on Acoustics, Speech and Signal Processing. South Brisbane: IEEE, 2015: 5186–5190.
Google Scholar
LEE C, ZHANG Y, GLASS J. Joint learning of phonetic units and word pronunciations for ASR [C]//2013 Conference on Empirical Methods in Natural Language Processing. Seattle: ACL, 2013: 182–192.
Google Scholar
LEE C Y, O’DONNELL T J, GLASS J. Unsupervised lexicon discovery from acoustic input [J]. Transactions of the Association for Computational Linguistics, 2015, 3: 389–403.
Article Google Scholar
AGENBAG W, NIESLER T. Improving automatically induced lexicons for highly agglutinating languages using data-driven morphological segmentation [C]//Interspeech 2019. Graz: ISCA, 2019: 3515–3519.
Chapter Google Scholar
GOEL N, THOMAS S, AGARWAL M, et al. Approaches to automatic lexicon learning with limited training examples [C]//2010 IEEE International Conference on Acoustics, Speech and Signal Processing. Dallas: IEEE, 2010: 5094–5097.
Chapter Google Scholar
CHEN G G, POVEY D, KHUDANPUR S. Acoustic data-driven pronunciation lexicon generation for logographic languages [C]//2016 IEEE International Conference on Acoustics, Speech and Signal Processing. Shanghai: IEEE, 2016: 5350–5354.
Google Scholar
ZHANG X H, MANOHAR V, POVEY D, et al. Acoustic data-driven lexicon learning based on a greedy pronunciation selection framework [C]//Interspeech 2017. Stockholm: ISCA, 2017: 2541–2545.
Chapter Google Scholar
XU Q T, BAEVSKI A, AULI M. Simple and effective zero-shot cross-lingual phoneme recognition [DB/OL]. (2021-09-23) [2023-12-19]. http://arxiv.org/abs/2109.11680
ARDILA R, BRANSON M, DAVIS K, et al. Common voice: A massively-multilingual speech corpus [DB/OL]. (2019-12-13). http://arxiv.org/abs/1912.06670
GALES M J F, KNILL K M, RAGNI A, et al. Speech recognition and keyword spotting for low-resource languages: Babel project research at cued [C]//Fourth International Workshop on Spoken Language Technologies for Under-Resourced Languages. St. Petersburg: ISCA, 2014: 16–23.
Google Scholar
GRAVES A, FERNÁNDEZ S, GOMEZ F, et al. Connectionist temporal classification: Labelling unsegmented sequence data with recurrent neural networks [C]//23rd International Conference on Machine Learning. Pittsburgh: IMLS, 2006: 369–376.
Google Scholar
NOVAK J R, MINEMATSU N, HIROSE K. Phonetisaurus: Exploring grapheme-to-phoneme conversion with joint n-gram models in the WFST framework [J]. Natural Language Engineering, 2016, 22(6): 907–938.
Article Google Scholar
POVEY D, GHOSHAL A, BOULIANNE G, et al. The Kaldi speech recognition toolkit [C]//IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. Hawaii: IEEE, 2011: 1–4.
Google Scholar
PARK D S, CHAN W, ZHANG Y, et al. SpecAugment: A simple data augmentation method for automatic speech recognition [C]//Interspeech 2019. Graz: ISCA, 2019: 2613–2617.
Chapter Google Scholar
POVEY D, PEDDINTI V, GALVEZ D, et al. Purely sequence-trained neural networks for ASR based on lattice-free MMI [C]//Interspeech 2016. San Francisco: ISCA, 2016: 2751–2755.
Chapter Google Scholar
STOLCKE A. SRILM-an extensible language modeling toolkit [C]//7th International Conference on Spoken Language Processing. Denver: ISCA, 2002: 1–4.
Google Scholar
OTT M, EDUNOV S, BAEVSKI A, et al. Fairseq: A fast, extensible toolkit for sequence modeling [DB/OL]. (2019-04-01) [2023-12-19]. http://arxiv.org/abs/1904.01038
CONNEAU A, BAEVSKI A, COLLOBERT R, et al. Unsupervised cross-lingual representation learning for speech recognition [DB/OL]. (2020-06-24) [2023-12-19]. http://arxiv.org/abs/2006.13979
HEAFIELD K. KenLM: Faster and smaller language model queries [C]//Sixth Workshop on Statistical Machine Translation. Edinburgh: ACL, 2011: 187–197.
Google Scholar
BISANI M, NEY H. Joint-sequence models for grapheme-to-phoneme conversion [J]. Speech Communication, 2008, 50(5): 434–451.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electronic Engineering, Tsinghua University, Beijing, 100084, China
Jinpeng Li (李金朋) & Weiqiang Zhang (张卫强)
Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, 200240, China
Xie Chen (陈谐)

Authors

Jinpeng Li (李金朋)
View author publications
You can also search for this author in PubMed Google Scholar
Xie Chen (陈谐)
View author publications
You can also search for this author in PubMed Google Scholar
Weiqiang Zhang (张卫强)
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Weiqiang Zhang (张卫强).

Ethics declarations

Conflict of Interest The authors declare that they have no conflict of interest.

Additional information

Foundation item: the National Natural Science Foundation of China (Nos. 62276153 and 62206171)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Chen, X. & Zhang, W. Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer. J. Shanghai Jiaotong Univ. (Sci.) (2024). https://doi.org/10.1007/s12204-024-2730-3

Download citation

Received: 19 December 2023
Accepted: 05 January 2024
Published: 23 April 2024
DOI: https://doi.org/10.1007/s12204-024-2730-3

Keywords

关键词

CLC number

TN912.34

Document code

A

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

Abstract

摘要

Access this article

Similar content being viewed by others

Lexical modeling for the development of Amharic automatic speech recognition systems

Jira: a Central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon

Lexicon-based vs. Lexicon-free ASR for Norwegian Parliament Speech Transcription

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Keywords

关键词

CLC number

Document code

Navigation

Exploring Generation of Pronunciation Lexicon for Low-Resource Language Automatic Speech Recognition Based on Generic Phone Recognizer

Abstract

摘要

Access this article

Similar content being viewed by others

Lexical modeling for the development of Amharic automatic speech recognition systems

Jira: a Central Kurdish speech recognition system, designing and building speech corpus and pronunciation lexicon

Lexicon-based vs. Lexicon-free ASR for Norwegian Parliament Speech Transcription

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

关键词

CLC number

Document code

Search

Navigation