Abstract
This paper reports the improvement of the Lithuanian Automatic Speech Recognition (ASR) system focusing on “vocabulary expansion”, i.e. enabling ASR system to recognize words never seen during training. These unseen words are called out-of-vocabulary (OOV) words and involve: 1) regular Lithuanian words appearing due to different topics or domains not covered in training; 2) complicated cases, i.e., foreign names, brand names, and loanwords pronounced not according to regular Lithuanian pronunciation rules. In weighted finite-state transducer (WFST) ASR OOV problem is typically solved by applying one of the following solutions: (1) making ASR vocabulary unlimited by performing recognition on sub-word level, (2) adding words directly to the WFST decoding graph or (3) by reconstruction of OOV words from ASR result. Our baseline Lithuanian ASR system already follows the first approach, however many OOV words are still not being recognized, because of low probability of corresponding sub-word sequences. Therefore, our offered approach can be seen as a combination of the first two solutions: we boost probabilities of sequences of sub-words (corresponding to words being “added” to vocabulary) in a sub-word weighted finite-state transducer (WFST) ASR system. In such way the vocabulary of the ASR is being “expanded”. The proposed approach allowed to achieve significant improvement over the baseline: the percentage of misrecognized out-of-vocabulary words dropped by \(\sim \)7%, while F1 reached 85.6%.
Keywords
- Speech recognition
- Out-of-vocabulary
- Vocabulary expansion
- Sub-word units
- The lithuanian language
This is a preview of subscription content, access via your institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)
Smit, P., Virpioja, S., Kurimo, M., et al.: Improved subword modeling for WFST-based speech recognition. In: Interspeech, pp. 2551–2555 (2017)
Wang, S., Li, G.: Overview of end-to-end speech recognition. J. Phys: Conf. Ser. 1187(5), 052068 (2019). https://doi.org/10.1088/1742-6596/1187/5/052068
Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)
Chiu, Ch.-Ch., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018)
Zenkel, T., Sanabria R., Metze, F., Waibel, A.: Subword and crossword Units for CTC acoustic models. In: Proceedings of the Interspeech 2018, pp. 396–400 (2018). https://doi.org/10.21437/Interspeech.2018-2057
Sainath, T.N., et al.: No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models. CoRR, arXiv:abs/1712.01864 (2017)
Lüscher, Ch., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention Interspeech 2019, ISCA (2019). https://doi.org/10.21437/interspeech.2019-1780
Laptev, A., Andrusenko, A., Podluzhny, I., Mitrofanov, A., Medennikov, I., Matveev, Y.: Dynamic acoustic unit augmentation with BPE-dropout for low-resource end-to-end speech recognition. Sensors (9), 3063 (2021). MDPI AG . https://doi.org/10.3390/s21093063
Alumäe, T., et al: The 2016 BBN Georgian telephone speech keyword spotting system. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5755–5759 (2017). https://doi.org/10.1109/ICASSP.2017.7953259
Laurinčiukaitė, S., Telksnys, L., Kasparaitis, P., Kliukienė, R. Paukštytė, V.: Lithuanian speech corpus liepa for development of human-computer interfaces working in voice recognition and synthesis mode. Informatica 29(3), 487–498 (2018). https://doi.org/10.15388/Informatica.2018.177. Vilnius University Institute of Data Science and Digital Technologies
Salimbajevs, A., Kapočiūtė-Dzikienė, J.: General-purpose lithuanian automatic speech recognition system. In: Human Language Technologies – The Baltic Perspective – Proceedings of the Eighth International Conference Baltic HLT, vol. 307, pp. 150–157. IOS Press (2018). https://doi.org/10.3233/978-1-61499-912-6-150
Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for OOV detection using hybrid word/fragment system. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3953–3956. IEEE (2009)
White, Ch., Zweig, G., Burget, L., Schwarz, P., Hermansky, H.: Confidence estimation, OOV detection and language id using phone-to-word transduction and phone-level alignments. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4085–4088. IEEE (2008)
Kumar, R., et al.: Detecting OOV named-entities in conversational speech. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)
Lin, H., Bilmes, J., Vergyri, D., Kirchhoff, K: OOV detection by joint word/phone lattice alignment. In: 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 478–483. IEEE (2007)
Asami, T., Masumura, R., Aono, Y., Shinoda, K.: Recurrent out-of-vocabulary word detection based on distribution of features. Comput. Speech Lang. 58, 247–259 (2019)
Lee, Ch-y., Zhang, Y., Glass, J.: Joint learning of phonetic units and word pronunciations for ASR. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 182–192, Association for Computational Linguistics (2013). https://aclanthology.org/D13-1019
Lee, Ch.-y., O’Donnell, T. J., Glass, J.: Unsupervised Lexicon Discovery from Acoustic Input. Transactions of the Association for Computational Linguistics, Cambridge, MA, vol. 3, pp. 389–403. MIT Press (2015). https://doi.org/10.1162/tacl_a_00146
Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., Moreno, P.: Improved recognition of contact names in voice commands. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5172–5175. IEEE (2015)
Allauzen, C., Riley, M.: Rapid vocabulary addition to context-dependent decoder graphs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)
Bulusheva, A., Zatvornitskiy, A., Korenevsky, M.: An efficient method for vocabulary addition to WFST graphs. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 452–458. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_52
Horndasch, A., Kaufhold, C., Nöth, E.: How to add word classes to the Kaldi speech recognition toolkit. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 486–494. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_56
Liu, J., Zhu, J., Kathuria, V., Peng, F.: Efficient dynamic WFST decoding for personalized language models. arXiv preprint, arXiv:1910.10670 (2019)
Bazzi, I.: Modelling OOV words for robust speech recognition. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (2002)
Kombrink, S., Hannemann, M., Burget, L., Heřmanský, H.: Recovery of Rare Words in Lecture Speech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 330–337. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_42https://www.fit.vut.cz/research/publication/9323
Alumäe, T., Tilk, O., Ullah, A.: Advanced Rich Transcription System for Estonian Speech. CoRR, arXiv:abs/1901.03601 (2019)
Zhang, X., Povey, D., Khudanpur, S.: OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for hybrid ASR. ICASSP, pp. 6334–6338. IEEE (2020)
Braun, R.A., Madikeri, S.R., Motlícek, P.: A comparison of methods for OOV-word recognition on a new public dataset. CoRR. arXiv:abs/2107.08091 (2021)
Hirsimäki, T., Pylkkönen, J., Kurimo, M.: Importance of high-order n-gram models in morph-based speech recognition. IEEE Trans. Speech Audio Process. 17(4), 724–732 (2009)
Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Interspeech (2003)
Klakow, D., Rose, G., Aubert, X.L.: OOV-detection in large vocabulary system using automatically defined word-fragments as fillers. In: Eurospeech, ISCA (1999)
Bisani, M., Ney, H.: Open vocabulary speech recognition with flat hybrid models. In: Interspeech [and] Eurospeech, 9th European Conference on Speech Communication and Technology, pp. 725–728 (2005). https://publications.rwth-aachen.de/record/113162
Kombrink, S., Hannemann, M., Burget, L.: Out-of-vocabulary word detection and beyond. In: Weinshall, D., Anemüller, J., van Gool, L. (eds.) Detection and Identification of Rare Audiovisual Cues. Studies in Computational Intelligence, vol. 384, pp. 57–65. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-24034-8_4
Drexler, J., Glass, J.: Subword regularization and beam search decoding for end-to-end automatic speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6266–6270, IEEE (2019)
Lakomkin, E., Heymann, J. Sklyar, I., Wiesler, S.: Subword regularization: an analysis of scalability and generalization for end-to-end automatic speech recognition. In: Proceedings of the Interspeech 2020, pp. 3600–3604 (2020). https://doi.org/10.21437/Interspeech.2020-1569
Raškinis, G., Paškauskaitė, G., Saudargienė, A., Kazlauskienė, A., Vaičiūnas, A.: Comparison of phonemic and graphemic word to sub-word unit mappings for lithuanian phone-level speech transcription. Informatica 30(3), 573–593 (2019). https://doi.org/10.15388/Informatica.2019.219
Alumäe, T., Ottokar, T.: Automatic speech recognition system. In: Human Language Technologies–The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 238, pp. 39. IOS Press (2016)
Allauzen, C., Riley, M. Schalkwyk, J.: A generalized composition algorithm for weighted finite-state transducers In:. Proceedings of the Interspeech 2009, pp. 1203–1206 (2009). https://doi.org/10.21437/Interspeech.2009-348
Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Language 16(1), 69–88 (2002)
Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Interspeech, pp. 3586–3589, ISCA (2015). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2015.html#KoPPK15
Acknowledgments
This research has been supported by the ICT Competence Centre (www.itkc.lv) within the project “2.8. Automated voice communication solutions for the healthcare industry” of EU Structural funds, ID no 1.2.1.1/18/A/003.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Salimbajevs, A., Kapočiūtė-Dzikienė, J. (2023). Vocabulary Expansion for the Sub-word WFST-Based Automatic Speech Recognition System. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2022, Volume 3. FTC 2022 2022. Lecture Notes in Networks and Systems, vol 561. Springer, Cham. https://doi.org/10.1007/978-3-031-18344-7_41
Download citation
DOI: https://doi.org/10.1007/978-3-031-18344-7_41
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18343-0
Online ISBN: 978-3-031-18344-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)