Skip to main content

Vocabulary Expansion for the Sub-word WFST-Based Automatic Speech Recognition System

  • 445 Accesses

Part of the Lecture Notes in Networks and Systems book series (LNNS,volume 561)

Abstract

This paper reports the improvement of the Lithuanian Automatic Speech Recognition (ASR) system focusing on “vocabulary expansion”, i.e. enabling ASR system to recognize words never seen during training. These unseen words are called out-of-vocabulary (OOV) words and involve: 1) regular Lithuanian words appearing due to different topics or domains not covered in training; 2) complicated cases, i.e., foreign names, brand names, and loanwords pronounced not according to regular Lithuanian pronunciation rules. In weighted finite-state transducer (WFST) ASR OOV problem is typically solved by applying one of the following solutions: (1) making ASR vocabulary unlimited by performing recognition on sub-word level, (2) adding words directly to the WFST decoding graph or (3) by reconstruction of OOV words from ASR result. Our baseline Lithuanian ASR system already follows the first approach, however many OOV words are still not being recognized, because of low probability of corresponding sub-word sequences. Therefore, our offered approach can be seen as a combination of the first two solutions: we boost probabilities of sequences of sub-words (corresponding to words being “added” to vocabulary) in a sub-word weighted finite-state transducer (WFST) ASR system. In such way the vocabulary of the ASR is being “expanded”. The proposed approach allowed to achieve significant improvement over the baseline: the percentage of misrecognized out-of-vocabulary words dropped by \(\sim \)7%, while F1 reached 85.6%.

Keywords

  • Speech recognition
  • Out-of-vocabulary
  • Vocabulary expansion
  • Sub-word units
  • The lithuanian language

This is a preview of subscription content, access via your institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Notes

  1. 1.

    https://xn--ratija-ckb.lt/liepa-2/infrastrukturines-paslaugos/garsynas/.

  2. 2.

    https://xn--ratija-ckb.lt/liepa-2/paslaugos-vartotojams/interneto-naujienu-skaitytuvas/.

References

  1. Sennrich, R., Haddow, B., Birch, A.: Neural machine translation of rare words with subword units. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1715–1725 (2016)

    Google Scholar 

  2. Smit, P., Virpioja, S., Kurimo, M., et al.: Improved subword modeling for WFST-based speech recognition. In: Interspeech, pp. 2551–2555 (2017)

    Google Scholar 

  3. Wang, S., Li, G.: Overview of end-to-end speech recognition. J. Phys: Conf. Ser. 1187(5), 052068 (2019). https://doi.org/10.1088/1742-6596/1187/5/052068

    CrossRef  Google Scholar 

  4. Rao, K., Sak, H., Prabhavalkar, R.: Exploring architectures, data and units for streaming end-to-end speech recognition with RNN-transducer. 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 193–199. IEEE (2017)

    Google Scholar 

  5. Chiu, Ch.-Ch., et al.: State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4774–4778. IEEE (2018)

    Google Scholar 

  6. Zenkel, T., Sanabria R., Metze, F., Waibel, A.: Subword and crossword Units for CTC acoustic models. In: Proceedings of the Interspeech 2018, pp. 396–400 (2018). https://doi.org/10.21437/Interspeech.2018-2057

  7. Sainath, T.N., et al.: No need for a lexicon? Evaluating the value of the pronunciation lexica in end-to-end models. CoRR, arXiv:abs/1712.01864 (2017)

  8. Lüscher, Ch., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention Interspeech 2019, ISCA (2019). https://doi.org/10.21437/interspeech.2019-1780

  9. Laptev, A., Andrusenko, A., Podluzhny, I., Mitrofanov, A., Medennikov, I., Matveev, Y.: Dynamic acoustic unit augmentation with BPE-dropout for low-resource end-to-end speech recognition. Sensors (9), 3063 (2021). MDPI AG . https://doi.org/10.3390/s21093063

  10. Alumäe, T., et al: The 2016 BBN Georgian telephone speech keyword spotting system. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5755–5759 (2017). https://doi.org/10.1109/ICASSP.2017.7953259

  11. Laurinčiukaitė, S., Telksnys, L., Kasparaitis, P., Kliukienė, R. Paukštytė, V.: Lithuanian speech corpus liepa for development of human-computer interfaces working in voice recognition and synthesis mode. Informatica 29(3), 487–498 (2018). https://doi.org/10.15388/Informatica.2018.177. Vilnius University Institute of Data Science and Digital Technologies

  12. Salimbajevs, A., Kapočiūtė-Dzikienė, J.: General-purpose lithuanian automatic speech recognition system. In: Human Language Technologies – The Baltic Perspective – Proceedings of the Eighth International Conference Baltic HLT, vol. 307, pp. 150–157. IOS Press (2018). https://doi.org/10.3233/978-1-61499-912-6-150

  13. Rastrow, A., Sethy, A., Ramabhadran, B.: A new method for OOV detection using hybrid word/fragment system. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3953–3956. IEEE (2009)

    Google Scholar 

  14. White, Ch., Zweig, G., Burget, L., Schwarz, P., Hermansky, H.: Confidence estimation, OOV detection and language id using phone-to-word transduction and phone-level alignments. In: 2008 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4085–4088. IEEE (2008)

    Google Scholar 

  15. Kumar, R., et al.: Detecting OOV named-entities in conversational speech. In: Thirteenth Annual Conference of the International Speech Communication Association (2012)

    Google Scholar 

  16. Lin, H., Bilmes, J., Vergyri, D., Kirchhoff, K: OOV detection by joint word/phone lattice alignment. In: 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU), pp. 478–483. IEEE (2007)

    Google Scholar 

  17. Asami, T., Masumura, R., Aono, Y., Shinoda, K.: Recurrent out-of-vocabulary word detection based on distribution of features. Comput. Speech Lang. 58, 247–259 (2019)

    CrossRef  Google Scholar 

  18. Lee, Ch-y., Zhang, Y., Glass, J.: Joint learning of phonetic units and word pronunciations for ASR. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 182–192, Association for Computational Linguistics (2013). https://aclanthology.org/D13-1019

  19. Lee, Ch.-y., O’Donnell, T. J., Glass, J.: Unsupervised Lexicon Discovery from Acoustic Input. Transactions of the Association for Computational Linguistics, Cambridge, MA, vol. 3, pp. 389–403. MIT Press (2015). https://doi.org/10.1162/tacl_a_00146

  20. Aleksic, P., Allauzen, C., Elson, D., Kracun, A., Casado, D.M., Moreno, P.: Improved recognition of contact names in voice commands. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5172–5175. IEEE (2015)

    Google Scholar 

  21. Allauzen, C., Riley, M.: Rapid vocabulary addition to context-dependent decoder graphs. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)

    Google Scholar 

  22. Bulusheva, A., Zatvornitskiy, A., Korenevsky, M.: An efficient method for vocabulary addition to WFST graphs. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 452–458. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_52

    CrossRef  Google Scholar 

  23. Horndasch, A., Kaufhold, C., Nöth, E.: How to add word classes to the Kaldi speech recognition toolkit. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2016. LNCS (LNAI), vol. 9924, pp. 486–494. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-45510-5_56

    CrossRef  Google Scholar 

  24. Liu, J., Zhu, J., Kathuria, V., Peng, F.: Efficient dynamic WFST decoding for personalized language models. arXiv preprint, arXiv:1910.10670 (2019)

  25. Bazzi, I.: Modelling OOV words for robust speech recognition. Ph.D. thesis, Massachusetts Institute of Technology, Cambridge, MA, USA (2002)

    Google Scholar 

  26. Kombrink, S., Hannemann, M., Burget, L., Heřmanský, H.: Recovery of Rare Words in Lecture Speech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 330–337. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_42https://www.fit.vut.cz/research/publication/9323

    CrossRef  Google Scholar 

  27. Alumäe, T., Tilk, O., Ullah, A.: Advanced Rich Transcription System for Estonian Speech. CoRR, arXiv:abs/1901.03601 (2019)

  28. Zhang, X., Povey, D., Khudanpur, S.: OOV recovery with efficient 2nd pass decoding and open-vocabulary word-level RNNLM rescoring for hybrid ASR. ICASSP, pp. 6334–6338. IEEE (2020)

    Google Scholar 

  29. Braun, R.A., Madikeri, S.R., Motlícek, P.: A comparison of methods for OOV-word recognition on a new public dataset. CoRR. arXiv:abs/2107.08091 (2021)

  30. Hirsimäki, T., Pylkkönen, J., Kurimo, M.: Importance of high-order n-gram models in morph-based speech recognition. IEEE Trans. Speech Audio Process. 17(4), 724–732 (2009)

    CrossRef  Google Scholar 

  31. Siivola, V., Hirsimäki, T., Creutz, M., Kurimo, M.: Unlimited vocabulary speech recognition based on morphs discovered in an unsupervised manner. In: Interspeech (2003)

    Google Scholar 

  32. Klakow, D., Rose, G., Aubert, X.L.: OOV-detection in large vocabulary system using automatically defined word-fragments as fillers. In: Eurospeech, ISCA (1999)

    Google Scholar 

  33. Bisani, M., Ney, H.: Open vocabulary speech recognition with flat hybrid models. In: Interspeech [and] Eurospeech, 9th European Conference on Speech Communication and Technology, pp. 725–728 (2005). https://publications.rwth-aachen.de/record/113162

  34. Kombrink, S., Hannemann, M., Burget, L.: Out-of-vocabulary word detection and beyond. In: Weinshall, D., Anemüller, J., van Gool, L. (eds.) Detection and Identification of Rare Audiovisual Cues. Studies in Computational Intelligence, vol. 384, pp. 57–65. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-24034-8_4

  35. Drexler, J., Glass, J.: Subword regularization and beam search decoding for end-to-end automatic speech recognition. In: ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6266–6270, IEEE (2019)

    Google Scholar 

  36. Lakomkin, E., Heymann, J. Sklyar, I., Wiesler, S.: Subword regularization: an analysis of scalability and generalization for end-to-end automatic speech recognition. In: Proceedings of the Interspeech 2020, pp. 3600–3604 (2020). https://doi.org/10.21437/Interspeech.2020-1569

  37. Raškinis, G., Paškauskaitė, G., Saudargienė, A., Kazlauskienė, A., Vaičiūnas, A.: Comparison of phonemic and graphemic word to sub-word unit mappings for lithuanian phone-level speech transcription. Informatica 30(3), 573–593 (2019). https://doi.org/10.15388/Informatica.2019.219

  38. Alumäe, T., Ottokar, T.: Automatic speech recognition system. In: Human Language Technologies–The Baltic Perspective: Proceedings of the Seventh International Conference Baltic HLT 2016, vol. 238, pp. 39. IOS Press (2016)

    Google Scholar 

  39. Allauzen, C., Riley, M. Schalkwyk, J.: A generalized composition algorithm for weighted finite-state transducers In:. Proceedings of the Interspeech 2009, pp. 1203–1206 (2009). https://doi.org/10.21437/Interspeech.2009-348

  40. Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Language 16(1), 69–88 (2002)

    CrossRef  Google Scholar 

  41. Ko, T., Peddinti, V., Povey, D., Khudanpur, S.: Audio augmentation for speech recognition. In: Interspeech, pp. 3586–3589, ISCA (2015). http://dblp.uni-trier.de/db/conf/interspeech/interspeech2015.html#KoPPK15

Download references

Acknowledgments

This research has been supported by the ICT Competence Centre (www.itkc.lv) within the project “2.8. Automated voice communication solutions for the healthcare industry” of EU Structural funds, ID no 1.2.1.1/18/A/003.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jurgita Kapočiūtė-Dzikienė .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and Permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Salimbajevs, A., Kapočiūtė-Dzikienė, J. (2023). Vocabulary Expansion for the Sub-word WFST-Based Automatic Speech Recognition System. In: Arai, K. (eds) Proceedings of the Future Technologies Conference (FTC) 2022, Volume 3. FTC 2022 2022. Lecture Notes in Networks and Systems, vol 561. Springer, Cham. https://doi.org/10.1007/978-3-031-18344-7_41

Download citation