Modeling under-resourced languages for speech recognition

Kurimo, Mikko; Enarvi, Seppo; Tilk, Ottokar; Varjokallio, Matti; Mansikkaniemi, André; Alumäe, Tanel

doi:10.1007/s10579-016-9336-9

Modeling under-resourced languages for speech recognition

Original Paper
Published: 10 February 2016

Volume 51, pages 961–987, (2017)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Mikko Kurimo¹,
Seppo Enarvi¹,
Ottokar Tilk²,
Matti Varjokallio¹,
André Mansikkaniemi¹ &
…
Tanel Alumäe²

747 Accesses
15 Citations
1 Altmetric
Explore all metrics

Abstract

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Autoencoders and their applications in machine learning: a survey

Article Open access 03 February 2024

Automatic speech recognition: a survey

Article 10 November 2020

Early dementia detection with speech analysis and machine learning techniques

Article Open access 11 April 2024

Notes

References

Adde, L., & Svendsen, T. (2011). Pronunciation variation modeling of non-native proper names by discriminative tree search. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4928–4931). Prague, Czech Republic.
Alumäe, T. (2013). Multi-domain neural network language model. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2182–2186).
Alumäe, T. (2014). Recent improvements in Estonian LVCSR. In Spoken Language Technologies for Under-Resourced Languages. St. Petersburg, Russia.
Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.
Google Scholar
Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451. doi:10.1016/j.specom.2008.01.002.
Article Google Scholar
Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL 2002 workshop on morphological and phonological learning, MPL ’02, Vol. 6 (pp. 21–30). Association for Computational Linguistics, Stroudsburg, PA, USA. doi:10.3115/1118647.1118650.
Deligne, S., & Bimbot, F. (1997). Inference of variable-length linguistic and acoustic units by multigrams. Speech Communication, 23(3), 223–241.
Article Google Scholar
Enarvi, S., & Kurimo, M. (2013a). Studies on training text selection for conversational finnish language modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013) (pp. 256–263). Heidelberg, Germany.
Enarvi, S., & Kurimo, M. (2013b). A novel discriminative method for pruning pronunciation dictionary entries. In Proceedings of the 7th International Conference on Speech Technology and Human–Computer Dialogue (pp. 113–116). Cluj-Napoca, Romania.
Gandhe, A., Metze, F., & Lane, I. (2014). Neural network language models for low resource languages. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014).
Hirsimäki, T., & Kurimo, M. (2009). Analysing recognition errors in unlimited-vocabulary speech recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies 2009 Conference (NAACL 2009) (pp. 193–196). Boulder, Colorado, USA.
Hirsimäki, T., Pylkkönen, J., & Kurimo, M. (2009). Importance of high-order n-gram models in morph-based speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 17(4), 724–732. doi:10.1109/TASL.2008.2012323.
Article Google Scholar
Iskra, D. J., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., & Kießling, A. (2002). SPEECON—speech databases for consumer devices: Database specification and validation. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Canary Islands, Spain.
Klakow, D. (2000). Selecting articles from the language model training corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Vol. 3 (pp. 1695–1698). IEEE Computer Society. doi:10.1109/ICASSP.2000.862077.
Lehecka, J., & Svec, J. (2013). Improving speech recognition by detecting foreign inclusions and generating pronunciations. Text, Speech, and Dialogue, Lecture Notes in Computer Science, 8082, 295–302.
Article Google Scholar
Leinonen, J. (2015). Automatic speech recognition for human–robot interaction using an under-resourced language. Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo.
Lennes, M. (2009). Segmental features in spontaneous and read-aloud Finnish. In V. de Silva & R. Ullakonoja (Eds.), Phonetics of Russian and Finnish. General introduction. Spontaneous and read-aloud speech (pp. 145–166). Bern: Peter Lang GmbH.
Google Scholar
Linden, K. (2009). Entry generation for new words by analogy for morphological lexicons. Northern European Journal of Language Technology, 1, 1–25.
Article Google Scholar
Maison, B., Chen, S., & Cohen, P. S. (2003). Pronunciation modeling for names of foreign origin. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 429–434).
Mansikkaniemi, A., & Kurimo, M. (2013). Unsupervised topic adaptation for morph-based speech recognition. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2693–2697). Lyon, France.
Moore, R. C., & Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10 (pp. 220–224). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1858842.1858883.
Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–1240.
Article Google Scholar
Park, J., Liu, X., Gales, M. J. F., & Woodland, P. C. (2010). Improved neural network based language modelling and adaptation. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010) (pp. 1041–1044).
Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Hilton Waikoloa Village, Big Island, Hawaii: IEEE Signal Processing Society.
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008) (pp. 4057–4060). IEEE.
Pylkkönen, J. (2005). An efficient one-pass decoder for Finnish large vocabulary continuous speech recognition. In Proceedings of the 2nd Baltic Conference on Human Language Technologies.
Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 55–59). IEEE.
Schwenk, H., & Gauvain, J. L. (2004). Neural network language models for conversational speech recognition. In Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH 2004).
Schwenk, H., & Gauvain, J. L. (2005). Training neural network language models on very large corpora. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 201–208). Association for Computational Linguistics.
Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 24–29). IEEE.
Sethy, A., Georgiou, P. G., & Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06 (pp. 382–389). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1610075.1610129.
Sethy, A., Georgiou, P. G., Ramabhadran, B., & Narayanan, S. S. (2009). An iterative relative entropy minimization-based data selection approach for n-gram model adaptation. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 13–23.
Article Google Scholar
Shi, Y., Larson, M., & Jonker, C. M. (2014). Recurrent neural network language model adaptation with curriculum learning. Computer Speech & Language. doi:10.1016/j.csl.2014.11.004.
Siivola, V., Hirsimäki, T., & Virpioja, S. (2007). On growing and pruning Kneser–Ney smoothed N-gram models. IEEE Transactions on Speech, Audio and Language Processing, 15(5), 1617–1624.
Article Google Scholar
Sixtus, A., & Ney, H. (2002). From within-word model search to across-word model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.
Article Google Scholar
Soltau, H., & Saon, G. (2009). Dynamic network decoding revisited. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 276–281).
Tilk, O., & Alumäe, T. (2014). Multi-domain recurrent neural network language model for medical speech recognition. In Human language technologies—The Baltic perspective, Vol. 268 (pp. 149–152). Amsterdam: IOS Press.
Varjokallio, M., & Kurimo, M. (2014a). A word-level token-passing decoder for subword n-gram LVCSR. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT) (pp. 495–500). South Lake Tahoe, California and Nevada.
Varjokallio, M., & Kurimo, M. (2014b). A toolkit for efficient learning of lexical units for speech recognition. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.
Varjokallio, M., Kurimo, M., & Virpioja, S. (2013). Learning a subword vocabulary based on unigram likelihood. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic.
Young, S. J., Russell, N. H., & Thornton, J. H. S. (1989). Token passing: A simple conceptual model for connected speech recognition system. Tech. rep., Cambridge University Engineering Department.

Download references

Acknowledgments

This work was partially funded by the Estonian Ministry of Education and Research target-financed research theme no. 0140007s12, by the Tallinn University of Technology project Estonian Speech Recognition System for Medical Applications, by the Academy of Finland under the Grant Number 251170 [Finnish Centre of Excellence Program (2012–2017)], and by Finnish Cultural Foundation. We acknowledge the computational resources provided by Aalto Science-IT project.

Author information

Authors and Affiliations

Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland
Mikko Kurimo, Seppo Enarvi, Matti Varjokallio & André Mansikkaniemi
Institute of Cybernetics, Tallinn University of Technology, Tallinn, Estonia
Ottokar Tilk & Tanel Alumäe

Authors

Mikko Kurimo
View author publications
You can also search for this author in PubMed Google Scholar
Seppo Enarvi
View author publications
You can also search for this author in PubMed Google Scholar
Ottokar Tilk
View author publications
You can also search for this author in PubMed Google Scholar
Matti Varjokallio
View author publications
You can also search for this author in PubMed Google Scholar
André Mansikkaniemi
View author publications
You can also search for this author in PubMed Google Scholar
Tanel Alumäe
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Seppo Enarvi.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kurimo, M., Enarvi, S., Tilk, O. et al. Modeling under-resourced languages for speech recognition. Lang Resources & Evaluation 51, 961–987 (2017). https://doi.org/10.1007/s10579-016-9336-9

Download citation

Published: 10 February 2016
Issue Date: December 2017
DOI: https://doi.org/10.1007/s10579-016-9336-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Modeling under-resourced languages for speech recognition

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Modeling under-resourced languages for speech recognition

Abstract

Access this article

Similar content being viewed by others

Autoencoders and their applications in machine learning: a survey

Automatic speech recognition: a survey

Early dementia detection with speech analysis and machine learning techniques

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation