Skip to main content
Log in

Modeling under-resourced languages for speech recognition

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Notes

  1. https://github.com/senarvi/senarvi-speech/tree/master/filter-text.

  2. https://github.com/senarvi/senarvi-speech/tree/master/filter-dictionary.

  3. http://www.keel.ut.ee/et/foneetikakorpus.

  4. https://research.csc.fi/-/finnish-text-collection.

References

  • Adde, L., & Svendsen, T. (2011). Pronunciation variation modeling of non-native proper names by discriminative tree search. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4928–4931). Prague, Czech Republic.

  • Alumäe, T. (2013). Multi-domain neural network language model. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2182–2186).

  • Alumäe, T. (2014). Recent improvements in Estonian LVCSR. In Spoken Language Technologies for Under-Resourced Languages. St. Petersburg, Russia.

  • Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    Google Scholar 

  • Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451. doi:10.1016/j.specom.2008.01.002.

    Article  Google Scholar 

  • Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL 2002 workshop on morphological and phonological learning, MPL ’02, Vol. 6 (pp. 21–30). Association for Computational Linguistics, Stroudsburg, PA, USA. doi:10.3115/1118647.1118650.

  • Deligne, S., & Bimbot, F. (1997). Inference of variable-length linguistic and acoustic units by multigrams. Speech Communication, 23(3), 223–241.

    Article  Google Scholar 

  • Enarvi, S., & Kurimo, M. (2013a). Studies on training text selection for conversational finnish language modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013) (pp. 256–263). Heidelberg, Germany.

  • Enarvi, S., & Kurimo, M. (2013b). A novel discriminative method for pruning pronunciation dictionary entries. In Proceedings of the 7th International Conference on Speech Technology and Human–Computer Dialogue (pp. 113–116). Cluj-Napoca, Romania.

  • Gandhe, A., Metze, F., & Lane, I. (2014). Neural network language models for low resource languages. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014).

  • Hirsimäki, T., & Kurimo, M. (2009). Analysing recognition errors in unlimited-vocabulary speech recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies 2009 Conference (NAACL 2009) (pp. 193–196). Boulder, Colorado, USA.

  • Hirsimäki, T., Pylkkönen, J., & Kurimo, M. (2009). Importance of high-order n-gram models in morph-based speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 17(4), 724–732. doi:10.1109/TASL.2008.2012323.

    Article  Google Scholar 

  • Iskra, D. J., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., & Kießling, A. (2002). SPEECON—speech databases for consumer devices: Database specification and validation. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Canary Islands, Spain.

  • Klakow, D. (2000). Selecting articles from the language model training corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Vol. 3 (pp. 1695–1698). IEEE Computer Society. doi:10.1109/ICASSP.2000.862077.

  • Lehecka, J., & Svec, J. (2013). Improving speech recognition by detecting foreign inclusions and generating pronunciations. Text, Speech, and Dialogue, Lecture Notes in Computer Science, 8082, 295–302.

    Article  Google Scholar 

  • Leinonen, J. (2015). Automatic speech recognition for human–robot interaction using an under-resourced language. Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo.

  • Lennes, M. (2009). Segmental features in spontaneous and read-aloud Finnish. In V. de Silva & R. Ullakonoja (Eds.), Phonetics of Russian and Finnish. General introduction. Spontaneous and read-aloud speech (pp. 145–166). Bern: Peter Lang GmbH.

    Google Scholar 

  • Linden, K. (2009). Entry generation for new words by analogy for morphological lexicons. Northern European Journal of Language Technology, 1, 1–25.

    Article  Google Scholar 

  • Maison, B., Chen, S., & Cohen, P. S. (2003). Pronunciation modeling for names of foreign origin. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 429–434).

  • Mansikkaniemi, A., & Kurimo, M. (2013). Unsupervised topic adaptation for morph-based speech recognition. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2693–2697). Lyon, France.

  • Moore, R. C., & Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10 (pp. 220–224). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1858842.1858883.

  • Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–1240.

    Article  Google Scholar 

  • Park, J., Liu, X., Gales, M. J. F., & Woodland, P. C. (2010). Improved neural network based language modelling and adaptation. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010) (pp. 1041–1044).

  • Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Hilton Waikoloa Village, Big Island, Hawaii: IEEE Signal Processing Society.

  • Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008) (pp. 4057–4060). IEEE.

  • Pylkkönen, J. (2005). An efficient one-pass decoder for Finnish large vocabulary continuous speech recognition. In Proceedings of the 2nd Baltic Conference on Human Language Technologies.

  • Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 55–59). IEEE.

  • Schwenk, H., & Gauvain, J. L. (2004). Neural network language models for conversational speech recognition. In Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH 2004).

  • Schwenk, H., & Gauvain, J. L. (2005). Training neural network language models on very large corpora. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 201–208). Association for Computational Linguistics.

  • Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 24–29). IEEE.

  • Sethy, A., Georgiou, P. G., & Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06 (pp. 382–389). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1610075.1610129.

  • Sethy, A., Georgiou, P. G., Ramabhadran, B., & Narayanan, S. S. (2009). An iterative relative entropy minimization-based data selection approach for n-gram model adaptation. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 13–23.

    Article  Google Scholar 

  • Shi, Y., Larson, M., & Jonker, C. M. (2014). Recurrent neural network language model adaptation with curriculum learning. Computer Speech & Language. doi:10.1016/j.csl.2014.11.004.

  • Siivola, V., Hirsimäki, T., & Virpioja, S. (2007). On growing and pruning Kneser–Ney smoothed N-gram models. IEEE Transactions on Speech, Audio and Language Processing, 15(5), 1617–1624.

    Article  Google Scholar 

  • Sixtus, A., & Ney, H. (2002). From within-word model search to across-word model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.

    Article  Google Scholar 

  • Soltau, H., & Saon, G. (2009). Dynamic network decoding revisited. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 276–281).

  • Tilk, O., & Alumäe, T. (2014). Multi-domain recurrent neural network language model for medical speech recognition. In Human language technologies—The Baltic perspective, Vol. 268 (pp. 149–152). Amsterdam: IOS Press.

  • Varjokallio, M., & Kurimo, M. (2014a). A word-level token-passing decoder for subword n-gram LVCSR. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT) (pp. 495–500). South Lake Tahoe, California and Nevada.

  • Varjokallio, M., & Kurimo, M. (2014b). A toolkit for efficient learning of lexical units for speech recognition. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.

  • Varjokallio, M., Kurimo, M., & Virpioja, S. (2013). Learning a subword vocabulary based on unigram likelihood. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic.

  • Young, S. J., Russell, N. H., & Thornton, J. H. S. (1989). Token passing: A simple conceptual model for connected speech recognition system. Tech. rep., Cambridge University Engineering Department.

Download references

Acknowledgments

This work was partially funded by the Estonian Ministry of Education and Research target-financed research theme no. 0140007s12, by the Tallinn University of Technology project Estonian Speech Recognition System for Medical Applications, by the Academy of Finland under the Grant Number 251170 [Finnish Centre of Excellence Program (2012–2017)], and by Finnish Cultural Foundation. We acknowledge the computational resources provided by Aalto Science-IT project.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Seppo Enarvi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kurimo, M., Enarvi, S., Tilk, O. et al. Modeling under-resourced languages for speech recognition. Lang Resources & Evaluation 51, 961–987 (2017). https://doi.org/10.1007/s10579-016-9336-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-016-9336-9

Keywords

Navigation