Advertisement

Language Resources and Evaluation

, Volume 51, Issue 4, pp 961–987 | Cite as

Modeling under-resourced languages for speech recognition

  • Mikko Kurimo
  • Seppo EnarviEmail author
  • Ottokar Tilk
  • Matti Varjokallio
  • André Mansikkaniemi
  • Tanel Alumäe
Original Paper

Abstract

One particular problem in large vocabulary continuous speech recognition for low-resourced languages is finding relevant training data for the statistical language models. Large amount of data is required, because models should estimate the probability for all possible word sequences. For Finnish, Estonian and the other fenno-ugric languages a special problem with the data is the huge amount of different word forms that are common in normal speech. The same problem exists also in other language technology applications such as machine translation, information retrieval, and in some extent also in other morphologically rich languages. In this paper we present methods and evaluations in four recent language modeling topics: selecting conversational data from the Internet, adapting models for foreign words, multi-domain and adapted neural network language modeling, and decoding with subword units. Our evaluations show that the same methods work in more than one language and that they scale down to smaller data resources.

Keywords

Large vocabulary speech recognition Statistical language modeling Subword units Data filtering Adaptation 

Notes

Acknowledgments

This work was partially funded by the Estonian Ministry of Education and Research target-financed research theme no. 0140007s12, by the Tallinn University of Technology project Estonian Speech Recognition System for Medical Applications, by the Academy of Finland under the Grant Number 251170 [Finnish Centre of Excellence Program (2012–2017)], and by Finnish Cultural Foundation. We acknowledge the computational resources provided by Aalto Science-IT project.

References

  1. Adde, L., & Svendsen, T. (2011). Pronunciation variation modeling of non-native proper names by discriminative tree search. In Proceedings of the 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4928–4931). Prague, Czech Republic.Google Scholar
  2. Alumäe, T. (2013). Multi-domain neural network language model. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2182–2186).Google Scholar
  3. Alumäe, T. (2014). Recent improvements in Estonian LVCSR. In Spoken Language Technologies for Under-Resourced Languages. St. Petersburg, Russia.Google Scholar
  4. Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.Google Scholar
  5. Bisani, M., & Ney, H. (2008). Joint-sequence models for grapheme-to-phoneme conversion. Speech Communication, 50(5), 434–451. doi: 10.1016/j.specom.2008.01.002.CrossRefGoogle Scholar
  6. Creutz, M., & Lagus, K. (2002). Unsupervised discovery of morphemes. In Proceedings of the ACL 2002 workshop on morphological and phonological learning, MPL ’02, Vol. 6 (pp. 21–30). Association for Computational Linguistics, Stroudsburg, PA, USA. doi: 10.3115/1118647.1118650.
  7. Deligne, S., & Bimbot, F. (1997). Inference of variable-length linguistic and acoustic units by multigrams. Speech Communication, 23(3), 223–241.CrossRefGoogle Scholar
  8. Enarvi, S., & Kurimo, M. (2013a). Studies on training text selection for conversational finnish language modeling. In Proceedings of the 10th International Workshop on Spoken Language Translation (IWSLT 2013) (pp. 256–263). Heidelberg, Germany.Google Scholar
  9. Enarvi, S., & Kurimo, M. (2013b). A novel discriminative method for pruning pronunciation dictionary entries. In Proceedings of the 7th International Conference on Speech Technology and Human–Computer Dialogue (pp. 113–116). Cluj-Napoca, Romania.Google Scholar
  10. Gandhe, A., Metze, F., & Lane, I. (2014). Neural network language models for low resource languages. In Proceedings of the 15th Annual Conference of the International Speech Communication Association (INTERSPEECH 2014).Google Scholar
  11. Hirsimäki, T., & Kurimo, M. (2009). Analysing recognition errors in unlimited-vocabulary speech recognition. In Proceedings of the North American Chapter of the Association for Computational Linguistics—Human Language Technologies 2009 Conference (NAACL 2009) (pp. 193–196). Boulder, Colorado, USA.Google Scholar
  12. Hirsimäki, T., Pylkkönen, J., & Kurimo, M. (2009). Importance of high-order n-gram models in morph-based speech recognition. IEEE Transactions on Audio, Speech & Language Processing, 17(4), 724–732. doi: 10.1109/TASL.2008.2012323.CrossRefGoogle Scholar
  13. Iskra, D. J., Grosskopf, B., Marasek, K., van den Heuvel, H., Diehl, F., & Kießling, A. (2002). SPEECON—speech databases for consumer devices: Database specification and validation. In Proceedings of the Third International Conference on Language Resources and Evaluation (LREC 2002). Canary Islands, Spain.Google Scholar
  14. Klakow, D. (2000). Selecting articles from the language model training corpus. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2000), Vol. 3 (pp. 1695–1698). IEEE Computer Society. doi: 10.1109/ICASSP.2000.862077.
  15. Lehecka, J., & Svec, J. (2013). Improving speech recognition by detecting foreign inclusions and generating pronunciations. Text, Speech, and Dialogue, Lecture Notes in Computer Science, 8082, 295–302.CrossRefGoogle Scholar
  16. Leinonen, J. (2015). Automatic speech recognition for human–robot interaction using an under-resourced language. Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics, Espoo.Google Scholar
  17. Lennes, M. (2009). Segmental features in spontaneous and read-aloud Finnish. In V. de Silva & R. Ullakonoja (Eds.), Phonetics of Russian and Finnish. General introduction. Spontaneous and read-aloud speech (pp. 145–166). Bern: Peter Lang GmbH.Google Scholar
  18. Linden, K. (2009). Entry generation for new words by analogy for morphological lexicons. Northern European Journal of Language Technology, 1, 1–25.CrossRefGoogle Scholar
  19. Maison, B., Chen, S., & Cohen, P. S. (2003). Pronunciation modeling for names of foreign origin. In Proceedings of the 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 429–434).Google Scholar
  20. Mansikkaniemi, A., & Kurimo, M. (2013). Unsupervised topic adaptation for morph-based speech recognition. In Proceedings of the 14th Annual Conference of the International Speech Communication Association (INTERSPEECH 2013) (pp. 2693–2697). Lyon, France.Google Scholar
  21. Moore, R. C., & Lewis, W. (2010). Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, ACLShort ’10 (pp. 220–224). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1858842.1858883.
  22. Ney, H., & Ortmanns, S. (2000). Progress in dynamic programming search for LVCSR. Proceedings of the IEEE, 88(8), 1224–1240.CrossRefGoogle Scholar
  23. Park, J., Liu, X., Gales, M. J. F., & Woodland, P. C. (2010). Improved neural network based language modelling and adaptation. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010) (pp. 1041–1044).Google Scholar
  24. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., & Vesely, K. (2011). The Kaldi speech recognition toolkit. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Hilton Waikoloa Village, Big Island, Hawaii: IEEE Signal Processing Society.Google Scholar
  25. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., & Visweswariah, K. (2008). Boosted MMI for model and feature-space discriminative training. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2008) (pp. 4057–4060). IEEE.Google Scholar
  26. Pylkkönen, J. (2005). An efficient one-pass decoder for Finnish large vocabulary continuous speech recognition. In Proceedings of the 2nd Baltic Conference on Human Language Technologies.Google Scholar
  27. Saon, G., Soltau, H., Nahamoo, D., & Picheny, M. (2013). Speaker adaptation of neural network acoustic models using i-vectors. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 55–59). IEEE.Google Scholar
  28. Schwenk, H., & Gauvain, J. L. (2004). Neural network language models for conversational speech recognition. In Proceedings of the 8th International Conference on Spoken Language Processing (INTERSPEECH 2004).Google Scholar
  29. Schwenk, H., & Gauvain, J. L. (2005). Training neural network language models on very large corpora. In Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing (pp. 201–208). Association for Computational Linguistics.Google Scholar
  30. Seide, F., Li, G., Chen, X., & Yu, D. (2011). Feature engineering in context-dependent deep neural networks for conversational speech transcription. In Proceedings of the 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 24–29). IEEE.Google Scholar
  31. Sethy, A., Georgiou, P. G., & Narayanan, S. (2006). Text data acquisition for domain-specific language models. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, EMNLP ’06 (pp. 382–389). Association for Computational Linguistics, Stroudsburg, PA, USA. http://dl.acm.org/citation.cfm?id=1610075.1610129.
  32. Sethy, A., Georgiou, P. G., Ramabhadran, B., & Narayanan, S. S. (2009). An iterative relative entropy minimization-based data selection approach for n-gram model adaptation. IEEE Transactions on Audio, Speech, and Language Processing, 17(1), 13–23.CrossRefGoogle Scholar
  33. Shi, Y., Larson, M., & Jonker, C. M. (2014). Recurrent neural network language model adaptation with curriculum learning. Computer Speech & Language. doi: 10.1016/j.csl.2014.11.004.
  34. Siivola, V., Hirsimäki, T., & Virpioja, S. (2007). On growing and pruning Kneser–Ney smoothed N-gram models. IEEE Transactions on Speech, Audio and Language Processing, 15(5), 1617–1624.CrossRefGoogle Scholar
  35. Sixtus, A., & Ney, H. (2002). From within-word model search to across-word model search in large vocabulary continuous speech recognition. Computer Speech and Language, 16(2), 245–271.CrossRefGoogle Scholar
  36. Soltau, H., & Saon, G. (2009). Dynamic network decoding revisited. In Proceedings of the 2009 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (pp. 276–281).Google Scholar
  37. Tilk, O., & Alumäe, T. (2014). Multi-domain recurrent neural network language model for medical speech recognition. In Human language technologies—The Baltic perspective, Vol. 268 (pp. 149–152). Amsterdam: IOS Press.Google Scholar
  38. Varjokallio, M., & Kurimo, M. (2014a). A word-level token-passing decoder for subword n-gram LVCSR. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop (SLT) (pp. 495–500). South Lake Tahoe, California and Nevada.Google Scholar
  39. Varjokallio, M., & Kurimo, M. (2014b). A toolkit for efficient learning of lexical units for speech recognition. In Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014). Reykjavik, Iceland.Google Scholar
  40. Varjokallio, M., Kurimo, M., & Virpioja, S. (2013). Learning a subword vocabulary based on unigram likelihood. In Proceedings of the 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). Olomouc, Czech Republic.Google Scholar
  41. Young, S. J., Russell, N. H., & Thornton, J. H. S. (1989). Token passing: A simple conceptual model for connected speech recognition system. Tech. rep., Cambridge University Engineering Department.Google Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2016

Authors and Affiliations

  • Mikko Kurimo
    • 1
  • Seppo Enarvi
    • 1
    Email author
  • Ottokar Tilk
    • 2
  • Matti Varjokallio
    • 1
  • André Mansikkaniemi
    • 1
  • Tanel Alumäe
    • 2
  1. 1.Department of Signal Processing and AcousticsAalto UniversityEspooFinland
  2. 2.Institute of CyberneticsTallinn University of TechnologyTallinnEstonia

Personalised recommendations