A Bilingual Kazakh-Russian System for Automatic Speech Recognition and Synthesis

  • Olga Khomitsevich
  • Valentin Mendelev
  • Natalia TomashenkoEmail author
  • Sergey Rybin
  • Ivan Medennikov
  • Saule Kudubayeva
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)


The paper presents a system for speech recognition and synthesis for the Kazakh and Russian languages. It is designed for use by speakers of Kazakh; due to the prevalence of bilingualism among Kazakh speakers, it was considered essential to design a bilingual Kazakh-Russian system. Developing our system involved building a text processing and transcription system that deals with both Kazakh and Russian text, and is used in both speech synthesis and recognition applications. We created a Kazakh TTS voice and an additional Russian voice using the recordings of the same bilingual voice artist. A Kazakh speech database was collected and used to train deep neural network acoustic models for the speech recognition system. The resulting models demonstrated sufficient performance for practical applications in interactive voice response and keyword spotting scenarios.


Speech recognition Speech synthesis ASR TTS Kazakh 



The work was financially supported by the Government of the Russian Federation, Grant 074-U01.


  1. 1.
    Karabalaeva, M., Sharipbaev, A.: Algorithms for phone-based recognition of kazakh speech in the amplitude-time space. In: Proceedings of 2nd All-Russian Conference “Knowledge-Ontology-Theories”, Novosibirsk, Russia (2009) (in Russian)Google Scholar
  2. 2.
    Buribayeva, A., Sharipbay, A.: The advantage of interphoneme processing at diphone recognition of Kazakh words. Türkiye Bilişim Vakfı Bilgisayar Bilimleri ve Mühendisliği Dergisi 8(8) (2014)Google Scholar
  3. 3.
    Pavlenko, A.: Russian in post-Soviet countries. Russ. linguist. 32(1), 59–80 (2008)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Pavlenko, A.: Multilingualism in post-Soviet countries: language revival, language removal, and sociolinguistic theory. Int. J. Biling. Educ. Biling. 11(3–4), 275–314 (2008)CrossRefGoogle Scholar
  5. 5.
    Chistikov, P.G., Korolkov, E.A., Talanov, A.O.: Combining HMM and unit selection technologies to increase naturalness of synthesized speech. In: Computational Linguistics and Intellectual Technologies: Proceedings of the International Conference “Dialog 2013”, vol. 2, pp. 2–10 (2013)Google Scholar
  6. 6.
    Chistikov, P., Zakharov, D., Talanov, A.: Improving speech synthesis quality for voices created from an audiobook database. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 276–283. Springer, Heidelberg (2014) Google Scholar
  7. 7.
    Musaev, K.M.: The Kazakh Language. Russian Academy of Sciences, Moscow (2008). (in Russian)Google Scholar
  8. 8.
    Makhambetov, O., Makazhanov, A., Yessenbayev, Z., Matkarimov, B., Sabyrgaliyev, I., Sharafudinov, A.: Assembling the Kazakh Language Corpus. In: EMNLP, pp. 1022–1031 (2013)Google Scholar
  9. 9.
    Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A.R., Jaitly, N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012)CrossRefGoogle Scholar
  10. 10.
    Thomas, S., Ganapathy, S., Jansen, A., Hermansky, H.: Data-driven posterior features for low resource speech recognition applications. In: Proceedings of Interspeech (2012)Google Scholar
  11. 11.
    Huang, J.T., Li, J., Yu, D., Deng, L., Gong, Y.: Cross-language knowledge transfer using multilingual deep neural network with shared hidden layers. In: Proceedings of ICASSP 2013, pp. 7304–7308. IEEE (2013)Google Scholar
  12. 12.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Vesel, K.: The Kaldi speech recognition toolkit (2011)Google Scholar
  13. 13.
    Chernykh, G., Korenevsky, M., Levin, K., Ponomareva, I., Tomashenko, N.: State level control for acoustic model training. In: Ronzhin, A., Potapova, R., Delic, V. (eds.) SPECOM 2014. LNCS, vol. 8773, pp. 435–442. Springer, Heidelberg (2014) Google Scholar
  14. 14.
    Kingsbury, B.: Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In: ICASSP 2009, pp. 3761–3764. IEEE (2009)Google Scholar
  15. 15.
    Hinton, G.E.: A practical guide to training restricted Boltzmann machines. Technical Report UTML TR 2010–003, Deptartment of Computer Science, University of Toronto (2010)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Olga Khomitsevich
    • 1
    • 2
  • Valentin Mendelev
    • 1
    • 2
  • Natalia Tomashenko
    • 1
    • 2
    Email author
  • Sergey Rybin
    • 2
  • Ivan Medennikov
    • 2
    • 3
  • Saule Kudubayeva
    • 4
  1. 1.Speech Technology CenterSaint PetersburgRussia
  2. 2.ITMO UniversitySaint PetersburgRussia
  3. 3.STC-innovations Ltd.Saint PetersburgRussia
  4. 4.Kostanay State University named after A. BaytursynovKostanayKazakhstan

Personalised recommendations