Deep Neural Network Based Continuous Speech Recognition for Serbian Using the Kaldi Toolkit

  • Branislav PopovićEmail author
  • Stevan Ostrogonac
  • Edvin Pakoci
  • Nikša Jakovljević
  • Vlado Delić
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9319)


This paper presents a deep neural network (DNN) based large vocabulary continuous speech recognition (LVCSR) system for Serbian, developed using the open-source Kaldi speech recognition toolkit. The DNNs are initialized using stacked restricted Boltzmann machines (RBMs) and trained using cross-entropy as the objective function and the standard error backpropagation procedure in order to provide posterior probability estimates for the hidden Markov model (HMM) states. Emission densities of HMM states are represented as Gaussian mixture models (GMMs). The recipes were modified based on the particularities of the Serbian language in order to achieve the optimal results. A corpus of approximately 90 hours of speech (21000 utterances) is used for the training. The performances are compared for two different sets of utterances between the baseline GMM-HMM algorithm and various DNN settings.


Kaldi speech recognition toolkit Continuous speech recognition Deep neural networks Serbian 



The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project TR32035: “Development of Dialogue Systems for Serbian and Other South Slavic Languages”.


  1. 1.
    Delić, V., Sečujski, M., Jakovljević, N., Pekar, D., Mišković, D., Popović, B., Ostrogonac, S., Bojanić, M., Knežević, D.: Speech and language resources within speech recognition and synthesis systems for Serbian and kindred south slavic languages. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 319–326. Springer, Heidelberg (2013) CrossRefGoogle Scholar
  2. 2.
    Young, S.J., Odell, J., Woodland, P.C.: Tree-based state tying for high accuracy acoustic modelling. In: ARPA Human Language Technology Workshop, pp. 307–312, Princeton (1994)Google Scholar
  3. 3.
    Jakovljević, N., Mišković, D., Janev, M., Pekar, D.: A decoder for large vocabulary speech recognition. In: 18th International Conference on Systems, Signals and Image Processing, IWSSIP, pp. 1–4, Sarajevo (2011)Google Scholar
  4. 4.
    Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, pp. 1–4, Waikoloa (2011)Google Scholar
  5. 5.
    Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16, 69–88 (2002)CrossRefGoogle Scholar
  6. 6.
    Blackford, L.S., et al.: An updated set of basic linear algebra subprograms (BLAS). ACM Trans. Math. Softw. 28(2), 135–151 (2002)MathSciNetCrossRefGoogle Scholar
  7. 7.
    Anderson, E., et al.: LAPACK Users’ Guide. Society for Industrial and Applied Mathematics (SIAM), Philadelphia (1999)CrossRefGoogle Scholar
  8. 8.
    Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: 10th Digital Speech and Image Processing, DOGS, pp. 31–34, Novi Sad (2014)Google Scholar
  9. 9.
    Veselý, K., Arnab, G., Lukáš, B., Povey, D.: Sequence-discriminative training of deep neural networks. In: International Speech Communication Association, Interspeech 2013, pp. 2345–2349, Lyon (2013)Google Scholar
  10. 10.
    Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 4057–4060, Las Vegas (2008)Google Scholar
  11. 11.
    Povey D., Woodland, P.C.: Minimum phone error and i-smoothing for improved discriminative training. In: 27th International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. I-105–I-108, Orlando (2002)Google Scholar
  12. 12.
    Povey, D., Kuo, H-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1245–1248, Brisbane (2008)Google Scholar
  13. 13.
    Povey, D., et al.: The subspace Gaussian mixture model - a structured model for speech recognition. Comput. Speech Lang. 25, 404–439 (2011)CrossRefGoogle Scholar
  14. 14.
    Carreira-Perpiñán, M., Hinton, G.: On contrastive divergence learning. In: 10th International Workshop on Artifitial Intelligence and Statistic, AISTATS, pp. 59–66, Barbados (2005)Google Scholar
  15. 15.
    Stolcke, A., Zheng, J., Wang, W., Abrash, V.: SRILM at sixteen: update and outlook. In: IEEE Workshop on Automatic Speech Recognition and Understanding, ASRU, Waikoloa (2011)Google Scholar
  16. 16.
    Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, pp. 181–184, Detroit (1995)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Branislav Popović
    • 1
    Email author
  • Stevan Ostrogonac
    • 1
  • Edvin Pakoci
    • 1
  • Nikša Jakovljević
    • 1
  • Vlado Delić
    • 1
  1. 1.Faculty of Technical SciencesUniversity of Novi SadNovi SadSerbia

Personalised recommendations