End-to-End Large Vocabulary Speech Recognition for the Serbian Language

  • Branislav PopovićEmail author
  • Edvin Pakoci
  • Darko Pekar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10458)


This paper presents the results of a large vocabulary speech recognition for the Serbian language, developed by using Eesen end-to-end framework. Eesen involves training a single deep recurrent neural network, containing a number of bidirectional long short-term memory layers, modeling the connection between the speech and a set of context-independent lexicon units. This approach reduces the amount of expert knowledge needed in order to develop other competitive speech recognition systems. The training is based on a connectionist temporal classification, while decoding allows the usage of weighted finite-state transducers. This provides much faster and more efficient decoding in comparison to other similar systems. A corpus of approximately 215 h of audio data (about 171 h of speech and 44 h of silence, or 243 male and 239 female speakers) was employed for the training (about 90%) and testing (about 10%) purposes. On a set of more than 120000 words, the word error rate of 14.68% and the character error rate of 3.68% is achieved.


Eesen End-to-end LSTM Speech recognition Serbian 



The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, id E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.


  1. 1.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., Veselý, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. IEEE Signal Processing Society (2011)Google Scholar
  2. 2.
    Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: 10th Digital Speech and Image Processing, DOGS, pp. 31–34. Novi Sad, Serbia (2014)Google Scholar
  3. 3.
    Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)Google Scholar
  4. 4.
    Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: 27th International Conference on Acoustics, Speech and Signal Processing ICASSP, Orlando, pp. I-105–I-108 (2002)Google Scholar
  5. 5.
    Povey, D., Kuo, H-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)Google Scholar
  6. 6.
    Pakoci, E., Popović, B., Jakovljević, N., Pekar, D., Yassa, F.: A phonetic segmentation procedure based on hidden markov models. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 67–74. Springer, Cham (2016). doi: 10.1007/978-3-319-43958-7_7 CrossRefGoogle Scholar
  7. 7.
    Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_23 CrossRefGoogle Scholar
  8. 8.
    Miao, Y., Gowayyed, M., Metze, F.: EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: Automatic Speech Recognition and Understanding Workshop, ASRU 2015, arXiv:1507.08240 (2015)
  9. 9.
    Mohri, M., Pereira, F., Riley, M.: Weighted finite-state transducers in speech recognition. Comput. Speech Lang. 16(1), 69–88 (2002)CrossRefGoogle Scholar
  10. 10.
    Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
  11. 11.
    Allauzen, C., Riley, M., Schalkwyk, J., Skut, W., Mohri, M.: OpenFst: a general and efficient weighted finite-state transducer library. In: Holub, J., Žďárek, J. (eds.) CIAA 2007. LNCS, vol. 4783, pp. 11–23. Springer, Heidelberg (2007). doi: 10.1007/978-3-540-76336-9_3 CrossRefGoogle Scholar
  12. 12.
    Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)Google Scholar

Copyright information

© Springer International Publishing AG 2017

Authors and Affiliations

  • Branislav Popović
    • 1
    • 2
    Email author
  • Edvin Pakoci
    • 1
    • 2
  • Darko Pekar
    • 1
    • 2
  1. 1.Department for Power Electronic and Telecommunication Engineering, Faculty of Technical SciencesUniversity of Novi SadNovi SadSerbia
  2. 2.AlfaNum Speech TechnologiesNovi SadSerbia

Personalised recommendations