End-to-End Large Vocabulary Speech Recognition for the Serbian Language
This paper presents the results of a large vocabulary speech recognition for the Serbian language, developed by using Eesen end-to-end framework. Eesen involves training a single deep recurrent neural network, containing a number of bidirectional long short-term memory layers, modeling the connection between the speech and a set of context-independent lexicon units. This approach reduces the amount of expert knowledge needed in order to develop other competitive speech recognition systems. The training is based on a connectionist temporal classification, while decoding allows the usage of weighted finite-state transducers. This provides much faster and more efficient decoding in comparison to other similar systems. A corpus of approximately 215 h of audio data (about 171 h of speech and 44 h of silence, or 243 male and 239 female speakers) was employed for the training (about 90%) and testing (about 10%) purposes. On a set of more than 120000 words, the word error rate of 14.68% and the character error rate of 3.68% is achieved.
KeywordsEesen End-to-end LSTM Speech recognition Serbian
The work described in this paper was supported in part by the Ministry of Education, Science and Technological Development of the Republic of Serbia, within the project “Development of Dialogue Systems for Serbian and Other South Slavic Languages”, EUREKA project DANSPLAT, “A Platform for the Applications of Speech Technologies on Smartphones for the Languages of the Danube Region”, id E! 9944, and the Provincial Secretariat for Higher Education and Scientific Research, within the project “Central Audio-Library of the University of Novi Sad”, No. 114-451-2570/2016-02.
- 1.Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlíček, P., Qian, Y., Schwarz, P., Silovský, J., Stemmer, G., Veselý, K.: The Kaldi speech recognition toolkit. In: IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 1–4. IEEE Signal Processing Society (2011)Google Scholar
- 2.Popović, B., Pakoci, E., Ostrogonac, S., Pekar, D.: Large vocabulary continuous speech recognition for Serbian using the Kaldi toolkit. In: 10th Digital Speech and Image Processing, DOGS, pp. 31–34. Novi Sad, Serbia (2014)Google Scholar
- 3.Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: 33rd International Conference on Acoustics, Speech and Signal Processing, ICASSP, Las Vegas, pp. 4057–4060 (2008)Google Scholar
- 4.Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: 27th International Conference on Acoustics, Speech and Signal Processing ICASSP, Orlando, pp. I-105–I-108 (2002)Google Scholar
- 5.Povey, D., Kuo, H-K.J., Soltau, H.: Fast speaker adaptive training for speech recognition. In: 9th Annual Conference of the International Speech Communication Association, INTERSPEECH, Brisbane, pp. 1245–1248 (2008)Google Scholar
- 7.Popović, B., Ostrogonac, S., Pakoci, E., Jakovljević, N., Delić, V.: Deep neural network based continuous speech recognition for serbian using the Kaldi toolkit. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 186–192. Springer, Cham (2015). doi: 10.1007/978-3-319-23132-7_23 CrossRefGoogle Scholar
- 8.Miao, Y., Gowayyed, M., Metze, F.: EESEN: End-to-end speech recognition using deep RNN models and WFST-based decoding. In: Automatic Speech Recognition and Understanding Workshop, ASRU 2015, arXiv:1507.08240 (2015)
- 10.Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: 23rd International Conference on Machine Learning, pp. 369–376. ACM (2006)Google Scholar
- 12.Kneser, R., Ney, H.: Improved backing-off for M-gram language modeling. In: 20th International Conference on Acoustics, Speech and Signal Processing, ICASSP, Detroit, pp. 181–184 (1995)Google Scholar