SPECOM 2015: Speech and Computer pp 234-242 | Cite as
Improving Acoustic Models for Russian Spontaneous Speech Recognition
Abstract
The aim of the paper is to investigate the ways to improve acoustic models for Russian spontaneous speech recognition. We applied the main steps of the Kaldi Switchboard recipe to a Russian dataset but obtained low accuracy with respect to the results for English spontaneous telephone speech. We found two methods to be especially useful for Russian spontaneous speech: the i-vector based deep neural network adaptation and speaker-dependent bottleneck features which provide 8.6 % and 11.9 % relative word error rate reduction over the baseline system respectively.
Keywords
Speech recognition Russian spontaneous speech Deep neural networks Speaker adaptation I-vectors Bottleneck featuresNotes
Acknowledgements
The work was partially financially supported by the Government of the Russian Federation, Grant 074-U01, and by the Ministry of Education and Science of Russian Federation, contract 14.579.21.0057, ID RFMEFI57914X0057.
References
- 1.Soltau, H., Saon, G., Sainath, T.N.: Joint training of convolutional and non-convolutional neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5572–5576. Florence (2014)Google Scholar
- 2.Vesely, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: 14th Annual Conference of the International Speech Communication Association (Interspeech), pp. 2345–2349. Lyon (2014)Google Scholar
- 3.Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: 13th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 55–59. Olomouc (2013)Google Scholar
- 4.Godfrey, J.J., Holliman, E.C., McDaniel, J.: SWITCHBOARD: telephone speech corpus for research and development. In: 17th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 517–520. San Francisco (1992)Google Scholar
- 5.Povey, D. et al.: The Kaldi speech recognition toolkit. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 5572–5576. Big Island (2011)Google Scholar
- 6.Gales, M.J.F.: Maximum Likelihood Linear Transformations for HMM-Based Speech Recognition. Technical report, Cambridge University Engineering Department (1997)Google Scholar
- 7.Povey, D.: Discriminative training for large vocabulary speech recognition. Ph.D. dissertation. University of Cambridge, Cambridge, UK (2003)Google Scholar
- 8.Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 12th Biannual IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. Big Island (2011)Google Scholar
- 9.Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Linear hidden transformations for adaptation of hybrid ANN/HMM models. Speech Commun. 49(10–11), 827–835 (2007)CrossRefGoogle Scholar
- 10.Yao K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: IEEE Spoken Language Technology Workshop (SLT), pp. 366–369. Miami (2012)Google Scholar
- 11.Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6399–6403. Florence (2014)Google Scholar
- 12.Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 31st International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toulouse (2006)Google Scholar
- 13.Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 38th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7893–7897. Vancouver (2013)Google Scholar
- 14.Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 225–229. Florence (2014)Google Scholar
- 15.Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: 15th Annual Conference of the International Speech Communication Association, pp. 2997–3001. Singapore (2014)Google Scholar
- 16.Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 195–199. Florence (2014)Google Scholar
- 17.Rouvier, M., Favre, B.: Speaker adaptation of DNN-based ASR with i-vectors: does it actually adapt models to speakers? In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 3007–3011. Singapore (2014)Google Scholar
- 18.Kozlov, A., Kudashev, O., Matveev, Y., Pekhovsky, T., Simonchik, K., Shulipa, A.: SVID Speaker Recognition System for NIST SRE 2012. In: Železný, M., Habernal, I., Ronzhin, A. (eds.) SPECOM 2013. LNCS, vol. 8113, pp. 278–285. Springer, Heidelberg (2013) CrossRefGoogle Scholar
- 19.Novoselov, S., Pekhovsky, T., Simonchik, K., Shulipa, A.: RBM-PLDA subsystem for the NIST i-vector challenge. In: 15th Annual Conference of the International Speech Communication Association (Interspeech), pp. 378–382. Singapore (2014)Google Scholar
- 20.Karafiat, M., Grezl, F., Hannemann, M., Cernocky, J.H.: But neural network features for spontaneous Vietnamese in BABEL. In: 39th International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5622–5626 (2014)Google Scholar
- 21.Chen, S.F., Goodman, J.: An empirical study of smoothing techniques for language modeling. Technical report search in Computing Technology (Harvard University) (1998)Google Scholar