A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

  • Natalia Tomashenko
  • Yuri Khokhlov
  • Yannick Estève
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9918)

Abstract

In this paper we investigate the GMM-derived features for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models with the focus on exploration of fusion of the adapted GMM-derived features and the conventional bottleneck features. We analyze and compare different types of fusion, such as feature level, posterior level, lattice level and others in order to discover the best possible way of fusion. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN setup at different levels and provide additional gain in recognition performance: up to 6 % of relative word error rate reduction (WERR) over the strong speaker adapted DNN baseline, and up to 22 % of relative WERR in comparison with a speaker independent DNN baseline model, trained on conventional features.

Keywords

Speaker adaptation Deep neural networks (DNN) MAP fMLLR CD-DNN-HMM GMM-derived (GMMD) features Fusion Posterior fusion Confusion network combination 

Notes

Acknowledgements

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call and by the Government of the Russian Federation, Grant 074-U01.

References

  1. 1.
    Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)CrossRefGoogle Scholar
  2. 2.
    Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)CrossRefGoogle Scholar
  3. 3.
    Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training. In: Proceedings of ICASSP (2006)Google Scholar
  4. 4.
    Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, pp. 526–529 (2010)Google Scholar
  5. 5.
    Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of ASRU, pp. 24–29. IEEE (2011)Google Scholar
  6. 6.
    Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of SLT, pp. 366–369. IEEE (2012)Google Scholar
  7. 7.
    Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP, pp. 7947–7951. IEEE (2013)Google Scholar
  8. 8.
    Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of ICASSP, pp. 7893–7897 (2013)Google Scholar
  9. 9.
    Albesano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S.: Adaptation of artificial neural networks avoiding catastrophic forgetting. In: Proceedings of IJCNN 2006, pp. 1554–1561. IEEE (2006)Google Scholar
  10. 10.
    Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: Proceedings of ICASSP, pp. 6349–6353. IEEE (2014)Google Scholar
  11. 11.
    Siniscalchi, S.M., Li, J., Lee, C.-H.: Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 21(10), 2152–2161 (2013)CrossRefGoogle Scholar
  12. 12.
    Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of SLT, pp. 171–176. IEEE (2014)Google Scholar
  13. 13.
    Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Weng, C., Lee, C.-H.: Feature space maximum a posteriori linear regression for adaptation of deep neural networks. In: Proceedings of INTERSPEECH, pp. 2992–2996 (2014)Google Scholar
  14. 14.
    Huang, Z., Siniscalchi, S.M., Chen, I.-F., Li, J., Wu, J., Lee, C.-H.: Maximum a posteriori adaptation of network parameters in deep models. In: Proceedings of INTERSPEECH (2015)Google Scholar
  15. 15.
    Li, S., Lu, X., Akita, Y., Kawahara, T.: Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation. In: Proceedings of INTERSPEECH (2015)Google Scholar
  16. 16.
    Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Wu, J., Lee, C.-H.: Rapid adaptation for deep neural networks through multi-task learning. In: Proceedings of INTERSPEECH (2015)Google Scholar
  17. 17.
    Swietojanski, P., Bell, P., Renals, S.: Structured output layer with auxiliary targets for context-dependent acoustic modelling. In: Proceedings of INTERSPEECH (2015)Google Scholar
  18. 18.
    Karanasou, P., Wang, Y., Gales, M.J., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised i-vectors. In: Proceedings of INTERSPEECH, pp. 2180–2184 (2014)Google Scholar
  19. 19.
    Gupta, V., Kenny, P., Ouellet, P., Stafylakis, T.: I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In: Proceedings of ICASSP, pp. 6334–6338. IEEE (2014)Google Scholar
  20. 20.
    Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of ICASSP, pp. 225–229 (2014)Google Scholar
  21. 21.
    Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)CrossRefGoogle Scholar
  22. 22.
    Li, J., Huang, J.-T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of ICASSP, pp. 5537–5541. IEEE (2014)Google Scholar
  23. 23.
    Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)Google Scholar
  24. 24.
    Kanagawa, H., Tachioka, Y., Watanabe, S., Ishii, J.: Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN (2015)Google Scholar
  25. 25.
    Lei, X., Lin, H., Heigold, G.: Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition. In: Proceedings of ICASSP, pp. 7634–7638. IEEE (2013)Google Scholar
  26. 26.
    Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition. In: Proceedings of ICASSP, pp. 195–199. IEEE (2014)Google Scholar
  27. 27.
    Murali Karthick, B., Kolhar, P., Umesh, S.: Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM (2015)Google Scholar
  28. 28.
    Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of INTERSPEECH (2015)Google Scholar
  29. 29.
    Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: Proceedings of INTERSPEECH, pp. 2997–3001 (2014)Google Scholar
  30. 30.
    Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploring GMM-derived features for unsupervised adaptation of deep neural network acoustic models. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 304–311. Springer, Heidelberg (2016). doi:10.1007/978-3-319-43958-7_36 CrossRefGoogle Scholar
  31. 31.
    Tomashenko, N., Khokhlov, Y.: GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH, pp. 2882–2886 (2015)Google Scholar
  32. 32.
    Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploration de paramètres acoustiques dérivés de GMM pour l’adaptation non supervisée de modèles acoustiques à base de réseaux de neurones profonds. In: Proceedings of 31éme Journées d’Études sur la Parole (JEP), pp. 337–345 (2016)Google Scholar
  33. 33.
    Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of Gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)Google Scholar
  34. 34.
    Pinto, J.P., Hermansky, H.: Combining evidence from a generative and a discriminative model in phoneme recognition. Technical report, IDIAP (2008)Google Scholar
  35. 35.
    Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings of ICASSP, pp. 7654–7658 (2014)Google Scholar
  36. 36.
    Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of ICASSP, pp. 6744–6748. IEEE (2013)Google Scholar
  37. 37.
    Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of ASRU, pp. 347–354. IEEE (1997)Google Scholar
  38. 38.
    Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, Baltimore, vol. 27 (2000)Google Scholar
  39. 39.
    Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC, pp. 3935–3939 (2014)Google Scholar
  40. 40.
    Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  • Natalia Tomashenko
    • 1
    • 2
    • 3
  • Yuri Khokhlov
    • 3
  • Yannick Estève
    • 1
  1. 1.University of Le MansLe MansFrance
  2. 2.ITMO UniversitySaint-PetersburgRussia
  3. 3.STC-innovations LtdSaint-PetersburgRussia

Personalised recommendations