Skip to main content

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

  • Conference paper
  • First Online:
Statistical Language and Speech Processing (SLSP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9918))

Included in the following conference series:

  • 524 Accesses

Abstract

In this paper we investigate the GMM-derived features for adaptation of context-dependent deep neural network HMM (CD-DNN-HMM) acoustic models with the focus on exploration of fusion of the adapted GMM-derived features and the conventional bottleneck features. We analyze and compare different types of fusion, such as feature level, posterior level, lattice level and others in order to discover the best possible way of fusion. Experimental results on the TED-LIUM corpus show that the proposed adaptation technique can be effectively integrated into DNN setup at different levels and provide additional gain in recognition performance: up to 6 % of relative word error rate reduction (WERR) over the strong speaker adapted DNN baseline, and up to 22 % of relative WERR in comparison with a speaker independent DNN baseline model, trained on conventional features.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://cantabresearch.com/cantab-TEDLIUM.tar.bz2.

References

  1. Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  2. Gauvain, J.-L., Lee, C.-H.: Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Speech Audio Process. 2, 291–298 (1994)

    Article  Google Scholar 

  3. Gemello, R., Mana, F., Scanzio, S., Laface, P., De Mori, R.: Adaptation of hybrid ANN/HMM models using linear hidden transformations and conservative training. In: Proceedings of ICASSP (2006)

    Google Scholar 

  4. Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems, pp. 526–529 (2010)

    Google Scholar 

  5. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of ASRU, pp. 24–29. IEEE (2011)

    Google Scholar 

  6. Yao, K., Yu, D., Seide, F., Su, H., Deng, L., Gong, Y.: Adaptation of context-dependent deep neural networks for automatic speech recognition. In: Proceedings of SLT, pp. 366–369. IEEE (2012)

    Google Scholar 

  7. Liao, H.: Speaker adaptation of context dependent deep neural networks. In: Proceedings of ICASSP, pp. 7947–7951. IEEE (2013)

    Google Scholar 

  8. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: Proceedings of ICASSP, pp. 7893–7897 (2013)

    Google Scholar 

  9. Albesano, D., Gemello, R., Laface, P., Mana, F., Scanzio, S.: Adaptation of artificial neural networks avoiding catastrophic forgetting. In: Proceedings of IJCNN 2006, pp. 1554–1561. IEEE (2006)

    Google Scholar 

  10. Ochiai, T., Matsuda, S., Lu, X., Hori, C., Katagiri, S.: Speaker adaptive training using deep neural networks. In: Proceedings of ICASSP, pp. 6349–6353. IEEE (2014)

    Google Scholar 

  11. Siniscalchi, S.M., Li, J., Lee, C.-H.: Hermitian polynomial for speaker adaptation of connectionist speech recognition systems. IEEE Trans. Audio Speech Lang. Process. 21(10), 2152–2161 (2013)

    Article  Google Scholar 

  12. Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: Proceedings of SLT, pp. 171–176. IEEE (2014)

    Google Scholar 

  13. Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Weng, C., Lee, C.-H.: Feature space maximum a posteriori linear regression for adaptation of deep neural networks. In: Proceedings of INTERSPEECH, pp. 2992–2996 (2014)

    Google Scholar 

  14. Huang, Z., Siniscalchi, S.M., Chen, I.-F., Li, J., Wu, J., Lee, C.-H.: Maximum a posteriori adaptation of network parameters in deep models. In: Proceedings of INTERSPEECH (2015)

    Google Scholar 

  15. Li, S., Lu, X., Akita, Y., Kawahara, T.: Ensemble speaker modeling using speaker adaptive training deep neural network for speaker adaptation. In: Proceedings of INTERSPEECH (2015)

    Google Scholar 

  16. Huang, Z., Li, J., Siniscalchi, S.M., Chen, I.-F., Wu, J., Lee, C.-H.: Rapid adaptation for deep neural networks through multi-task learning. In: Proceedings of INTERSPEECH (2015)

    Google Scholar 

  17. Swietojanski, P., Bell, P., Renals, S.: Structured output layer with auxiliary targets for context-dependent acoustic modelling. In: Proceedings of INTERSPEECH (2015)

    Google Scholar 

  18. Karanasou, P., Wang, Y., Gales, M.J., Woodland, P.C.: Adaptation of deep neural network acoustic models using factorised i-vectors. In: Proceedings of INTERSPEECH, pp. 2180–2184 (2014)

    Google Scholar 

  19. Gupta, V., Kenny, P., Ouellet, P., Stafylakis, T.: I-vector-based speaker adaptation of deep neural networks for french broadcast audio transcription. In: Proceedings of ICASSP, pp. 6334–6338. IEEE (2014)

    Google Scholar 

  20. Senior, A., Lopez-Moreno, I.: Improving DNN speaker independence with i-vector inputs. In: Proceedings of ICASSP, pp. 225–229 (2014)

    Google Scholar 

  21. Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(12), 1713–1725 (2014)

    Article  Google Scholar 

  22. Li, J., Huang, J.-T., Gong, Y.: Factorized adaptation for deep neural network. In: Proceedings of ICASSP, pp. 5537–5541. IEEE (2014)

    Google Scholar 

  23. Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: Proceedings of INTERSPEECH, pp. 109–113 (2013)

    Google Scholar 

  24. Kanagawa, H., Tachioka, Y., Watanabe, S., Ishii, J.: Feature-space structural MAPLR with regression tree-based multiple transformation matrices for DNN (2015)

    Google Scholar 

  25. Lei, X., Lin, H., Heigold, G.: Deep neural networks with auxiliary Gaussian mixture models for real-time speech recognition. In: Proceedings of ICASSP, pp. 7634–7638. IEEE (2013)

    Google Scholar 

  26. Liu, S., Sim, K.C.: On combining DNN and GMM with unsupervised speaker adaptation for robust automatic speech recognition. In: Proceedings of ICASSP, pp. 195–199. IEEE (2014)

    Google Scholar 

  27. Murali Karthick, B., Kolhar, P., Umesh, S.: Speaker adaptation of convolutional neural network using speaker specific subspace vectors of SGMM (2015)

    Google Scholar 

  28. Parthasarathi, S.H.K., Hoffmeister, B., Matsoukas, S., Mandal, A., Strom, N., Garimella, S.: fMLLR based feature-space speaker adaptation of DNN acoustic models. In: Proceedings of INTERSPEECH (2015)

    Google Scholar 

  29. Tomashenko, N., Khokhlov, Y.: Speaker adaptation of context dependent deep neural networks based on MAP-adaptation and GMM-derived feature processing. In: Proceedings of INTERSPEECH, pp. 2997–3001 (2014)

    Google Scholar 

  30. Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploring GMM-derived features for unsupervised adaptation of deep neural network acoustic models. In: Ronzhin, A., Potapova, R., Németh, G. (eds.) SPECOM 2016. LNCS, vol. 9811, pp. 304–311. Springer, Heidelberg (2016). doi:10.1007/978-3-319-43958-7_36

    Chapter  Google Scholar 

  31. Tomashenko, N., Khokhlov, Y.: GMM-derived features for effective unsupervised adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH, pp. 2882–2886 (2015)

    Google Scholar 

  32. Tomashenko, N., Khokhlov, Y., Larcher, A., Estève, Y.: Exploration de paramètres acoustiques dérivés de GMM pour l’adaptation non supervisée de modèles acoustiques à base de réseaux de neurones profonds. In: Proceedings of 31éme Journées d’Études sur la Parole (JEP), pp. 337–345 (2016)

    Google Scholar 

  33. Tomashenko, N., Khokhlov, Y., Esteve, Y.: On the use of Gaussian mixture model framework to improve speaker adaptation of deep neural network acoustic models. In: Proceedings of INTERSPEECH (2016)

    Google Scholar 

  34. Pinto, J.P., Hermansky, H.: Combining evidence from a generative and a discriminative model in phoneme recognition. Technical report, IDIAP (2008)

    Google Scholar 

  35. Grézl, F., Karafiát, M., Vesely, K.: Adaptation of multilingual stacked bottle-neck neural network structure for new language. In: Proceedings of ICASSP, pp. 7654–7658 (2014)

    Google Scholar 

  36. Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of ICASSP, pp. 6744–6748. IEEE (2013)

    Google Scholar 

  37. Fiscus, J.G.: A post-processing system to yield reduced word error rates: recognizer output voting error reduction (ROVER). In: Proceedings of ASRU, pp. 347–354. IEEE (1997)

    Google Scholar 

  38. Evermann, G., Woodland, P.: Posterior probability decoding, confidence estimation and system combination. In: Proceedings of Speech Transcription Workshop, Baltimore, vol. 27 (2000)

    Google Scholar 

  39. Rousseau, A., Deléglise, P., Estève, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: Proceedings of LREC, pp. 3935–3939 (2014)

    Google Scholar 

  40. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., et al.: The Kaldi speech recognition toolkit. In: Proceedings of ASRU (2011)

    Google Scholar 

Download references

Acknowledgements

This work was partially funded by the European Commission through the EUMSSI project, under the contract number 611057, in the framework of the FP7-ICT-2013-10 call and by the Government of the Russian Federation, Grant 074-U01.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Natalia Tomashenko .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Tomashenko, N., Khokhlov, Y., Estève, Y. (2016). A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation. In: Král, P., Martín-Vide, C. (eds) Statistical Language and Speech Processing. SLSP 2016. Lecture Notes in Computer Science(), vol 9918. Springer, Cham. https://doi.org/10.1007/978-3-319-45925-7_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45925-7_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45924-0

  • Online ISBN: 978-3-319-45925-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics