Skip to main content

Fuse Deep Neural Network and Gaussian Mixture Model Systems

  • Chapter
  • First Online:
Automatic Speech Recognition

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

In this chapter, we introduce techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs). We first describe the Tandem and bottleneck approach in which DNNs are used as feature extractors. The hidden layers, which are better representation than the raw input feature, are used as features in the GMM systems. We then introduce techniques that fuse the recognition results and frame-level scores of the DNN-HMM hybrid system with that of the GMM-HMM system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638 (2000)

    Google Scholar 

  2. Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4729–4732 (2008)

    Google Scholar 

  3. Grézl, F., Karafiát, M., Kontár, S., Černocký, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 757–760 (2007)

    Google Scholar 

  4. Fousek, P., Lamel, L., Gauvain, J.L.: Transcribing broadcast data using MLP features. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1433–1436 (2008)

    Google Scholar 

  5. Valente, F., Doss, M.M., Plahl, C., Ravuri, S., Wang, W.: A comparative large scale study of MLP features for mandarin ASR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2630–2633 (2010)

    Google Scholar 

  6. Vergyri, D., Mandal, A., Wang, W., Stolcke, A., Zheng, J., Graciarena, M., Rybach, D., Gollan, C., Schlüter, R., Kirchhoff, K., et al.: Development of the SRI/nightingale Arabic ASR system. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1437–1440 (2008)

    Google Scholar 

  7. Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 237–240 (2011)

    Google Scholar 

  8. Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)

    Google Scholar 

  9. Heigold, G., Ney, H., Lehnen, P., Gass, T., Schluter, R.: Equivalence of generative and log-linear models. IEEE Trans. Audio, Speech Lang. Process. 19(5), 1138–1148 (2011)

    Article  Google Scholar 

  10. Yan, Z., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2013)

    Google Scholar 

  11. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)

    Article  Google Scholar 

  12. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)

    Google Scholar 

  13. Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)

    Google Scholar 

  14. Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  15. Yan, Z.J., Huo, Q., Xu, J., Zhang, Y.: Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6940–6944 (2013)

    Google Scholar 

  16. Zhang, B., Matsoukas, S., Schwartz, R.: Discriminatively trained region dependent feature transforms for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–I (2006)

    Google Scholar 

  17. Sainath, T.N., Kingsbury, B., Ramabhadran, B.: Auto-encoder bottleneck features using deep belief networks. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4153–4156 (2012)

    Google Scholar 

  18. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)

    Google Scholar 

  19. Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)

    Article  Google Scholar 

  20. Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P., Novak, P., Mohamed, A.r.: Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 30–35 (2011)

    Google Scholar 

  21. Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 347–354 (1997)

    Google Scholar 

  22. Zweig, G., Nguyen, P.: SCARF: a segmental conditional random field toolkit for speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2858–2861 (2010)

    Google Scholar 

  23. Xu, H., Povey, D., Mangu, L., Zhu, J.: Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Comput. Speech Lang. 25(4), 802–828 (2011)

    Article  Google Scholar 

  24. Jaitly, N., Nguyen, P., Senior, A.W., Vanhoucke, V.: Application of pretrained deep neural networks to large vocabulary speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)

    Google Scholar 

  25. Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  26. Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)

    Google Scholar 

  27. Bourlard, H., Dupont, S., Martigny Valais Suisse C.R.: Multi stream speech recognition (1996)

    Google Scholar 

  28. Bourlard, H., et al.: Non-stationary multi-channel (multi-stream) processing towards robust and adaptive asr. In: Proceedings of Workshop on Robust Methods for Speech Recognition in Adverse Conditions, pp. 1–10 (1999)

    Google Scholar 

  29. Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)

    Article  Google Scholar 

  30. Zhou, P., Dai, L., Liu, Q., Jiang, H.: Combining information from multi-stream features using deep neural network in speech recognition. In: IEEE International Conference on Signal Processing (ICSP) vol. 1, pp. 557–561 (2012)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Yu .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Yu, D., Deng, L. (2015). Fuse Deep Neural Network and Gaussian Mixture Model Systems. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_10

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5779-3_10

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5778-6

  • Online ISBN: 978-1-4471-5779-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics