Fuse Deep Neural Network and Gaussian Mixture Model Systems

Yu, Dong; Deng, Li

doi:10.1007/978-1-4471-5779-3_10

Dong Yu³ &
Li Deng⁴

Part of the book series: Signals and Communication Technology ((SCT))

13k Accesses
2 Citations

Abstract

In this chapter, we introduce techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs). We first describe the Tandem and bottleneck approach in which DNNs are used as feature extractors. The hidden layers, which are better representation than the raw input feature, are used as features in the GMM systems. We then introduce techniques that fuse the recognition results and frame-level scores of the DNN-HMM hybrid system with that of the GMM-HMM system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638 (2000)
Google Scholar
Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4729–4732 (2008)
Google Scholar
Grézl, F., Karafiát, M., Kontár, S., Černocký, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 757–760 (2007)
Google Scholar
Fousek, P., Lamel, L., Gauvain, J.L.: Transcribing broadcast data using MLP features. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1433–1436 (2008)
Google Scholar
Valente, F., Doss, M.M., Plahl, C., Ravuri, S., Wang, W.: A comparative large scale study of MLP features for mandarin ASR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2630–2633 (2010)
Google Scholar
Vergyri, D., Mandal, A., Wang, W., Stolcke, A., Zheng, J., Graciarena, M., Rybach, D., Gollan, C., Schlüter, R., Kirchhoff, K., et al.: Development of the SRI/nightingale Arabic ASR system. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1437–1440 (2008)
Google Scholar
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 237–240 (2011)
Google Scholar
Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)
Google Scholar
Heigold, G., Ney, H., Lehnen, P., Gass, T., Schluter, R.: Equivalence of generative and log-linear models. IEEE Trans. Audio, Speech Lang. Process. 19(5), 1138–1148 (2011)
Article Google Scholar
Yan, Z., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2013)
Google Scholar
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
Article Google Scholar
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)
Google Scholar
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)
Google Scholar
Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Yan, Z.J., Huo, Q., Xu, J., Zhang, Y.: Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6940–6944 (2013)
Google Scholar
Zhang, B., Matsoukas, S., Schwartz, R.: Discriminatively trained region dependent feature transforms for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–I (2006)
Google Scholar
Sainath, T.N., Kingsbury, B., Ramabhadran, B.: Auto-encoder bottleneck features using deep belief networks. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4153–4156 (2012)
Google Scholar
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)
Google Scholar
Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)
Article Google Scholar
Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P., Novak, P., Mohamed, A.r.: Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 30–35 (2011)
Google Scholar
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 347–354 (1997)
Google Scholar
Zweig, G., Nguyen, P.: SCARF: a segmental conditional random field toolkit for speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2858–2861 (2010)
Google Scholar
Xu, H., Povey, D., Mangu, L., Zhu, J.: Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Comput. Speech Lang. 25(4), 802–828 (2011)
Article Google Scholar
Jaitly, N., Nguyen, P., Senior, A.W., Vanhoucke, V.: Application of pretrained deep neural networks to large vocabulary speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)
Google Scholar
Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Google Scholar
Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
Google Scholar
Bourlard, H., Dupont, S., Martigny Valais Suisse C.R.: Multi stream speech recognition (1996)
Google Scholar
Bourlard, H., et al.: Non-stationary multi-channel (multi-stream) processing towards robust and adaptive asr. In: Proceedings of Workshop on Robust Methods for Speech Recognition in Adverse Conditions, pp. 1–10 (1999)
Google Scholar
Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)
Article Google Scholar
Zhou, P., Dai, L., Liu, Q., Jiang, H.: Combining information from multi-stream features using deep neural network in speech recognition. In: IEEE International Conference on Signal Processing (ICSP) vol. 1, pp. 557–561 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

Microsoft Research, Bothell, USA
Dong Yu
Microsoft Research, Redmond, WA, USA
Li Deng

Authors

Dong Yu
View author publications
You can also search for this author in PubMed Google Scholar
Li Deng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Dong Yu .

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Yu, D., Deng, L. (2015). Fuse Deep Neural Network and Gaussian Mixture Model Systems. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_10

Download citation

DOI: https://doi.org/10.1007/978-1-4471-5779-3_10
Published: 12 November 2014
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5778-6
Online ISBN: 978-1-4471-5779-3
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics