Abstract
In this chapter, we introduce techniques that fuse deep neural networks (DNNs) and Gaussian mixture models (GMMs). We first describe the Tandem and bottleneck approach in which DNNs are used as feature extractors. The hidden layers, which are better representation than the raw input feature, are used as features in the GMM systems. We then introduce techniques that fuse the recognition results and frame-level scores of the DNN-HMM hybrid system with that of the GMM-HMM system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Hermansky, H., Ellis, D.P., Sharma, S.: Tandem connectionist feature extraction for conventional HMM systems. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 3, pp. 1635–1638 (2000)
Grézl, F., Fousek, P.: Optimizing bottle-neck features for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4729–4732 (2008)
Grézl, F., Karafiát, M., Kontár, S., Černocký, J.: Probabilistic and bottle-neck features for LVCSR of meetings. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 757–760 (2007)
Fousek, P., Lamel, L., Gauvain, J.L.: Transcribing broadcast data using MLP features. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1433–1436 (2008)
Valente, F., Doss, M.M., Plahl, C., Ravuri, S., Wang, W.: A comparative large scale study of MLP features for mandarin ASR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2630–2633 (2010)
Vergyri, D., Mandal, A., Wang, W., Stolcke, A., Zheng, J., Graciarena, M., Rybach, D., Gollan, C., Schlüter, R., Kirchhoff, K., et al.: Development of the SRI/nightingale Arabic ASR system. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1437–1440 (2008)
Yu, D., Seltzer, M.L.: Improved bottleneck features using pretrained deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 237–240 (2011)
Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)
Heigold, G., Ney, H., Lehnen, P., Gass, T., Schluter, R.: Equivalence of generative and log-linear models. IEEE Trans. Audio, Speech Lang. Process. 19(5), 1138–1148 (2011)
Yan, Z., Huo, Q., Xu, J.: A scalable approach to using DNN-derived features in GMM-HMM based acoustic modeling for LVCSR. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2013)
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)
Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 437–440 (2011)
Su, H., Li, G., Yu, D., Seide, F.: Error back propagation for sequence training of context-dependent deep networks for conversational speech transcription. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Yan, Z.J., Huo, Q., Xu, J., Zhang, Y.: Tied-state based discriminative training of context-expanded region-dependent feature transforms for LVCSR. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6940–6944 (2013)
Zhang, B., Matsoukas, S., Schwartz, R.: Discriminatively trained region dependent feature transforms for speech recognition. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–I (2006)
Sainath, T.N., Kingsbury, B., Ramabhadran, B.: Auto-encoder bottleneck features using deep belief networks. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4153–4156 (2012)
Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)
Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)
Sainath, T.N., Kingsbury, B., Ramabhadran, B., Fousek, P., Novak, P., Mohamed, A.r.: Making deep belief networks effective for large vocabulary continuous speech recognition. In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 30–35 (2011)
Fiscus, J.G.: A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In: Proceedings of IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 347–354 (1997)
Zweig, G., Nguyen, P.: SCARF: a segmental conditional random field toolkit for speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2858–2861 (2010)
Xu, H., Povey, D., Mangu, L., Zhu, J.: Minimum Bayes risk decoding and system combination based on a recursion for edit distance. Comput. Speech Lang. 25(4), 802–828 (2011)
Jaitly, N., Nguyen, P., Senior, A.W., Vanhoucke, V.: Application of pretrained deep neural networks to large vocabulary speech recognition. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2012)
Swietojanski, P., Ghoshal, A., Renals, S.: Revisiting hybrid and GMM-HMM system combination techniques. In: Proceedings of International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)
Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)
Bourlard, H., Dupont, S., Martigny Valais Suisse C.R.: Multi stream speech recognition (1996)
Bourlard, H., et al.: Non-stationary multi-channel (multi-stream) processing towards robust and adaptive asr. In: Proceedings of Workshop on Robust Methods for Speech Recognition in Adverse Conditions, pp. 1–10 (1999)
Allen, J.B.: How do humans process and recognize speech? IEEE Trans. Speech Audio Process. 2(4), 567–577 (1994)
Zhou, P., Dai, L., Liu, Q., Jiang, H.: Combining information from multi-stream features using deep neural network in speech recognition. In: IEEE International Conference on Signal Processing (ICSP) vol. 1, pp. 557–561 (2012)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
Copyright information
© 2015 Springer-Verlag London
About this chapter
Cite this chapter
Yu, D., Deng, L. (2015). Fuse Deep Neural Network and Gaussian Mixture Model Systems. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_10
Download citation
DOI: https://doi.org/10.1007/978-1-4471-5779-3_10
Published:
Publisher Name: Springer, London
Print ISBN: 978-1-4471-5778-6
Online ISBN: 978-1-4471-5779-3
eBook Packages: EngineeringEngineering (R0)