Abstract
In Automatic Speech Recognition (ASR), the acoustic model (AM) is modeled by a Deep Neural Network (DNN). The DNN learns a posterior probability in a supervised fashion utilizing input features and ground-truth labels. Current approaches combine a DNN with a Hidden Markov Model (HMM) in a hybrid approach, which achieved good results in the last years. Similar approaches using a discrete version, hence a Discrete Hidden Markov Model (DHMM), have been disregarded in recent past. Our approach revisits the idea of a discrete system, more precisely the so-called Deep Neural Network Quantizer (DNNQ), demonstrating how a DNNQ is created and trained. We introduce a novel approach to train a DNNQ in a supervised fashion with an arbitrary output layer size even though suitable target values are not available. The proposed method provides a mapping function exploiting fixed ground-truth labels. Consequently, we are able to apply a frame-based cross entropy (CE) training. Our experiments demonstrate that the DNNQ reduces the Word Error Rate (WER) by 17.6 % on monophones and by 2.2 % on triphones, respectively, compared to a continuous HMM-Gaussian Mixture Model (GMM) system.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, New York (2012)
Carletta, J., et al.: The AMI meeting corpus: a pre-announcement. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer (2005). https://doi.org/10.1007/11677482_3
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Kanda, N., Fujita, Y., Nagamatsu, K.: Lattice-free state-level minimum Bayes risk training of acoustic models. In: Proceedings of the INTERSPEECH (2018)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Neukirchen, C., Rigoll, G.: Advanced training methods and new network topologies for hybrid MMI-connectionist/HMM speech recognition systems. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 3257–3260. IEEE (1997)
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics (1992)
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: INTERSPEECH, pp. 2751–2755 (2016)
Price, P., Fisher, W.M., Bernstein, J., Pallett, D.S.: The DARPA 1000-word resource management database for continuous speech recognition. In: 1988 International Conference on Acoustics, Speech, and Signal Processing, pp. 651–654. IEEE (1988)
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: INTERSPEECH, pp. 109–113 (2013)
Rigoll, G., Neukirchen, C., Rottland, J.: A new hybrid system based on MMI-neural networks for the RM speech recognition task. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 865–868. IEEE (1996)
Rottland, J., Neukirchen, C., Willett, D., Rigoll, G.: Large vocabulary speech recognition with context dependent MMI-connectionist/HMM systems using the WSJ database. In: Fifth European Conference on Speech Communication and Technology (1997)
Rousseau, A., Deléglise, P., Esteve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC, pp. 3935–3939 (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Veselỳ, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH, pp. 2345–2349 (2013)
Xiong, W., et al.: Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Watzel, T., Li, L., Kürzinger, L., Rigoll, G. (2019). Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_54
Download citation
DOI: https://doi.org/10.1007/978-3-030-26061-3_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)