Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems

Watzel, Tobias; Li, Lujun; Kürzinger, Ludwig; Rigoll, Gerhard

doi:10.1007/978-3-030-26061-3_54

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11658))

Included in the following conference series:

International Conference on Speech and Computer

1168 Accesses
1 Citations

Abstract

In Automatic Speech Recognition (ASR), the acoustic model (AM) is modeled by a Deep Neural Network (DNN). The DNN learns a posterior probability in a supervised fashion utilizing input features and ground-truth labels. Current approaches combine a DNN with a Hidden Markov Model (HMM) in a hybrid approach, which achieved good results in the last years. Similar approaches using a discrete version, hence a Discrete Hidden Markov Model (DHMM), have been disregarded in recent past. Our approach revisits the idea of a discrete system, more precisely the so-called Deep Neural Network Quantizer (DNNQ), demonstrating how a DNNQ is created and trained. We introduce a novel approach to train a DNNQ in a supervised fashion with an arbitrary output layer size even though suitable target values are not available. The proposed method provides a mapping function exploiting fixed ground-truth labels. Consequently, we are able to apply a frame-based cross entropy (CE) training. Our experiments demonstrate that the DNNQ reduces the Word Error Rate (WER) by 17.6 % on monophones and by 2.2 % on triphones, respectively, compared to a continuous HMM-Gaussian Mixture Model (GMM) system.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. OSDI 16, 265–283 (2016)
Google Scholar
Bourlard, H.A., Morgan, N.: Connectionist Speech Recognition: A Hybrid Approach, vol. 247. Springer, New York (2012)
Google Scholar
Carletta, J., et al.: The AMI meeting corpus: a pre-announcement. In: International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39. Springer (2005). https://doi.org/10.1007/11677482_3
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)
Kanda, N., Fujita, Y., Nagamatsu, K.: Lattice-free state-level minimum Bayes risk training of acoustic models. In: Proceedings of the INTERSPEECH (2018)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Neukirchen, C., Rigoll, G.: Advanced training methods and new network topologies for hybrid MMI-connectionist/HMM speech recognition systems. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 4, pp. 3257–3260. IEEE (1997)
Google Scholar
Paul, D.B., Baker, J.M.: The design for the wall street journal-based CSR corpus. In: Proceedings of the Workshop on Speech and Natural Language, pp. 357–362. Association for Computational Linguistics (1992)
Google Scholar
Povey, D., et al.: The Kaldi speech recognition toolkit. In: IEEE 2011 Workshop on Automatic Speech Recognition and Understanding. No. EPFL-CONF-192584. IEEE Signal Processing Society (2011)
Google Scholar
Povey, D., et al.: Purely sequence-trained neural networks for ASR based on lattice-free MMI. In: INTERSPEECH, pp. 2751–2755 (2016)
Google Scholar
Price, P., Fisher, W.M., Bernstein, J., Pallett, D.S.: The DARPA 1000-word resource management database for continuous speech recognition. In: 1988 International Conference on Acoustics, Speech, and Signal Processing, pp. 651–654. IEEE (1988)
Google Scholar
Rath, S.P., Povey, D., Veselỳ, K., Cernockỳ, J.: Improved feature processing for deep neural networks. In: INTERSPEECH, pp. 109–113 (2013)
Google Scholar
Rigoll, G., Neukirchen, C., Rottland, J.: A new hybrid system based on MMI-neural networks for the RM speech recognition task. In: 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 865–868. IEEE (1996)
Google Scholar
Rottland, J., Neukirchen, C., Willett, D., Rigoll, G.: Large vocabulary speech recognition with context dependent MMI-connectionist/HMM systems using the WSJ database. In: Fifth European Conference on Speech Communication and Technology (1997)
Google Scholar
Rousseau, A., Deléglise, P., Esteve, Y.: Enhancing the TED-LIUM corpus with selected data for language modeling and more TED talks. In: LREC, pp. 3935–3939 (2014)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Veselỳ, K., Ghoshal, A., Burget, L., Povey, D.: Sequence-discriminative training of deep neural networks. In: INTERSPEECH, pp. 2345–2349 (2013)
Google Scholar
Xiong, W., et al.: Achieving human parity in conversational speech recognition. arXiv preprint arXiv:1610.05256 (2016)

Download references

Author information

Authors and Affiliations

Institute for Human-Machine Communication, Technical University of Munich, Munich, Germany
Tobias Watzel, Lujun Li, Ludwig Kürzinger & Gerhard Rigoll

Authors

Tobias Watzel
View author publications
You can also search for this author in PubMed Google Scholar
Lujun Li
View author publications
You can also search for this author in PubMed Google Scholar
Ludwig Kürzinger
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Rigoll
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tobias Watzel .

Editor information

Editors and Affiliations

Utrecht University, Utrecht, The Netherlands
Albert Ali Salah
St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, St. Petersburg, Russia
Alexey Karpov
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Watzel, T., Li, L., Kürzinger, L., Rigoll, G. (2019). Deep Neural Network Quantizers Outperforming Continuous Speech Recognition Systems. In: Salah, A., Karpov, A., Potapova, R. (eds) Speech and Computer. SPECOM 2019. Lecture Notes in Computer Science(), vol 11658. Springer, Cham. https://doi.org/10.1007/978-3-030-26061-3_54

Download citation

DOI: https://doi.org/10.1007/978-3-030-26061-3_54
Published: 24 July 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-26060-6
Online ISBN: 978-3-030-26061-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics