Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task

Vaněk, Jan; Michálek, Josef; Psutka, Josef

doi:10.1007/978-3-319-99579-3_74

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11096))

Included in the following conference series:

International Conference on Speech and Computer

1430 Accesses
2 Citations

Abstract

In this paper, we have investigated recurrent deep neural networks (DNNs) in combination with regularization techniques as dropout, zoneout, and regularization post-layer. As a benchmark, we chose the TIMIT phone recognition task due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition task. In recent years, recurrent DNNs pushed the error rates in automatic speech recognition down. But, there was no clear winner in proposed architectures. The dropout was used as the regularization technique in most cases, but combination with other regularization techniques together with model ensembles was omitted. However, just an ensemble of recurrent DNNs performed best and achieved an average phone error rate from 10 experiments 14.84% (minimum 14.69%) on core test set that is slightly lower then the best-published PER to date, according to our knowledge. Finally, in contrast of the most papers, we published the open-source scripts to easily replicate the results and to help continue the development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Barrow, D.K., Crone, S.F.: Crogging (Cross-Validation Aggregation) for forecasting - a novel algorithm of neural network ensembles on time series subsamples. In: Proceedings of the International Joint Conference on Neural Networks (2013). https://doi.org/10.1109/IJCNN.2013.6706740
Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)
MathSciNet MATH Google Scholar
A Flexible Framework of Neural Networks for Deep Learning. https://chainer.org
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)
Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y.: An exploration of dropout with LSTMs. In: Interspeech 2017, pp. 1586–1590 (2017). https://doi.org/10.21437/Interspeech.2017-129, http://www.danielpovey.com/files/2017_interspeech_dropout.pdf
Garofolo, J.S.E.A.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium LDC93S1 (1993)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kaldi Speech Recognition Toolkit. https://github.com/kaldi-asr/kaldi
Krueger, D., et al.: Zoneout: regularizing RNNs by randomly preserving hidden activations. In: International Conference on Learning Representations 2017 (2017)
Google Scholar
Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio, Speech Lang. Process. 20(1), 14–22 (2012). https://doi.org/10.1109/TASL.2011.2109382
Moon, T., Choi, H., Lee, H., Song, I.: RNNDROP : a novel dropout for RNNs in ASR. In: Proceedings of the ASRU (2015)
Google Scholar
Olah, C.: Understanding LSTM Networks, August 2015. http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., Kessler, F.B.: Improving speech recognition by revising gated recurrent units. In: Interspeech 2017, pp. 1308–1312 (2017). https://doi.org/10.21437/Interspeech.2017-775
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
Tóth, L.: Convolutional deep rectifier neural nets for phone recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1722–1726, August 2013
Google Scholar
Tóth, L.: Convolutional deep maxout networks for phone recognition. In: Proceedings of the INTERSPEECH, pp. 1078–1082 (2014). https://doi.org/10.1186/s13636-015-0068-3
Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Statistical Language and Speech Processing, pp. 204–214 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506 and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

University of West Bohemia, Univerzitní 8, 301 00, Pilsen, Czech Republic
Jan Vaněk, Josef Michálek & Josef Psutka

Authors

Jan Vaněk
View author publications
You can also search for this author in PubMed Google Scholar
Josef Michálek
View author publications
You can also search for this author in PubMed Google Scholar
Josef Psutka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Josef Michálek .

Editor information

Editors and Affiliations

SPIIRAS, St. Petersburg, Russia
Alexey Karpov
Leipzig University of Telecommunications, Leipzig, Germany
Oliver Jokisch
Moscow State Linguistic University, Moscow, Russia
Rodmonga Potapova

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vaněk, J., Michálek, J., Psutka, J. (2018). Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_74

Download citation

DOI: https://doi.org/10.1007/978-3-319-99579-3_74
Published: 25 August 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-99578-6
Online ISBN: 978-3-319-99579-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics