Skip to main content

Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task

  • Conference paper
  • First Online:
Speech and Computer (SPECOM 2018)

Abstract

In this paper, we have investigated recurrent deep neural networks (DNNs) in combination with regularization techniques as dropout, zoneout, and regularization post-layer. As a benchmark, we chose the TIMIT phone recognition task due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition task. In recent years, recurrent DNNs pushed the error rates in automatic speech recognition down. But, there was no clear winner in proposed architectures. The dropout was used as the regularization technique in most cases, but combination with other regularization techniques together with model ensembles was omitted. However, just an ensemble of recurrent DNNs performed best and achieved an average phone error rate from 10 experiments 14.84% (minimum 14.69%) on core test set that is slightly lower then the best-published PER to date, according to our knowledge. Finally, in contrast of the most papers, we published the open-source scripts to easily replicate the results and to help continue the development.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Barrow, D.K., Crone, S.F.: Crogging (Cross-Validation Aggregation) for forecasting - a novel algorithm of neural network ensembles on time series subsamples. In: Proceedings of the International Joint Conference on Neural Networks (2013). https://doi.org/10.1109/IJCNN.2013.6706740

  2. Breiman, L.: Bagging predictors. Mach. Learn. 24(2), 123–140 (1996)

    MathSciNet  MATH  Google Scholar 

  3. A Flexible Framework of Neural Networks for Deep Learning. https://chainer.org

  4. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014)

  5. Cheng, G., Peddinti, V., Povey, D., Manohar, V., Khudanpur, S., Yan, Y.: An exploration of dropout with LSTMs. In: Interspeech 2017, pp. 1586–1590 (2017). https://doi.org/10.21437/Interspeech.2017-129, http://www.danielpovey.com/files/2017_interspeech_dropout.pdf

  6. Garofolo, J.S.E.A.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium LDC93S1 (1993)

    Google Scholar 

  7. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)

    Article  Google Scholar 

  8. Kaldi Speech Recognition Toolkit. https://github.com/kaldi-asr/kaldi

  9. Krueger, D., et al.: Zoneout: regularizing RNNs by randomly preserving hidden activations. In: International Conference on Learning Representations 2017 (2017)

    Google Scholar 

  10. Mohamed, A.R., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio, Speech Lang. Process. 20(1), 14–22 (2012). https://doi.org/10.1109/TASL.2011.2109382

  11. Moon, T., Choi, H., Lee, H., Song, I.: RNNDROP : a novel dropout for RNNs in ASR. In: Proceedings of the ASRU (2015)

    Google Scholar 

  12. Olah, C.: Understanding LSTM Networks, August 2015. http://colah.github.io/posts/2015-08-Understanding-LSTMs/

  13. Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y., Kessler, F.B.: Improving speech recognition by revising gated recurrent units. In: Interspeech 2017, pp. 1308–1312 (2017). https://doi.org/10.21437/Interspeech.2017-775

  14. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  15. Tóth, L.: Convolutional deep rectifier neural nets for phone recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1722–1726, August 2013

    Google Scholar 

  16. Tóth, L.: Convolutional deep maxout networks for phone recognition. In: Proceedings of the INTERSPEECH, pp. 1078–1082 (2014). https://doi.org/10.1186/s13636-015-0068-3

  17. Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Statistical Language and Speech Processing, pp. 204–214 (2017)

    Google Scholar 

Download references

Acknowledgments

This work was supported by Ministry of Education, Youth and Sports of the Czech Republic project No. LO1506 and by the grant of the University of West Bohemia, project No. SGS-2016-039. Access to computing and storage facilities owned by parties and projects contributing to the National Grid Infrastructure MetaCentrum provided under the programme “Projects of Large Research, Development, and Innovations Infrastructures” (CESNET LM2015042), is greatly appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Josef Michálek .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Vaněk, J., Michálek, J., Psutka, J. (2018). Recurrent DNNs and Its Ensembles on the TIMIT Phone Recognition Task. In: Karpov, A., Jokisch, O., Potapova, R. (eds) Speech and Computer. SPECOM 2018. Lecture Notes in Computer Science(), vol 11096. Springer, Cham. https://doi.org/10.1007/978-3-319-99579-3_74

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-99579-3_74

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-99578-6

  • Online ISBN: 978-3-319-99579-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics