Advertisement

A Survey of Recent DNN Architectures on the TIMIT Phone Recognition Task

  • Josef Michálek
  • Jan Vaněk
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11107)

Abstract

In this survey paper, we have evaluated several recent deep neural network (DNN) architectures on a TIMIT phone recognition task. We chose the TIMIT corpus due to its popularity and broad availability in the community. It also simulates a low-resource scenario that is helpful in minor languages. Also, we prefer the phone recognition task because it is much more sensitive to an acoustic model quality than a large vocabulary continuous speech recognition (LVCSR) task. In recent years, many DNN published papers reported results on TIMIT. However, the reported phone error rates (PERs) were often much higher than a PER of a simple feed-forward (FF) DNN. That was the main motivation of this paper: To provide a baseline DNNs with open-source scripts to easily replicate the baseline results for future papers with lowest possible PERs. According to our knowledge, the best-achieved PER of this survey is better than the best-published PER to date.

Keywords

Neural networks Acoustic model Survey Review TIMIT LSTM Phone recognition 

References

  1. 1.
    A flexible framework of neural networks for deep learning. https://chainer.org
  2. 2.
    Garofolo, J.S., et al.: TIMIT Acoustic-Phonetic Continuous Speech Corpus. Linguistic Data Consortium LDC93S1 (1993)Google Scholar
  3. 3.
    Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)MathSciNetCrossRefGoogle Scholar
  4. 4.
    Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)CrossRefGoogle Scholar
  5. 5.
    Kaldi speech recognition toolkit. https://github.com/kaldi-asr/kaldi
  6. 6.
    Larochelle, H., Bengio, Y., Louradour, J., Lamblin, P.: Exploring strategies for training deep neural networks. J. Mach. Learn. Res. 10(Jan), 1–40 (2009)zbMATHGoogle Scholar
  7. 7.
    Lopes, C., Perdigao, F.: Phoneme Recognition on the TIMIT Database. Speech Technologies (2011).  https://doi.org/10.5772/17600, http://www.intechopen.com/books/speech-technologies/phoneme-recognition-on-the-timit-databaseGoogle Scholar
  8. 8.
    Mohamed, A., Dahl, G.E., Hinton, G.: Acoustic modeling using deep belief networks. IEEE Trans. Audio Speech Lang. Process. 20(1), 14–22 (2012).  https://doi.org/10.1109/TASL.2011.2109382CrossRefGoogle Scholar
  9. 9.
    Moon, T., Choi, H., Lee, H., Song, I.: RNNDROP: a novel dropout for RNNs in ASR. In: Proceedings of the ASRU (2015)Google Scholar
  10. 10.
    Peddinti, V., Povey, D., Khudanpur, S.: A time delay neural network architecture for efficient modeling of long temporal contexts. In: Sixteenth Annual Conference of the International Speech Communication Association (2015)Google Scholar
  11. 11.
    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)MathSciNetzbMATHGoogle Scholar
  12. 12.
    Tóth, L.: Convolutional deep rectifier neural nets for phone recognition. In: Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, pp. 1722–1726, August 2013Google Scholar
  13. 13.
    Tóth, L.: Convolutional deep maxout networks for phone recognition. In: Proceedings of the INTERSPEECH, pp. 1078–1082 (2014).  https://doi.org/10.1186/s13636-015-0068-3
  14. 14.
    Vaněk, J., Zelinka, J., Soutner, D., Psutka, J.: A regularization post layer: an additional way how to make deep neural networks robust. In: Camelin, N., Estève, Y., Martín-Vide, C. (eds.) SLSP 2017. LNCS (LNAI), vol. 10583, pp. 204–214. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-68456-7_17CrossRefGoogle Scholar
  15. 15.
    Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., Lang, K.J.: Phoneme recognition using time-delay neural networks. In: Readings in Speech Recognition, pp. 393–404. Elsevier (1990)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  1. 1.University of West BohemiaPilsenCzech Republic

Personalised recommendations