Skip to main content

Training Maxout Neural Networks for Speech Recognition Tasks

  • Conference paper
  • First Online:
Text, Speech, and Dialogue (TSD 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9924))

Included in the following conference series:

Abstract

The topic of the paper is the training of deep neural networks which use tunable piecewise-linear activation functions called “maxout” for speech recognition tasks. Maxout networks are compared to the conventional fully-connected DNNs in case of training with both cross-entropy and sequence discriminative (sMBR) criteria. Experiments are carried out on the CHiME Challenge 2015 corpus of multi-microphone noisy dictation speech and the Switchboard corpus of conversational telephone speech. The clear advantage of maxout networks over DNNs is demonstrated when using the cross-entropy criterion on both corpora. It is also argued that maxout networks are prone to overfitting during sequence training but in some cases it can be successfully overcome with the use of the KL-divergence based regularization.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Similar ideas of deep networks training were proposed much earlier (see for example review in [25]) but for some reasons they were not paid proper attention.

  2. 2.

    The fraction of the neurons turned off is called dropout rate and can differ from 0.5.

  3. 3.

    Hereinafter the maxout group size \(k=2\) is used unless otherwise is stated. The increase of the group size makes both training and recognition slower and makes training less stable but does not provide significant improvements in our experiments.

  4. 4.

    We do not compare Maxout + AD to ReLU + AD because, in our experience, AD training of ReLU networks does not provide significant WER reduction (it is also observed in [19, Sect. 4.8]) whilist carefully tuned sigmoidal DNNs with \(L_2\) weight decay often outperform ReLU DNNs with dropout and other types of regularization.

References

  1. Abdel-Hamid, O.,Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280. IEEE (2012)

    Google Scholar 

  2. Barker, J., Marxer, R., Vincent, E., Watanabe, S.: The third ‘chime’ speech separation and recognition challenge: dataset, task and baselines. In: Proceedings of 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), pp. 504–511 (2015)

    Google Scholar 

  3. Bengio, Y.: Deep learning architectures for AI. Found. Trends Mach. Learn. 2(1), 1–127 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bengio, Y., Lamblin, P., Popovici, D., Larochelle, H.: Greedy layer-wise training of deep networks. In: Schölkopf, B., Platt, J.C., Hoffman, T. (eds.) Advances in Neural Information Processing Systems, vol. 19, pp. 153–160. MIT Press, Cambridge (2007)

    Google Scholar 

  5. Cai, M., Shi, Y., Liu, J.: Deep maxout neural networks for speech recognition. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 291–296. IEEE (2013)

    Google Scholar 

  6. de-la-Calle-Silos, F., Gallardo-Antolín, A., Peláez-Moreno, C.: Deep maxout networks applied to noise-robust speech recognition. IberSPEECH 2014. LNCS, vol. 8854, pp. 109–118. Springer, Heidelberg (2014)

    Google Scholar 

  7. Carreira-Perpinan, M.A., Hinton, G.: On contrastive divergence learning. In: AISTATS, vol. 10, pp. 33–40. Citeseer (2005)

    Google Scholar 

  8. Dahl, G.E., Sainath, T.N., Hinton, G.E.: Improving deep neural networks for LVCSR using rectified linear units and dropout. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8609–8613. IEEE (2013)

    Google Scholar 

  9. Goodfellow, I.J., Warde-Farley, D., Mirza, M., Courville, A., Bengio, Y.: Maxout networks. arXiv preprint arXiv:1302.4389 (2013)

  10. Graves, A., Mohamed, A.R., Hinton, G.: Speech recognition with deep recurrent neural networks. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6645–6649. IEEE (2013)

    Google Scholar 

  11. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14), pp. 1764–1772 (2014)

    Google Scholar 

  12. Hinton, G.E., Osindero, S., Teh, Y.W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  13. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  14. Miao, Y., Metze, F., Rawat, S.: Deep maxout networks for low-resource speech recognition. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 398–403. IEEE (2013)

    Google Scholar 

  15. Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp. 807–814 (2010)

    Google Scholar 

  16. Povey, D., Ghoshal, A., Boulianne, G., Burget, L., Glembek, O., Goel, N., Hannemann, M., Motlicek, P., Qian, Y., Schwarz, P., Silovsky, J., Stemmer, G., Vesely, K.: The kaldi speech recognition toolkit. In: Proceedings of 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2011) (2011)

    Google Scholar 

  17. Prudnikov, A., Korenevsky, M., Aleinik, S.: Adaptive beamforming and adaptive training of DNN acoustic models for enhanced multichannel noisy speech recognition. In: Proceedings of 2015 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2015), pp. 401–408 (2015)

    Google Scholar 

  18. Prudnikov, A., Medennikov, I., Mendelev, V., Korenevsky, M., Khokhlov, Y.: Improving acoustic models for Russian spontaneous speech recognition. In: Ronzhin, A., Potapova, R., Fakotakis, N. (eds.) SPECOM 2015. LNCS, vol. 9319, pp. 234–242. Springer, Heidelberg (2015)

    Chapter  Google Scholar 

  19. Rennie, S.J., Goel, V., Thomas, S.: Annealed dropout training of deep networks. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 159–164. IEEE (2014)

    Google Scholar 

  20. Rumelhart, D., Hinton, G., Williams, R.: Learning internal representations by error propagation. Parallel Distrib. Process. 1, 318–362 (1986)

    Google Scholar 

  21. Sainath, T., Rao, K., et al.: Acoustic modelling with CD-CTC-SMBR LSTM RNNS. In: 2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 604–609. IEEE (2015)

    Google Scholar 

  22. Sainath, T.N., Mohamed, A.R., Kingsbury, B., Ramabhadran, B.: Deep convolutional neural networks for LVCSR. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 8614–8618. IEEE (2013)

    Google Scholar 

  23. Saon, G., Soltau, H., Nahamoo, D., Picheny, M.: Speaker adaptation of neural network acoustic models using i-vectors. In: Proceedings of 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU 2013), pp. 55–59 (2013)

    Google Scholar 

  24. Saon, G., Kuo, H.K.J., Rennie, S., Picheny, M.: The IBM 2015 English conversational telephone speech recognition system. arXiv preprint arXiv:1505.05899 (2015)

  25. Schmidhuber, J.: Deep learning in neural networks: an overview. Neural Netw. 61, 85–117 (2015)

    Article  Google Scholar 

  26. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 24–29. IEEE (2011)

    Google Scholar 

  27. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  28. Swietojanski, P., Li, J., Huang, J.T.: Investigation of maxout networks for speech recognition. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7699–7703. IEEE (2014)

    Google Scholar 

  29. Yu, D., Deng, L.: Automatic Speech Recognition. A Deep Learning Approach. Springer, London (2015)

    Google Scholar 

  30. Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  31. Zeiler, M.D., Ranzato, M., Monga, R., Mao, M., Yang, K., Le, Q.V., Nguyen, P., Senior, A., Vanhoucke, V., Dean, J., et al.: On rectified linear units for speech processing. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3517–3521. IEEE (2013)

    Google Scholar 

  32. Zhang, X., Trmal, J., Povey, D., Khudanpur, S.: Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 215–219. IEEE (2014)

    Google Scholar 

Download references

Acknowledgments

This work was financially supported by the Ministry of Educa-tion and Science of the Russian Federation, Contract 14.575.21.0033 (ID RFMEFI57514X0033)

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Maxim Korenevsky .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Prudnikov, A., Korenevsky, M. (2016). Training Maxout Neural Networks for Speech Recognition Tasks. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2016. Lecture Notes in Computer Science(), vol 9924. Springer, Cham. https://doi.org/10.1007/978-3-319-45510-5_51

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-45510-5_51

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-45509-9

  • Online ISBN: 978-3-319-45510-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics