Skip to main content

Feature Representation Learning in Deep Neural Networks

  • Chapter
  • First Online:
Book cover Automatic Speech Recognition

Part of the book series: Signals and Communication Technology ((SCT))

Abstract

In this chapter, we show that deep neural networks jointly learn the feature representation and the classifier. Through many layers of nonlinear processing, DNNs transform the raw input feature to a more invariant and discriminative representation that can be better classified by the log-linear model. In addition, DNNs learn a hierarchy of features. The lower-level features typically catch local patterns. These patterns are very sensitive to changes in the raw feature. The higher-level features, however, are built upon the low-level features and are more abstract and invariant to the variations in the raw feature. We demonstrate that the learned high-level features are robust to speaker and environment variations.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Good raw features still help though since the existing DNN learning algorithms may generate an underperformed system even if a linear transformation such as discrete cosine transformation (DCT) is applied to the log filter-bank features.

  2. 2.

    This behavior can be alleviated by adding small random noises to each training sample dynamically during the training time.

  3. 3.

    Huang et al. [8] also tried some of the variations such as the number of vowels per second and the speaking rate normalized by the average duration of different phonemes. It was reported that no matter which definition is used the WER pattern is very similar.

References

  1. Andreou, A., Kamm, T., Cohen, J.: Experiments in vocal tract normalization. In: Proceedings of the CAIP Workshop: Frontiers in Speech Recognition II (1994)

    Google Scholar 

  2. Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust., Speech Signal Process. 28(4), 357–366 (1980)

    Article  Google Scholar 

  3. Flego, F., Gales, M.J.: Discriminative adaptive training with VTS and JUD. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 170–175 (2009)

    Google Scholar 

  4. Gales, M.J., Woodland, P.: Mean and variance adaptation within the mllr framework. Comput. Speech Lang. 10(4), 249–264 (1996)

    Article  Google Scholar 

  5. Gales, M.J.: Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 12(2), 75–98 (1998)

    Article  Google Scholar 

  6. Gunawardana, A., Mahajan, M., Acero, A., Platt, J.C.: Hidden conditional random fields for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 1117–1120 (2005)

    Google Scholar 

  7. Hinton, G.E., Srivastava, N., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.R.: Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580 (2012)

  8. Huang, Y., Yu, D., Liu, C., Gong, Y.: A comparative analytic study on the gaussian mixture and context dependent deep neural network hidden markov models. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH) (2014)

    Google Scholar 

  9. Imagenet. http://www.image-net.org/

  10. Jarrett, K., Kavukcuoglu, K., Ranzato, M., LeCun, Y.: What is the best multi-stage architecture for object recognition? In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2146–2153 (2009)

    Google Scholar 

  11. Kalinli, O., Seltzer, M.L., Droppo, J., Acero, A.: Noise adaptive training for robust automatic speech recognition. IEEE Trans. Audio, Speech Lang. Process. 18(8), 1889–1901 (2010)

    Article  Google Scholar 

  12. Kim, D.Y., Kwan Un, C., Kim, N.S.: Speech recognition in noisy environments using first-order vector Taylor series. Speech Commun. 24(1), 39–49 (1998)

    Article  Google Scholar 

  13. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition. Speech Commun. 26(4), 283–297 (1998)

    Article  Google Scholar 

  14. Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: High-performance HMM adaptation with joint compensation of additive and convolutive distortions via vector Taylor series. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 65–70 (2007)

    Google Scholar 

  15. Li, J., Deng, L., Yu, D., Gong, Y., Acero, A.: HMM adaptation using a phase-sensitive acoustic distortion model for environment-robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4069–4072 (2008)

    Google Scholar 

  16. Li, J., Yu, D., Huang, J.T., Gong, Y.: Improving wideband speech recognition using mixed-bandwidth training data in CD-DNN-HMM. In: Proceedings of the IEEE Spoken Language Technology Workshop (SLT), pp. 131–136 (2012)

    Google Scholar 

  17. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the Seventh IEEE International Conference on Computer vision, vol. 2, IEEE, pp. 1150–1157. (1999)

    Google Scholar 

  18. Mohamed, A.R., Hinton, G., Penn, G.: Understanding how deep belief networks perform acoustic modelling. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4273–4276 (2012)

    Google Scholar 

  19. Moreno, P.J., Raj, B., Stern, R.M.: A vector Taylor series approach for environment-independent speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 2, pp. 733–736 (1996)

    Google Scholar 

  20. Parihar, N., Picone, J.: Aurora working group: DSR front end LVCSR evaluation AU/384/02. Institute for Signal and Information Process, Mississippi State University, Technical Report (2002)

    Google Scholar 

  21. Povey, D., Kanevsky, D., Kingsbury, B., Ramabhadran, B., Saon, G., Visweswariah, K.: Boosted MMI for model and feature-space discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4057–4060 (2008)

    Google Scholar 

  22. Povey, D., Kingsbury, B., Mangu, L., Saon, G., Soltau, H., Zweig, G.: fMPE: Discriminatively trained features for speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. 961–964 (2005)

    Google Scholar 

  23. Povey, D., Woodland, P.C.: Minimum phone error and I-smoothing for improved discriminative training. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, pp. I–105 (2002)

    Google Scholar 

  24. Ragni, A., Gales, M.: Derivative kernels for noise robust ASR. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 119–124 (2011)

    Google Scholar 

  25. Ratnaparkhi, A.: A simple introduction to maximum entropy models for natural language processing. IRCS Technical Reports Series, p. 81 (1997)

    Google Scholar 

  26. Sainath, T.N., Kingsbury, B., Mohamed, A.r., Ramabhadran, B.: Learning filter banks within a deep neural network framework. In: Proceedings of the IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) (2013)

    Google Scholar 

  27. Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: Proceedings of the IEEE Workshop on Automfatic Speech Recognition and Understanding (ASRU), pp. 24–29 (2011)

    Google Scholar 

  28. Seltzer, M., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2013)

    Google Scholar 

  29. Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., Fergus, R.: Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199 (2013)

  30. Wang, Y., Gales, M.J.: Speaker and noise factorization for robust speech recognition. IEEE Trans. Audio, Speech Lang. Process. 20(7), 2149–2158 (2012)

    Article  Google Scholar 

  31. Yu, D., Deng, L., Acero, A.: Hidden conditional random field with distribution constraints for phone classification. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 676–679 (2009)

    Google Scholar 

  32. Yu, D., Ju, Y.C., Wang, Y.Y., Zweig, G., Acero, A.: Automated directory assistance system-from theory to practice. In: Proceedings of the Annual Conference of International Speech Communication Association (INTERSPEECH), pp. 2709–2712 (2007)

    Google Scholar 

  33. Yu, D., Seltzer, M.L., Li, J., Huang, J.T., Seide, F.: Feature learning in deep neural networks—studies on speech recognition tasks. In: Proceedings of the ICLR (2013)

    Google Scholar 

  34. Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional neural networks. arXiv preprint arXiv:1311.2901 (2013)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Dong Yu .

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer-Verlag London

About this chapter

Cite this chapter

Yu, D., Deng, L. (2015). Feature Representation Learning in Deep Neural Networks. In: Automatic Speech Recognition. Signals and Communication Technology. Springer, London. https://doi.org/10.1007/978-1-4471-5779-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-1-4471-5779-3_9

  • Published:

  • Publisher Name: Springer, London

  • Print ISBN: 978-1-4471-5778-6

  • Online ISBN: 978-1-4471-5779-3

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics