Long, Deep and Wide Artificial Neural Nets for Dealing with Unexpected Noise in Machine Recognition of Speech

  • Hynek Hermansky
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8082)


Most emphasis in current deep learning artificial neural network based automatic recognition of speech is put on deep net architectures with multiple sequential levels of processing. . The current work argues that benefits can be also seen in expanding the nets longer in temporal direction, and wider into multiple parallel processing streams.


artificial neural networks machine recognition of speech robustness to noise unexpected distortions parallel processing 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Bourlard, H., Wellekens, C.J.: Links between markov models and multilayer perceptrons. IEEE Trans. Patt. Anal. and Machine Intell. 12(12), 1167–1178 (1990)CrossRefGoogle Scholar
  2. 2.
    Seide, F., Li, G., Yu, D.: Conversational speech transcription using context-dependent deep neural networks. In: Proc. Interspeech, pp. 437–412 (2011)Google Scholar
  3. 3.
    Morgan, N.: Deep and wide: Multiple layers in automatic speech recognition. IEEE Transactions on Audio, Speech, and Language Processing 20(1), 7–13 (2012)CrossRefGoogle Scholar
  4. 4.
    Miller, G.: Language and Communication. McGraw-Hill Book Company (1951)Google Scholar
  5. 5.
    Hermansky, H., Sharma, S.: TRAPS, classifiers of temporal patterns. In: Proc. Int. Conf. Spoken Language Processing. I. S. C. Association, Syndey (1998)Google Scholar
  6. 6.
    Tibrewala, S., Hermansky, H.: Multi-stream approach in acoustic modeling. In: Proc. DARPA Large Vocabulary Continuous Speech Recognition Hub 5 Workshop, pp. 1255–1258 (1997)Google Scholar
  7. 7.
    Tibrewala, S., Hermansky, H.: Sub-band based recognition of noisy speech. In: Proc. Int. Conf. Spoken Language Processing. International Speech Communication Association (1997)Google Scholar
  8. 8.
    Sharma, S.: Multi-stream approach to robust speech recognition. Ph.D. dissertation, Oregon Graduate Institute of Science and Technology, Portland, Oregon (1999)Google Scholar
  9. 9.
    Jain, P., Hermansky, H.: Beyond a single critical-band in TRAP based ASR. In: Proc. Eurospeech, pp. 437–440 (2003)Google Scholar
  10. 10.
    Hermansky, H.: Multistream recognition of speech: Dealing with unknown unknowns (invited paper). Proceedings of Institute of Electriocal and Electronics Engineers 101(5), 1076–1088 (2013)CrossRefGoogle Scholar
  11. 11.
    Fletcher, H.: Speech and Hearing in Communication. Krieger, New York (1953)Google Scholar
  12. 12.
    Duchnowski, P.: A new structure for automatic speech recognition. Ph.D. dissertation, Massachusetts Instittute of Technology, Cambridge, MA (1992)Google Scholar
  13. 13.
    Bourlard, H., Dupont, S., Hermansky, H., Morgan, N.: Towards subband-based speech recognition. In: Proc. EUSIPCO 1996, pp. 1579–1582 (1996)Google Scholar
  14. 14.
    Hermansky, H., Tibrewala, S., Pavel, M.: Towards ASR on partially corrupted speech. In: Proc. Int. Conf. Spoken Language Processing, pp. 462–465 (1996)Google Scholar
  15. 15.
    Bourlard, H., Dupont, S.: A new ASR approach based on independent processing and re-combination of partial frequency bands. In: Proc. Int. Conf. Spoken Language Processing, pp. 426–429 (1996)Google Scholar
  16. 16.
    Allen, J.B.: Personal communicaton. DoD Summer Workshop at Rutgers University (1993)Google Scholar
  17. 17.
    Allen, J.B.: Articulation and Intelligibility. Morgan & Claypool (2005)Google Scholar
  18. 18.
    Hermansky, H.: History of modulation spectrum in ASR. In: Proc. IEEE Int. Conf. Acoust., Speech and Signal Processing, pp. 5458–5461 (2010)Google Scholar
  19. 19.
    Hermansky, H.: Speech recognition from spectral dynamics (invited paper). Sādhanā, Indian Academy of Sciences 36(5), 729–744 (2011)Google Scholar
  20. 20.
    Dewey, E.: Relative Frequency of English Speech Sounds. Harvard University Press, Cambridge (1923)Google Scholar
  21. 21.
    Miller, G.A., Nicely, P.: An analysis of perceptual confusions among some english consonants. J. Acoust. Soc. Amer. 27(2), 338–352 (1955)CrossRefGoogle Scholar
  22. 22.
    Mesgarani, N., Thomas, S., Hermansky, H.: Towards optimizing stream fusion. Express Letters of the Acoustical Society of America 139(1), 14–18 (2011)Google Scholar
  23. 23.
    Variani, E., Hermansky, H.: Estimating classifier performance in unknown noise. To appear in Proc. Interspeech (2012)Google Scholar
  24. 24.
    Mesgarani, N., Thomas, S., Hermansky, H.: Adaptive stream fusion in multistream recognition of speech. In: Proc. Interspeech, pp. 2329–2332 (2011)Google Scholar
  25. 25.
    Hermansky, H., Variani, E., Peddinti, V.: Mean temporal distance: Predicting asr error from temporal properties of speech signal. JHU Center for Language and Speech Processing. Technical Report (December 2012)Google Scholar
  26. 26.
    Variani, E., Peng, L., Hermansky, H.: Multi-stream recogntion of noisy speech with performance monitoring. In: Proceedings Interspeech (2013)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Hynek Hermansky
    • 1
  1. 1.Center for Language and Speech ProcessingThe Johns Hopkins UniversityBaltimoreUSA

Personalised recommendations