Skip to main content

Do We Need STRFs for Cocktail Parties? On the Relevance of Physiologically Motivated Features for Human Speech Perception Derived from Automatic Speech Recognition

  • Conference paper
  • First Online:
Basic Aspects of Hearing

Part of the book series: Advances in Experimental Medicine and Biology ((volume 787))

Abstract

Complex auditory features such as spectro-temporal receptive fields (STRFs) derived from the cortical auditory neurons appear to be advantageous in sound processing. However, their physiological and functional relevance is still unclear. To assess the utility of such feature processing for speech reception in noise, automatic speech recognition (ASR) performance using feature sets obtained from physiological and/or psychoacoustical data and models is compared to human performance. Time-frequency representations with a nonlinear compression are compared with standard features such as mel-scaled spectrograms. Both alternatives serve as an input to model estimators that infer spectro-temporal filters (and subsequent nonlinearity) from physiological measurements in auditory brain areas of zebra finches. Alternatively, a filter bank of 2-dimensional Gabor functions is employed, which covers a wide range of modulation frequencies in the time and frequency domain. The results indicate a clear increase in ASR robustness using complex features (modeled by Gabor functions), while the benefit from physiologically derived STRFs is limited. In all cases, the use of power-normalized spectral representations increases performance, indicating that substantial dynamic compression is advantageous for level-independent pattern recognition. The methods employed may help physiologists to look for more relevant STRFs and to better understand specific differences in estimated STRFs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  • Gill P, Zhang J, Woolley S, Fremouw T, Theunissen FE (2006) Sound representation methods for spectro-temporal receptive field estimation. J Comput Neurosci 21:5–20

    Article  PubMed  Google Scholar 

  • Hirsch H, Pearce D (2000) The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proceedings of ICSLP, Beijing, 2000, vol 4, pp 29–37

    Google Scholar 

  • Jürgens T, Brand T (2009) Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model. J Acoust Soc Am 126:2635–2648

    Article  PubMed  Google Scholar 

  • Kim C, Stern RM (2009) Feature extraction for robust speech recognition using a power-law ­nonlinearity and power-bias subtraction. In: Proceedings of Interspeech, 2009, Brighton, UK, pp 28–31

    Google Scholar 

  • Meyer BT, Brand T, Kollmeier B (2011) Effect of speech-intrinsic variations on human and ­automatic recognition of spoken phonemes. J Acoust Soc Am 129:388–403

    Article  PubMed  Google Scholar 

  • Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust ASR. J Acoust Soc Am 131:4134–4151

    Article  PubMed  Google Scholar 

  • Sroka JJ, Braida LD (2005) Human and machine consonant recognition. Speech Commun 45:401–423

    Article  Google Scholar 

Download references

Acknowledgment

This work was supported by Deutsche Forschungsgemeinschaft (SFB-TRR 31).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to B. Kollmeier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer Science+Business Media New York

About this paper

Cite this paper

Kollmeier, B., Schädler, M.R.R., Meyer, A., Anemüller, J., Meyer, B.T. (2013). Do We Need STRFs for Cocktail Parties? On the Relevance of Physiologically Motivated Features for Human Speech Perception Derived from Automatic Speech Recognition. In: Moore, B., Patterson, R., Winter, I., Carlyon, R., Gockel, H. (eds) Basic Aspects of Hearing. Advances in Experimental Medicine and Biology, vol 787. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1590-9_37

Download citation

Publish with us

Policies and ethics