Do We Need STRFs for Cocktail Parties? On the Relevance of Physiologically Motivated Features for Human Speech Perception Derived from Automatic Speech Recognition

Kollmeier, B.; Schädler, M. R. René; Meyer, A.; Anemüller, J.; Meyer, B. T.

doi:10.1007/978-1-4614-1590-9_37

B. Kollmeier⁶,
M. R. René Schädler⁶,
A. Meyer⁶,
J. Anemüller⁶ &
…
B. T. Meyer⁶

Part of the book series: Advances in Experimental Medicine and Biology ((volume 787))

4309 Accesses
1 Citations

Abstract

Complex auditory features such as spectro-temporal receptive fields (STRFs) derived from the cortical auditory neurons appear to be advantageous in sound processing. However, their physiological and functional relevance is still unclear. To assess the utility of such feature processing for speech reception in noise, automatic speech recognition (ASR) performance using feature sets obtained from physiological and/or psychoacoustical data and models is compared to human performance. Time-frequency representations with a nonlinear compression are compared with standard features such as mel-scaled spectrograms. Both alternatives serve as an input to model estimators that infer spectro-temporal filters (and subsequent nonlinearity) from physiological measurements in auditory brain areas of zebra finches. Alternatively, a filter bank of 2-dimensional Gabor functions is employed, which covers a wide range of modulation frequencies in the time and frequency domain. The results indicate a clear increase in ASR robustness using complex features (modeled by Gabor functions), while the benefit from physiologically derived STRFs is limited. In all cases, the use of power-normalized spectral representations increases performance, indicating that substantial dynamic compression is advantageous for level-independent pattern recognition. The methods employed may help physiologists to look for more relevant STRFs and to better understand specific differences in estimated STRFs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Gill P, Zhang J, Woolley S, Fremouw T, Theunissen FE (2006) Sound representation methods for spectro-temporal receptive field estimation. J Comput Neurosci 21:5–20
Article PubMed Google Scholar
Hirsch H, Pearce D (2000) The AURORA experimental framework for the performance evaluation of speech recognition systems under noisy conditions. In: Proceedings of ICSLP, Beijing, 2000, vol 4, pp 29–37
Google Scholar
Jürgens T, Brand T (2009) Microscopic prediction of speech recognition for listeners with normal hearing in noise using an auditory model. J Acoust Soc Am 126:2635–2648
Article PubMed Google Scholar
Kim C, Stern RM (2009) Feature extraction for robust speech recognition using a power-law nonlinearity and power-bias subtraction. In: Proceedings of Interspeech, 2009, Brighton, UK, pp 28–31
Google Scholar
Meyer BT, Brand T, Kollmeier B (2011) Effect of speech-intrinsic variations on human and automatic recognition of spoken phonemes. J Acoust Soc Am 129:388–403
Article PubMed Google Scholar
Schädler MR, Meyer BT, Kollmeier B (2012) Spectro-temporal modulation subspace-spanning filter bank features for robust ASR. J Acoust Soc Am 131:4134–4151
Article PubMed Google Scholar
Sroka JJ, Braida LD (2005) Human and machine consonant recognition. Speech Commun 45:401–423
Article Google Scholar

Download references

Acknowledgment

This work was supported by Deutsche Forschungsgemeinschaft (SFB-TRR 31).

Author information

Authors and Affiliations

Medizinische Physik, Carl von Ossietzky University, Oldenburg, D-26111, Germany
B. Kollmeier, M. R. René Schädler, A. Meyer, J. Anemüller & B. T. Meyer

Authors

B. Kollmeier
View author publications
You can also search for this author in PubMed Google Scholar
M. R. René Schädler
View author publications
You can also search for this author in PubMed Google Scholar
A. Meyer
View author publications
You can also search for this author in PubMed Google Scholar
J. Anemüller
View author publications
You can also search for this author in PubMed Google Scholar
B. T. Meyer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to B. Kollmeier .

Editor information

Editors and Affiliations

Department of Experimental Psychology, University of Cambridge, Cambridge, United Kingdom
Brian C. J. Moore
Physiology Department, University of Cambridge, Cambridge, United Kingdom
Roy D. Patterson
Physiology Department, University of Cambridge, Cambridge, United Kingdom
Ian M. Winter
MRC-Cognition and Brain Sciences Unit, MRC-Cognition and Brain Sciences Unit, Cambridge, United Kingdom
Robert P. Carlyon
MRC-Cognition and Brain Sciences Unit, MRC-Cognition and Brain Sciences Unit, Cambridge, United Kingdom
Hedwig E Gockel

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kollmeier, B., Schädler, M.R.R., Meyer, A., Anemüller, J., Meyer, B.T. (2013). Do We Need STRFs for Cocktail Parties? On the Relevance of Physiologically Motivated Features for Human Speech Perception Derived from Automatic Speech Recognition. In: Moore, B., Patterson, R., Winter, I., Carlyon, R., Gockel, H. (eds) Basic Aspects of Hearing. Advances in Experimental Medicine and Biology, vol 787. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-1590-9_37

Download citation

DOI: https://doi.org/10.1007/978-1-4614-1590-9_37
Published: 16 April 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4614-1589-3
Online ISBN: 978-1-4614-1590-9
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)

Publish with us

Policies and ethics