Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise

Hu, Guangxin; Determan, Sarah C.; Dong, Yue; Beeve, Alec T.; Collins, Joshua E.; Gai, Yan

doi:10.1007/s10162-019-00737-z

Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise

Research Article
Published: 22 November 2019

Volume 21, pages 73–87, (2020)
Cite this article

Journal of the Association for Research in Otolaryngology Aims and scope Submit manuscript

Guangxin Hu¹,
Sarah C. Determan¹,
Yue Dong¹,
Alec T. Beeve¹,
Joshua E. Collins¹ &
…
Yan Gai ORCID: orcid.org/0000-0002-1228-8705¹

615 Accesses
3 Citations
Explore all metrics

Abstract

Acoustic features of speech include various spectral and temporal cues. It is known that temporal envelope plays a critical role for speech recognition by human listeners, while automated speech recognition (ASR) heavily relies on spectral analysis. This study compared sentence-recognition scores of humans and an ASR software, Dragon, when spectral and temporal-envelope cues were manipulated in background noise. Temporal fine structure of meaningful sentences was reduced by noise or tone vocoders. Three types of background noise were introduced: a white noise, a time-reversed multi-talker noise, and a fake-formant noise. Spectral information was manipulated by changing the number of frequency channels. With a 20-dB signal-to-noise ratio (SNR) and four vocoding channels, white noise had a stronger disruptive effect than the fake-formant noise. The same observation with 22 channels was made when SNR was lowered to 0 dB. In contrast, ASR was unable to function with four vocoding channels even with a 20-dB SNR. Its performance was least affected by white noise and most affected by the fake-formant noise. Increasing the number of channels, which improved the spectral resolution, generated non-monotonic behaviors for the ASR with white noise but not with colored noise. The ASR also showed highly improved performance with tone vocoders. It is possible that fake-formant noise affected the software’s performance by disrupting spectral cues, whereas white noise affected performance by compromising speech segmentation. Overall, these results suggest that human listeners and ASR utilize different listening strategies in noise.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Perception of vocoded speech in domestic dogs

Article Open access 16 April 2024

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

References

Ali A (1999) Auditory-based acoustic-phonetic signal processing for robust continuous speech recognition. PhD thesis, University of Pennsylvania
Allen JB (1995) How do humans process and recognize speech? In: Ramachandran RP, Mammone RJ (eds) Modern methods of speech processing. Springer US, Boston, pp 251–275
Chapter Google Scholar
Ardoint M, Agus T, Sheft S, Lorenzi C (2011) Importance of temporal-envelope speech cues in different spectral regions. J Acoust Soc Am 130:EL115–EL121
Article PubMed Google Scholar
Atal BS, Hanaver SL (1971) Speech analysis and synthesis by linear prediction of the speech wave. J Acoust Soc Am 50:637–655
Article CAS PubMed Google Scholar
Baken R, Orlikoff R (2000) Clinical measurement of speech and voice, 2nd edn. Singular Publishing Group Thomson Learning, San Diego
Google Scholar
Baker J (1975) The DRAGON system—an overview. IEEE Transactions on Acoustics, Speech, and Signal Processing 23:24–29
Article Google Scholar
Beekhuizen B, Bod R, Zuidema W (2013) Three design principles of language: the search for parsimony in redundancy. Lang Speech 56:265–290
Article PubMed Google Scholar
Bregman AS, Pinker S (1978) Auditory streaming and the building of timbre. Can J Psychol 32:19–31
Article CAS PubMed Google Scholar
Calandruccio L, Smiljanic R (2012) New sentence recognition materials developed using a basic non-native English lexicon. J Speech Lang Hear Res 55:1342–1355
Article PubMed Google Scholar
Cooke M (2006) A glimpsing model of speech perception in noise. J Acoust Soc Am 119:1562–1573
Article PubMed Google Scholar
Davis SB, Mermelstein P (1990) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. In: Alex W, Kai-Fu L (eds) Readings in speech recognition. Morgan Kaufmann Publishers Inc., pp 65–74
Do CT, Pastor D, Goalic A (2010) On the recognition of cochlear implant-like spectrally reduced speech with MFCC and HMM-based ASR. IEEE Transactions on Audio, Speech, and Language Processing 18:1065–1068
Article Google Scholar
Dorman MF, Loizou PC, Rainey D (1997a) Speech intelligibility as a function of the number of channels of stimulation for signal processors using sine-wave and noise-band outputs. J Acoust Soc Am 102:2403–2411
Article CAS PubMed Google Scholar
Dorman MF, Loizou PC, Rainey D (1997b) Simulating the effect of cochlear-implant electrode insertion depth on speech understanding. J Acoust Soc Am 102:2993–2996
Article CAS PubMed Google Scholar
Eisenberg LS, Shannon RV, Martinez AS, Wygonski J, Boothroyd A (2000) Speech recognition with reduced spectral cues as a function of age. J Acoust Soc Am 107:2704–2710
Article CAS PubMed Google Scholar
Friesen LM, Shannon RV, Baskent D, Wang X (2001) Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. J Acoust Soc Am 110:1150–1163
Article CAS PubMed Google Scholar
Gelfer MP, Mikos VA (2005) The relative contributions of speaking fundamental frequency and formant frequencies to gender identification based on isolated vowels. J Voice 19:544–554
Article PubMed Google Scholar
Ghitza O (2001) On the upper cutoff frequency of the auditory critical-band envelope detectors in the context of speech perception. J Acoust Soc Am 110:1628–1640
Article CAS PubMed Google Scholar
Gilbert G, Lorenzi C (2006) The ability of listeners to use recovered envelope cues from speech fine structure. J Acoust Soc Am 119:2438–2444
Article PubMed Google Scholar
Glasberg BR, Moore BC (1990) Derivation of auditory filter shapes from notched-noise data. Hear Res 47:103–138
Article CAS PubMed Google Scholar
Heinz MG, Swaminathan J (2009) Quantifying envelope and fine-structure coding in auditory nerve responses to chimaeric speech. J Assoc Res Otolaryngol 10:407–423
Article PubMed PubMed Central Google Scholar
Juneja A, Espy-Wilson C (2008) A probabilistic framework for landmark detection based on phonetic features for automatic speech recognition. J Acoust Soc Am 123:1154–1168
Article PubMed Google Scholar
Liu C, Fu QJ (2007) Estimation of vowel recognition with cochlear implant simulations. IEEE Trans Biomed Eng 54:74–81
Article PubMed Google Scholar
Lock RH, Lock PF, Morgan KL, Lock EF, Lock DF (2017) Statistics: unlocking the power of data, 2nd edn. Wiley, NJ
Google Scholar
Loizou PC, Dorman M, Tu Z (1999) On the number of channels needed to understand speech. J Acoust Soc Am 106:2097–2103
Article CAS PubMed Google Scholar
Makhoul J (1975) Linear prediction: a tutorial review. Proc IEEE 63:561–580
Article Google Scholar
Mao J, Carney LH (2014) Binaural detection with narrowband and wideband reproducible noise maskers. IV. Models using interaural time, level, and envelope differences. J Acoust Soc Am 135:824–837
Article PubMed PubMed Central Google Scholar
Mao J, Carney LH (2015) Tone-in-noise detection using envelope cues: comparison of signal-processing-based and physiological models. J Assoc Res Otolaryngol 16:121–133
Article PubMed Google Scholar
Mao J, Koch KJ, Doherty KA, Carney LH (2015) Cues for diotic and dichotic detection of a 500-Hz tone in noise vary with hearing loss. J Assoc Res Otolaryngol 16:507–521
Article PubMed PubMed Central Google Scholar
Qin MK, Oxenham AJ (2003) Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. J Acoust Soc Am 114:446–454
Article PubMed Google Scholar
Rader T, Adel Y, Fastl H, Baumann U (2015) Speech perception with combined electric-acoustic stimulation: a simulation and model comparison. Ear Hear 36:e314–e325
Article PubMed Google Scholar
Rao A, Kumaresan R (2000) On decomposing speech into modulated components. IEEE Trans Speech Audio Process 8:240–254
Article Google Scholar
Reddy DR (1976) Speech recognition by machine: a review. Proc IEEE 64:501–531
Article Google Scholar
Roberts B, Summers RJ, Bailey PJ (2011) The intelligibility of noise-vocoded speech: spectral information available from across-channel comparison of amplitude envelopes. Proc Biol Sci 278:1595–1600
PubMed Google Scholar
Rosen S (1992) Temporal information in speech: acoustic, auditory and linguistic aspects. Philos Trans R Soc Lond B Biol Sci 336:367–373
Article CAS PubMed Google Scholar
Schnupp J, Nelken I, King AJ (2012) Auditory neuroscience: making sense of sound. MIT Press, Cambridge
Google Scholar
Shannon RV, Zeng FG, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270:303–304
Article CAS PubMed Google Scholar
Shannon RV, Fu QJ, Galvin J, 3rd (2004) The number of spectral channels required for speech recognition depends on the difficulty of the listening situation. Acta Otolaryngol Suppl:50–54
Article Google Scholar
Smith ZM, Delgutte B, Oxenham AJ (2002) Chimaeric sounds reveal dichotomies in auditory perception. Nature 416:87–90
Article CAS PubMed PubMed Central Google Scholar
Stilp CE (2011) The redundancy of phonemes in sentential context. J Acoust Soc Am 130:EL323–EL328
Article PubMed PubMed Central Google Scholar
Swaminathan J, Reed CM, Desloge JG, Braida LD, Delhorne LA (2014) Consonant identification using temporal fine structure and recovered envelope cues. J Acoust Soc Am 135:2078–2090
Article PubMed PubMed Central Google Scholar
Whitmal NA, Poissant SF, Freyman RL, Helfer KS (2007) Speech intelligibility in cochlear implant simulations: effects of carrier type, interfering noise, and subject experience. J Acoust Soc Am 122:2376–2388
Article PubMed Google Scholar
Zeng FG, Nie K, Liu S, Stickney G, Del Rio E, Kong YY, Chen H (2004) On the dichotomy in auditory perception between temporal envelope and fine structure cues. J Acoust Soc Am 116:1351–1354
Article PubMed Google Scholar
Zeng FG, Nie K, Stickney GS, Kong YY, Vongphoe M, Bhargave A, Wei C, Cao K (2005) Speech recognition with amplitude and frequency modulations. Proc Natl Acad Sci U S A 102:2293–2298
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgments

We thank L. Carney for offering significant input on the manuscript. We thank L. Calandruccio for providing the sentences. We also thank the reviewers for providing tremendous help and insights to the manuscript.

Author information

Guangxin Hu and Sarah C. Determan are both first authors with equal contributions.

Authors and Affiliations

Biomedical Engineering Department, Saint Louis University, 3007 Lindell Blvd Suite 2007, St Louis, MO, 63103, USA
Guangxin Hu, Sarah C. Determan, Yue Dong, Alec T. Beeve, Joshua E. Collins & Yan Gai

Authors

Guangxin Hu
View author publications
You can also search for this author in PubMed Google Scholar
Sarah C. Determan
View author publications
You can also search for this author in PubMed Google Scholar
Yue Dong
View author publications
You can also search for this author in PubMed Google Scholar
Alec T. Beeve
View author publications
You can also search for this author in PubMed Google Scholar
Joshua E. Collins
View author publications
You can also search for this author in PubMed Google Scholar
Yan Gai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yan Gai.

Ethics declarations

Conflict of Interest

The authors declare that they have no conflict of interest.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hu, G., Determan, S.C., Dong, Y. et al. Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise. JARO 21, 73–87 (2020). https://doi.org/10.1007/s10162-019-00737-z

Download citation

Received: 15 March 2019
Accepted: 16 September 2019
Published: 22 November 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s10162-019-00737-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise

Abstract

Access this article

Similar content being viewed by others

Perception of vocoded speech in domestic dogs

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spectral and Temporal Envelope Cues for Human and Automatic Speech Recognition in Noise

Abstract

Access this article

Similar content being viewed by others

Perception of vocoded speech in domestic dogs

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation