Abstract
The concept of modulation has been ubiquitously linked to the notion of timbre. Modulation describes the variations of an acoustic signal (both spectrally and temporally) that shape how the acoustic energy fluctuates as the signal evolves over time. These fluctuations are largely shaped by the physics of a sound source or acoustic event and, as such, are inextricably reflective of the sound identity or its timbre. How one extracts these variations or modulations remains an open research question. The manifestation of signal variations not only spans the time and frequency axes but also bridges various resolutions in the joint spectrotemporal space. The additional variations driven by linguistic and musical constructs (e.g., semantics, harmony) further compound the complexity of the spectrotemporal space. This chapter examines common techniques that are used to explore the modulation space in such signals, which include signal processing, psychophysics, and neurophysiology. The perceptual and neural interpretations of modulation representations are discussed in the context of biological encoding of sounds in the central auditory system and the psychophysical manifestations of these cues. This chapter enumerates various representations of modulations, including the signal envelope, the modulation spectrum, and spectrotemporal receptive fields. The review also examines the effectiveness of these representations for understanding how sound modulations convey information to the listener about the timbre of a sound and, ultimately, how sound modulations shape the complex perceptual experience evoked by everyday sounds such as speech and music.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Anden J, Mallat S (2014) Deep scattering spectrum. IEEE Trans Signal Process 62:4114–4128. https://doi.org/10.1109/TSP.2014.2326991
Arai T, Greenberg S (1998) Speech intelligibility in the presence of cross-channel spectral asynchrony. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, p 933–939
Arai T, Pavel M, Hermansky H, Avendano C (1999) Syllable intelligibility for temporally filtered LPC cepstral trajectories. J Acoust Soc Am 105:2783–2791
Athineos M, Ellis DPW (2003) Frequency-domain linear prediction for temporal features. IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU):261–266
Attias H, Schreiner CE (1997) Temporal low-order statistics of natural sounds. In: Adv. Neural Inf. Proc. sys. (NIPS). MIT Press: Cambridge, MA, p 27–33
Carlin MA, Patil K, Nemala SK, Elhilali M (2012) Robust phoneme recognition based on biomimetic speech contours. In: Proceedings of the 13th annual conference of the international speech communication association (INTERSPEECH), p 1348–1351
Chen F, Jokinen K (2010) Speech technology: theory and applications, 1st edn. Springer, New York
Chi T, Gao Y, Guyton MC, Ru P, Shamma S (1999) Spectro-temporal modulation transfer functions and speech intelligibility. J Acoust Soc Am 106:2719–2732
Chi T, Ru P, Shamma SA (2005) Multiresolution spectrotemporal analysis of complex sounds. J Acoust Soc Am 118:887–906
Childers DG, Skinner DP, Kemerait RC (1977) The cepstrum: a guide to processing. Proc IEEE 65:1428–1443. https://doi.org/10.1109/PROC.1977.10747
Choi JE, Won JH, Kim CH, Cho Y-S, Hong SH, Moon IJ (2018) Relationship between spectrotemporal modulation detection and music perception in normal-hearing, hearing-impaired, and cochlear implant listeners. Sci Rep. 8(1). https://doi.org/10.1038/s41598-017-17350-w
Chowning JM (1973) The synthesis of complex audio spectra by means of frequency modulation. J Audio Eng Soc 21:1–10
Cohen L (1995) Time-frequency signal analysis, 1st edn. Prentice-Hall, Englewood Cliffs
Collins N (2009) Introduction to computer music, 1st edn. Wiley, Chichester/West Sussex
Croghan NBH, Duran SI, Smith ZM (2017) Re-examining the relationship between number of cochlear implant channels and maximal speech intelligibility. J Acoust Soc Am 142:EL537–EL543. https://doi.org/10.1121/1.5016044
deBoer E (1976) On the “residue” and auditory pitch perception. In: Keidel W, Neff D (eds) Auditory system (handbook of sensory physiology). Springer, Berlin, pp 479–583
Depireux DA, Elhilali M (eds) (2013) Handbook of modern techniques in auditory cortex. First. Nova Science Publishers, Inc., New York
Depireux DA, Simon JZ, Klein DJ, Shamma SA (2001) Spectro-temporal response field characterization with dynamic ripples in ferret primary auditory cortex. J Neurophysiol 85:1220–1234
Ding N, Patel AD, Chen L, Butler H, Luo C, Poeppel D (2017) Temporal modulations in speech and music. Neurosci Biobehav Rev 81:181–187
Divenyi P, Greenberg S, Meyer G (eds) (2006) Dynamics of speech production and perception. IOS Press, Amsterdam, p 388
Drullman R, Festen JM, Plomp R (1994) Effect of temporal envelope smearing on speech reception. J Acoust Soc Am 95:1053–1064
Dudley H (1939) Remaking speech. J Acoust Soc Am 11:169–177
Dudley H (1940) The carrier nature of speech. Bell Syst TechJ 19:495–513
Eggermont JJ (2001) Between sound and perception: reviewing the search for a neural code. Hear Res 157:1–42
Elhilali M (2017) Modeling the cocktail party problem. In: Middlebrooks J, Simon JZ, Popper AN, Fay RR (eds) The auditory system at the cocktail party. Springer, New York, pp 111–135
Elhilali M, Shamma S (2008) Information-bearing components of speech intelligibility under babble-noise and bandlimiting distortions. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), p 4205–4208
Elhilali M, Chi T, Shamma SA (2003) A spectro-temporal modulation index (STMI) for assessment of speech intelligibility. Speech Commun 41:331–348. https://doi.org/10.1016/S0167-6393(02)00134-6
Elhilali M, Shamma SA, Simon JZ, Fritz JB (2013) A linear systems view to the concept of STRF. In: Depireux D, Elhilali M (eds) Handbook of modern techniques in auditory cortex. Nova Science Pub Inc, New York, pp 33–60
Elliott TM, Theunissen FE (2009) The modulation transfer function for speech intelligibility. PLoS Comput Biol 5:e1000302
Elliott TM, Hamilton LS, Theunissen FE (2013) Acoustic structure of the five perceptual dimensions of timbre in orchestral instrument tones. J Acoust Soc Am 133(1):389–404. https://doi.org/10.1121/1.4770244
Escabi MA, Read HL (2003) Representation of spectrotemporal sound information in the ascending auditory pathway. Biol Cybern 89:350–362
Freeman R (2004) Telecommunication system engineering, fourth edn. Wiley-Interscience, New York
Friesen LM, Shannon RV, Baskent D, Wang X (2001) Speech recognition in noise as a function of the number of spectral channels: comparison of acoustic hearing and cochlear implants. J Acoust Soc Am 110:1150–1163
Ganapathy S, Thomas S, Hermansky H (2010) Robust spectro-temporal features based on autoregressive models of Hilbert envelopes. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), p 4286–4289
Gill P, Zhang J, Woolley S, Fremouw T, Theunissen F (2006) Sound representation methods for spectro-temporal receptive field estimation. J Comput Neurosci 21:5. https://doi.org/10.1007/s10827-006-7059-4
Glasberg BR, Moore BC (1992) Effects of envelope fluctuations on gap detection. Hear Res 64:81–92
Gosselin F, Schyns PG (2001) Bubbles: a technique to reveal the use of information in recognition tasks. Vis Res 41(17):2261–2271. https://doi.org/10.1016/S0042-6989(01)00097-9
Greenberg S (2004) Temporal properties of spoken language. In: Proceedings of the international congress on acoustics. Kyoto, Japan, p 441–445
Greenberg S, Arai T (2001) The relation between speech intelligibility and the complex modulation spectrum. In: Proceedings of the 7th European conference on speech communication and technology (Eurospeech-2001), p 473–476
Grochenig K (2001) Foundations of time-frequency analysis. Birkhauser, Boston
Hemery E, Aucouturier J-J (2015) One hundred ways to process time, frequency, rate and scale in the central auditory system: a pattern-recognition meta-analysis. Front Comput Neurosci 9(80). https://doi.org/10.3389/fncom.2015.00080
Hepworth-Sawyer R, Hodgson J (2016) Mixing music, First edn. Routledge, New York/London
Hermansky H, Sharma S (1999) Temporal patterns (TRAPs) in ASR of noisy speech. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), p 292
Hintz M (2016) Digital speech technology:pProcessing, recognition and synthesis. Willford Press
Houtgast T, Steeneken HJM (1985) A review of MTF concept in room acoustics and its use for estimating speech intelligibility in auditoria. J Acoust Soc Am 77:1069–1077
Houtsma AJM, Smurzynski J (1990) Pitch identification and discrimination for complex tones with many harmonics. J Acoust Soc Am 87:304–310
Ibrahim R, Bruce I (2010) Effects of peripheral tuning on the auditory nerve’s representation of speech envelope and temporal fine structure cues. In: Lopez-Poveda EA, Palmer AR, MR (eds) The neurophysiological bases of auditory perception. Springer, New York, pp 429–438
Jepsen ML, Ewert SD, Dau T (2008) A computational model of human auditory signal processing and perception. J Acoust Soc Am 124:422–438
Katz M (2006) The violin: a research and information guide. Routledge Taylor and Francis Group, London/New York
Kingsbury B, Morgan N, Greenberg S (1998) Robust speech recognition using the modulation spectrogram. Speech Commun 25:117–132
Kleinschmidt M (2003) Localized spectro-temporal features for automatic speech recognition. In: Proceedings of Eurospeech, p 2573–2576
Kowalski N, Depireux DA, Shamma SA (1996) Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J Neurophysiol 76:3503–3523
Leaver AM, Rauschecker JP (2010) Cortical representation of natural complex sounds: effects of acoustic features and auditory object category. J Neurosci 30:7604–7612
Qin Li, Les Atlas (2005) Properties for modulation spectral filtering. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), p 521–524
Liégeois-Chauvel C, Peretz I, Babaï M, Laguitton V, Chauvel P (1998) Contribution of different cortical areas in the temporal lobes to music processing. Brain 121:1853–1867. https://doi.org/10.1093/brain/121.10.1853
Liu RC, Miller KD, Merzenich MM, Schreiner CE (2003) Acoustic variability and distinguishability among mouse ultrasound vocalizations. J Acoust Soc Am 114:3412–3422
Lyons RG (2011) Understanding digital signal processing, third edn. Prentice Hall, Upper Saddle River
McAuley J, Ming J, Stewart D, Hanna P (2005) Subband correlation and robust speech recognition. IEEE Trans Speech Audio Process 13:956–963. https://doi.org/10.1109/TSA.2005.851952
McDermott HJ (2004) Music perception with cochlear implants: a review. Trends Amplif 8:49–82
Meredith D (ed) (2016) Computational music analysis. Springer International Publishing, Cham
Meyer B, Ravuri S, Schaedler M, Morgan N (2011) Comparing different flavors of spectro-temporal features for ASR. In: Proceedings of the 12th annual conference of the international speech communication association (INTERSPEECH), p 1269–1272
Miller LM, Escabí MA, Read HL, Schreiner CE, Escabi MA, Read HL, Schreiner CE (2002) Spectrotemporal receptive fields in the lemniscal auditory thalamus and cortex. J Neurophysiol 87:516–527. https://doi.org/10.1152/jn.00395.2001
Moore BCJ (2003) An introduction to the psychology of hearing, 5th edn. Emerald Group Publishing Ltd, Leiden
Moore BCJ (2014) Auditory processing of temporal fine structure: Effects of age and hearing loss, 1st edn. World Scientific Publishing, Co, Hackensack/New Jersey
Morgan N, Chen BY, Zhu Q, Stolcke A (2004) Trapping conversational speech: extending TRAP/tandem approaches to conversational telephone speech recognition. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), p 40 vol.1
Moritz N, Anemuller J, Kollmeier B (2011) Amplitude modulation spectrogram based features for robust speech recognition in noisy and reverberant environments. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), p 5492–5495
Müller M (2015) Fundamentals of music processing. Springer International Publishing, Cham
Muller M, Ellis DPW, Klapuri A, Richard G (2011) Signal processing for music analysis. J IEEE, Sel Top Signal Process 5:1088–1110. https://doi.org/10.1109/JSTSP.2011.2112333
Nemala SK, Patil K, Elhilali M (2013) A multistream feature framework based on bandpass modulation filtering for robust speech recognition. IEEE Trans Audio Speech Lang Process 21:416–426. https://doi.org/10.1109/TASL.2012.2219526
Norman-Haignere S, Kanwisher NG, McDermott JH (2015) Distinct cortical pathways for music and speech revealed by hypothesis-free voxel decomposition. Neuron 88:1281–1296. https://doi.org/10.1016/j.neuron.2015.11.035
Patel AD (2008) Music, language, and the brain, First edn. Oxford University Press, Oxford
Patil K, Pressnitzer D, Shamma S, Elhilali M (2012) Music in our ears: the biological bases of musical timbre perception. PLoS Comput Biol 8:e1002759. https://doi.org/10.1371/journal.pcbi.1002759
Peters RW, Moore BC, Baer T (1998) Speech reception thresholds in noise with and without spectral and temporal dips for hearing-impaired and normally hearing people. J Acoust Soc Am 103:577–587
Pickett JM (1999) The acoustics of speech communication: fundamentals, speech perception theory, and technology. Allyn & Bacon, Boston
Poeppel D, Idsardi WJ, van Wassenhove V (2008) Speech perception at the interface of neurobiology and linguistics. PhilosTransR Socl B BiolSci 363:1071–1086
Qin MK, Oxenham AJ (2003) Effects of simulated cochlear-implant processing on speech reception in fluctuating maskers. J Acoust Soc Am 114:446–454
Rabiner L, Schafer R (2010) Theory and applications of digital speech processing, First edn. Pearson, Upper Saddle River
Rosen S (1992) Temporal information in speech: acoustic, auditory and linguistic aspects. Philos Trans R Soc B-Biological Sci 336:367–373
Sadagopan S, Wang X (2009) Nonlinear spectrotemporal interactions underlying selectivity for complex sounds in auditory cortex. J Neurosci 29:11192–11202
Sadie S (ed) (2001) The new grove dictionary of music and musicians, Second edn. Macmillan, London
Santoro R, Moerel M, De Martino F, Goebel R, Ugurbil K, Yacoub E, Formisano E (2014) Encoding of natural sounds at multiple spectral and temporal resolutions in the human auditory cortex. PLoS Comput Biol 10(1). https://doi.org/10.1371/journal.pcbi.1003412
Schädler MR, Kollmeier B (2015) Separable spectro-temporal Gabor filter bank features: reducing the complexity of robust features for automatic speech recognition. J Acoust Soc Am 137:2047–2059. https://doi.org/10.1121/1.4916618
Schreiner C, Calhoun B (1995) Spectral envelope coding in cat primary auditory cortex: properties of ripple transfer functions. J Audit Neurosci 1:39–61
Schreiner CE, Sutter ML (1992) Topography of excitatory bandwidth in cat primary auditory cortex: single-neuron versus multiple-neuron recordings. J Neurophysiol 68:1487–1502
Schroeder M, Atal B (1985) Code-excited linear prediction(CELP): high-quality speech at very low bit rates. In: Proceedings of the IEEE international conference on acoustics, speech, and signal processing (ICASSP), p 937–940. doi: https://doi.org/10.1109/ICASSP.1985.1168147
Schuller B (2013) Applications in intelligent music analysis. Springer, Berlin/ Heidelberg
Shamma S, Fritz J (2014) Adaptive auditory computations. Curr Opin Neurobiol 25:164–168. https://doi.org/10.1016/j.conb.2014.01.011
Shamma S, Lorenzi C (2013) On the balance of envelope and temporal fine structure in the encoding of speech in the early auditory system. J Acoust Soc Am 133:2818–2833. https://doi.org/10.1121/1.4795783
Shannon RV (2005) Speech and music have different requirements for spectral resolution. Int Rev Neurobiol 70:121–134
Shannon R, Zeng F, Kamath V, Wygonski J, Ekelid M (1995) Speech recognition with primarily temporal cues. Science 270:303–304
Singh N, Theunissen F (2003) Modulation spectra of natural sounds and ethological theories of auditory processing. J Acoust Soc Am 106:3394–3411
Smith ZM, Delgutte B, Oxenham AJ (2002) Chimaeric sounds reveal dichotomies in auditory perception. Nature 416:87–90. https://doi.org/10.1038/416087a
Steeneken HJ, Houtgast T (1979) A physical method for measuring speech-transmission quality. J Acoust Soc Am 67:318–326
Thoret E, Depalle P, McAdams S (2016) Perceptually salient spectrotemporal modulations for recognition of sustained musical instruments. J Acoust Soc Am 140(6). https://doi.org/10.1121/1.4971204
Turner RE, Sahani M (2011) Demodulation as probabilistic inference. IEEE Trans Audio Speech Lang Process 19(8):2398–2411
Van Der Wel RPRD, Sternad D, Rosenbaum DA (2009) Moving the arm at different rates: slow movements are avoided. J Mot Behav 42:29–36. https://doi.org/10.1080/00222890903267116
van Noorden L, Moelants D (1999) Resonance in the perception of musical pulse. J New Music Res 28:43–66. https://doi.org/10.1076/jnmr.28.1.43.3122
Venezia JH, Hickok G, Richards VM (2016) Auditory “bubbles”: efficient classification of the spectrotemporal modulations essential for speech intelligibility. J Acoust Soc Am 140(2):1072–1088. https://doi.org/10.1121/1.4960544
Versnel H, Kowalski N, Shamma SA (1995) Ripple analysis in ferret primary auditory cortex. III. Topographic distribution of ripple response parameters. J Audit Neurosci 1:271–286
Wang TT, Quatieri TF (2012) Two-dimensional speech-signal modeling. IEEE Trans Audio Speech Lang Process 20:1843–1856. https://doi.org/10.1109/TASL.2012.2188795
Wilson BS (2004) Engineering design of cochlear implants. 20:14–52
Xu L, Pfingst BE (2003) Relative importance of temporal envelope and fine structure in lexical-tone perception. J Acoust Soc Am 114:3024–3027
Yang X, Wang K, Shamma SA (1992) Auditory representations of acoustic signals. IEEE Trans Inf Theory 38:824–839
Zatorre RJ, Belin P, Penhune VB (2002) Structure and function of auditory cortex: music and speech. Trends Cogn Sci 6:37–46
Zeng F-G, Nie K, Stickney GS, Kong Y-Y, Vongphoe M, Bhargave A, Wei C, Cao K (2005) Speech recognition with amplitude and frequency modulations. Proc Natl Acad Sci 102:2293–2298
Zhang X, Heinz MG, Bruce IC, Carney LH (2001) A phenomenological model for the responses of auditory-nerve fibers: I. Nonlinear tuning with compression and suppression. J Acoust Soc Am 109:648–670
Acknowledgements
Dr. Elhilali’s work is supported by grants from the National Science Foundation (NSF), the National Institutes of Health (NIH), and the Office of Naval Research (ONR).
Compliance with Ethics Requirements
Mounya Elhilali declares she has no conflicts of interest.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this chapter
Cite this chapter
Elhilali, M. (2019). Modulation Representations for Speech and Music. In: Siedenburg, K., Saitis, C., McAdams, S., Popper, A., Fay, R. (eds) Timbre: Acoustics, Perception, and Cognition. Springer Handbook of Auditory Research, vol 69. Springer, Cham. https://doi.org/10.1007/978-3-030-14832-4_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-14832-4_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-14831-7
Online ISBN: 978-3-030-14832-4
eBook Packages: Biomedical and Life SciencesBiomedical and Life Sciences (R0)