Abstract
Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to well-accepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.
This is a preview of subscription content, access via your institution.
References
Arai T, Pavel M, Hermansky H, Avendano C 1999 Syllable intelligibility for temporally filtered LPC cepstral trajectories. J. Acoust. Soc. Am. 105(5): 2783–2791
Athineos M, Ellis D P W 2007 Autoregressive modelling of temporal envelopes. IEEE Trans. Signal Process. 55(11): 5237–5245
Athineos M, Hermansky H, Ellis D P W 2004 LP-TRAPS: Linear predictive temporal patterns. Proc. Interspeech 2004, Jeju Island, Korea
Avendano C 1997 Temporal processing of speech in a time-feature space. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland
Avendano C, Hermansky H 1997 On the properties of temporal processing for speech in adverse environments. Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y.
Bourlard H, Wellekens C J 1989 Links between Markov models and multilayer perceptrons, in D S Touretzky (ed), Advances in neural information processing systems I, Morgan Kaufmann, Los Altos, CA, 502–510
Cernocky J 2003 Temporal processing for feature extraction in speech recognition. Habilitation Thesis, FIT, Brno University of Technology, Czech Republic
Chen B Y 2005 Learning discriminant narrow-band temporal patterns for automatic recognition of conversational telephone speech. Ph.D. Thesis, University of California at Berkeley
Chen B, Zhu Q, Morgan N 2004 Learning long-term temporal features in LVCSR using neural networks. Proc. Interspeech 2004, Jeju Island, Korea
Cohen J 1990 Personal communications at the International Computer Science Institute, Berkeley, California
Dau T, Kollmeier B, Kollrausch A 1997 Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102(5): 2892–2905
Dau T, Pueschel D, Kohlrausch A 1996 A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6): 3615–3622
de Veth J, Boves L 1997 Phase-corrected RASTA for automatic speech recognition over the phone. ICASSP’97, Munich
Drullman R, Festen J M, Plomp R 1994 Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5): 2670–2680
Dudley H 1939 Remaking speech. J. Acoust. Soc. Am. 11(2): 169–177
Dudley H 1940 The carrier nature of speech. Bell System Tech. J. 19: 495–513
Elhilali M, Chi T, Shamma S A 2003 A spectro-temporal modulation index (STMI) assessment of speech intelligibility. Speech Commun. 41(2–3): 331–348
Fousek P, Lamel L, Gauvain J 2008 Transcribing broadcast data using MLP features. Proc. Interspeech 2008, Brisbane
Furui S 1981 Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29(2): 254–272
Ganapathy S, Thomas S, Hermansky H 2009 Modulation frequency features for phoneme recognition in noisy speech. J. Acoust. Soc. Am. 125(1): EL8–EL12
Gold B 1998 Personal communications, Berkeley, California
Greenberg S 1999 Speaking in shorthand – A syllable-centric perspective for understanding pronunciation variation. Speech Commun. 29(2–4): 159–176
Grézl F 2007 TRAP-based probabilistic features for automatic speech recognition. Ph.D. Thesis, FIT, Brno University of Technology, Czech Republic
Grézl F, Karafiat M, Kontar S, Cernocky J 2007 Probabilistic and bottle-neck features for LVCSR of meetings. Proc. ICASSP’07, Honolulu
Hermansky H 1990 Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4): 1738–1752
Hermansky H 1994 Speech beyond 10 ms (temporal filtering in feature domain). International Workshop on Human Interface Technology 1994, Aizu, Japan
Hermansky H 1997 The modulation spectrum in automatic recognition of speech. Proc. 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CA
Hermansky H 1998a Modulation spectrum in speech processing, in Procházka A, Uhlíř J, Rayner P J W, Kingsbury N G (eds) Signal analysis and prediction. Boston: Birkhauser
Hermansky H 1998b Data-driven analysis of speech. Invited Paper, Proceedings of the International Conference on Text, Speech and Dialogue, Brno, Czech Republic
Hermansky H 1998c Should recognizers have ears? Speech Commun. 25(1–3): 3–27
Hermansky H, Ellis D P W, Sharma S 2000 Connectionist feature extraction for conventional HMM systems. ICASSP’00, Istanbul
Hermansky H, Fousek P 2005 Multi-resolution RASTA filtering for TANDEM-based ASR. Proc. Interspeech 2005, Lisbon, 361–364
Hermansky H, Greenberg S, Pavel M 1995 A brief (100–200 ms) history of time in feature extraction of speech. The XV Annual Speech Research Symposium, Baltimore, MD
Hermansky H, Morgan N 1994 RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4): 578–589
Hermansky H, Morgan N, Bayya A, Kohn P 1991 Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP), In EUROSPEECH-1991, 1367–1370.
Hermansky H, Sharma S 1998 TRAPS – Classifiers of temporal patterns. ICSLP’98, Sydney
Houtgast T 1989 Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Am. 85(4): 1676–1680
Houtgast T, Steeneken H J M 1973 The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28: 66–73
Jain P 2003 Temporal patterns of frequency localized features in ASR. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland
Kajarekar S, Malayath N, Hermansky H 2000 ANOVA in modulation spectral domain. ICASSP’00, Istanbul
Kanedera N, Arai T, Hermansky H, Pavel M 1999 On the relative importance of various components of the modulation spectrum of speech. Speech Commun. 28(1): 43–55
Kanedera N, Hermansky H, Arai T 1998 Desired characteristics of modulation spectrum for robust automatic speech recognition. ICASSP’98, Seattle, WA, 2: 613–616
Kim J, Choi S, Park S 2002 Performance analysis of automatic lip reading based on inter-frame filtering. Proc. 2002 Multimodal Speech Recognition Workshop, Greensboro, NC
Kingsbury B E D, Morgan N 1997 The modulation spectrogram: In pursuit of an invariant representation of speech. Proc. ICASSP’97, Munich, 1259–1262
Kollmeier B, Wesselkamp M, Hansen M, Dau T 1999 Modeling speech intelligibility and quality on the basis of the “effective” signal processing in the auditory system (A). J. Acoust. Soc. Am. 105(2): 1305–1305
Kowalski N, Depireux D A, Shamma S A 1996 Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J. Neurophysiol. 76(5): 3503–3523
Kozhevnikov V A, Chistovich L A 1967 Speech: Articulation and perception. Trans. U.S. Department of Commerce, Clearing House for Federal Scientific and Technical Information (Washington, D.C.: Joint Publications Research Service), 250–251
Ladefoged P 1967 Three areas of experimental phonetics (London: Oxford University Press)
Makhoul J 1975 Spectral linear prediction: properties and applications. IEEE Trans. Acoust. Speech Signal Process. 23(3): 283–296
Marr D 1982 Vision: A computational investigation into the human representation and processing of visual information (San Francisco: W.H. Freeman and Company)
Mermelstein P 1976 Distance measures for speech recognition, psychological and instrumental, in R C H Chen (ed) Pattern recognition and artificial intelligence, New York: Academic Press, 374–388
Mesgarani N, Thomas S, Hermansky H 2011 Toward optimizing stream fusion in multistream recognition of speech. J. Acoust. Soc. Am. 130(1): EL14–EL18
Mlouka M, Lienard J S 1975 Word recognition based on either stationary items or on transitions. Speech Commun. 3: 257–263, Go Fant (ed.) (Stockholm: Almqvist & Wiksell Int.)
Park J, Diehl F, Gales M J F, Tomalin M, Woodland P C 2009 Training and adapting MLP features for Arabic speech recognition. Proc. ICASSP’09, Taipei
Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24: 175–184
Plahl C, Hoffmeister B, Heigold G, Loeoef J, Schlueter R, Ney H 2009 Development of the GALE 2008 Mandarin LVCSR System. Proc. Interspeech 2009, Brighton, UK, 2107–2111
Potter R K, Kopp G A, Green H C 1947 Visible speech (New York: D Van Nostrand)
Riesz R 1928 Differential intensity sensitivity of the ear for pure tones. Phys. Rev. 31(5): 867–875
Schroeder M R 1998 Personal communications, Il Ciocco NATO Advanced Study Institute
Schwarz P 2008 Phoneme recognition based on long temporal context. Ph.D. Thesis, FIT, Brno University of Technology, Czech Republic
Sharma S 1999 Multi-stream approach to robust speech recognition. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland
Thomas S, Ganapathy S, Hermansky H 2008a Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech. Proc. Interspeech 2008, Brisbane
Thomas S, Ganapathy S, Hermansky H 2008b Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Process. Lett. 15: 681–684
Thomas S, Ganapathy S, Hermansky H 2009 Tandem representations of spectral envelope and modulation frequency features for ASR. Proc. Interspeech 2009, Brighton, UK
Thomas S, Patil K, Ganapathy S, Mesgarani N, Hermansky H 2010 A phoneme recognition framework based on auditory spectro-temporal receptive fields. Proc. Interspeech 2010, Tokyo, 2458–2461
Tibrewala S, Hermansky H 1997 Multi-stream approach in acoustic modeling. LVCSR-Hub5 Workshop, Baltimore
Valente F, Hermansky H 2006 Discriminant linear processing of time-frequency plane. ICSLP’98, Pittsburgh
van Vuuren S, Hermansky H 1997 Data-driven design of RASTA-like filters. Eurospeech’97, ESCA, Rhodes, Greece
van Vuuren S, Hermansky H 1998 On the importance of components of the modulation spectrum for speaker verification. ICSLP’98, Sydney
von Helmholtz A 1863 Die Lehre von den Tonempfindungen als physiologische Grundlage für die Theorie der Musik (On the sensations of tone as a physiological basis for the theory of music) Trans. Ellis. Kaufmann, London: Longmans, Green, and Co., 1875
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
HERMANSKY, H. Speech recognition from spectral dynamics. Sadhana 36, 729–744 (2011). https://doi.org/10.1007/s12046-011-0044-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12046-011-0044-2
Keywords
- Carrier nature of speech
- modulation spectrum
- spectral dynamics of speech
- coding of linguistic information in speech
- machine recognition of speech
- data-guided signal processing techniques