Skip to main content

Speech recognition from spectral dynamics

Abstract

Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to well-accepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.

This is a preview of subscription content, access via your institution.

References

  • Arai T, Pavel M, Hermansky H, Avendano C 1999 Syllable intelligibility for temporally filtered LPC cepstral trajectories. J. Acoust. Soc. Am. 105(5): 2783–2791

    Article  Google Scholar 

  • Athineos M, Ellis D P W 2007 Autoregressive modelling of temporal envelopes. IEEE Trans. Signal Process. 55(11): 5237–5245

    Article  MathSciNet  Google Scholar 

  • Athineos M, Hermansky H, Ellis D P W 2004 LP-TRAPS: Linear predictive temporal patterns. Proc. Interspeech 2004, Jeju Island, Korea

  • Avendano C 1997 Temporal processing of speech in a time-feature space. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland

  • Avendano C, Hermansky H 1997 On the properties of temporal processing for speech in adverse environments. Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y.

  • Bourlard H, Wellekens C J 1989 Links between Markov models and multilayer perceptrons, in D S Touretzky (ed), Advances in neural information processing systems I, Morgan Kaufmann, Los Altos, CA, 502–510

  • Cernocky J 2003 Temporal processing for feature extraction in speech recognition. Habilitation Thesis, FIT, Brno University of Technology, Czech Republic

  • Chen B Y 2005 Learning discriminant narrow-band temporal patterns for automatic recognition of conversational telephone speech. Ph.D. Thesis, University of California at Berkeley

  • Chen B, Zhu Q, Morgan N 2004 Learning long-term temporal features in LVCSR using neural networks. Proc. Interspeech 2004, Jeju Island, Korea

  • Cohen J 1990 Personal communications at the International Computer Science Institute, Berkeley, California

    Google Scholar 

  • Dau T, Kollmeier B, Kollrausch A 1997 Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102(5): 2892–2905

    Article  Google Scholar 

  • Dau T, Pueschel D, Kohlrausch A 1996 A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6): 3615–3622

    Article  Google Scholar 

  • de Veth J, Boves L 1997 Phase-corrected RASTA for automatic speech recognition over the phone. ICASSP’97, Munich

  • Drullman R, Festen J M, Plomp R 1994 Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5): 2670–2680

    Article  Google Scholar 

  • Dudley H 1939 Remaking speech. J. Acoust. Soc. Am. 11(2): 169–177

    Article  Google Scholar 

  • Dudley H 1940 The carrier nature of speech. Bell System Tech. J. 19: 495–513

    Google Scholar 

  • Elhilali M, Chi T, Shamma S A 2003 A spectro-temporal modulation index (STMI) assessment of speech intelligibility. Speech Commun. 41(2–3): 331–348

    Article  Google Scholar 

  • Fousek P, Lamel L, Gauvain J 2008 Transcribing broadcast data using MLP features. Proc. Interspeech 2008, Brisbane

  • Furui S 1981 Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29(2): 254–272

    Article  Google Scholar 

  • Ganapathy S, Thomas S, Hermansky H 2009 Modulation frequency features for phoneme recognition in noisy speech. J. Acoust. Soc. Am. 125(1): EL8–EL12

    Article  Google Scholar 

  • Gold B 1998 Personal communications, Berkeley, California

    Google Scholar 

  • Greenberg S 1999 Speaking in shorthand – A syllable-centric perspective for understanding pronunciation variation. Speech Commun. 29(2–4): 159–176

    Article  Google Scholar 

  • Grézl F 2007 TRAP-based probabilistic features for automatic speech recognition. Ph.D. Thesis, FIT, Brno University of Technology, Czech Republic

  • Grézl F, Karafiat M, Kontar S, Cernocky J 2007 Probabilistic and bottle-neck features for LVCSR of meetings. Proc. ICASSP’07, Honolulu

  • Hermansky H 1990 Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4): 1738–1752

    Article  Google Scholar 

  • Hermansky H 1994 Speech beyond 10 ms (temporal filtering in feature domain). International Workshop on Human Interface Technology 1994, Aizu, Japan

  • Hermansky H 1997 The modulation spectrum in automatic recognition of speech. Proc. 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CA

  • Hermansky H 1998a Modulation spectrum in speech processing, in Procházka A, Uhlíř J, Rayner P J W, Kingsbury N G (eds) Signal analysis and prediction. Boston: Birkhauser

    Google Scholar 

  • Hermansky H 1998b Data-driven analysis of speech. Invited Paper, Proceedings of the International Conference on Text, Speech and Dialogue, Brno, Czech Republic

  • Hermansky H 1998c Should recognizers have ears? Speech Commun. 25(1–3): 3–27

    Article  Google Scholar 

  • Hermansky H, Ellis D P W, Sharma S 2000 Connectionist feature extraction for conventional HMM systems. ICASSP’00, Istanbul

  • Hermansky H, Fousek P 2005 Multi-resolution RASTA filtering for TANDEM-based ASR. Proc. Interspeech 2005, Lisbon, 361–364

  • Hermansky H, Greenberg S, Pavel M 1995 A brief (100–200 ms) history of time in feature extraction of speech. The XV Annual Speech Research Symposium, Baltimore, MD

  • Hermansky H, Morgan N 1994 RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4): 578–589

    Article  Google Scholar 

  • Hermansky H, Morgan N, Bayya A, Kohn P 1991 Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP), In EUROSPEECH-1991, 1367–1370.

  • Hermansky H, Sharma S 1998 TRAPS – Classifiers of temporal patterns. ICSLP’98, Sydney

  • Houtgast T 1989 Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Am. 85(4): 1676–1680

    Article  Google Scholar 

  • Houtgast T, Steeneken H J M 1973 The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28: 66–73

    Google Scholar 

  • Jain P 2003 Temporal patterns of frequency localized features in ASR. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland

  • Kajarekar S, Malayath N, Hermansky H 2000 ANOVA in modulation spectral domain. ICASSP’00, Istanbul

  • Kanedera N, Arai T, Hermansky H, Pavel M 1999 On the relative importance of various components of the modulation spectrum of speech. Speech Commun. 28(1): 43–55

    Article  Google Scholar 

  • Kanedera N, Hermansky H, Arai T 1998 Desired characteristics of modulation spectrum for robust automatic speech recognition. ICASSP’98, Seattle, WA, 2: 613–616

  • Kim J, Choi S, Park S 2002 Performance analysis of automatic lip reading based on inter-frame filtering. Proc. 2002 Multimodal Speech Recognition Workshop, Greensboro, NC

  • Kingsbury B E D, Morgan N 1997 The modulation spectrogram: In pursuit of an invariant representation of speech. Proc. ICASSP’97, Munich, 1259–1262

  • Kollmeier B, Wesselkamp M, Hansen M, Dau T 1999 Modeling speech intelligibility and quality on the basis of the “effective” signal processing in the auditory system (A). J. Acoust. Soc. Am. 105(2): 1305–1305

    Article  Google Scholar 

  • Kowalski N, Depireux D A, Shamma S A 1996 Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J. Neurophysiol. 76(5): 3503–3523

    Google Scholar 

  • Kozhevnikov V A, Chistovich L A 1967 Speech: Articulation and perception. Trans. U.S. Department of Commerce, Clearing House for Federal Scientific and Technical Information (Washington, D.C.: Joint Publications Research Service), 250–251

  • Ladefoged P 1967 Three areas of experimental phonetics (London: Oxford University Press)

    Google Scholar 

  • Makhoul J 1975 Spectral linear prediction: properties and applications. IEEE Trans. Acoust. Speech Signal Process. 23(3): 283–296

    Article  MathSciNet  Google Scholar 

  • Marr D 1982 Vision: A computational investigation into the human representation and processing of visual information (San Francisco: W.H. Freeman and Company)

    Google Scholar 

  • Mermelstein P 1976 Distance measures for speech recognition, psychological and instrumental, in R C H Chen (ed) Pattern recognition and artificial intelligence, New York: Academic Press, 374–388

    Google Scholar 

  • Mesgarani N, Thomas S, Hermansky H 2011 Toward optimizing stream fusion in multistream recognition of speech. J. Acoust. Soc. Am. 130(1): EL14–EL18

    Article  Google Scholar 

  • Mlouka M, Lienard J S 1975 Word recognition based on either stationary items or on transitions. Speech Commun. 3: 257–263, Go Fant (ed.) (Stockholm: Almqvist & Wiksell Int.)

    Google Scholar 

  • Park J, Diehl F, Gales M J F, Tomalin M, Woodland P C 2009 Training and adapting MLP features for Arabic speech recognition. Proc. ICASSP’09, Taipei

  • Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24: 175–184

    Article  Google Scholar 

  • Plahl C, Hoffmeister B, Heigold G, Loeoef J, Schlueter R, Ney H 2009 Development of the GALE 2008 Mandarin LVCSR System. Proc. Interspeech 2009, Brighton, UK, 2107–2111

  • Potter R K, Kopp G A, Green H C 1947 Visible speech (New York: D Van Nostrand)

    Google Scholar 

  • Riesz R 1928 Differential intensity sensitivity of the ear for pure tones. Phys. Rev. 31(5): 867–875

    Article  Google Scholar 

  • Schroeder M R 1998 Personal communications, Il Ciocco NATO Advanced Study Institute

  • Schwarz P 2008 Phoneme recognition based on long temporal context. Ph.D. Thesis, FIT, Brno University of Technology, Czech Republic

  • Sharma S 1999 Multi-stream approach to robust speech recognition. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, Portland

  • Thomas S, Ganapathy S, Hermansky H 2008a Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech. Proc. Interspeech 2008, Brisbane

  • Thomas S, Ganapathy S, Hermansky H 2008b Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Process. Lett. 15: 681–684

    Article  Google Scholar 

  • Thomas S, Ganapathy S, Hermansky H 2009 Tandem representations of spectral envelope and modulation frequency features for ASR. Proc. Interspeech 2009, Brighton, UK

  • Thomas S, Patil K, Ganapathy S, Mesgarani N, Hermansky H 2010 A phoneme recognition framework based on auditory spectro-temporal receptive fields. Proc. Interspeech 2010, Tokyo, 2458–2461

  • Tibrewala S, Hermansky H 1997 Multi-stream approach in acoustic modeling. LVCSR-Hub5 Workshop, Baltimore

  • Valente F, Hermansky H 2006 Discriminant linear processing of time-frequency plane. ICSLP’98, Pittsburgh

  • van Vuuren S, Hermansky H 1997 Data-driven design of RASTA-like filters. Eurospeech’97, ESCA, Rhodes, Greece

  • van Vuuren S, Hermansky H 1998 On the importance of components of the modulation spectrum for speaker verification. ICSLP’98, Sydney

  • von Helmholtz A 1863 Die Lehre von den Tonempfindungen als physiologische Grundlage für die Theorie der Musik (On the sensations of tone as a physiological basis for the theory of music) Trans. Ellis. Kaufmann, London: Longmans, Green, and Co., 1875

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to HYNEK HERMANSKY.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

HERMANSKY, H. Speech recognition from spectral dynamics. Sadhana 36, 729–744 (2011). https://doi.org/10.1007/s12046-011-0044-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12046-011-0044-2

Keywords

  • Carrier nature of speech
  • modulation spectrum
  • spectral dynamics of speech
  • coding of linguistic information in speech
  • machine recognition of speech
  • data-guided signal processing techniques