, Volume 36, Issue 5, pp 729–744 | Cite as

Speech recognition from spectral dynamics

  • HYNEK HERMANSKYEmail author


Information is carried in changes of a signal. The paper starts with revisiting Dudley’s concept of the carrier nature of speech. It points to its close connection to modulation spectra of speech and argues against short-term spectral envelopes as dominant carriers of the linguistic information in speech. The history of spectral representations of speech is briefly discussed. Some of the history of gradual infusion of the modulation spectrum concept into Automatic recognition of speech (ASR) comes next, pointing to the relationship of modulation spectrum processing to well-accepted ASR techniques such as dynamic speech features or RelAtive SpecTrAl (RASTA) filtering. Next, the frequency domain perceptual linear prediction technique for deriving autoregressive models of temporal trajectories of spectral power in individual frequency bands is reviewed. Finally, posterior-based features, which allow for straightforward application of modulation frequency domain information, are described. The paper is tutorial in nature, aims at a historical global overview of attempts for using spectral dynamics in machine recognition of speech, and does not always provide enough detail of the described techniques. However, extensive references to earlier work are provided to compensate for the lack of detail in the paper.


Carrier nature of speech modulation spectrum spectral dynamics of speech coding of linguistic information in speech machine recognition of speech data-guided signal processing techniques 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. Arai T, Pavel M, Hermansky H, Avendano C 1999 Syllable intelligibility for temporally filtered LPC cepstral trajectories. J. Acoust. Soc. Am. 105(5): 2783–2791CrossRefGoogle Scholar
  2. Athineos M, Ellis D P W 2007 Autoregressive modelling of temporal envelopes. IEEE Trans. Signal Process. 55(11): 5237–5245MathSciNetCrossRefGoogle Scholar
  3. Athineos M, Hermansky H, Ellis D P W 2004 LP-TRAPS: Linear predictive temporal patterns. Proc. Interspeech 2004, Jeju Island, KoreaGoogle Scholar
  4. Avendano C 1997 Temporal processing of speech in a time-feature space. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, PortlandGoogle Scholar
  5. Avendano C, Hermansky H 1997 On the properties of temporal processing for speech in adverse environments. Workshop on Applications of Signal Processing to Audio and Acoustics, New Paltz, N.Y.Google Scholar
  6. Bourlard H, Wellekens C J 1989 Links between Markov models and multilayer perceptrons, in D S Touretzky (ed), Advances in neural information processing systems I, Morgan Kaufmann, Los Altos, CA, 502–510Google Scholar
  7. Cernocky J 2003 Temporal processing for feature extraction in speech recognition. Habilitation Thesis, FIT, Brno University of Technology, Czech RepublicGoogle Scholar
  8. Chen B Y 2005 Learning discriminant narrow-band temporal patterns for automatic recognition of conversational telephone speech. Ph.D. Thesis, University of California at BerkeleyGoogle Scholar
  9. Chen B, Zhu Q, Morgan N 2004 Learning long-term temporal features in LVCSR using neural networks. Proc. Interspeech 2004, Jeju Island, KoreaGoogle Scholar
  10. Cohen J 1990 Personal communications at the International Computer Science Institute, Berkeley, CaliforniaGoogle Scholar
  11. Dau T, Kollmeier B, Kollrausch A 1997 Modeling auditory processing of amplitude modulation. I. Detection and masking with narrow-band carriers. J. Acoust. Soc. Am. 102(5): 2892–2905CrossRefGoogle Scholar
  12. Dau T, Pueschel D, Kohlrausch A 1996 A quantitative model of the “effective” signal processing in the auditory system. I. Model structure. J. Acoust. Soc. Am. 99(6): 3615–3622CrossRefGoogle Scholar
  13. de Veth J, Boves L 1997 Phase-corrected RASTA for automatic speech recognition over the phone. ICASSP’97, MunichGoogle Scholar
  14. Drullman R, Festen J M, Plomp R 1994 Effect of reducing slow temporal modulations on speech reception. J. Acoust. Soc. Am. 95(5): 2670–2680CrossRefGoogle Scholar
  15. Dudley H 1939 Remaking speech. J. Acoust. Soc. Am. 11(2): 169–177CrossRefGoogle Scholar
  16. Dudley H 1940 The carrier nature of speech. Bell System Tech. J. 19: 495–513Google Scholar
  17. Elhilali M, Chi T, Shamma S A 2003 A spectro-temporal modulation index (STMI) assessment of speech intelligibility. Speech Commun. 41(2–3): 331–348CrossRefGoogle Scholar
  18. Fousek P, Lamel L, Gauvain J 2008 Transcribing broadcast data using MLP features. Proc. Interspeech 2008, BrisbaneGoogle Scholar
  19. Furui S 1981 Cepstral analysis technique for automatic speaker verification. IEEE Trans. Acoust. Speech Signal Process. 29(2): 254–272CrossRefGoogle Scholar
  20. Ganapathy S, Thomas S, Hermansky H 2009 Modulation frequency features for phoneme recognition in noisy speech. J. Acoust. Soc. Am. 125(1): EL8–EL12CrossRefGoogle Scholar
  21. Gold B 1998 Personal communications, Berkeley, CaliforniaGoogle Scholar
  22. Greenberg S 1999 Speaking in shorthand – A syllable-centric perspective for understanding pronunciation variation. Speech Commun. 29(2–4): 159–176CrossRefGoogle Scholar
  23. Grézl F 2007 TRAP-based probabilistic features for automatic speech recognition. Ph.D. Thesis, FIT, Brno University of Technology, Czech RepublicGoogle Scholar
  24. Grézl F, Karafiat M, Kontar S, Cernocky J 2007 Probabilistic and bottle-neck features for LVCSR of meetings. Proc. ICASSP’07, HonoluluGoogle Scholar
  25. Hermansky H 1990 Perceptual linear predictive (PLP) analysis of speech. J. Acoust. Soc. Am. 87(4): 1738–1752CrossRefGoogle Scholar
  26. Hermansky H 1994 Speech beyond 10 ms (temporal filtering in feature domain). International Workshop on Human Interface Technology 1994, Aizu, JapanGoogle Scholar
  27. Hermansky H 1997 The modulation spectrum in automatic recognition of speech. Proc. 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, Santa Barbara, CAGoogle Scholar
  28. Hermansky H 1998a Modulation spectrum in speech processing, in Procházka A, Uhlíř J, Rayner P J W, Kingsbury N G (eds) Signal analysis and prediction. Boston: BirkhauserGoogle Scholar
  29. Hermansky H 1998b Data-driven analysis of speech. Invited Paper, Proceedings of the International Conference on Text, Speech and Dialogue, Brno, Czech RepublicGoogle Scholar
  30. Hermansky H 1998c Should recognizers have ears? Speech Commun. 25(1–3): 3–27CrossRefGoogle Scholar
  31. Hermansky H, Ellis D P W, Sharma S 2000 Connectionist feature extraction for conventional HMM systems. ICASSP’00, IstanbulGoogle Scholar
  32. Hermansky H, Fousek P 2005 Multi-resolution RASTA filtering for TANDEM-based ASR. Proc. Interspeech 2005, Lisbon, 361–364Google Scholar
  33. Hermansky H, Greenberg S, Pavel M 1995 A brief (100–200 ms) history of time in feature extraction of speech. The XV Annual Speech Research Symposium, Baltimore, MDGoogle Scholar
  34. Hermansky H, Morgan N 1994 RASTA processing of speech. IEEE Trans. Speech Audio Process. 2(4): 578–589CrossRefGoogle Scholar
  35. Hermansky H, Morgan N, Bayya A, Kohn P 1991 Compensation for the effect of the communication channel in auditory-like analysis of speech (RASTA-PLP), In EUROSPEECH-1991, 1367–1370.Google Scholar
  36. Hermansky H, Sharma S 1998 TRAPS – Classifiers of temporal patterns. ICSLP’98, SydneyGoogle Scholar
  37. Houtgast T 1989 Frequency selectivity in amplitude-modulation detection. J. Acoust. Soc. Am. 85(4): 1676–1680CrossRefGoogle Scholar
  38. Houtgast T, Steeneken H J M 1973 The modulation transfer function in room acoustics as a predictor of speech intelligibility. Acustica 28: 66–73Google Scholar
  39. Jain P 2003 Temporal patterns of frequency localized features in ASR. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, PortlandGoogle Scholar
  40. Kajarekar S, Malayath N, Hermansky H 2000 ANOVA in modulation spectral domain. ICASSP’00, IstanbulGoogle Scholar
  41. Kanedera N, Arai T, Hermansky H, Pavel M 1999 On the relative importance of various components of the modulation spectrum of speech. Speech Commun. 28(1): 43–55CrossRefGoogle Scholar
  42. Kanedera N, Hermansky H, Arai T 1998 Desired characteristics of modulation spectrum for robust automatic speech recognition. ICASSP’98, Seattle, WA, 2: 613–616Google Scholar
  43. Kim J, Choi S, Park S 2002 Performance analysis of automatic lip reading based on inter-frame filtering. Proc. 2002 Multimodal Speech Recognition Workshop, Greensboro, NCGoogle Scholar
  44. Kingsbury B E D, Morgan N 1997 The modulation spectrogram: In pursuit of an invariant representation of speech. Proc. ICASSP’97, Munich, 1259–1262Google Scholar
  45. Kollmeier B, Wesselkamp M, Hansen M, Dau T 1999 Modeling speech intelligibility and quality on the basis of the “effective” signal processing in the auditory system (A). J. Acoust. Soc. Am. 105(2): 1305–1305CrossRefGoogle Scholar
  46. Kowalski N, Depireux D A, Shamma S A 1996 Analysis of dynamic spectra in ferret primary auditory cortex. I. Characteristics of single-unit responses to moving ripple spectra. J. Neurophysiol. 76(5): 3503–3523Google Scholar
  47. Kozhevnikov V A, Chistovich L A 1967 Speech: Articulation and perception. Trans. U.S. Department of Commerce, Clearing House for Federal Scientific and Technical Information (Washington, D.C.: Joint Publications Research Service), 250–251Google Scholar
  48. Ladefoged P 1967 Three areas of experimental phonetics (London: Oxford University Press)Google Scholar
  49. Makhoul J 1975 Spectral linear prediction: properties and applications. IEEE Trans. Acoust. Speech Signal Process. 23(3): 283–296MathSciNetCrossRefGoogle Scholar
  50. Marr D 1982 Vision: A computational investigation into the human representation and processing of visual information (San Francisco: W.H. Freeman and Company)Google Scholar
  51. Mermelstein P 1976 Distance measures for speech recognition, psychological and instrumental, in R C H Chen (ed) Pattern recognition and artificial intelligence, New York: Academic Press, 374–388Google Scholar
  52. Mesgarani N, Thomas S, Hermansky H 2011 Toward optimizing stream fusion in multistream recognition of speech. J. Acoust. Soc. Am. 130(1): EL14–EL18CrossRefGoogle Scholar
  53. Mlouka M, Lienard J S 1975 Word recognition based on either stationary items or on transitions. Speech Commun. 3: 257–263, Go Fant (ed.) (Stockholm: Almqvist & Wiksell Int.)Google Scholar
  54. Park J, Diehl F, Gales M J F, Tomalin M, Woodland P C 2009 Training and adapting MLP features for Arabic speech recognition. Proc. ICASSP’09, TaipeiGoogle Scholar
  55. Peterson G E, Barney H L 1952 Control methods used in a study of the vowels. J. Acoust. Soc. Am. 24: 175–184CrossRefGoogle Scholar
  56. Plahl C, Hoffmeister B, Heigold G, Loeoef J, Schlueter R, Ney H 2009 Development of the GALE 2008 Mandarin LVCSR System. Proc. Interspeech 2009, Brighton, UK, 2107–2111Google Scholar
  57. Potter R K, Kopp G A, Green H C 1947 Visible speech (New York: D Van Nostrand)Google Scholar
  58. Riesz R 1928 Differential intensity sensitivity of the ear for pure tones. Phys. Rev. 31(5): 867–875CrossRefGoogle Scholar
  59. Schroeder M R 1998 Personal communications, Il Ciocco NATO Advanced Study InstituteGoogle Scholar
  60. Schwarz P 2008 Phoneme recognition based on long temporal context. Ph.D. Thesis, FIT, Brno University of Technology, Czech RepublicGoogle Scholar
  61. Sharma S 1999 Multi-stream approach to robust speech recognition. Ph.D. Thesis, Oregon Graduate Institute of Science and Technology, PortlandGoogle Scholar
  62. Thomas S, Ganapathy S, Hermansky H 2008a Hilbert envelope based spectro-temporal features for phoneme recognition in telephone speech. Proc. Interspeech 2008, BrisbaneGoogle Scholar
  63. Thomas S, Ganapathy S, Hermansky H 2008b Recognition of reverberant speech using frequency domain linear prediction. IEEE Signal Process. Lett. 15: 681–684CrossRefGoogle Scholar
  64. Thomas S, Ganapathy S, Hermansky H 2009 Tandem representations of spectral envelope and modulation frequency features for ASR. Proc. Interspeech 2009, Brighton, UKGoogle Scholar
  65. Thomas S, Patil K, Ganapathy S, Mesgarani N, Hermansky H 2010 A phoneme recognition framework based on auditory spectro-temporal receptive fields. Proc. Interspeech 2010, Tokyo, 2458–2461Google Scholar
  66. Tibrewala S, Hermansky H 1997 Multi-stream approach in acoustic modeling. LVCSR-Hub5 Workshop, BaltimoreGoogle Scholar
  67. Valente F, Hermansky H 2006 Discriminant linear processing of time-frequency plane. ICSLP’98, PittsburghGoogle Scholar
  68. van Vuuren S, Hermansky H 1997 Data-driven design of RASTA-like filters. Eurospeech’97, ESCA, Rhodes, GreeceGoogle Scholar
  69. van Vuuren S, Hermansky H 1998 On the importance of components of the modulation spectrum for speaker verification. ICSLP’98, SydneyGoogle Scholar
  70. von Helmholtz A 1863 Die Lehre von den Tonempfindungen als physiologische Grundlage für die Theorie der Musik (On the sensations of tone as a physiological basis for the theory of music) Trans. Ellis. Kaufmann, London: Longmans, Green, and Co., 1875Google Scholar

Copyright information

© Indian Academy of Sciences 2011

Authors and Affiliations

  1. 1.The Johns Hopkins UniversityBaltimoreUSA

Personalised recommendations