Contemporary Methods for Speech Parameterization

Short-Time Cepstrum-Based Speech Features
  • Todor Ganchev
Part of the SpringerBriefs in Electrical and Computer Engineering book series


This brief book offers a general view of short-time cepstrum-based speech parameterization and provides a common ground for further in-depth studies on the subject. Specifically, it offers a comprehensive description, comparative analysis, and empirical performance evaluation of eleven contemporary speech parameterization methods, which compute short-time cepstrum-based speech features. Among these are five discrete wavelet packet transform (DWPT)-based and six discrete Fourier transform (DFT)-based speech features and some of their variants which have been used on the speech recognition, speaker recognition, and other related speech processing tasks. The main similarities and differences in their computation are discussed and empirical results from performance evaluation in common experimental conditions are presented. The recognition accuracy obtained on the monophone recognition, continuous speech recognition, and speaker recognition tasks is contrasted against the one obtained for the well-known and widely used Mel Frequency Cepstral Coefficients (MFCC). It is shown that many of these methods lead to speech features that do offer competitive performance on a certain speech processing setup when compared to the venerable MFCC. The last does not target the promotion of certain speech features but instead aims to enhance the common understanding about the advantages and disadvantages of the various speech parameterization techniques available today and to provide the basis for selection of an appropriate speech parameterization in each particular case. In brief, this volume consists of nine sections. Section 1 summarizes the main concepts on which the contemporary speech parameterization is based and offers some background information about their origins. Section 2 introduces the objectives of speech pre-processing and describes the processing steps that are commonly used in the contemporary speech parameterization methods. Sections 3 and 4 offer a comprehensive description and a comparative analysis of the DFT- and DWPT-based speech parameterization methods of interest. Sections 5–7, present results from experimental evaluation on the monophone recognition, continuous speech recognition, and speaker recognition tasks, respectively. 8 offers concluding remarks and outlook for possible future targets of speech parameterization research. Finally, Sect. 9 provides some links to other sources of information and to publically available software, which offer ready-to-use implementations of these speech features.


Speech pre-processing Speech parameterization Mel-scale Critical bands Cepstrum Sub-band processing of speech Time-frequency decomposition of speech Cepstral analysis of speech Speech features Linear frequency cepstral coefficients Mel frequency cepstral coefficients Human factor cepstral coefficients Perceptual linear prediction cepstral coefficients Wavelet packet transform-based speech features Wavelet packet features Monophone recognition Continuous speech recognition Speaker recognition 


  1. Allen JB (1996) Harvey Fletcher’s role in the creation of the communication acoustics. Journal of the Acoustical Society of America 99(4):1825–1839CrossRefGoogle Scholar
  2. Assaleh KT, Mammone RJ (1994a) Robust cepstral features for speaker identification. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP’94), Adelaide, Australia. Vol.1, pp. 129–132Google Scholar
  3. Assaleh KT, Mammone RJ (1994b) New LP-derived features for speaker identification. IEEE Transactions on Speech and Audio Processing 2(4):630–638CrossRefGoogle Scholar
  4. Atal BS, Hanauer SL (1971) Speech analysis and synthesis by linear prediction of the speech wave. Journal of the Acoustical Society of America 50(2):637–655CrossRefGoogle Scholar
  5. Atal BS (1974) Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. Journal of the Acoustical Society of America 55(6):1304–1312CrossRefGoogle Scholar
  6. Athineos M, Hermansky H, Ellis DPW (2004) LP-TRAP: Linear Predictive Temporal Patterns. Proceedings of the ICSLP-2004, Korea, Oct. 2004Google Scholar
  7. Batliner A, Steidl S, Schuller B, Seppi D, Vogt T, Wagner J, Devillers L, Vidrascu L, Aharonson V, Kessous L, Amir N (2011) Whodunnit - Searching for the Most Important Feature Types Signaling Emotion-Related User States in Speech. Computer Speech and Language 25(1):4–28CrossRefGoogle Scholar
  8. Benesty J, Sondhi MM, Huang Y (Eds.) (2008) Springer Handbook of Speech Processing. ISBN: 978-3-540-49125-5, Springer-Verlag, Berlin, HeidelbergGoogle Scholar
  9. Beranek LL (1949) Acoustic Measurements, New York. Wiley6Google Scholar
  10. Bogert BP, Hearly MJR, Tukey JW (1963) Quefrency analysis of time series for echoes: cepstrum, pseudo-autocovariance, cross-cepstrum and saphe cracking. Proceedings of the Symposium on Time Series Analysis, M. Rosenblatt, Rd. (John Wiley & Sons, Inc., New York, 1963), Chapter 15, pp. 209–243Google Scholar
  11. Bogert BP (1967) Informal comments on the uses of power spectrum analysis. IEEE Trans. on Audio and Electroacoustics 15(2):74–75, June 1967CrossRefGoogle Scholar
  12. Bridle JS, Brown MD (1974) An Experimental automatic word-recognition system: Interim report. JSRU Report, No.1003, Dec. 1974Google Scholar
  13. Campbell JP (1997) Speaker Recognition: A tutorial. Proceedings of the IEEE, 85(9):1437–1462, Sept. 1997Google Scholar
  14. Campbell W, Sturim D, Reynolds D (2006) Support vector machines using GMM supervectors for speaker verification. IEEE Signal Processing Letters 13(5):308–311CrossRefGoogle Scholar
  15. Chen K, Wang L, Chi H (1997) Methods of combining multiple classifiers with different features and their applications to text-independent speaker recognition. International Journal on Pattern Recognition and Artificial Intelligence 11(3):417–445, 1997CrossRefGoogle Scholar
  16. Chistovich LA (1985) Central auditory processing of peripheral vowel spectra, Journal of the Acoustical Society of America 77:789–805, October 1985CrossRefGoogle Scholar
  17. Crandall IB (1917) The composition of speech. Phys. Rev. 10(1):74–76, July 1917Google Scholar
  18. Crandall IB (1925) The sounds of speech. Bell System Technical Journal 4(4):586–626, Oct. 1925Google Scholar
  19. Davis SB, Mermelstein P (1980) Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. on Acoustic, Speech and Signal Processing 28(4):357–366CrossRefGoogle Scholar
  20. Deller JR, Proakis JG, Hansen JHL (1993) Discrete-time processing of speech signals. Prentice Hall, 1993Google Scholar
  21. Dudley H (1939) Remaking speech. Journal of the Acoustical Society of America 11:169–177, October 1939CrossRefGoogle Scholar
  22. Erzin E, Cetin AE, Yardimci Y (1995) Subband analysis for speech recognition in the presence of car noise. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-95), Detroit, MI, USA. Vol.1, pp.417–420Google Scholar
  23. Eyben F, Woellmer M, Schuller B (2010) openSMILE: the Munich versatile and fast open-source audio feature extractor. Proceedings of the International Conference on Multimedia, ACM, New York, NY, USA, pp. 1459–1462Google Scholar
  24. Fant G (1949) Analys av de svenska konsonantljuden. L.M. Ericsson protokoll H/P 1064 (139 pages)Google Scholar
  25. Fant CGM (1956) On the predictability of formant levels and spectrum envelopes from formant frequences. In M. Halle, H. MacLean (Eds), For Roman Jakobson, Mouton & Co, The Hague, pp.109–120Google Scholar
  26. Fant G (1973) Speech sounds and features. The MIT Press. Cambridge, MA, USAGoogle Scholar
  27. Farooq O, Datta S (2001) Mel filter-like admissible wavelet packet structure for speech recognition. IEEE Signal Processing Letters 8(7):196–198CrossRefGoogle Scholar
  28. Farooq O, Datta S (2002) Mel-scaled wavelet filter based features for noisy unvoiced phoneme recognition. Proceedings of the 7th International Conference on Spoken Language Processing (ICSLP 2002), Denver, Colorado, USA. pp.1017–1020Google Scholar
  29. Fauve BGB, Matrouf D, Scheffer N, Bonastre J-F, Mason JSD (2007) State-of-the-art performance in text-independent speaker verification through open-source software. IEEE Trans. on Audio, Speech, and Language Processing 15(7):1960–1968CrossRefGoogle Scholar
  30. Flanagan JL (1972) Speech analysis, synthesis and perception. Springer-Verlag, BerlinGoogle Scholar
  31. Fletcher H, Munson W (1933) Loudness, its definition, measurement, and calculation. Journal of the Acoustical Society of America 5:82–108CrossRefGoogle Scholar
  32. Fletcher H (1938a) Loudness, masking and their relation to the hearing process and the problem of noise measurement. Journal of the Acoustical Society of America 9:275–293CrossRefGoogle Scholar
  33. Fletcher H (1938b) The mechanism of hearing as revealed through experiment on the masking effect of thermal noise”, Proceedings of the National Academy of Sciences of the United States of America, vol. 24, no. 7, Jul. 15, 1938, pp.265-274. Available at:
  34. Fletcher H (1940) Auditory patterns. Reviews of Modern Physics 12:47–65, Jan. 1940. DOI: 10.1103/RevModPhys.12.47Google Scholar
  35. Gabor D (1946) Theory of communication. Journal of Institution of Electrical Engineers, 93(3):429–457, November 1946Google Scholar
  36. Ganchev T, Fakotakis N, Kokkinakis G (2002a) Text-independent speaker verification based on probabilistic neural networks. Proceedings of the Acoustics 2002, Patras, Greece. pp.159–166Google Scholar
  37. Ganchev T, Fakotakis N, Kokkinakis G (2002b) A speaker verification system based on probabilistic neural networks. 2002 NIST Speaker Recognition Evaluation, Results CD Workshop Presentations & Final Release of Results, Vienna, Virginia, USAGoogle Scholar
  38. Ganchev T (2005) Speaker Recognition. PhD dissertation. Dept. of Electrical and Computer Engineering, University of Patras, Greece, November 2005Google Scholar
  39. Ganchev T, Fakotakis N, Kokkinakis G (2005) Comparative evaluation of various MFCC implementations on the speaker verification task. Proceedings of the 10th International Conference on Speech and Computer, (SPECOM 2005), October 17–19, 2005. Patras, Greece, vol.1, pp.191–194.Google Scholar
  40. Ganchev T, Mporas I, Fakotakis N (2010) Automatic height estimation from speech in real-world setup. Proceedings of the 2010 European Signal Processing Conference (EUSIPCO 2010), Aalborg, Danmark, August 23–27, 2010, pp. 800–804Google Scholar
  41. Garofolo J (1998) Getting started with the DARPA-TIMIT CD-ROM: An acoustic phonetic continuous speech database. National Institute of Standards and Technology (NIST), Gaithersburgh, MD, USAGoogle Scholar
  42. Glasberg BR, Moore BCJ (1990) Derivation of auditory filter shapes from notched-noise data. Hearing Research 47(1–2):103–138CrossRefGoogle Scholar
  43. Greenberg G, Martin A, Brandschain L, Campbell J, Cieri C, Doddington G, Godfrey J (2010) Human assisted speaker recognition (HASR) in NIST SRE 2010. Proceedings of the speaker and language recognition workshop, Odyssey 2010, June 28-July 1, 2010, Brno, Czech RepublicGoogle Scholar
  44. Greenwood DD (1991) Critical bandwidth and consonance in relation to cochlear frequency –position coordinates. Hearing research 54:165–208MathSciNetGoogle Scholar
  45. Greenwood DD (1997) The Mel Scale’s disqualifying bias and a consistency of pitch-difference equisections in 1956 with equal cochlear distances and equal frequency ratios. Hearing research 103:199–248CrossRefGoogle Scholar
  46. Grezl F, Karafiat M, Cernocky J (2004) TRAP based features for LVCSR of meeting data. Proc. of ICSLP-2004, Korea, Oct. 2004Google Scholar
  47. Harris FJ (1978) On the use of windows for harmonic analysis with the discrete Fourier transform. Proceedings of the IEEE 66(1):51–83, January 1978Google Scholar
  48. Hennebert J, Melin H, Petrovska D, Genoud D (2000) POLYCOST: A telephone-speech database for speaker recognition. Speech Communication 31(2–3):265–270CrossRefGoogle Scholar
  49. Hermansky H (1990) Perceptual linear predictive (PLP) analysis for speech. Journal of the Acoustical Society of America 87(4):1738–1752CrossRefGoogle Scholar
  50. Hermansky H (2003) TRAP-TANDEM: Data-driven extraction of temporal features from speech. Technical Report IDIAP-RR-03-50, August 31, 2003Google Scholar
  51. Huang X, Acero A, Hon HW (2001) Spoken language processing: A guide to theory, algorithm, and system development. Prentice HallGoogle Scholar
  52. Kenny P, Ouellet P, Dehak N, Gupta V, Dumouchel P (2008) A study of inter-speaker variability in speaker verification. IEEE Trans. Audio Speech and Language Processing 16(5):980–988CrossRefGoogle Scholar
  53. King IR (1971) A comparison of existing eigenvector studies of the dimensionality of speech. Institute for Defense Analyses, Princeton, N.J., Communication Research Division, Working Paper No. 333, Sept. 1971.Google Scholar
  54. Kinnunen T, Li H (2010) An overview of text-independent speaker recognition: from features to supervectors. Speech Communication 52(1):12–40CrossRefGoogle Scholar
  55. Klein W, Plomp R, Pols LC (1970) Vowel spectra, vowel spaces, and vowel identification. Journal of the Acoustical Society of America 48(4):999–1009CrossRefGoogle Scholar
  56. Koenig W (1949) A new frequency scale for acoustic measurements. Bell Telephone Laboratory Record 27:299–301Google Scholar
  57. Lee K-F, Hon H-W, Reddy R (1990) An overview of the SPHINX speech recognition system. IEEE Trans. Acoustics Speech and Signal Processing 38(1):35–45CrossRefGoogle Scholar
  58. LePage EL (2003) The mammalian cochlear map is optimally warped. Journal of the Acoustical Society of America 114(2):896–906, August 2003Google Scholar
  59. Long CJ, Datta S (1996) Wavelet based feature extraction for phoneme recognition. Proceedings of the ICSLP-96, Philadelphia, USA. Vol. 1, pp. 264–267Google Scholar
  60. Mattingly IG (1999) A short-history of Acoustic Phonetics in the U.S. Proceedings of the 14th International Congress of Phonetic Sciences, San Francisco, CA, USA, pp.1–6Google Scholar
  61. Makhoul J (1975) Spectral linear prediction: properties and applications. IEEE Trans. on Acoustics Speech and Signal Processing 23:283–296CrossRefGoogle Scholar
  62. Mermelstein P (1976) Distance measures for speech recognition, psychological and instrumental. In Pattern Recognition and Artificial Intelligence, C.H. Chen, Ed., New York, Academic Press, pp.374–388Google Scholar
  63. Miller DC (1916) Science of the musical sounds. Macmillan, New YorkMATHGoogle Scholar
  64. Miller JD (1989) Auditory-perceptual interpretation of the vowel. Journal of the Acoustical Society of America 85(5):2114–2134, May 1989Google Scholar
  65. Moore BCJ, Glasberg BR (1983) Suggested formulae for calculating auditory-filter bandwidths and excitation patterns. Journal of the Acoustical Society of America, 74(3):750–753CrossRefGoogle Scholar
  66. Moore BCJ, Glasberg BR (1996) A revision of the Zwicker’s loudness model. Acustica-Acta Acustica, 82:335–345Google Scholar
  67. Moore BCJ (2003) An introduction to the psychology of hearing. Academic Press, London, 5th Ed.Google Scholar
  68. Mporas I, Ganchev T, Siafarikas M, Fakotakis N (2007) Comparison of speech features on the speech recognition task. Journal of Computer Science, 3(8):608–616CrossRefGoogle Scholar
  69. Mporas I, Ganchev T, Fakotakis N (2008) Phonetic segmentation using multiple speech features. International Journal of Speech Technology 11(2):73–85, June 2008CrossRefGoogle Scholar
  70. Mporas I, Ganchev T (2009) Estimation of unknown speaker’s height from speech. International Journal of Speech Technology 12(4), December 2009Google Scholar
  71. Nogueira W, Büchner A, Lenarz T, Edler B (2005) A psychoacoustic “NofM”-type speech coding strategy for cochlear implants. EURASIP Journal on Applied Signal Processing – Special Issue on DSP in Hearing Aids and Cochlear Implants 18:3044–3059Google Scholar
  72. Nogueira W, Giese A, Edler B, Büchner A (2006) Wavelet packet filter-bank for speech processing strategies in cochlear implants. Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, (ICASSP 2006), Toulouse, France, vol. 5, pp.121–124Google Scholar
  73. NIST (2001) The NIST year 2001 speaker recognition evaluation plan. National Institute of Standards and Technology of USA. Available:
  74. NIST (2002) The NIST year 2002 speaker recognition evaluation plan. National Institute of Standards and Technology of USA. Available:
  75. Noll AM (1964) Short-time spectrum and ”cepstrum” techniques for vocal-pitch detection. Journal of the Acoustical Society of America 36(2):296–302, Feb. 1964Google Scholar
  76. Noll AM, Schroeder MR (1964) Short-time ”cepstrum” pitch detection. Sixty-Seventh Meeting of the Acoustical Society of America, May 6–9, 1964, Journal of the Acoustical Society of America 36(5):1030, May 1964 (abstract)Google Scholar
  77. Noll AM (1967) Cepstrum pitch determination. Journal of the Acoustical Society of America 41(2):293–309, Feb 1967.Google Scholar
  78. Oppenheim AV (1967) Deconvolution of Speech, Journal of the Acoustical Society of America 41:1595, 1967 (abstract)Google Scholar
  79. Oppenheim AV, Schafer RW, Stockham TG (1968) Non-linear filtering of multiplied and convolved signals. Proceedings of the IEEE 56(8):1264–1291, Aug. 1968CrossRefGoogle Scholar
  80. Nuttall AH (1981) Some Windows with Very Good Sidelobe Behavior. IEEE Trans. on Acoustics, Speech, and Signal Processing, 29(1):84–91, February 1981Google Scholar
  81. Oppenheim AV, Schafer RW (1968a) Homomorphic analysis of speech. IEEE Trans. Audio and Electroacoustics, 16(2):221–226, June 1968CrossRefGoogle Scholar
  82. Oppenheim AV, Schafer RW (1968b) Non-linear filtering of multiplied and convolved signals. IEEE Trans. Audio and Electroacoustics, 16(3):437–446, Sept. 1968Google Scholar
  83. Oppenheim AV, Schafer RW (2004) From frequency to quefrency: A history of the cepstrum. IEEE Signal Processing Magazine 21(5):95–99 & 106.Google Scholar
  84. O’Shaughnessy D (1987) Speech communications: Human and machine. Addison-Wesley Publishing Co., Reading, MA, USAGoogle Scholar
  85. Patterson RD, Moore BCJ (1986) Auditory filters and excitation patterns as representation of frequency resolution. In B.C.J. Moore (Ed.) Frequency Selectivity in Hearing, Academic Press. London, pp.123–177Google Scholar
  86. Pols LCW (1971) Real-time recognition of spoken words. IEEE Trans. on Computers, 20(9):972–978, Sept. 1971Google Scholar
  87. Reynolds DA (1994) Experimental evaluation of features for robust speaker identification. IEEE Trans. on Speech and Audio Processing 2(4):639–643, Oct. 1994Google Scholar
  88. Sarikaya R, Pellom BL, Hansen JHL (1998) Wavelet packet transform features with application to speaker identification. Proceedings of the IEEE Nordic Signal Processing Symposium, Vigso, Denmark, June 1998, pp. 81–84Google Scholar
  89. Sarikaya R, Hansen JHL (2000) High resolution speech feature parameterization for monophone-based stressed speech recognition. IEEE Signal Proc. Let. 7(7):182–185CrossRefGoogle Scholar
  90. Siafarikas M, Ganchev T, Fakotakis N (2004) Wavelet packets based speaker verification. Proc. of the Odyssey 2004, Toledo, Spain. pp. 257–264Google Scholar
  91. Siafarikas M, Ganchev T, Fakotakis N, Kokkinakis G (2005) Overlapping wavelet packet features for speaker verification. Proceedings of the INTERSPEECH’05, Lisbon, Portugal, pp. 3121–3124Google Scholar
  92. Siafarikas M, Ganchev T, Fakotakis N, Kokkinakis G (2007) Wavelet packet approximation of critical bands for speaker verification, International Journal of Speech Technology 10(4):197–218, 2007CrossRefGoogle Scholar
  93. Slaney M (1998) Auditory toolbox. Version 2. Technical Report #1998-010, Interval Research CorporationGoogle Scholar
  94. Skowronski MD (2004) Biologically inspired noise-robust speech recognition for both man and machine. Ph.D. Dissertation, University of Florida, 2004Google Scholar
  95. Skowronski MD, Harris JG (2004) Exploiting independent filter bandwidth of human factor cepstral coefficients in automatic speech recognition. Journal of the Acoustical Society of America 116(3):1774–1780CrossRefGoogle Scholar
  96. Schroeder MR (1977) Recognition of complex acoustic signals. Life Science Research Reports, T.H. Bullock, Ed., 55:323–328Google Scholar
  97. Schuller B, Steidl S, Batliner A (2009) The INTERSPEECH 2009 emotion challenge. Proc. of Interspeech 2009, Brighton, UK, pp.312–315Google Scholar
  98. Schuller B, Steidl S, Batliner A, Burkhardt F, Devillers L, Muller C, Narayanan S (2010) The INTERSPEECH 2010 paralinguistic challenge - age, gender, and affect. Proceedings of the 11th International Conference on Spoken Language Processing, INTERSPEECH 2010 - ICSLP, Makuhari, Japan, 2010, pp. 2794–2797Google Scholar
  99. Stevens SS, Volkman J, Newman EB (1937) A scale for the measurement of the psychological magnitude pitch. Journal of the Acoustical Society of America 8(1):185–190, Jan. 1937CrossRefGoogle Scholar
  100. Stevens SS, Volkman J (1940) The relation of pitch to frequency: A revised scale. American Journal of Psychology 53(3):329–353, July 1940CrossRefGoogle Scholar
  101. Stevens SS (1957) On the psychophysical law. Psychology Review 64:153–181CrossRefGoogle Scholar
  102. Tufekci Z, Gowdy JN (2000) Feature extraction using discrete wavelet transform for speech recognition. Proceedings of the IEEE SoutheastCon 2000, Nashville, Tennessee, USA. pp.116–123Google Scholar
  103. Umesh S, Cohen L, Nelson D (1999) Fitting the Mel scale. Proceedings of the ICASSP-99, Phoenix, USA, 15–19, March, vol.1, pp.217–220Google Scholar
  104. Valente F (2010) Hierarchical and parallel processing of auditory and modulation frequencies for automatic speech recognition. Speech Communication 52:790–800CrossRefGoogle Scholar
  105. Yilmaz H (1967) A theory of speech perception. Bulletin of Mathematical Biophysics 29(4):739–825Google Scholar
  106. Young SJ, Odell J, Ollason D, Woodland P (1995) The HTK Book. Version 2.0. Department of Engineering, Cambridge University, UKGoogle Scholar
  107. Young SJ (1996) A review of large-vocabulary continuous-speech recognition. IEEE Signal Processing Magazine 13(5):45–57, September 1996Google Scholar
  108. Young SJ, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D, Valtchev V, Woodland P (2006) The HTK Book. Version 3.4. Department of Engineering, Cambridge University, UKGoogle Scholar
  109. Zwicker E, Flottorp G, Stevens SS (1957) Critical bandwidth in loudness summation. Journal of the Acoustical Society of America 29:548–557CrossRefGoogle Scholar
  110. Zwicker E (1961) Subdivision of the audible frequency range into critical bands (Frequenzgruppen). Journal of the Acoustical Society of America 33:248–249CrossRefGoogle Scholar
  111. Zwicker E, Terhardt E (1980) Analytical expressions for critical-band rate and critical bandwidth as a function of frequency. Journal of the Acoustical Society of America 68(5):1523–1525CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media, LLC 2011

Authors and Affiliations

  1. 1.Wire Communications Laboratory Department of Electrical & Computer EngineeringUniversity of PatrasRion-PatrasGreece

Personalised recommendations