Skip to main content

Comparison of different implementations of MFCC

Abstract

The performance of the Mel-Frequency Cepstrum Coefficients (MFCC) may be affected by (1) the number of filters, (2) the shape of filters, (3) the way in which filters are spaced, and (4) the way in which the power spectrum is warped. In this paper, several comparison experiments are done to find a best implementation. The traditional MFCC calculation excludes the 0th coefficient for the reason that it is regarded as somewhat unreliable. According to the analysis and experiments, the authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC. The authors also propose a better analysis, namely the auto-regressive analysis, on the frame energy, which outperform its 1st and/or 2nd order differential derivatives. Experiments with the “863” Speech Database show that, compared with the traditional MFCC with its corresponding auto-regressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reduces CSER by 2.5%. Comparison experiments are also done with a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS) corpus. The FBE-MFCC can reduce the error rate by about 2.9% on an average.

This is a preview of subscription content, access via your institution.

References

  1. [1]

    Pols L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words [dissertation]. Free University, Amsterdam, The Netherlands, 1966.

    Google Scholar 

  2. [2]

    Davis S B, Mermelstein P. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences.IEEE Trans. ASSP, Aug., 1980.

  3. [3]

    Picone J W. Signal modeling techniques in speech recognition. InProceedings of the IEEE, 1993, 81(9): 1215–1247.

  4. [4]

    Schroeder M R. Recognition of complex acoustic signals.Life Science Research Report, Bullock T H (ed.), Abakon Verlag, Berlin, 1997, 55: 323–328.

    Google Scholar 

  5. [5]

    Huang X D, Acero A, Alleva Fet al. From SPHINX-II to WHISPER — Making Speech Recognition Usable.Automatic Speech and Speaker Recognition: Advanced Topics. Lee C H, Soong F K, Paliwal K K (eds.), USA: Kluwer Academic Publishers, 1996, pp.481–508.

    Google Scholar 

  6. [6]

    Furui S. Speaker-independent isolated word recognition using dynamic features of speech spectrum.IEEE Trans. Acoust., Speech, and Signal Processing, Feb., 1986, 34(1): 52–59.

    Article  Google Scholar 

  7. [7]

    Zheng F. Studies on speaker-independent continous digit recognition methods and Chinese speech corpus [thesis]. Department of Computer Science & Technology, Tsinghua University, June 1992.

  8. [8]

    Zheng F, Mou X-L, Wu W-Het al. On the embedded multiple-model scoring scheme for speech recognition.International Symposium on Chinese Spoken Langauge Processing (ISCSLP’98), Singapore, Dec. 7–9, 1998, ASRA3: 49–53.

  9. [9]

    Hermansky Hynek. Perceptual linear predictive (PLP) analysis of speech.J. Acoust. Soc. Am., April, 1990, 87 (4): 1738–1752.

    Article  Google Scholar 

  10. [10]

    Zwicker E. Masking and psychological excitation as consequences of ear’s frequency analysis. InFrequency Analysis and Periodicity Detection in Hearing, Plomp R, Smoorenburg G F (eds.), Sijthoff Leyden, The Netherlands, 1970.

    Google Scholar 

  11. [11]

    Zwicker E. Subdivision of the audible frequency range into critical bands.J. Acoust. Soc. Am., Feb., 1961, 33.

  12. [12]

    Chen X-X, Li A-Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.

  13. [13]

    Li A-J, Chen X-X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.

  14. [14]

    Young S, Kershaw D, Odell Jet al. The HTK Book, Version 2.2, Entropic Ltd., 1999.

  15. [15]

    Li A-J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.

Download references

Author information

Affiliations

Authors

Corresponding author

Correspondence to Zheng Fang.

Rights and permissions

Reprints and Permissions

About this article

Cite this article

Zheng, F., Zhang, G. & Song, Z. Comparison of different implementations of MFCC. J. Comput. Sci. & Technol. 16, 582–589 (2001). https://doi.org/10.1007/BF02943243

Download citation

Keywords

  • MFCC
  • frequency band energy
  • auto-regressive analysis
  • generalized initial/final