Journal of Computer Science and Technology

, Volume 16, Issue 6, pp 582–589 | Cite as

Comparison of different implementations of MFCC

Notes

Abstract

The performance of the Mel-Frequency Cepstrum Coefficients (MFCC) may be affected by (1) the number of filters, (2) the shape of filters, (3) the way in which filters are spaced, and (4) the way in which the power spectrum is warped. In this paper, several comparison experiments are done to find a best implementation. The traditional MFCC calculation excludes the 0th coefficient for the reason that it is regarded as somewhat unreliable. According to the analysis and experiments, the authors find that it can be regarded as the generalized frequency band energy (FBE) and is hence useful, which results in the FBE-MFCC. The authors also propose a better analysis, namely the auto-regressive analysis, on the frame energy, which outperform its 1st and/or 2nd order differential derivatives. Experiments with the “863” Speech Database show that, compared with the traditional MFCC with its corresponding auto-regressive analysis coefficients, the FBE-MFCC and the frame energy with their corresponding auto-regressive analysis coefficients form the best combination, reducing the Chinese syllable error rate (CSER) by about 10%, while the FBE-MFCC with the corresponding auto-regressive analysis coefficients reduces CSER by 2.5%. Comparison experiments are also done with a quite casual Chinese speech database, named Chinese Annotated Spontaneous Speech (CASS) corpus. The FBE-MFCC can reduce the error rate by about 2.9% on an average.

Keywords

MFCC frequency band energy auto-regressive analysis generalized initial/final 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    Pols L C W. Spectral analysis and identification of Dutch vowels in monosyllabic words [dissertation]. Free University, Amsterdam, The Netherlands, 1966.Google Scholar
  2. [2]
    Davis S B, Mermelstein P. Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences.IEEE Trans. ASSP, Aug., 1980.Google Scholar
  3. [3]
    Picone J W. Signal modeling techniques in speech recognition. InProceedings of the IEEE, 1993, 81(9): 1215–1247.Google Scholar
  4. [4]
    Schroeder M R. Recognition of complex acoustic signals.Life Science Research Report, Bullock T H (ed.), Abakon Verlag, Berlin, 1997, 55: 323–328.Google Scholar
  5. [5]
    Huang X D, Acero A, Alleva Fet al. From SPHINX-II to WHISPER — Making Speech Recognition Usable.Automatic Speech and Speaker Recognition: Advanced Topics. Lee C H, Soong F K, Paliwal K K (eds.), USA: Kluwer Academic Publishers, 1996, pp.481–508.Google Scholar
  6. [6]
    Furui S. Speaker-independent isolated word recognition using dynamic features of speech spectrum.IEEE Trans. Acoust., Speech, and Signal Processing, Feb., 1986, 34(1): 52–59.CrossRefGoogle Scholar
  7. [7]
    Zheng F. Studies on speaker-independent continous digit recognition methods and Chinese speech corpus [thesis]. Department of Computer Science & Technology, Tsinghua University, June 1992.Google Scholar
  8. [8]
    Zheng F, Mou X-L, Wu W-Het al. On the embedded multiple-model scoring scheme for speech recognition.International Symposium on Chinese Spoken Langauge Processing (ISCSLP’98), Singapore, Dec. 7–9, 1998, ASRA3: 49–53.Google Scholar
  9. [9]
    Hermansky Hynek. Perceptual linear predictive (PLP) analysis of speech.J. Acoust. Soc. Am., April, 1990, 87 (4): 1738–1752.CrossRefGoogle Scholar
  10. [10]
    Zwicker E. Masking and psychological excitation as consequences of ear’s frequency analysis. InFrequency Analysis and Periodicity Detection in Hearing, Plomp R, Smoorenburg G F (eds.), Sijthoff Leyden, The Netherlands, 1970.Google Scholar
  11. [11]
    Zwicker E. Subdivision of the audible frequency range into critical bands.J. Acoust. Soc. Am., Feb., 1961, 33.Google Scholar
  12. [12]
    Chen X-X, Li A-Jet al. An application of SAMPA-C for standard Chinese. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.Google Scholar
  13. [13]
    Li A-J, Chen X-X, Sun Get al. The phonetic labeling on read and spontaneous discourse corpora. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.Google Scholar
  14. [14]
    Young S, Kershaw D, Odell Jet al. The HTK Book, Version 2.2, Entropic Ltd., 1999.Google Scholar
  15. [15]
    Li A-J, Zheng F, Byrne W, Fung Pet al. CASS: A phonetically transcribed corpus of Mandarin spontaneous speech. InInternational Conference on Spoken Language Processing (ICSLP’2000), Oct. 16–20, 2000, Beijing.Google Scholar

Copyright information

© Science Press, Beijing China and Allerton Press Inc. 2001

Authors and Affiliations

  • Zheng Fang 
    • 1
  • Zhang Guoliang 
    • 1
  • Song Zhanjiang 
    • 1
  1. 1.Center of Speech Technology, State Key Laboratory of Intelligent Technology and Systems Department of Computer Science and TechnologyTsinghua UniversityBeijingP.R. China

Personalised recommendations