Skip to main content
Log in

Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems

  • Published:
Cognitive Computation Aims and scope Submit manuscript

Abstract

In this paper, we investigate low-variance multitaper spectrum estimation methods to compute the mel-frequency cepstral coefficient (MFCC) features for robust speech and speaker recognition systems. In speech and speaker recognition, MFCC features are usually computed from a single-tapered (e.g., Hamming window) direct spectrum estimate, that is, the squared magnitude of the Fourier transform of the observed signal. Compared with the periodogram, a power spectrum estimate that uses a smooth window function, such as Hamming window, can reduce spectral leakage. Windowing may help to reduce spectral bias, but variance often remains high. A multitaper spectrum estimation method that uses well-selected tapers can gain from the bias-variance trade-off, giving an estimate that has small bias compared with a single-taper spectrum estimate but substantially lower variance. Speech recognition and speaker verification experimental results on the AURORA-2 and AURORA-4 corpora and the NIST 2010 speaker recognition evaluation corpus (telephone as well as microphone speech), respectively, show that the multitaper methods perform better compared with the Hamming-windowed spectrum estimation method. In a speaker verification task, compared with the Hamming window technique, the sinusoidal weighted cepstrum estimator, multi-peak, and Thomson multitaper techniques provide a relative improvement of 20.25, 18.73, and 12.83 %, respectively, in equal error rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
EUR 32.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or Ebook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  1. O’Shaughnessy D. Invited paper: automatic speech recognition: history, methods and challenges. Pattern Recognit. 2008;41(10):2965–79.

    Article  Google Scholar 

  2. O’Shaughnessy D. Speech communications—human and machine, vol. I-XXV. 2nd ed. New York: IEEE Press; 2000. p. 1–547.

    Google Scholar 

  3. Kotnik B, Vlaj D, Kacic A, Horvat B. Robust MFCC feature extraction algorithm using efficient addictive and convolutional noise reduction procedures. Proc ICSLP, p. 445–48 (2002).

  4. Alam Md J, Kinnunen T, Kenny P, Ouellet P, O’Shaughnessy D. Multi-taper MFCC features for speaker verification using I-vectors. ASRU, p. 547–52 (2011).

  5. Kinnunen T, Li H. An overview of text-independent speaker recognition-from features to supervectors. Speech Comm. 2010;52(1):12–40.

    Article  Google Scholar 

  6. Kinnunen T. Spectral features for automatic text-independent speaker recognition. Licentiate’s thesis, University of Joensuu, Finland, December (2003).

  7. Davis S, Mermelstein P. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans Acoust Speech Signal Process. 1980;28(4):357–66.

    Article  Google Scholar 

  8. Bimbot F, Bonastre J-F, Fredouille C, Gravier G, Magrin-Chagnolleau I, Meignier S, Merlin T, Ortega-Garcia J, Petrovska-Delacretaz D, Reynolds DA. A tutorial on text-independent speaker verification. EURASIP J Appl Signal Process. 2004;4:430–51.

    Article  Google Scholar 

  9. Alam MJ, Kinnunen T, Ouellet P, Kenny P, O’Shaughnessy D. Multitaper MFCC and PLP features for speaker verification using I-vectors. accepted for publication in Speech Comm. (2012). doi:10.1016/j.specom.2012.08.007.

  10. Hu Y, Loizou P. Speech enhancement based on wavelet thresholding the multitaper spectrum. IEEE Trans Speech Audio Proc. 2004;12(1):59–67.

    Article  Google Scholar 

  11. Percival DB, Walden AT. Spectral analysis for physical applications, multitaper and conventional univariate techniques. Cambridge: Cambridge University Press; 1993.

    Book  Google Scholar 

  12. Coy EJ, Walden AT, Percival DB. Multitaper Spectral Estimation of Power Law Process. IEEE Trans Signal Process. 1998;46(3):655–68.

    Article  Google Scholar 

  13. Kinnunen T, Saeidi R, Sandberg J, Hansson-Sandsten M. What Else is New Than the Hamming Window? Robust MFCCs for speaker recognition via multitapering. Interspeech, Makuhari, Japan, p. 2734–37 (2010).

  14. Sandberg J, Hansson-Sandsten M, Kinnunen T, Saeidi R, Flandrin P, Borgnat P. Multitaper estimation of frequency-warped cepstra with application to speaker verification. IEEE Signal Process Lett. 2010;17(4):343–6.

    Article  Google Scholar 

  15. Thomson DJ. Spectrum estimation and harmonic analysis. IEEE Proc. 1982;70(9):1055–96.

    Article  Google Scholar 

  16. Riedel KS, Sidorenko A. Minimum bias multiple taper spectral estimation. IEEE Trans Signal Proc. 1995;43(1):188–95.

    Article  Google Scholar 

  17. Prieto GA, Parker RL, Thomson DJ, Vernon FL, Graham RL. Reducing the bias of multitaper spectrum estimates. Geophys J Int. 2007;171:1269–81.

    Article  Google Scholar 

  18. Wieczorek MA, Simons FJ. Localized spectral analysis on the sphere. Geophys J Int. 2005;162:655–75.

    Article  Google Scholar 

  19. Kinnunen T, Saeidi R, Sedlak F, Lee KA, Sandberg J, Hansson-Sandsten M, Li H. Low-variance multitaper MFCC features: a case study in robust speaker verification. IEEE Trans Audio Speech Lang Process. 2012;20(7):1990–2001.

    Article  Google Scholar 

  20. Reynolds DA, Quatieri TF, Dunn RB. Speaker verification using adapted Gaussian mixture models. Digital Signal Process. 2000;10(1):19–41.

    Article  Google Scholar 

  21. Kenny P, Boulianne G, Ouellet P, Dumouchel P. Joint factor analysis versus eigenchannels in speaker recognition. IEEE Trans Audio Speech Lang Process. 2007;15(4):1435–47.

    Article  Google Scholar 

  22. Kenny P, Boulianne G, Ouellet P, Dumouchel P. Speaker and session variability in GMM-based speaker verification. IEEE Trans Audio Speech Lang Process. 2007;15(4):1448–60.

    Article  Google Scholar 

  23. Hirsch HG, Pearce D. The AURORA Experimental Framework for the Performance Evaluations of Speech Recognition Systems under Noisy Condition. ISCA ITRW ASR2000 Automatic Speech Recognition: Challenges for the Next Millennium, France (2000). online: http://aurora.hsnr.de/aurora-2/publications.html.

  24. Parihar N, Picone J, Pearce D, Hirsch HG. Performance analysis of the Aurora large vocabulary baseline system. Vienna: Proceedings of the European Signal Processing Conference; 2004.

    Google Scholar 

  25. Kim C, Stern RM. Feature extraction for robust speech recognition based on maximizing the sharpness of the power distribution and on power flooring. Proceedings of IEEE ICASSP, p. 4574–577 (2010).

  26. Alam MJ, Kenny P, O’Shaughnessy D. Robust feature extraction for speech recognition by enhancing auditory spectrum. Proceedings of INTERSPEECH, Portland, Oregon, September (2012).

  27. Schuster A. On the investigation of hidden periodicities with application to a supposed 26 day period of meteorological phenomena. Terr Magn. 1898;3:13–41.

    Article  Google Scholar 

  28. Priestley MB. Spectral analysis and time series. I & II. London: Academic Press; 1981.

    Google Scholar 

  29. Kay SM. Modern spectral estimation. Englewood Cliffs: Prentice-Hall; 1988.

    Google Scholar 

  30. Djuric PM, Kay SM. Spectrum estimation and modeling. Digital signal processing handbook. Boca Raton: CRC Press LLC; 1999.

    Google Scholar 

  31. Walden AT, McCoy EJ, Percival DB. The variance of multitaper spectrum estimates for real Gaussian processes. IEEE Trans Signal Process. 1994;2:479–82.

    Article  Google Scholar 

  32. Komm RW, Gu Y, Hill F. Multitaper spectral analysis and wavelet denoising applied to helioseismic data. Astrophys J. 1999;519:407–21.

    Article  Google Scholar 

  33. Wieczorek MA, Simons FJ. Minimum variance multitaper spectrum estimation on the sphere. J Fourier Anal Appl. 2007;13(6):665–92.

    Article  Google Scholar 

  34. Alam Md J, Kenny P, O’Shaughnessy D. A Study of low-variance multi-taper features for distributed speech recognition. Proceedings of NOLISP, LNAI 7015, p. 239–45 (2011).

  35. Hansson-Sandsten M, Sandberg J. Optimal cepstrum estimation using multiple windows. Taipei: IEEE ICASSP; 2009. p. 3077–80.

    Google Scholar 

  36. Hansson M, Salomonsson G. A multiple window method for estimation of peaked spectra. IEEE Trans Sign Proc. 1997;45(3):778–81.

    Article  Google Scholar 

  37. Hermansky H. Perceptual linear predictive (PLP) analysis of speech. J Acoust Soc Am. 1990;87(4):1738–52.

    Article  PubMed  CAS  Google Scholar 

  38. Young SJ et al. HTK book, Entropic Cambridge Research Laboratory Ltd., 3.4 edn (2006). online: http://htk.eng.cam.ac.uk/.

  39. Pan Shing-Tai, Lai Chih-Chin, Tsai Bo-Yu. The implementation of speech recognition systems on FPGA-based embedded systems with soc architecture. Int J Innov Comput Inf Control. 2011;7(11):6161–76.

    Google Scholar 

  40. Picone JW. Signal modeling techniques in speech recognition. Proc IEEE. 1993;81:1215–47.

    Article  Google Scholar 

  41. Ezeiza A, Lopez de Ipina K, Hernandez C, Barosso N. Enhancing the feature extraction process for automatic speech recognition with fractal dimensions. Cogn Comput J. 2012. doi:10.1007/s12559-012-9165-0.

  42. Huang XD, Acero A, Hon HW. Spoken language processing: A guide to theory, algorithm, and system development. Englewood Cliffs: Prentice-Hall; 2001.

    Google Scholar 

  43. von Bekesy G. Experiments in Hearing. New York: McGraw-Hill; 1960.

    Google Scholar 

  44. Oppenheim AV, Schafer RW. Digital signal processing. Englewood Cliffs: Prentice-Hall; 1975.

    Google Scholar 

  45. Kenny P, Ouellet P, Senoussaoui M. The CRIM system for the 2010 NIST speaker Recognition Evaluation, April (2010).

  46. ABC (Agnitio BUT and CRIM) system description for NIST Speaker Recognition Evaluation, June (2010).

  47. Dehak N et al. MIT-CSAIL Spoken Language Systems and Lincoln Labs NIST SRE systems (2010).

  48. Alam MJ, Ouellet P, Kenny P, O Shaughnessy D. Comparative Evaluation of Feature normalization techniques for speaker verification. Proceedings of NOLISP, LNAI 7015, p. 246–53 (2011).

  49. Pelecanos J, Sridharan S. Feature warping for robust speaker verification. In: Proc. Speaker Odyssey: the speaker recognition workshop, Crete, Greece, p. 213–18 (2001).

  50. Xiang B, Chaudhari U, Navratil J, Ramaswamy G, Gopinath R. Short-time Gaussianization for robust speaker verification. In: IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), vol. 1, Orlando, Florida, USA, p. 681–684 (2002).

  51. Furui S. Cepstral analysis technique for automatic speaker verification. IEEE Trans Acoust Speech Signal Process. 1981;29(2):254–72.

    Article  Google Scholar 

  52. Chen C-P, Bilmes J. MVA processing of speech features. Technical Report UWEETR-2003-0024, EE Department, University of Washington, USA (2003).

  53. Rabiner L, Juang BH. Fundamentals of Speech Recognition. Englewood Cliffs: Prentice-Hall; 1993.

    Google Scholar 

  54. Makhoul J, Schwartz J. State of the art in continuous speech recognition. In: Roe D, Wilpon J, editors. Voice communication between humans and machines. Washington, DC: National Academy Press; 1994. p. 165–88.

    Google Scholar 

  55. Au Yeung SK, Siu M-H. Improved performance of Aurora-4 using HTK and unsupervised MLLR adaptation, Proceedings of the Int. Conference on Spoken Language Processing, Jeju, Korea, (2004).

  56. Dehak N, Kenny P, Dehak R, Dumouchel P, Ouellet P. Front-end factor analysis for speaker verification. IEEE Trans Audio Speech Lang Process. 2011;19(4):788–98.

    Article  Google Scholar 

  57. Kenny P. Bayesian speaker verification with heavy tailed priors. The Odyssey Speaker and Language Recognition Workshop, Brno, Czech Republic, June (2010).

  58. Brümmer N, de Villiers E. The speaker partitioning problem. The Odyssey speaker and language recognition workshop, Brno, Czech Republic, June (2010).

  59. Senoussaoui M, Kenny P, Brummer N, de Villiers E, Dumouchel P. Mixture of PLDA models in I-vector space for gender independent speaker recognition. Interspeech, Florence, Italy, August (2011).

  60. National Institute of Standards and Technology, NIST 2010 Speaker Recognition Evaluation Plan, http://www.itl.nist.gov/iad/mig/tests/spk/2010/index.html.

  61. Garcia-Romero D, Espy-Wilson CY. Analysis of i-vector length normalization in speaker recognition systems. Interspeech 2011, Florence, Italy, August (2011).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Md. Jahangir Alam.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alam, M.J., Kenny, P. & O’Shaughnessy, D. Low-variance Multitaper Mel-frequency Cepstral Coefficient Features for Speech and Speaker Recognition Systems. Cogn Comput 5, 533–544 (2013). https://doi.org/10.1007/s12559-012-9197-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-012-9197-5

Keywords

Navigation