International Journal of Speech Technology

, Volume 16, Issue 1, pp 103–113 | Cite as

Speaker recognition utilizing distributed DCT-II based Mel frequency cepstral coefficients and fuzzy vector quantization

Article

Abstract

In this paper, a new and novel Automatic Speaker Recognition (ASR) system is presented. The new ASR system includes novel feature extraction and vector classification steps utilizing distributed Discrete Cosine Transform (DCT-II) based Mel Frequency Cepstral Coefficients (MFCC) and Fuzzy Vector Quantization (FVQ). The ASR algorithm utilizes an approach based on MFCC to identify dynamic features that are used for Speaker Recognition (SR). A series of experiments were performed utilizing three different feature extraction methods: (1) conventional MFCC; (2) Delta-Delta MFCC (DDMFCC); and (3) DCT-II based DDMFCC. The experiments were then expanded to include four classifiers: (1) FVQ; (2) K-means Vector Quantization (VQ); (3) Linde, Buzo and Gray VQ; and (4) Gaussian Mixed Model (GMM). The combination of DCT-II based MFCC, DMFCC and DDMFCC with FVQ was found to have the lowest Equal Error Rate for the VQ based classifiers. The results found were an improvement over previously reported non-GMM methods and approached the results achieved for the computationally expensive GMM based method. Speaker verification tests carried out highlighted the overall performance improvement for the new ASR system. The National Institute of Standards and Technology Speaker Recognition Evaluation corpora was used to provide speaker source data for the experiments.

Keywords

Speaker recognition Discrete cosine transform Fuzzy vector quantization K-Means, Linde–Buzo–Gray Mel frequency cepstral coefficients Speech feature extraction 

References

  1. Abida, M. K. (2007). Fuzzy gmm-based confidence measure towards keyword spotting application. A thesis presented to the University of Waterloo in fulfilment of the thesis requirement for the degree of master of applied science, electrical and computer engineering, University of Waterloo, Ontario, Canada. Google Scholar
  2. Assaleh, K. T., & Mammone, R. J. (1994). Robust cepstral features for speaker identification. In IEEE international conference on acoustics, speech, and signal processing (Vol. 1, pp. 129–132). Google Scholar
  3. Atal, B. S. & Hanauer, S. L. (1971). Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. The Journal of the Acoustical Society of America, 55(6), 1304–1312. CrossRefGoogle Scholar
  4. Barbu, T. (2009). Comparing various voice recognition techniques. In Proceedings of the 5-th conference on speech technology and human-computer dialogue (pp. 1–6). CrossRefGoogle Scholar
  5. Charbuillet, C., Gas, B., Chetouani, M., & Zarader, J. L. (2007). Complementary features for speaker verification based on genetic algorithms. In IEEE international conference on acoustics, speech and signal processing (Vol. 4, pp. IV-285–IV-288). Google Scholar
  6. Chen, J., Paliwal, K. K., Mizumachi, M., & Nakamura, S. (2001a). Robust MFCCs derived from differentiated power spectrum. In Proc. intern. conf. on speech processing, TaeJon, Korea (Vol. 2, pp. 577–582). Google Scholar
  7. Chen, W., Zhenjiang, M., & Xiao, M. (2001b). Comparison of different implementations of mfcc. Journal of Computer Science and Technology, 16(16), 582–589. Google Scholar
  8. Chen, W., Zhenjiang, M., & Xiao, M. (2008). Differential mfcc and vector quantization used for real-time speaker recognition system. In Congress on image and signal processing (pp. 319–323). Google Scholar
  9. Cheng, J., & Wang, H. C. (2004). A method of estimating the equal error rate for automatic speaker verification. In International symposium on Chinese spoken language processing (pp. 285–288). CrossRefGoogle Scholar
  10. Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28, 357–366. CrossRefGoogle Scholar
  11. Ganchev, T. D. (2005). Speaker recognition. A dissertation submitted to the University of Patras in partial fulfilment of the requirements for the degree doctor of philosophy. Google Scholar
  12. Hermansky, H. (1990). Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical Society of America, 87(4), 1738–1752. CrossRefGoogle Scholar
  13. Hossan, M. A. (2011). Automatic speaker recognition dynamic feature identification and classification using distributed discrete cosine transform based Mel frequency cepstral coefficients and fuzzy vector quantization. A thesis presented to the RMIT University in fulfilment of the thesis requirement for the degree of master of engineering, electrical and computer engineering, RMIT University, Melbourne, Australia. Google Scholar
  14. Hossan, M. A., Memon, S., & Gregory, M. A. (2010). A novel approach for MFCC feature extraction. In 4th international conference on signal processing and communication systems (ICSPCS), Dec. 2010 (pp. 1–5), 13–15. Google Scholar
  15. Jayanna, H. S., & Prasanna, S. R. M. (2008). Fuzzy vector quantization for speaker recognition under limited data conditions. In IEEE region 10 conference (TENCON 2008) (pp. 1–4). CrossRefGoogle Scholar
  16. Kanade, P. M., & Hall, L. O. (2007). Fuzzy ants and clustering. IEEE Transactions on Systems, Man and Cybernetics. Part A. Systems and Humans, 37(5), 758–769. CrossRefGoogle Scholar
  17. Keshet, J., & Bengio, S. (2009). Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley. CrossRefGoogle Scholar
  18. Kim, S., & Eriksson, T. (2004). A pitch synchronous feature extraction method for speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2004 (Vol. 1, pp. 405–408). Google Scholar
  19. MATLAB (2012). MATLAB & Simulink, Mathworks, USA. http://www.mathworks.com.au/products/matlab/. Accessed on 1 March 2012.
  20. Memon, S., Lech, M., & He, L. (2009a). Using information theoretic vector quantization for inverted mfcc based speaker verification. In 2nd international conference on computer, control and communication (pp. 1–5). CrossRefGoogle Scholar
  21. Memon, S., Lech, M., & Maddage, N. (2009b). Speaker verification based on different vector quantization techniques with Gaussian mixture models. In Third international conference on network and system security (pp. 403–408). CrossRefGoogle Scholar
  22. National Institute of Standards and Technology speaker recognition evaluation (2004). http://www.itl.nist.gov/iad/mig/tests/spk/2004/. Accessed online 20/9/2010.
  23. Oppenheim, A. V. (1969). A speech analysis-synthesis system based on homomorphic filtering. The Journal of the Acoustical Society of America, 45, 458–465. CrossRefGoogle Scholar
  24. Paul, A. K., Das, D., & Kamal, M. (2009). Bangla speech recognition system using lpc and ann. In Seventh international conference on advances in pattern recognition (pp. 171–174). CrossRefGoogle Scholar
  25. Saeidi, R., Mohammadi, H. R. S., Rodman, R. D., & Kinnunen, T. (2007). A new segmentation algorithm combined with transient frames power for text independent speaker verification. In IEEE international conference on acoustics, speech and signal processing (ICASSP) (Vol. 4, pp. 305–308). Google Scholar
  26. Sahidullah, M., & Saha, G. (2009). On the use of distributed DCT in speaker identification. In 2009 annual IEEE India conference (INDICON) (pp. 1–4). CrossRefGoogle Scholar
  27. Salman, A., Muhammad, E., & Khurshid, K. (2007). Speaker verification using boosted cepstral features with Gaussian distributions. In IEEE international multitopic conference (pp. 1–5). CrossRefGoogle Scholar
  28. Shi, N., Liu, X., & Guan, Y. (2010). Research on k-means clustering algorithm: an improved k-means clustering algorithm. In Third international symposium on intelligent information technology and security informatics (IITSI), 2–4 April 2010 (pp. 63–67). Google Scholar
  29. Wang, W., Zhang, Y., Li, Y., & Zhang, X. (2006). The global fuzzy C-means clustering algorithm. In The sixth world congress on intelligent control and automation (WCICA 2006) (Vol. 1, pp. 3604–3607). CrossRefGoogle Scholar
  30. Wang, H., Zhang, X., Suo, H., Zhao, Q., & Yan, Y. (2009). A novel fuzzy-based automatic speaker clustering algorithm. In 6th international symposium on neural networks (ISNN 2009), Wuhan, China, 26–29 May 2009 (pp. 639–646), Part II. CrossRefGoogle Scholar
  31. Wei-Guo, G., Li-Ping, Y., & Di, C. (2008). Pitch synchronous based feature extraction for noise-robust speaker verification. In Congress on image and signal processing (CISP’08) (Vol. 5, pp. 295–298). Google Scholar
  32. Zilca, R. D., Navratil, J., & Ramaswamy, G. N. (2003). Depitch and the role of fundamental frequency in speaker recognition. In IEEE international conference on acoustics, speech, and signal processing, 2003 (Vol. 2, pp. 81–84). Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2012

Authors and Affiliations

  1. 1.RMIT UniversityMelbourneAustralia

Personalised recommendations