Audio-Visual Feature Fusion for Speaker Identification
Analyses of facial and audio features have been considered separately in conventional speaker identification systems. Herein, we propose a robust algorithm for text-independent speaker identification based on a decision-level and feature-level fusion of facial and audio features. The suggested approach makes use of Mel-frequency Cepstral Coefficients (MFCCs) for audio signal processing, Viola-Jones Haar cascade algorithm for face detection from video, eigenface features (EFF) and Gaussian Mixture Models (GMMs) for feature-level and decision-level fusion of audio and video. Decision-level fusion is carried out using PCA for face and GMM for audio through AND voting. Feature-level fusion is investigated by combining both MFCC (audio) and PCA (face) features to construct a hybrid GMM for each speaker. Testing on GRID, a multi-speaker audio-visual database, shows that the decision-level fusion of PCA (face) and GMM (audio) achieves 98.2 % accuracy and it is almost 15 % more efficient than feature-level fusion.
KeywordsPrincipal component Analysis Gaussian mixture models speaker identification audio-visual fusion Mel-frequency Cepstral coefficients
Unable to display preview. Download preview PDF.
- 2.Zhao, X., Shao, Y., Wang, D.L.: Robust Speaker Identification Using a CASA Frontend. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5468–5471 (2011)Google Scholar
- 4.Boujelbene, S.Z., Mezghani, D.B.A., Ellouze, N.: Improved feature data for robust speaker identification using hybrid Gaussian mixture models - sequential minimal optimization system. The International Review on Computers and Software (IRECOS) 4, 344–350 (2009)Google Scholar
- 8.Mohamed, S., Noureddine, D., Noureddine, G.: Face and Speech Based Multi-Modal Biometric Authentication. International Journal of Advanced Science and Technology 21, 41 (2010)Google Scholar
- 9.GRID Audio Corpus for Speech Recognition, http://www.dcs.shef.ac.uk/spandh/gridcorpus/
- 10.Pandey, B., Ranjan, A., Kumar, R., Shukla, A.: Multilingual Speaker Recognition Using ANFIS. In: Proceedings of the 2nd International Conference on Signal Processing Systems (ICSPS), pp. 714–718 (2010)Google Scholar
- 11.Hassan, M., Memon, S., Gregory, M.A.: A Novel Approach for MFCC Feature Extraction. In: Proceedings of the 4th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5. IEEE, New York (2011)Google Scholar
- 13.Paliy, I.: Face Detection Using Haar-like Features Cascade and Convolutional Neural Network. In: Proceedings of the International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science, pp. 375–377 (2008)Google Scholar
- 15.Memon, S., Lech, M., Maddage, N.: Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models. In: Proceedings of the Third International Conference on Network and System Security, NSS 2009, pp. 403–408 (2009)Google Scholar