Audio-Visual Feature Fusion for Speaker Identification

  • Noor Almaadeed
  • Amar Aggoun
  • Abbes Amira
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7663)


Analyses of facial and audio features have been considered separately in conventional speaker identification systems. Herein, we propose a robust algorithm for text-independent speaker identification based on a decision-level and feature-level fusion of facial and audio features. The suggested approach makes use of Mel-frequency Cepstral Coefficients (MFCCs) for audio signal processing, Viola-Jones Haar cascade algorithm for face detection from video, eigenface features (EFF) and Gaussian Mixture Models (GMMs) for feature-level and decision-level fusion of audio and video. Decision-level fusion is carried out using PCA for face and GMM for audio through AND voting. Feature-level fusion is investigated by combining both MFCC (audio) and PCA (face) features to construct a hybrid GMM for each speaker. Testing on GRID, a multi-speaker audio-visual database, shows that the decision-level fusion of PCA (face) and GMM (audio) achieves 98.2 % accuracy and it is almost 15 % more efficient than feature-level fusion.


Principal component Analysis Gaussian mixture models speaker identification audio-visual fusion Mel-frequency Cepstral coefficients 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Abboud, B., Bredin, H., Aversano, G., Chollet, G.: Audio-visual Identity Verification: An Introductory Overview. In: Stylianou, Y., Faundez-Zanuy, M., Esposito, A. (eds.) WNSP 2005. LNCS, vol. 4391, pp. 118–134. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  2. 2.
    Zhao, X., Shao, Y., Wang, D.L.: Robust Speaker Identification Using a CASA Frontend. In: Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 5468–5471 (2011)Google Scholar
  3. 3.
    Ekenel, H.K., Stiefelhagen, R.: Why Is Facial Occlusion a Challenging Problem? In: Tistarelli, M., Nixon, M.S. (eds.) ICB 2009. LNCS, vol. 5558, pp. 299–308. Springer, Heidelberg (2009)CrossRefGoogle Scholar
  4. 4.
    Boujelbene, S.Z., Mezghani, D.B.A., Ellouze, N.: Improved feature data for robust speaker identification using hybrid Gaussian mixture models - sequential minimal optimization system. The International Review on Computers and Software (IRECOS) 4, 344–350 (2009)Google Scholar
  5. 5.
    Mashao, D.J., Skosan, M.: Combining Classifier Decisions for Robust Speaker Identification. Pattern Recognition 39, 147–155 (2006)CrossRefGoogle Scholar
  6. 6.
    Ross, A., Jain, A.: Information Fusion in Biometrics. Pattern Recognition Letters 24, 2115–2125 (2003)CrossRefGoogle Scholar
  7. 7.
    Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face Recognition: A Literature Survey. ACM Computing Survey 35, 399–458 (2003)CrossRefGoogle Scholar
  8. 8.
    Mohamed, S., Noureddine, D., Noureddine, G.: Face and Speech Based Multi-Modal Biometric Authentication. International Journal of Advanced Science and Technology 21, 41 (2010)Google Scholar
  9. 9.
    GRID Audio Corpus for Speech Recognition,
  10. 10.
    Pandey, B., Ranjan, A., Kumar, R., Shukla, A.: Multilingual Speaker Recognition Using ANFIS. In: Proceedings of the 2nd International Conference on Signal Processing Systems (ICSPS), pp. 714–718 (2010)Google Scholar
  11. 11.
    Hassan, M., Memon, S., Gregory, M.A.: A Novel Approach for MFCC Feature Extraction. In: Proceedings of the 4th International Conference on Signal Processing and Communication Systems (ICSPCS), pp. 1–5. IEEE, New York (2011)Google Scholar
  12. 12.
    Viola, P., Jones, M.J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57, 137–154 (2004)CrossRefGoogle Scholar
  13. 13.
    Paliy, I.: Face Detection Using Haar-like Features Cascade and Convolutional Neural Network. In: Proceedings of the International Conference on Modern Problems of Radio Engineering, Telecommunications and Computer Science, pp. 375–377 (2008)Google Scholar
  14. 14.
    Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991)CrossRefGoogle Scholar
  15. 15.
    Memon, S., Lech, M., Maddage, N.: Speaker Verification Based on Different Vector Quantization Techniques with Gaussian Mixture Models. In: Proceedings of the Third International Conference on Network and System Security, NSS 2009, pp. 403–408 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Noor Almaadeed
    • 1
  • Amar Aggoun
    • 1
  • Abbes Amira
    • 2
    • 3
  1. 1.Department of Computer EngineeringBrunel UniversityLondonUK
  2. 2.NIBECUniversity of UlsterJordanstownUK
  3. 3.College of EngineeringQatar UniversityQatar

Personalised recommendations