Audio Visual Person Authentication by Multiple Nearest Neighbor Classifiers

  • Amitava Das
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4642)


We propose a low-complexity audio-visual person authentication framework based on multiple features and multiple nearest-neighbor classifiers, which instead of a single template uses a set of codebooks or collection of templates. Several novel highly-discriminatory speech and face image features are introduced along with a novel “text-conditioned” speaker recognition approach. Powered by discriminative scoring and a novel fusion method, the proposed MCCN method delivers not only excellent performance (0% EER) but also a significant separation between the scores of client and imposters as observed on trials run on a unique multilingual 120-user audio-visual biometric database created for this research.


Speaker recognition face recognition audio-visual biometric authentication fusion multiple classifiers feature extraction VQ Multimodal 


  1. 1.
    Zhao, W.Y., et al.: Face recognition: A Literature Survey. ACM Comp. Surveys, 399–458 (2003)Google Scholar
  2. 2.
    Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of cognitive neuroscience 3, 71–86 (1991)CrossRefGoogle Scholar
  3. 3.
    Chibelushi, C., et al.: A Review of Speech Based Bimodal Recognition. IEEE trans. Multimedia, 23–37 (2002)Google Scholar
  4. 4.
    Kanak, A., et al.: Joint Audio Video Processing for Biometric Speaker Identification. In: Proc. ICASSP 2003 (2003)Google Scholar
  5. 5.
    Marcel, S., et al.: Bi-Modal Face & Speech Authentication: A Bio Login Demonstration System. In: Proc. MMUA 2006 (May 2006)Google Scholar
  6. 6.
    Hazen, T., et al.: Multi-modal Face and Speaker Identification on Handheld Device. In: Proc. MMUA (May 2003)Google Scholar
  7. 7.
    Yacoub, S., et al.: Fusion of Face and Speech Data for Person identity Verification. IEEE trans. neural Network (September 1999)Google Scholar
  8. 8.
    Wu, Z., Cai, L., Meng, H.: Multi-level Fusion of Audio and visual Features for Speaker identification. In: Proc. ICB 2006, pp. 493–499 (2006)Google Scholar
  9. 9.
    Biuk, Z., Loncaric, S.: Face Recognition from Multi-pose Image Sequence. In: Proc. 2nd Int’l. Symp. on Image and Signal Processing and Analysis, pp. 319–324 (2001)Google Scholar
  10. 10.
    Soong, F.K., Rosenberg, A.E., Juang, B.-H., Rabiner, L.R.: A vector quantization approach to speaker recognition. AT&T Journal 66, 14–26 (1987)Google Scholar
  11. 11.
    Reynolds, D., et al.: Speaker Verification using adapted GMM. Digital Signal Processing 10(1-3) (2000)Google Scholar
  12. 12.
    Kittler, J., et al.: Combining Evidence in Multimodal personal identity recognition systems. In: Proc. Int. Conf. on Audio & Video Based Person Authentication (1997)Google Scholar
  13. 13.
    Das, A., Ram, V.: Text-dependent speaker-recognition – A survey and State of the Art. Tutorial presented at ICASSP-2006, Toulouse (May 2006)Google Scholar
  14. 14.
    Ram, V., Das, A., Kumar, P.: Text-dependent speaker-recognition using one-pass dynamic programming. In: Proc. ICASSP, Toulouse, France (May 2006)Google Scholar
  15. 15.
    Viola, P., Jones, M.: Robust Real-time Object Detection. In: Proc. ICCV-2001 (2001)Google Scholar
  16. 16.
    Das, A., et al.: Face Recognition from Images with high Pose Variations by Transform Vector Quantization. In: Kalra, P., Peleg, S. (eds.) ICVGIP 2006. LNCS, vol. 4338, pp. 674–685. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  17. 17.
    Das, A., et al.: Audio Visual Biometric Recognition by Vector Quantization. In: Proc. IEEE/ACL SLT workshop (December 2006)Google Scholar
  18. 18.
    Das, A.: Audio-Visual Biometric Recognition. In: ICASSP 2007 (accepted tutorial, 2007)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2007

Authors and Affiliations

  • Amitava Das
    • 1
  1. 1.Microsoft Research - India 196/36 2nd Main Sadashivnagar, Bangalore 560 080India

Personalised recommendations