User Verification by Combining Speech and Face Biometrics in Video

  • Imran Naseem
  • Ajmal Mian
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5359)


In this paper, physiological biometrics from face are combined with behavioral biometrics from speech in video to achieve robust user authentication. The choice of biometrics is motivated by user convenience and robustness to forgery as it is hard to simultaneously forge these two biometrics. We used the Mel Frequency Cepstral Coefficients for text-independent speaker recognition and local scale invariant features for video-based face recognition. Results of the two classifiers were fused using a weighted sum rule and an equal error rate of 0.6% was achieved on the VidTIMIT audio-visual database. We also performed identification experiments and achieved a combined identification rate of 99.13% on the same database.


Hide Markov Model Face Recognition Video Sequence Recognition Rate Scale Invariant Feature Transform 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Furui, S.: An Overview of Speaker Recognition Technology. In: ESCA Workshop on Automatic Speaker Recognition, Identification and Verification (1994)Google Scholar
  2. 2.
    Pawlewski, M., Jones, J.: Speaker Verification: Part 1. Biometric Technology Today 14(6), 9–11 (2006)CrossRefGoogle Scholar
  3. 3.
    Reynolds, D.: A Gaussian Mixture Modeling Approach to Text-independent Speaker Identification. PhD Thesis, Georgia Institute of Technology (1992)Google Scholar
  4. 4.
    McLachlan, G.: Mixture Models, vol. Wright, J. and Yang, A. and Ganesh, A. and Sastri, S, S. and Ma, Y. Marcel Dekker, New York (1988)zbMATHGoogle Scholar
  5. 5.
    Tishby, N.: On the Application of Mixture AR Hidden Markov Models to Text-independent Speaker Recognition. IEEE Trans. on Signal Proc. 39, 563–570 (1991)CrossRefGoogle Scholar
  6. 6.
    Poritz, A.: Linear Predictive Hidden Markov Models and the Speech Signal. In: Proceedings of IEEE ICASSP, pp. 1291–1294 (1982)Google Scholar
  7. 7.
    Rosenberg, A.: Sub-word Talker Verification using Hidden Markov Models. In: Proceeding of IEEE ICASSP, pp. 269–272 (1990)Google Scholar
  8. 8.
    Levinson, D.: A Perspective on Speech Recognition. Communication Magazine 28 (1990)Google Scholar
  9. 9.
    Kohata, M.: Interpolation of LSP Coefficients using Recurrent Neural Networks. Electronics Letters 32 (1996)Google Scholar
  10. 10.
    Zhao, W., Chellappa, R., Phillips, P.J., Rosenfeld, A.: Face Recognition: A Literature Survey. ACM Computing Survey 35(4), 399–458 (2003)CrossRefGoogle Scholar
  11. 11.
    Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991)CrossRefGoogle Scholar
  12. 12.
    Belhumeur, P., Hespanha, J., Kriegman, D.: Eigenfaces vs. Fisherfaces: Recognition Using Class Specific Linear Projection. IEEE Trans. on PAMI 19, 711–720 (1997)CrossRefGoogle Scholar
  13. 13.
    Lee, K., Ho, J., Yang, M., Kriegman, D.: Visual Tracking and Recognition Using Probabilistic Appearance Manifolds. CVIU 99(3), 303–331 (2005)Google Scholar
  14. 14.
    Liu, L., Wang, Y., Tan, T.: Online Appearance Model Learning for Video-Based Face Recognition. In: CVPR, pp. 1–7 (2007)Google Scholar
  15. 15.
    Lee, K., Kriegman, D.: Online Learning of Probabilistic Appearance Manifolds for Video-based Recognition and Tracking. In: CVPR, vol. 1, pp. 852–859 (2005)Google Scholar
  16. 16.
    Li, Y., Gong, S., Liddell, H.: Constructing Facial Identity Surfaces in a Nonlinear Discriminating Space. In: CVPR, vol. 2, pp. 258–263 (2001)Google Scholar
  17. 17.
    Sivic, J., Everingham, M., Zisserman, A.: Person Spotting: Video Shot Retrieval for Face Sets. In: CIVR (2005)Google Scholar
  18. 18.
    Sanderson, C., Paliwal, K.: Identity Verification Using Speech and Face Information. Digital Signal Processing 14(5), 449–480 (2004)CrossRefGoogle Scholar
  19. 19.
    Sanderson, C.: Biometric Person Recognition: Face, Speech and Fusion. VDM-Verlag (2008)Google Scholar
  20. 20.
    Moore, B.: Information Extraction and Perceptual Grouping in the Auditory System. Human and Machine Perception: Information Fusion (1997)Google Scholar
  21. 21.
    Haung, X., Acero, A., Hon, H.: Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR, New Jersey (2001)Google Scholar
  22. 22.
    Moore, B.: Frequency Analysis and Masking. Academic Press, USA (1995)CrossRefGoogle Scholar
  23. 23.
    Bimbot, F., Magrin-Chagnolleau, I., Mathan, L.: Second-order Statistical Measures for Text-independent Speaker Identification. Speech Communication 17, 177–192 (1995)CrossRefGoogle Scholar
  24. 24.
    Viola, P., Jones, M.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004)CrossRefGoogle Scholar
  25. 25.
    Lowe, D.: Distinctive Image Features from Scale-invariant Key Points. International Journal of Computer Vision 60(2), 91–110 (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2008

Authors and Affiliations

  • Imran Naseem
    • 1
  • Ajmal Mian
    • 2
  1. 1.School of Electrical, Electronic and Computer EngineeringAustralia
  2. 2.School of Computer Science and Software EngineeringThe University of Western AustraliaAustralia

Personalised recommendations