Robust Automatic Human Identification Using Face, Mouth, and Acoustic Information

  • Niall A. Fox
  • Ralph Gross
  • Jeffrey F. Cohn
  • Richard B. Reilly
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3723)


Discriminatory information about person identity is multimodal. Yet, most person recognition systems are unimodal, e.g. the use of facial appearance. With a view to exploiting the complementary nature of different modes of information and increasing pattern recognition robustness to test signal degradation, we developed a multiple expert biometric person identification system that combines information from three experts: face, visual speech, and audio. The system uses multimodal fusion in an automatic unsupervised manner, adapting to the local performance and output reliability of each of the experts. The expert weightings are chosen automatically such that the reliability measure of the combined scores is maximized. To test system robustness to train/test mismatch, we used a broad range of Gaussian noise and JPEG compression to degrade the audio and visual signals, respectively. Experiments were carried out on the XM2VTS database. The multimodal expert system out performed each of the single experts in all comparisons. At severe audio and visual mismatch levels tested, the audio, mouth, face, and tri-expert fusion accuracies were 37.1%, 48%, 75%, and 92.7% respectively, representing a relative improvement of 23.6% over the best performing expert.


Quality Factor Face Recognition Discrete Cosine Transform JPEG Compression Visual Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Blackburn, D., Bone, M., Phillips, P.J.: Facial Recognition Vendor Test 2000. Evaluation report (2000)Google Scholar
  2. 2.
    Gross, R., Shi, J., Cohn, J.F.: Quo Vadis Face Recognition. In: Third Workshop on Empirical Evaluation Methods in Computer Vision (2001)Google Scholar
  3. 3.
    Fox, N.A., Reilly, R.B.: Audio-Visual Speaker Identification Based on the Use of Dynamic Audio and Visual Features. In: Proc. of the fourth Int’l Conf. on Audio- and Video-Based Biometric Person Authentication, Guildford, UK, pp. 743–751 (2003)Google Scholar
  4. 4.
    Identix Corp., 5600 Rowland Road, Minnetonka, MN 55343,
  5. 5.
    Dieckmann, U., Plankensteiner, P., Wagner, T.: SESAM: A biometric person identification system using sensor fusion. Pattern Recognition Letters 18, 827–833 (1997)CrossRefGoogle Scholar
  6. 6.
    Yemez, Y., Kanak, A., Erzin, E., Tekalp, A.M.: Multimodal Speaker Identification with Audio-video Processing. In: Proc. of the Int’l Conf. on Image Processing, vol. 3, pp. 5–8 (2003)Google Scholar
  7. 7.
    Frischholz, R.W., Dieckmann, U.: BiolD: a multimodal biometric identification system. Computer 33, 64–68 (2000)CrossRefGoogle Scholar
  8. 8.
    Sanderson, C., Paliwal, K.K.: Identity verification using speech and face information. Digital Signal Processing 14, 449–480 (2004)CrossRefGoogle Scholar
  9. 9.
    Wark, T., Sridharan, S.: Adaptive Fusion of Speech and Lip Information for Robust Speaker Identification. Digital Signal Processing 11, 169–186 (2001)CrossRefGoogle Scholar
  10. 10.
    Fox, N.A., Reilly, R.B.: Robust Multi-modal Person Identification with Tolerance of Facial Expression. In: The Proc. of the IEEE Int’l Conf. on Systems, Man and Cybernetics, vol. 1, pp. 580–585. The Hague, The Netherlands (2004)Google Scholar
  11. 11.
    Reynolds, D.A., Rose, R.C.: Robust Text-Independent Speaker Identification Using Gaussian Mixture Speaker Models. IEEE Tran. on Speech and Audio Processing 3, 72–83 (1995)CrossRefGoogle Scholar
  12. 12.
    Young, S., Evermann, G., Kershaw, D., Moore, G., Odell, J., Ollason, D., Valtchev, V., Woodland, P.: The HTK Book (for HTK Version 3.1). Cambridge University Engineering Department: Microsoft Corporation (2001)Google Scholar
  13. 13.
    Lucey, S., Chen, T., Sridharan, S., Chandran, V.: Integration strategies for audio-visual speech processing: Applied to text dependent speaker recognition. To appear in the IEEE Transactions on Multimedia, vol. 7 (2005)Google Scholar
  14. 14.
    Potamianos, G., Graf, H., Cosatto, E.: An Image Transform Approach for HMM Based Automatic Lipreading. In: Proc. of the IEEE Int’l Conf. Image Processing, Chicago, vol. 3, pp. 173–177 (1998)Google Scholar
  15. 15.
    Matthews, I., Potamianos, G., Neti, C., Luettin, J.: A Comparison of Model and Transform-based Visual Features for Audio-Visual LVCSR. In: Proc. of the IEEE Int’l Conf. on Multimedia and Expo., pp. 825–828 (2001)Google Scholar
  16. 16.
    Fox, N.A., Gross, R., de Chazal, P., Cohn, J.F., Reilly, R.B.: Person Identification Using Automatic Integration of Speech, Lip, and Face Experts. In: ACM SIGMM workshop on Biometrics Methods and Applications, Berkley, CA, pp. 25–32 (2003)Google Scholar
  17. 17.
    Fox, N.A., O’Mullane, B.A., Reilly, R.B.: Audio-Visual Speaker Identification via Automatic Fusion using Reliability Estimates of both Modalities. In: Kanade, T., Jain, A., Ratha, N.K. (eds.) AVBPA 2005. LNCS, vol. 3546, pp. 787–796. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  18. 18.
    Vapnik, V.: The nature of statistical learning theory. Springer, Heidelberg (1995)zbMATHGoogle Scholar
  19. 19.
    Messer, K., Kittler, J., Luettin, J., Maitre, G.: XM2VTSDB: The Extended M2VTS Database. In: The Proc. of the Second Int’l Conf. on Audio and Video-based Biometric Person Authentication, Washington D.C., pp. 72–77 (1999)Google Scholar
  20. 20.
    Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. Fisherfaces: recognition using class specific linear projection. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 711–720 (1997)CrossRefGoogle Scholar
  21. 21.
    Sirovich, L., Kirby, M.: Low-dimensional procedure for the characterization of human faces. Journal of the Optical Society of America A 4, 519–524 (1987)CrossRefGoogle Scholar
  22. 22.
    Turk, M., Pentland, A.: Eigenfaces for Recognition. Journal of Cognitive Neuroscience 3, 71–86 (1991)CrossRefGoogle Scholar
  23. 23.
    Li, Y., Gong, S., Liddell, H.: Support vector regression and classification based multi-view face detection and recognition. In: Proc. of the Fourth IEEE Int’l Conf. on Automatic Face and Gesture Recognition, pp. 300–305 (2000)Google Scholar
  24. 24.
    Lawrence, S., Giles, C.L., Tsoi, A.C., Back, A.D.: Face recognition: a convolutional neural-network approach. IEEE Tran. on Neural Networks 8, 98–113 (1997)CrossRefGoogle Scholar
  25. 25.
    Lanitis, A., Taylor, C.J., Cootes, T.F.: Automatic interpretation and coding of face images using flexible models. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 743–756 (1997)CrossRefGoogle Scholar
  26. 26.
    Yuille, A.: Deformable Templates for Face Recognition. Journal of Cognitive Neuroscience 3, 59–70 (1991)CrossRefGoogle Scholar
  27. 27.
    Wiskott, L., Fellous, J.-M., Kuiger, N., von der Malsburg, C.: Face recognition by elastic bunch graph matching. IEEE Transactions on Pattern Analysis and Machine Intelligence 19, 775–779 (1997)CrossRefGoogle Scholar
  28. 28.
    Penev, P., Atick, J.: Local feature analysis: A general statistical theory for object representation. Network: Computation in Neural Systems 7, 477–500 (1996)zbMATHCrossRefGoogle Scholar
  29. 29.
    Phillips, P.J., Grother, P., Michaels, P., Blackburn, D., Tabassi, E., Bone, M.: Face Recognition Vendor Test 2002, Evaluation report (2002)Google Scholar
  30. 30.
    Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On Combining Classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence 20, 226–239 (1998)CrossRefGoogle Scholar
  31. 31.
    Jain, A., Nandakumar, K., Ross, A.: Score Normalization in Multimodal Biometric Systems. To appear in Pattern Recognition (2005)Google Scholar
  32. 32.
    Heckmann, M., Berthommier, F., Kristian, K.: Noise Adaptive Stream Weighting in Audio-Visual Speech Recognition. EURASIP Journal on Applied Signal Processing 2002, 1260–1273 (2002)zbMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Niall A. Fox
    • 1
  • Ralph Gross
    • 2
  • Jeffrey F. Cohn
    • 2
  • Richard B. Reilly
    • 1
  1. 1.Dept. of Electronic and Electrical EngineeringUniversity College Dublin, BelfieldDublin 4Ireland
  2. 2.Robotics InstituteCarnegie Mellon UniversityPittsburgh

Personalised recommendations