Skip to main content

Advertisement

Log in

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model

  • Original Article
  • Published:
Medical and Biological Engineering and Computing Aims and scope Submit manuscript

Abstract

Science of human identification using physiological characteristics or biometry has been of great concern in security systems. However, robust multimodal identification systems based on audio-visual information has not been thoroughly investigated yet. Therefore, the aim of this work to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently, imitation of valid speakers could be reduced to a large extent. These parameters are applied to a hidden Markov model (HMM) audio-visual identification system. In this work, a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. Noise robust audio features such as Mel-frequency cepstral coefficients (MFCC), spectral subtraction (SS), and relative spectra perceptual linear prediction (J-RASTA-PLP) have been used to evaluate the performance of the multimodal system once efficient audio feature extraction methods have been utilized. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. To evaluate the robustness of algorithms, some experiments were performed on genetically identical twins. Furthermore, changes in speaker voice were simulated with drug inhalation tests. In 3 dB signal to noise ratio (SNR), the dynamic muscle model improved the identification rate of the audio-visual system from 91 to 98%. Results on identical twins revealed that there was an apparent improvement on the performance for the dynamic muscle model-based system, in which the identification rate of the audio-visual system was enhanced from 87 to 96%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Amir Karniel, Gideon Inbar F (1997) A model for learning human reaching movements. Biol Cybern 77:173–183

    Article  MATH  Google Scholar 

  2. Cohen MA, Grossberg S (1997) Parallel auditory filtering by sustained and transient channels seperates coarticulated vowels and consonants. IEEE Trans Speech Audio Process 5(4):301–318

    Article  Google Scholar 

  3. Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech identification. IEEE Transaction on Multimedia 2(3):141–151

    Article  Google Scholar 

  4. Fasela B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern Identif 36:259–275

    Article  Google Scholar 

  5. Fletcher H (1953) Speech and hearing in communication. Krieger, New York

    Google Scholar 

  6. Fu T, Liu XX, Liang LH, Pi X, Nefian A (2002) Audio-visual speaker identification using coupled hidden Markov models. IEEE Int Conf Biomed Eng 3:29–32

    Google Scholar 

  7. Grant KW, Braida LD (1991) Evaluating the articulation index for auditory-visual input. J Acoust Soc Am 89(6):2952–2960

    Article  Google Scholar 

  8. Haxby JV, Hoffman EA, Gobbini MI (2002) Human neural systems for face identification and social communication. Biol Psychiatry 51:59–67

    Article  Google Scholar 

  9. Henneck ME, Prasad KV, Stork DG (1996) Using deformable templates to infer visual speech dynamics. J Image Process 5:120–128

    Google Scholar 

  10. Hermansky H, Morgon N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2:578–589

    Article  Google Scholar 

  11. Hill AV (1938) The heat of shortening and dynamic constants of muscle. Proc Res Soc Lond B 126:136–195

    Article  Google Scholar 

  12. Hill AV (1970) First and last experiments in muscle mechanics. Cambridge University Press, Cambridge

    Google Scholar 

  13. Hirsch HG (1993) Estimation of noise spectrum and its application to SNR-estimation and speech enhancement. ICSI technical report TR-93–012, Intelligent Computation Science Institute, Berkeley

  14. Knappmeyer B, Thornton IM, Bulthoff HH (2003) The use of facial motion and facial form during the processing of identity. Vis Res 43:1921–1936

    Article  Google Scholar 

  15. Krylow AM, Rymer WZ (1997) Role of intrinsic muscle properties in producing smooth movements. IEEE Trans Biomed Eng 44:165–176

    Article  Google Scholar 

  16. Lander K, Bruce V (2000) Recognizing famous faces: exploring the benefits of facial motion. Ecol Psychol 12:259–272

    Article  Google Scholar 

  17. Ljung L (1987) System identification: theory for the user. Prentice-Hall, Englewood Cliffs

    MATH  Google Scholar 

  18. Lucey S, Sridharan S, Chandran V (2002) Adaptive mouth segmentation using chromatic features. Pattern Identif Lett 23:1293–1302

    Article  MATH  Google Scholar 

  19. Luettin J, Thacker NA, Beet SW (1996) Locating and tracking facial speech features. In: Proceedings of the international conference on pattern identification, Vienna, Austria, pp 875–880

  20. Luis-Garcia R, Lopez CA, Aghzout O, Alzola JR (2003) Biometric identification systems. Signal Process 83:2539–2557

    Article  MATH  Google Scholar 

  21. Magdi Mohamed A, Paul Gader (2000) Generalized hidden Markov models—part I: theoretical frameworks. IEEE Trans Fuzzy Syst 8:67–81

    Article  Google Scholar 

  22. Matas J, Jonsson K, Kittler J (1999) Fast face localization and verification. Image Vis Comput 17:575–581

    Article  Google Scholar 

  23. McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748

    Article  Google Scholar 

  24. Messer K, Matas J, Kittler J, Johnsson K (1999) XM2VTSDB: the extended M2VTS database. In: Second international conference on audio and video-based biometric person authentication, Washington DC, pp 1037–1043

  25. Morris JF, Koski WA, J ohnson LC (1971) Spirometric standards for healthy non-smoking adults. Am Rev Respir Dis 103:57–67

    Google Scholar 

  26. Otoole AJ, Roark DA, Abdi H (2002) Recognizing moving faces: psychological and neural synthesis. Trends Cogn Sci 6(6):261–266

    Article  Google Scholar 

  27. Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286

    Article  Google Scholar 

  28. Rabiner L, Schafer R (1978) Digital processing of speech signals. Prentice-Hall, Englewood Cliffs

    Google Scholar 

  29. Summerfield AQ (1992) Lipreading and audio-visual speech perception. Philos Trans Res Soc Lond B 335:71–78

    Google Scholar 

  30. Tomlinson MJ, Russel MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceeding IEEE international conference of acoustics speech and signal processing, pp 821–824

  31. Vogt M (1996) Fast matching of a dynamic lip model to color video sequence under regular illumination condition. NATO ASI Ser F 150:399–407

    Google Scholar 

Download references

Acknowledgments

We would like to acknowledge contributions to this work by ITRC organisation for their support and by M. S. Moein (ITRC) for help with the introduction of audio-visual biometry systems.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Vahid Asadpour.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asadpour, V., Towhidkhah, F. & Homayounpour, M.M. Performance enhancement for audio-visual speaker identification using dynamic facial muscle model. Med Bio Eng Comput 44, 919–930 (2006). https://doi.org/10.1007/s11517-006-0106-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11517-006-0106-5

Keywords

Navigation