Performance enhancement for audio-visual speaker identification using dynamic facial muscle model

Asadpour, Vahid; Towhidkhah, Farzad; Homayounpour, Mohammad Mehdi

doi:10.1007/s11517-006-0106-5

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model

Original Article
Published: 26 September 2006

Volume 44, pages 919–930, (2006)
Cite this article

Medical and Biological Engineering and Computing Aims and scope Submit manuscript

Vahid Asadpour¹,
Farzad Towhidkhah¹ &
Mohammad Mehdi Homayounpour²

185 Accesses
4 Citations
Explore all metrics

Abstract

Science of human identification using physiological characteristics or biometry has been of great concern in security systems. However, robust multimodal identification systems based on audio-visual information has not been thoroughly investigated yet. Therefore, the aim of this work to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently, imitation of valid speakers could be reduced to a large extent. These parameters are applied to a hidden Markov model (HMM) audio-visual identification system. In this work, a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. Noise robust audio features such as Mel-frequency cepstral coefficients (MFCC), spectral subtraction (SS), and relative spectra perceptual linear prediction (J-RASTA-PLP) have been used to evaluate the performance of the multimodal system once efficient audio feature extraction methods have been utilized. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. To evaluate the robustness of algorithms, some experiments were performed on genetically identical twins. Furthermore, changes in speaker voice were simulated with drug inhalation tests. In 3 dB signal to noise ratio (SNR), the dynamic muscle model improved the identification rate of the audio-visual system from 91 to 98%. Results on identical twins revealed that there was an apparent improvement on the performance for the dynamic muscle model-based system, in which the identification rate of the audio-visual system was enhanced from 87 to 96%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified System for Visual Speech Recognition and Speaker Identification

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Article 28 September 2014

Database of speech and facial expressions recorded with optimized face motion capture settings

Article Open access 21 February 2019

References

Amir Karniel, Gideon Inbar F (1997) A model for learning human reaching movements. Biol Cybern 77:173–183
Article MATH Google Scholar
Cohen MA, Grossberg S (1997) Parallel auditory filtering by sustained and transient channels seperates coarticulated vowels and consonants. IEEE Trans Speech Audio Process 5(4):301–318
Article Google Scholar
Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech identification. IEEE Transaction on Multimedia 2(3):141–151
Article Google Scholar
Fasela B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern Identif 36:259–275
Article Google Scholar
Fletcher H (1953) Speech and hearing in communication. Krieger, New York
Google Scholar
Fu T, Liu XX, Liang LH, Pi X, Nefian A (2002) Audio-visual speaker identification using coupled hidden Markov models. IEEE Int Conf Biomed Eng 3:29–32
Google Scholar
Grant KW, Braida LD (1991) Evaluating the articulation index for auditory-visual input. J Acoust Soc Am 89(6):2952–2960
Article Google Scholar
Haxby JV, Hoffman EA, Gobbini MI (2002) Human neural systems for face identification and social communication. Biol Psychiatry 51:59–67
Article Google Scholar
Henneck ME, Prasad KV, Stork DG (1996) Using deformable templates to infer visual speech dynamics. J Image Process 5:120–128
Google Scholar
Hermansky H, Morgon N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2:578–589
Article Google Scholar
Hill AV (1938) The heat of shortening and dynamic constants of muscle. Proc Res Soc Lond B 126:136–195
Article Google Scholar
Hill AV (1970) First and last experiments in muscle mechanics. Cambridge University Press, Cambridge
Google Scholar
Hirsch HG (1993) Estimation of noise spectrum and its application to SNR-estimation and speech enhancement. ICSI technical report TR-93–012, Intelligent Computation Science Institute, Berkeley
Knappmeyer B, Thornton IM, Bulthoff HH (2003) The use of facial motion and facial form during the processing of identity. Vis Res 43:1921–1936
Article Google Scholar
Krylow AM, Rymer WZ (1997) Role of intrinsic muscle properties in producing smooth movements. IEEE Trans Biomed Eng 44:165–176
Article Google Scholar
Lander K, Bruce V (2000) Recognizing famous faces: exploring the benefits of facial motion. Ecol Psychol 12:259–272
Article Google Scholar
Ljung L (1987) System identification: theory for the user. Prentice-Hall, Englewood Cliffs
MATH Google Scholar
Lucey S, Sridharan S, Chandran V (2002) Adaptive mouth segmentation using chromatic features. Pattern Identif Lett 23:1293–1302
Article MATH Google Scholar
Luettin J, Thacker NA, Beet SW (1996) Locating and tracking facial speech features. In: Proceedings of the international conference on pattern identification, Vienna, Austria, pp 875–880
Luis-Garcia R, Lopez CA, Aghzout O, Alzola JR (2003) Biometric identification systems. Signal Process 83:2539–2557
Article MATH Google Scholar
Magdi Mohamed A, Paul Gader (2000) Generalized hidden Markov models—part I: theoretical frameworks. IEEE Trans Fuzzy Syst 8:67–81
Article Google Scholar
Matas J, Jonsson K, Kittler J (1999) Fast face localization and verification. Image Vis Comput 17:575–581
Article Google Scholar
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Article Google Scholar
Messer K, Matas J, Kittler J, Johnsson K (1999) XM2VTSDB: the extended M2VTS database. In: Second international conference on audio and video-based biometric person authentication, Washington DC, pp 1037–1043
Morris JF, Koski WA, J ohnson LC (1971) Spirometric standards for healthy non-smoking adults. Am Rev Respir Dis 103:57–67
Google Scholar
Otoole AJ, Roark DA, Abdi H (2002) Recognizing moving faces: psychological and neural synthesis. Trends Cogn Sci 6(6):261–266
Article Google Scholar
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Article Google Scholar
Rabiner L, Schafer R (1978) Digital processing of speech signals. Prentice-Hall, Englewood Cliffs
Google Scholar
Summerfield AQ (1992) Lipreading and audio-visual speech perception. Philos Trans Res Soc Lond B 335:71–78
Google Scholar
Tomlinson MJ, Russel MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceeding IEEE international conference of acoustics speech and signal processing, pp 821–824
Vogt M (1996) Fast matching of a dynamic lip model to color video sequence under regular illumination condition. NATO ASI Ser F 150:399–407
Google Scholar

Download references

Acknowledgments

We would like to acknowledge contributions to this work by ITRC organisation for their support and by M. S. Moein (ITRC) for help with the introduction of audio-visual biometry systems.

Author information

Authors and Affiliations

Department of Biomedical Engineering, Amirkabir University of Technology, Tehran, Iran
Vahid Asadpour & Farzad Towhidkhah
Department of Computer Engineering and IT, Amirkabir University of Technology, Tehran, Iran
Mohammad Mehdi Homayounpour

Authors

Vahid Asadpour
View author publications
You can also search for this author in PubMed Google Scholar
Farzad Towhidkhah
View author publications
You can also search for this author in PubMed Google Scholar
Mohammad Mehdi Homayounpour
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vahid Asadpour.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Asadpour, V., Towhidkhah, F. & Homayounpour, M.M. Performance enhancement for audio-visual speaker identification using dynamic facial muscle model. Med Bio Eng Comput 44, 919–930 (2006). https://doi.org/10.1007/s11517-006-0106-5

Download citation

Received: 13 February 2006
Accepted: 21 August 2006
Published: 26 September 2006
Issue Date: October 2006
DOI: https://doi.org/10.1007/s11517-006-0106-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model

Abstract

Access this article

Similar content being viewed by others

Unified System for Visual Speech Recognition and Speaker Identification

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Database of speech and facial expressions recorded with optimized face motion capture settings

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Performance enhancement for audio-visual speaker identification using dynamic facial muscle model

Abstract

Access this article

Similar content being viewed by others

Unified System for Visual Speech Recognition and Speaker Identification

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Database of speech and facial expressions recorded with optimized face motion capture settings

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation