Abstract
Science of human identification using physiological characteristics or biometry has been of great concern in security systems. However, robust multimodal identification systems based on audio-visual information has not been thoroughly investigated yet. Therefore, the aim of this work to propose a model-based feature extraction method which employs physiological characteristics of facial muscles producing lip movements. This approach adopts the intrinsic properties of muscles such as viscosity, elasticity, and mass which are extracted from the dynamic lip model. These parameters are exclusively dependent on the neuro-muscular properties of speaker; consequently, imitation of valid speakers could be reduced to a large extent. These parameters are applied to a hidden Markov model (HMM) audio-visual identification system. In this work, a combination of audio and video features has been employed by adopting a multistream pseudo-synchronized HMM training method. Noise robust audio features such as Mel-frequency cepstral coefficients (MFCC), spectral subtraction (SS), and relative spectra perceptual linear prediction (J-RASTA-PLP) have been used to evaluate the performance of the multimodal system once efficient audio feature extraction methods have been utilized. The superior performance of the proposed system is demonstrated on a large multispeaker database of continuously spoken digits, along with a sentence that is phonetically rich. To evaluate the robustness of algorithms, some experiments were performed on genetically identical twins. Furthermore, changes in speaker voice were simulated with drug inhalation tests. In 3 dB signal to noise ratio (SNR), the dynamic muscle model improved the identification rate of the audio-visual system from 91 to 98%. Results on identical twins revealed that there was an apparent improvement on the performance for the dynamic muscle model-based system, in which the identification rate of the audio-visual system was enhanced from 87 to 96%.
Similar content being viewed by others
References
Amir Karniel, Gideon Inbar F (1997) A model for learning human reaching movements. Biol Cybern 77:173–183
Cohen MA, Grossberg S (1997) Parallel auditory filtering by sustained and transient channels seperates coarticulated vowels and consonants. IEEE Trans Speech Audio Process 5(4):301–318
Dupont S, Luettin J (2000) Audio-visual speech modeling for continuous speech identification. IEEE Transaction on Multimedia 2(3):141–151
Fasela B, Luettin J (2003) Automatic facial expression analysis: a survey. Pattern Identif 36:259–275
Fletcher H (1953) Speech and hearing in communication. Krieger, New York
Fu T, Liu XX, Liang LH, Pi X, Nefian A (2002) Audio-visual speaker identification using coupled hidden Markov models. IEEE Int Conf Biomed Eng 3:29–32
Grant KW, Braida LD (1991) Evaluating the articulation index for auditory-visual input. J Acoust Soc Am 89(6):2952–2960
Haxby JV, Hoffman EA, Gobbini MI (2002) Human neural systems for face identification and social communication. Biol Psychiatry 51:59–67
Henneck ME, Prasad KV, Stork DG (1996) Using deformable templates to infer visual speech dynamics. J Image Process 5:120–128
Hermansky H, Morgon N (1994) Rasta processing of speech. IEEE Trans Speech Audio Process 2:578–589
Hill AV (1938) The heat of shortening and dynamic constants of muscle. Proc Res Soc Lond B 126:136–195
Hill AV (1970) First and last experiments in muscle mechanics. Cambridge University Press, Cambridge
Hirsch HG (1993) Estimation of noise spectrum and its application to SNR-estimation and speech enhancement. ICSI technical report TR-93–012, Intelligent Computation Science Institute, Berkeley
Knappmeyer B, Thornton IM, Bulthoff HH (2003) The use of facial motion and facial form during the processing of identity. Vis Res 43:1921–1936
Krylow AM, Rymer WZ (1997) Role of intrinsic muscle properties in producing smooth movements. IEEE Trans Biomed Eng 44:165–176
Lander K, Bruce V (2000) Recognizing famous faces: exploring the benefits of facial motion. Ecol Psychol 12:259–272
Ljung L (1987) System identification: theory for the user. Prentice-Hall, Englewood Cliffs
Lucey S, Sridharan S, Chandran V (2002) Adaptive mouth segmentation using chromatic features. Pattern Identif Lett 23:1293–1302
Luettin J, Thacker NA, Beet SW (1996) Locating and tracking facial speech features. In: Proceedings of the international conference on pattern identification, Vienna, Austria, pp 875–880
Luis-Garcia R, Lopez CA, Aghzout O, Alzola JR (2003) Biometric identification systems. Signal Process 83:2539–2557
Magdi Mohamed A, Paul Gader (2000) Generalized hidden Markov models—part I: theoretical frameworks. IEEE Trans Fuzzy Syst 8:67–81
Matas J, Jonsson K, Kittler J (1999) Fast face localization and verification. Image Vis Comput 17:575–581
McGurk H, MacDonald J (1976) Hearing lips and seeing voices. Nature 264:746–748
Messer K, Matas J, Kittler J, Johnsson K (1999) XM2VTSDB: the extended M2VTS database. In: Second international conference on audio and video-based biometric person authentication, Washington DC, pp 1037–1043
Morris JF, Koski WA, J ohnson LC (1971) Spirometric standards for healthy non-smoking adults. Am Rev Respir Dis 103:57–67
Otoole AJ, Roark DA, Abdi H (2002) Recognizing moving faces: psychological and neural synthesis. Trends Cogn Sci 6(6):261–266
Rabiner L (1989) A tutorial on hidden Markov models and selected applications in speech recognition. Proc IEEE 77:257–286
Rabiner L, Schafer R (1978) Digital processing of speech signals. Prentice-Hall, Englewood Cliffs
Summerfield AQ (1992) Lipreading and audio-visual speech perception. Philos Trans Res Soc Lond B 335:71–78
Tomlinson MJ, Russel MJ, Brooke NM (1996) Integrating audio and visual information to provide highly robust speech recognition. In: Proceeding IEEE international conference of acoustics speech and signal processing, pp 821–824
Vogt M (1996) Fast matching of a dynamic lip model to color video sequence under regular illumination condition. NATO ASI Ser F 150:399–407
Acknowledgments
We would like to acknowledge contributions to this work by ITRC organisation for their support and by M. S. Moein (ITRC) for help with the introduction of audio-visual biometry systems.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Asadpour, V., Towhidkhah, F. & Homayounpour, M.M. Performance enhancement for audio-visual speaker identification using dynamic facial muscle model. Med Bio Eng Comput 44, 919–930 (2006). https://doi.org/10.1007/s11517-006-0106-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11517-006-0106-5