Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy

Lucey, Simon; Chen, Tsuhan

doi:10.1007/3-540-44887-X_108

Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy

Simon Lucey⁶ &
Tsuhan Chen⁶

Conference paper
First Online: 01 January 2003

1761 Accesses
2 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2688))

Abstract

In this paper an in depth analysis is undertaken into effective strategies for integrating the audio-visual modalities for the purposes of text-dependent speaker recognition. Our work is based around the well known hidden Markov model (HMM) classifier framework for modelling speech. A framework is proposed to handle the mismatch between train and test observation sets, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speaker recognition applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise. Results are presented on the M2VTS database.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

J. Kittler, “Combining classifiers: A theoretical framework,” Pattern Analysis and Applications, vol. 1, no. 1, pp. 18–27, 1998.
Article Google Scholar
K. Fukunaga, Introduction to statistical pattern recognition. 24–28 Oval Road, London NW1 7DX: Academic Press Inc., 2nd ed., 1990.
MATH Google Scholar
J.R. Movellan and P. Mineiro, “Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition,” Tech. Rep. 97.01, Departement of Cognitive Science, USCD, San Diego, CA, 1997.
Google Scholar
S. Pigeon, “The M2VTS database,” (Laboratoire de Telecommunications et Teledection, Place du Levant, 2-B-1348 Louvain-La-Neuve, Belgium), 1996.
Google Scholar
P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker verification,” Pattern Recognition Letters, 1997.
Google Scholar
S. Lucey, S. Sridharan, and V. Chandran, “Improved facial feature detection for AVSP via unsupervised clustering and discriminant analysis,” EURASIP Journal on Applied Signal Processing, March 2003.
Google Scholar
S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK version 2.2). Entropic Ltd., 1999.
Google Scholar
R. J. Mammone, X. Zhang, and R.P. Ramachandran, “Robust speaker recognition: A feature-based approach,” IEEE Signal Processing Magazine, vol. 13, pp. 58–70, September 1996. 933
Google Scholar
S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, pp. 141–151, September 2000.
Google Scholar
S. Lucey, S. Sridharan, and V. Chandran, “A link between cepstral shrinking and the weighted product rule in audio-visual speech recognition,” in International Conference on Spoken Language Processing, (Denver, Colorado), September 2002.
Google Scholar
A. Adjoudani and C. Benoit, “Audio-visual speech recognition compared across two architectures,” in EUROSPEECH’95, (Madrid Spain), pp. 1563–1566, September 1995.
Google Scholar

Download references

Author information

Authors and Affiliations

Advanced Multimedia Processing Laboratory, Department of Electrical and Computer Engineering, Carnegie Mellon University, 15213, Pittsburgh, PA, USA
Simon Lucey & Tsuhan Chen

Authors

Simon Lucey
View author publications
You can also search for this author in PubMed Google Scholar
Tsuhan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Center for Vision, Speech and Signal Proc., University of Surrey, GU2 7XH, Guildford, Surrey, UK
Josef Kittler
Department of Electronics and Computer Science, University of Southampton, SO17 1BJ, Southampton, UK
Mark S. Nixon

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lucey, S., Chen, T. (2003). Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_108

Download citation

DOI: https://doi.org/10.1007/3-540-44887-X_108
Published: 24 June 2003
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40302-9
Online ISBN: 978-3-540-44887-7
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics