Skip to main content

Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2688))

Abstract

In this paper an in depth analysis is undertaken into effective strategies for integrating the audio-visual modalities for the purposes of text-dependent speaker recognition. Our work is based around the well known hidden Markov model (HMM) classifier framework for modelling speech. A framework is proposed to handle the mismatch between train and test observation sets, so as to provide effective classifier combination performance between the acoustic and visual HMM classifiers. From this framework, it can be shown that strategies for combining independent classifiers, such as the weighted product or sum rules, naturally emerge depending on the influence of the mismatch. Based on the assumption that poor performance in most audio-visual speaker recognition applications can be attributed to train/test mismatches we propose that the main impetus of practical audio-visual integration is to dampen the independent errors, resulting from the mismatch, rather than trying to model any bimodal speech dependencies. To this end a strategy is recommended, based on theory and empirical evidence, using a hybrid between the weighted product and weighted sum rules in the presence of varying acoustic noise. Results are presented on the M2VTS database.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. J. Kittler, “Combining classifiers: A theoretical framework,” Pattern Analysis and Applications, vol. 1, no. 1, pp. 18–27, 1998.

    Article  Google Scholar 

  2. K. Fukunaga, Introduction to statistical pattern recognition. 24–28 Oval Road, London NW1 7DX: Academic Press Inc., 2nd ed., 1990.

    MATH  Google Scholar 

  3. J.R. Movellan and P. Mineiro, “Modularity and catastrophic fusion: A bayesian approach with applications to audio-visual speech recognition,” Tech. Rep. 97.01, Departement of Cognitive Science, USCD, San Diego, CA, 1997.

    Google Scholar 

  4. S. Pigeon, “The M2VTS database,” (Laboratoire de Telecommunications et Teledection, Place du Levant, 2-B-1348 Louvain-La-Neuve, Belgium), 1996.

    Google Scholar 

  5. P. Jourlin, J. Luettin, D. Genoud, and H. Wassner, “Acoustic-labial speaker verification,” Pattern Recognition Letters, 1997.

    Google Scholar 

  6. S. Lucey, S. Sridharan, and V. Chandran, “Improved facial feature detection for AVSP via unsupervised clustering and discriminant analysis,” EURASIP Journal on Applied Signal Processing, March 2003.

    Google Scholar 

  7. S. Young, D. Kershaw, J. Odell, D. Ollason, V. Valtchev, and P. Woodland, The HTK Book (for HTK version 2.2). Entropic Ltd., 1999.

    Google Scholar 

  8. R. J. Mammone, X. Zhang, and R.P. Ramachandran, “Robust speaker recognition: A feature-based approach,” IEEE Signal Processing Magazine, vol. 13, pp. 58–70, September 1996. 933

    Google Scholar 

  9. S. Dupont and J. Luettin, “Audio-visual speech modeling for continuous speech recognition,” IEEE Transactions on Multimedia, vol. 2, pp. 141–151, September 2000.

    Google Scholar 

  10. S. Lucey, S. Sridharan, and V. Chandran, “A link between cepstral shrinking and the weighted product rule in audio-visual speech recognition,” in International Conference on Spoken Language Processing, (Denver, Colorado), September 2002.

    Google Scholar 

  11. A. Adjoudani and C. Benoit, “Audio-visual speech recognition compared across two architectures,” in EUROSPEECH’95, (Madrid Spain), pp. 1563–1566, September 1995.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2003 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Lucey, S., Chen, T. (2003). Improved Audio-Visual Speaker Recognition via the Use of a Hybrid Combination Strategy. In: Kittler, J., Nixon, M.S. (eds) Audio- and Video-Based Biometric Person Authentication. AVBPA 2003. Lecture Notes in Computer Science, vol 2688. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44887-X_108

Download citation

  • DOI: https://doi.org/10.1007/3-540-44887-X_108

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-40302-9

  • Online ISBN: 978-3-540-44887-7

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics