Multi-level Fusion of Audio and Visual Features for Speaker Identification

  • Zhiyong Wu
  • Lianhong Cai
  • Helen Meng
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3832)


This paper explores the fusion of audio and visual evidences through a multi-level hybrid fusion architecture based on dynamic Bayesian network (DBN), which combines model level and decision level fusion to achieve higher performance. In model level fusion, a new audio-visual correlative model (AVCM) based on DBN is proposed, which describes both the inter-correlations and loose timing synchronicity between the audio and video streams. The experiments on the CMU database and our own homegrown database both demonstrate that the methods can improve the accuracies of audio-visual bimodal speaker identification at all levels of acoustic signal-to-noise-ratios (SNR) from 0dB to 30dB with varying acoustic conditions.


Video Stream Dynamic Bayesian Network Feature Fusion Speaker Identification Decision Fusion 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Senior, A., Neti, C., Maison, B.: On the use of visual information for improving audio-based speaker recognition. In: Proc. Audio-visual Speech Processing Conf., pp. 108–111 (1999)Google Scholar
  2. 2.
    Nefian, A.V., Liang, L.H., Fu, T.Y., Liu, X.X.: A Bayesian approach to audio-visual speaker identification. In: Kittler, J., Nixon, M.S. (eds.) AVBPA 2003. LNCS, vol. 2688, pp. 761–769. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  3. 3.
    Chibelushi, C.C., Deravi, F., Mason, J.S.D.: A review of speech-based bimodal recognition. IEEE Trans. Multimedia 4, 23–37 (2002)CrossRefGoogle Scholar
  4. 4.
    Chibelushi, C.C., Mason, J.S.D., Deravi, F.: Feature-level data fusion for bimodal person recognition. In: Proc. 6th IEEE International Conf. Image Processing and its Applications., pp. 399–403. IEEE, Stevenage (1997)CrossRefGoogle Scholar
  5. 5.
    Chatzis, V., Bors, A.G., Pitas, I.: Multimodal decision-level fusion for person authentication. IEEE Trans. Syst. Man Cybern. A 29, 674–680 (1999)CrossRefGoogle Scholar
  6. 6.
    Dupont, S., Luettin, J.: Audio-visual speech modeling for continuous speech recognition. IEEE Trans. Multimedia 2, 141–151 (2000)CrossRefGoogle Scholar
  7. 7.
    Gowdy, J.N., Subramanya, A., Bartels, C., Bilmes, J.: DBN based multi-stream models for audio-visual speech recognition. In: Billene, M. (ed.) Proc. IEEE International Conf. Acoustics, Speech, and Signal Processing,, vol. 1, pp. 993–996. IEEE, Canada (2004)Google Scholar
  8. 8.
    Dean, T., Kanazawa, J.: Probabilistic temporal reasoning. In: Proc. 7th National Conf. Artificial Intelligence, pp. 524–528 (1988)Google Scholar
  9. 9.
    Wu, Z.Y.: Audio-visual bimodal modeling for speaker identification and visual-speech synthesis. Ph.D. Dissertation. Department of Computer Science and Technology, Tsinghua University, Beijing, China (2005)Google Scholar
  10. 10.
    Chen, T.: Audiovisual speech processing. IEEE Trans. Signal Processing 18, 9–21 (2001)zbMATHGoogle Scholar
  11. 11.
    Bilmes, J., Zweig, G.: The graphical models toolkit: An open source software system for speech and time-series processing. In: Proc. IEEE International Conf. Acoustics, Speech and Signal Processing, vol. 4, pp. 3916–3919. IEEE, Florida (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Zhiyong Wu
    • 1
    • 2
  • Lianhong Cai
    • 1
  • Helen Meng
    • 2
  1. 1.Department of Computer Science and TechnologyTsinghua UniversityBeijingChina
  2. 2.Department of Systems Engineering and Engineering ManagementThe Chinese University of Hong KongShatin, N.T., Hong Kong SARChina

Personalised recommendations