Multi-stream Articulator Model with Adaptive Reliability Measure for Audio Visual Speech Recognition

  • Lei Xie
  • Zhi-Qiang Liu
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3930)


We propose a multi-stream articulator model (MSAM) for audio visual speech recognition (AVSR). This model extends the articulator modelling technique recently used in audio-only speech recognition to audio-visual domain. A multiple-stream structure with a shared articulator layer is used in the model to mimic the speech production process. We also present an adaptive reliability measure (ARM) based on two local dispersion indicators, integrating audio and visual streams with local, temporal reliability. Experiments on the AVCONDIG database shows that our model can achieve comparable recognition performance with the multi-stream hidden Markov model (MSHMM) under various noisy conditions. With the help of the ARM, our model even performs the best at some testing SNRs.


Visual Speech Visual Stream Audio Speech Babble Noise Reliability Pair 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Xie, L.: Research on Key Issues of Audio Visual Speech Recognition. Ph.D Thesis of Northwestern Polytechnical University (2004)Google Scholar
  2. 2.
    Richardson, J., Bilmes, J., Diorio, C.: Hidden-Articulator Models for Speech Recognition. Speech Communication 41, 511–529 (2003)CrossRefGoogle Scholar
  3. 3.
    Bilmes, J.A., Zweig, G., et al.: Discrimiatively Structured Graphical Models for Speech Recognition. Technical Report of JHU 2001 Summer Workshop (2001)Google Scholar
  4. 4.
    Saenko, K., Livescu, K., Glass, J., Darrell, T.: Production Domain Modeling of Pronunciation for Visual Speech Recognition. In: Proc. ICASSP 2005, Philadelphia (2005)Google Scholar
  5. 5.
    Adjoudani, A., Benoit, C.: On the Integration of Auditory and Visual Parameters on an HMM-based ASR. In: Stork, D.G., Hennecke, M.E. (eds.) Speechreading by Humans and Machines, pp. 461–471. Springer, Berlin (1996)Google Scholar
  6. 6.
    Lucey, S.: Audio-Visual Speech Processing. Ph.D Thesis of Queensland University of Technology (2002)Google Scholar
  7. 7.
    Xie, L., Zhao, R.C., Liu, Z.Q.: Adaptive Stream Reliability Modelling based on Local Dispersion Measures for Audio Visual Speech Recognitin. In: Proc. ICMLC 2005, Guangzhou, China (2005)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Lei Xie
    • 1
  • Zhi-Qiang Liu
    • 1
  1. 1.Center for Media Technology, School of Creative MediaCity University of Hong KongKowloon, Hong Kong

Personalised recommendations