Advertisement

DBN Based Models for Audio-Visual Speech Analysis and Recognition

  • Ilse Ravyse
  • Dongmei Jiang
  • Xiaoyue Jiang
  • Guoyun Lv
  • Yunshu Hou
  • Hichem Sahli
  • Rongchun Zhao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 4261)

Abstract

We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.

Keywords

Speech Recognition Automatic Speech Recognition Dynamic Bayesian Network Speech Recognition System Visual Speech 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Bilmes, J., Zweig, G.: The graphical modelds toolkit:an open source software system for speech and time-series processing. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP), vol. 4, pp. 3916–3919 (2002)Google Scholar
  2. 2.
    Jeff Bilmes, G.Z., et al.: Discriminatively structured dynamic graphical models for speech recognition. Technical report, JHU 2001 Summer Workshop (2001)Google Scholar
  3. 3.
    Zhang, Y., Diao, Q., C., S.W., Bilmes, J.: Dbn based multi-stream models for speech. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2003)Google Scholar
  4. 4.
    Gravier, G., Potamianos, G., Neti, C.: Asynchrony modeling for audio visual speech recognition. In: Proceedings of Human Language Technology Conference (2002)Google Scholar
  5. 5.
    Gawdy, G.N., Subramanya, A., C.J.: Dbn based multi-stream models for audio visual speech recognition. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2004)Google Scholar
  6. 6.
    Lei, X., Ji, G., Ng, T., Bilmes, J., Ostendorf, M.: Dbn-based multi-stream mandarin toneme recogntion. In: Proceedings of the IEEE Internation Conf. on Acoustic Speech and Signal Processing (ICASSP) (2005)Google Scholar
  7. 7.
    Bilmes, J., Bartels, C.: Graphical model architecture for speech recognition. IEEE signal processing magazine 89 (2005)Google Scholar
  8. 8.
    Lei, X., Dongmei, J., Ravyse, I., Verhelst, W., Sahli, H., Slavova, V., Rongchun, Z.: Context dependent viseme models for voice driven animation. In: 4th EURASIP Conference focused on Video/Image Processing and Multimedia Communications, EC-VIP-MC 2003, Zagreb, Croatia, July 2-4, 2003, vol. 2, pp. 649–654 (2003)Google Scholar
  9. 9.
    Ravyse, I., Enescu, V., Sahli, H.: Kernel-based head tracker for videophony. In: The IEEE International Conference on Image Processing 2005 (ICIP 2005), Genoa, Italy, 11-14/09/2005, vol. 3, pp. 1068–1071 (2005)Google Scholar
  10. 10.
    Zhou, Y., Gu, L., Zhang, H.J.: Bayesian tangent shape model: Estimating shape and pose parameters via bayesian inference. In: Proceedings of the 2003 IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2003), vol. 1, pp. 109–118 (2003)Google Scholar
  11. 11.
    Ravyse, I.: Facial Analysis and Synthesis. PhD thesis, Vrije Universiteit Brussel, Dept. Electronics and Informatics, Belgium (2006), online: http://www.etro.vub.ac.be/Personal/icravyse/RavysePhDThesis.pdf
  12. 12.
    Vedula, S., Baker, S., Rander, P., Collins, R., Kanade, T.: Three-dimensional scene flow. IEEE Transactions on Pattern Analysis and Machine Intelligence 27, 137–154 (2005)CrossRefGoogle Scholar
  13. 13.
    Lee, Y., Terzopoulos, D., Waters, K.: Constructing physics based facial models of individuals. In: Proceedings of the Graphics Interface 1993 Conference, Toronto, ON, Canada, pp. 1–8 (1993)Google Scholar
  14. 14.
    Eisert, P.: Very Low Bit-Rate Video Coding Using 3-D Models. PhD thesis, Universitat Erlangen, Shaker Verlag, Aachen, Germany (2000) ISBN 3-8265-8308-6Google Scholar
  15. 15.
    Davis, B.S., Mermelstein, P.: Comparison of parametric representation for monosyllable word recognition in continuously spoken sentences. IEEE Transactions on Acoustics, Speech, and Signal Processing 28, 357–366 (1980)CrossRefGoogle Scholar
  16. 16.
  17. 17.
    Beskow, J., Karlson, I., Kewley, J., Salvi, G.: Synface-a talking head telephone for the hearing-impaired. In: Miesenberger, K., Klaus, J., Zagler, W., Burger, D. (eds.) ICCHP 2004. LNCS, vol. 3118, pp. 1178–1186. Springer, Heidelberg (2004)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Ilse Ravyse
    • 1
  • Dongmei Jiang
    • 3
  • Xiaoyue Jiang
    • 3
  • Guoyun Lv
    • 3
  • Yunshu Hou
    • 3
  • Hichem Sahli
    • 1
    • 2
  • Rongchun Zhao
    • 3
  1. 1.Department ETROJoint Research Group on Audio Visual Signal Processing (AVSP), Vrije Universiteit BrusselBrussel
  2. 2.IMECLeuven
  3. 3.School of Computer ScienceNorthwestern Polytechnical UniversityXi’anP.R. China

Personalised recommendations