Skip to main content

Audio-Visual Isolated Words Recognition for Voice Dialogue System

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6800))

Abstract

This contribution is about experiments in audio-visual isolated words recognition. The results of these experiments will be used to improve our voice dialogue system, where visual speech recognition will be added. The voice dialogue systems can be used in train or bus stations (or elsewhere), where noise levels are relatively high, therefore the visual part of speech can improve the recognition rate mainly in noisy conditions. The audio-visual recognition of isolated words in our experiments was based on the technique of two-stream Hidden Markov Models (HMM) and on the HMM of single Czech phonemes and visemes. Different visual speech features and a different number of states and mixtures of HMM were evaluated in single tests. In the following experiments, isolated words were being recognized after training of the HMM and babble noise was added in the successive steps to the acoustic speech signal.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chaloupka, J., Chaloupka, Z.: Czech Artificial Computerized Talking Head George. In: Esposito, A., Vích, R. (eds.) Cross-Modal Analysis of Speech, Gestures, Gaze and Facial Expressions. LNCS (LNAI), vol. 5641, pp. 324–330. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  2. Viola, P., Jones, M.: J.: Robust Real-Time Face Detection. International Journal of Computer Vision 57(2), 137–154 (2004)

    Article  Google Scholar 

  3. Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proceedings of the IEEE 91(9), 1306–1326 (2003)

    Article  Google Scholar 

  4. Liew, A.W.C., Wang, S.: Visual speech recognition – lip segmentation and mapping. Medical Information Science Reference Press, New York (2009)

    Book  Google Scholar 

  5. Heckmann, M., Kroschel, K., Savariaux, C., Berthommier, F.: DCT-based video features for audio-visual speech recognition. In: Proc. Int. Conf. Spoken Lang. Process. (2002)

    Google Scholar 

  6. Goecke, R., Asthana, A.: A Comparative Study of 2D and 3D Lip Tracking Methods for AV ASR. In: Proceedings of International Conference on Auditory-Visual Speech Processing (AVSP 2008), Australia, pp. 235–240 (2008) ISBN 978-0-646-49504-0

    Google Scholar 

  7. Lan, Y., Theobald, B.J., Harvey, R., Ong, E.J., Bowden, R.: Improving Visual Features for Lip-reading. In: The 9th International Conference on Auditory-Visual Speech Processing - AVSP 2010, Japan, pp. 142–147 (September 2010) ISBN 978-4-9905475-0-9

    Google Scholar 

  8. Varga, A.P., Steeneken, H.J.M., Tomlinson, M., Jones, D.: The NOISEX-92 study on the effect of additive noise on automatic speech recognition. Tech. Rep., Speech Research Unit, Defence Research Agency, Malvern, UK (1992)

    Google Scholar 

  9. Zhao, D.Y., Kleijn, W.B., Ypma, A., de Vries, B.: Online Noise Estimation Using Stochastic-Gain HMM for Speech Enhancement. IEEE Transactions on Audio, Speech, and Language Processing 16(4), 835–846 (2008)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2011 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chaloupka, J. (2011). Audio-Visual Isolated Words Recognition for Voice Dialogue System. In: Esposito, A., Vinciarelli, A., Vicsi, K., Pelachaud, C., Nijholt, A. (eds) Analysis of Verbal and Nonverbal Communication and Enactment. The Processing Issues. Lecture Notes in Computer Science, vol 6800. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-25775-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-25775-9_8

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-25774-2

  • Online ISBN: 978-3-642-25775-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics