Skip to main content

Using Speech in Visual Object Recognition

  • Conference paper
Mustererkennung 2000

Part of the book series: Informatik aktuell ((INFORMAT))

Abstract

Automatic understanding of multi-modal input is the central topic in modern human Computer interfaces. But the basic questions about how the interpretations provided by different modalities can be connected in a universal and robust manner is still an open problem. The most intuitive input modalities, speech perception and vision, can only be correlated on a qualitative content based interpretation level. But, due to vague meanings and erroneous processing results this is extremely difficult to accomplish. A simple frame based integration scheme filling appropriate slots with new analysis results will fail when ambiguous or contradictory information appears. In this paper we propose a new probabilistic framework to overcome these drawbacks. The integration model is built up from data collected in labeled test sets and psycholinguistic experiments. Thereby, the correspondence problem is solved in a very robust and universal manner. In particular, we will show that erroneous visual interpretations can be corrected by a joint analysis of visual and speech input data.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 54.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 69.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. U. Ahlrichs, J. Fischer, J. Denzler, C. Drexler, H. Niemann, E. Nöth, and D. Paulus. Knowledge based image and speech analysis for Service robots. In Integration of Speech and Image Understanding, p. 21-48, Corfu, Greece, 1999. IEEE Comp. Soc.

    Google Scholar 

  2. C. Bauckhage, F. Kümmert, and G. Sagerer. Modeling and Recognition of Assembled Objects. In IECON’98 Proceedings of the 24th Annual Conference of the IEEE Industrial Electronics Society, p. 2051-2056, 1998.

    Google Scholar 

  3. H. Brandt-Pook, G.A. Fink, S. Wachsmuth, and G. Sagerer. Integrated recognition and interpretation of speech for a construction task domain. In Proc. of the Int. Conference on Human Computer Interaction (HCl), volume 1, p. 550–554, 1999.

    Google Scholar 

  4. C. Brindöpke, M. Johanntokrax, A. Pahde, and B. Wrede. “‘Darf ich dich Marvin nennen?’” — Instruktionsdialoge in einem Wizard-of-Oz-Szenario: Materialband. Report 7/95, Sonderforschungsbereich 360 ‘“Situierte Künstliche Kommunikatoren’”, Universität Bielefeld, 1995.

    Google Scholar 

  5. G.A. Fink. Developing HMM-based recognizers with ESMERALDA. In V. Matousek, P. Mautner, J. Ocelfkova, and P. Sojka, editors, Lecture Notes in Artificial Intelligence, volume 1692, p. 229–234, Berlin, 1999. Springer.

    Google Scholar 

  6. G. Furnas, T. Landauer, L. Gomez, and S. Dumais. The vobabulary problem in human-system communication. Communications of ACM, 30(11) 1987.

    Article  Google Scholar 

  7. R. Jackendoff. Languages of the Mind. The MIT Press, 1992.

    Google Scholar 

  8. F. Kümmert, G. Fink, G. Sagerer, and E. Braun. Hybrid Object Recognition in Image Sequences. In 14th International Conference on Pattern Recognition, volume II, p. 1165–1170, Brisbane, 1998.

    Google Scholar 

  9. F. Kümmert, H. Niemann, R. Prechtel, and G. Sagerer. Control and Explanation in a Signal Understanding Environment. Signal Processing, special issue on ‘Intelligent Systems for Signal and Image Understanding’, 32:111–145, 1993.

    Google Scholar 

  10. A. Mukerjee. Neat vs scruffy: A review of computational models for spatial expressions. In P. Olivier and K.-P. Gapp, editors, Representation and processing of spatial expressions. Lawrence Erlbaum Associates, 1997.

    Google Scholar 

  11. K. Nagao and J. Rekimoto.Ubiquitous talker: Spoken language interaction with real world objects. In Proceedings of IJCAI-95, p. 1284-1290, 1995.

    Google Scholar 

  12. J. Pearl. Probabilstic reasoning in intelligent systems: networks of plausible inference. Morgan Kaufmann, 1989.

    Google Scholar 

  13. G. Socher, G. Sagerer, and P. Perona. Baysian Reasoning on Qualitative Descriptions from Images and Speech. In H. Buxton and A. Mukerjee, editors, ICCV 98 Workshop on Conceptual Description of Images, Bombay, India, 1998.

    Google Scholar 

  14. R.K. Srihari. Computational models for integrating linguistic and visual information: A survey. In Artificial Intelligence Review, 8, p. 349–369, Netherlands, 1994. Kluwer Academic Publishers.

    Google Scholar 

  15. R.K. Srihari and D.T. Burhans. Visual semantics: extracting visual information from text accompanying pictures. In Proc. of AAAI-94, p. 793-798, Seattle, 1994.

    Google Scholar 

  16. B. Suhm, B. Myers, and A. Waibel. Interactive recovery from speech recognition errors in speech user interfaces. In Proc. ICSLP ’96, volume 2, p. 865-868, Philadelphia, PA, Oct. 1996.

    Google Scholar 

  17. L. Ungerleider and M. Mishkin. Two cortical visual systems. In Analysis of Visual Behaviour, p. 549-586. The MIT Press, 1982.

    Google Scholar 

  18. S. Wachsmuth, H. Brandt-Pook, G. Socher, F. Kümmert, and G. Sagerer. Multilevel integration of vision and speech understanding using bayesian networks. In H.I. Christensen, editor, Computer Vision Systems: First International Conference, volume 1542 of Lecture Notes in Computer Science, p. 231–254, Las Palmas, Gran Canaria, Spain, Jan. 1999. Springer-Verlag.

    Google Scholar 

  19. S. Wachsmuth, G.A. Fink, and G. Sagerer. Integration of parsing and incremental speech recognition. In Proceedings of the European Signal Processing Conference (EUSIPCO-98), volume 1, p. 371-375, Rhodes, Sept. 1998.

    Google Scholar 

  20. A. Waibel, B. Suhm, M.T. Vo, and J. Yang. Multimodal interfaces for multimedia information agents. InProc. ICASSP ’97, volume 1, p. 167–170, 1997.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2000 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wachsmuth, S., Fink, G.A., Kümmert, F., Sagerer, G. (2000). Using Speech in Visual Object Recognition. In: Sommer, G., Krüger, N., Perwass, C. (eds) Mustererkennung 2000. Informatik aktuell. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-59802-9_54

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-59802-9_54

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-67886-1

  • Online ISBN: 978-3-642-59802-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics