Skip to main content

VIsual TRAnslator: Linking Perceptions and Natural Language Descriptions

  • Chapter
Integration of Natural Language and Vision Processing

Abstract

Despite the fact that image understanding and natural language processing constitute two major areas of AI, there have only been a few attempts toward the integration of computer vision and the generation of natural language expressions for the description of image sequences. In this contribution we will report on practical experience gained in the project Vitra (VIsual TRAnslator) concerning the design and construction of integrated knowledge-based systems capable of translating visual information into natural language descriptions. In Vitra different domains, like traffic scenes and short sequences from soccer matches, have been investigated.

Our approach towards simultaneous scene description emphasizes concurrent image sequence evaluation and natural language processing, carried out on an incremental basis, an important prerequisite for real-time performance. One major achievement of our cooperation with the vision group at the Fraunhofer Institute (IITB, Karlsruhe) is the automatic generation of natural language descriptions for recognized trajectories of objects in real world image sequences. In this survey, the different processes pertaining to high-level scene analysis and natural language generation will be discussed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • André, E., Bosch, G., Herzog, G. & Rist, T. (1987). Coping with the Intrinsic and the Deictic Uses of Spatial Prepositions. In Jorrand, K. & Sgurev, L. (eds.) Artificial Intelligence II: Methodology, Systems, Applications, 375–382. North-Holland: Amsterdam.

    Google Scholar 

  • André, E., Rist, T. & Herzog, G. (1987). Generierung natürlichsprachlicher Äußerungen zur simultanen Beschreibung zeitveränderlicher Szenen. In Morik, K. (Hrsg.) GWAI-87, 330–337. Springer: Berlin, Heidelberg.

    Google Scholar 

  • André, E., Herzog, G. & Rist, T. (1988). On the Simultaneous Interpretation of Real World Image Sequences and their Natural Language Description: The System SOCCER. In Proceedings of The Eighth ECAI, 449–454. Munich.

    Google Scholar 

  • André, E., Herzog, G. & Rist, T. (1989). Natural Language Access to Visual Data: Dealing with Space and Movement. Report 63, Universität des Saarlandes, SFB 314 (VITRA), Saarbrücken. Presented at the 1st Workshop on Logical Semantics of Time, Space and Movement in Natural Language, Toulouse, France.

    Google Scholar 

  • Bajcsy, R., Joshi, A., Krotkov, E. & Zwarico, A. (1985). LandScan: A Natural Language and Computer Vision System for Analyzing Aerial Images. In Proceedings of The Ninth IJCAI, 919–921. Los Angeles, CA.

    Google Scholar 

  • Finkler, W. & Schauder, A. (1992). Effects of Incremental Output on Incremental Natural Language Generation. In Proceedings of The Tenth ECAI, 505–507. Vienna.

    Google Scholar 

  • Gapp, K.-P. (1993). Berechnungsverfahren für räumliche Relationen in 3D-Szenen. Memo 59, Universität des Saarlandes, SFB 314.

    Google Scholar 

  • Gapp, K.-P. (1994). Basic Meanings of Spatial Relations; Computation and Evaluation in 3D Space. In Proceedings of The AAAI-94. Seattle, WA. (to appear).

    Google Scholar 

  • Grice, H. P. (1975). Logic and Conversation. In Cole, P. & Morgan, J. L. (eds.) Speech Acts, 41–58. Academic Press: London.

    Google Scholar 

  • Harbusch, K., Finkler, W. & Schauder, A. (1991). Incremental Syntax Generation with Tree Adjoining Grammars. In Brauer, W. & Hernandez, D. (eds.) Verteilte Künstliche Intelligenz und kooperatives Arbeiten: 4. Int. GI-Kongreß Wisssensbasierte Systeme, 363–374. Springer: Berlin, Heidelberg.

    Chapter  Google Scholar 

  • Herzog, G. (1986). Ein Werkzeug zur Visualisierung und Generierung von geometrischen Bildfolgenbeschreibungen. Memo 12, Universität des Saarlandes, SFB 314 (VITRA).

    Google Scholar 

  • Herzog, G. (1992). Utilizing Interval-Based Event Representations for Incremental High-Level Scene Analysis. In Aurnague, M., Borillo, A., Borillo, M. & Bras M. (eds.). Proceedings of The Fourth European Workshop on Semantics of Time, Space, and Movement and Spatio-Temporal Reasoning, 425–435. Château de Bonas, France.

    Google Scholar 

  • Herzog, G. (1992). Visualization Methods for the VITRA Workbench. Memo 53, Universität des Saarlandes, SFB 314 (VITRA).

    Google Scholar 

  • Herzog, G., Sung, C.-K., André, E., Enkelmann, W., Nagel, H.-H., Rist, T., Wahlster, W. & Zimmermann, G. (1989). Incremental Natural Language Description of Dynamic Imagery. In Freksa, Ch. & Brauer, E. (eds.) Wissensbasierte Systeme. 3. Int. GI-Kongreß, 153–162. Springer: Berlin, Heidelberg.

    Google Scholar 

  • Herzog, G., Maaß & Wazinski, P. (1993). VITRA GUIDE: Utilisation du langage Naturel et de Représentation Graphiques pour la Description d’Itinéraires. In Colloque Interdisciplinaire du Comité National “Images et Langages: Multimodalité et Modélisation Cognitive, 243–251. Paris.

    Google Scholar 

  • Herzog, G., Schirra, J. & Wazinski, P. (1993). Arbeitsbericht für den Zeitraum 1991–1993: VITRAKopplung bildverstehender und sprachverstehender Systeme. Memo 58, Univesität des Saarlandes, SFB 314 (VITRA).

    Google Scholar 

  • Jameson, A. & Wahlster, W. (1982). User Modelling in Anaphora Generation. In Proceedings of The Fifth ECAI, 222–227. Orsay, France.

    Google Scholar 

  • Koller, D. (1992). Detektion, Verfolgung und Klassifikation bewegter Objekte in monokularen Bildfolgen am Beispiel von Straßenverkehresszenen. Infix: St. Augustin.

    Google Scholar 

  • Koller, D., Daniilidis, K., Thórhallson, T. & Nagel, H.-H. (1992a). Model-based Object Tracking in Traffic Scenes. In Sandini, G. (ed.). Proceedings of The Second European Conf. on Computer Vision, 437–452. Springer: Berlin, Heidelberg.

    Google Scholar 

  • Koller, D., Heinze, N. & Nagel, H.-H. (1992b). Algorithmic Characterization of Vehicle Trajectories from Image Sequences by Motion Verbs. In Proceedings of The IEEE Conf on Computer Vision and Pattern Recognition, 90–95. Maui, Hawaii.

    Google Scholar 

  • Kollnig, H. & Nagel, H.-H. (1993). Ermittlung von begrifflichen Beschreibungen von Geschehen in Straßenverkehrsszenen mit Hilfe unscharfer Mengen. Informatik Forschung und Entwicklung 8(4): 186–196.

    Google Scholar 

  • Lüth, T. C., Längle, Th., Herzog, G., Stopp, E. & Rembold, U. (1994). Human-Machine Interaction for Intelligent Robots Using Natural Language. In Third IEEE Int.Workshop on Robot and Human Communication, RO-MAN’94, Nagoya, Japan (to appear).

    Google Scholar 

  • Maaß, W., Wazinski, P. & Herzog, G. (1993). VITRA GUIDE: Multi-modal Route Descriptions for Computer Assisted Vehicle Navigation. In Proceedings of The Sixth Int. Conf. on Industrial and Engineering Applications of Artificial Intelligence and Expert Systems IEA/AIE-93, 144–147. Edinburgh, Scotland.

    Google Scholar 

  • Neumann, B. & Novak, H.-J. (1986). NAOS: Ein System zur natürlichsprachlichen Beschreibung zeitveränderlicher Szenen. Informatik Forschung und Entwicklung 1: 83–92.

    Google Scholar 

  • Neumann, B. (1989). Natural Language Description of Time-Varying Scenes. In Waltz D. L. (ed.) Semantic Structures, 167–207. Lawrence Erlbaum: Hillsdale, NJ.

    Google Scholar 

  • Niemann, J., Bunke, H., Hofmann, I., Sagerer, G., Wolf, F. & Feistel, H. (1985). A Knowledge Based System for Analysis of Gated Blood Pool Studies. IEEE Transactions on Pattern Analysis and Machine Intelligence 7: 246–259.

    Article  Google Scholar 

  • Reithinger, N. (1992). The Performance of an Incremental Generation Component for Multi-Modal Dialog Contributions. In Dale, R., Hovy, E., Rösner, D. & Stock, O. (eds.) Aspects of Automated Natural Language Generation: Proceedings of The Sixth Int. Workshop on Natural Language Generation, 263–276. Springer: Berlin, Heidelberg.

    Chapter  Google Scholar 

  • Retz-Schmidt, G. (1988). Various Views on Spatial Prepositions. Al Magazine 9(2): 95–105.

    Google Scholar 

  • Retz-Schmidt, G. (1991). Recognizing Intentions, Interactions, and Causes of Plan Failures. User Modeling and User-Adapted Interaction 1: 173–202.

    Article  Google Scholar 

  • Retz-Schmidt, G. (1992). Die Interpretation des Verhaltens mehrerer Akteure in Szenenfolgen. Springer: Berlin, Heidelberg.

    Book  MATH  Google Scholar 

  • Rohr, K. (1994). Towards Model-based Recognition of Human Movements in Image Sequences. Computer Vision, Graphics, and Image Processing (CVGIP): Image Understanding 59(1): 94–115.

    Article  Google Scholar 

  • Schirra, J. R. J. & Stopp E. (1993). ANTLIMA - A Listener Model with Mental Images. In Proceedings of The Thirteenth IJCAI, 175–180. Chambery, France.

    Google Scholar 

  • Schirra, J. R. J., Bosch, G., Sung, C.-K. & Zimmermann, G. (1987). From Image Sequences to Natural Language: A First Step Towards Automatic Perception and Description of Motions. Applied Artificial Intelligence 1: 287–305.

    Article  Google Scholar 

  • Sung, C.-K. & Zimmermann, G. (1986). Detektion und Verfolgung mehrerer Objekte in Bildfolgen. In Hartmann, G. (Hrsg.) Mustererkennung, 181–184. Springer: Berlin, Heidelberg.

    Google Scholar 

  • Sung, C.-K. (1988). Extraktion von typischen und komplexen Vorgängen aus einer langen Bildfolge einer Verkehrsszene. In Bunke, H., Kubier, O. & Stucki, P. (Hrsg.) Mustererkennung, 90–96. Springer: Berlin, Heidelberg.

    Google Scholar 

  • Tsotsos, J. K. (1985). Knowledge Organization and its Role in Representation and Interpretation for Time-Varying Data: the AL VEN System. Computational Intelligence 1: 16–32.

    Article  Google Scholar 

  • Wahlster, W., Marburger, H., Jameson, A. & Busemann, S. (1983), Over-answering Yes-No Questions: Extended Responses in a NL Interface to a Vision System. In Proceedings of The Eighth IJCAI, 643–646. Karlsruhe, FRG.

    Google Scholar 

  • Wahlster, W. (1989). One Word Says More Than a Thousand Pictures. On the Automatic Verbalization of the Results of Image Sequence Analysis Systems. Computers and Artifial Intelligence 8: 470–492.

    Google Scholar 

  • Walter, I., Lockemann, P. C. & Nagel, H.-H. (1988). Database Support for Knowledge-Based Image Evaluation. In Stocker, P. M., Kent, W. & Hammersley, R. (eds.) Proceedings of The Thirteenth Conf. on Very Large Databases, Brighton, UK, 3–11. Los Altos, CA: Morgan Kaufmann.

    Google Scholar 

  • Wazinski, P. (1993a). Graduated Topological Relations. Memo 54, Universität des Saarlandes, SFB 314.

    Google Scholar 

  • Wazinski, P. (1993b). Graduierte topologische Relationen. In Hernandez (ed.) Hybride und integrierte Ansätze zur Raumrepräsentation und ihre Anwendung, Workshop auf der 17, KI-Fachtagung, Berlin, 16–19. Technische Univ. München. Institut für Informatik. Forschungsberichte Künstliche Intelligenz, FKI-185–93.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 1994 Springer Science+Business Media Dordrecht

About this chapter

Cite this chapter

Herzog, G., Wazinski, P. (1994). VIsual TRAnslator: Linking Perceptions and Natural Language Descriptions. In: Mc Kevitt, P. (eds) Integration of Natural Language and Vision Processing. Springer, Dordrecht. https://doi.org/10.1007/978-94-011-0273-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-94-011-0273-5_6

  • Publisher Name: Springer, Dordrecht

  • Print ISBN: 978-94-010-4121-8

  • Online ISBN: 978-94-011-0273-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics